Losing our digital history

Published by Reblogs - Credits in Posts, September 16th, 2024

Watch

We're losing our digital history. Can the Internet Archive save it?

12 hours ago

Chris Stokel-Walker

Serenity Strull/ Getty Images

(Credit: Serenity Strull/ Getty Images)

Research shows 25% of web pages posted between 2013 and 2023 have vanished. A few organisations are racing to save the echoes of the web, but new risks threaten their very existence.

It's possible, thanks to surviving fragments of papyrus, mosaics and wax tablets, to learn what Pompeiians ate for breakfast 2,000 years ago. Understand enough Medieval Latin, and you can learn how many livestock were reared at farms in Northumberland in 11th Century England – thanks to the Domesday Book, the oldest document held in the UK National Archives. Through letters and novels, the social lives of the Victorian era – and who they loved and hated – come into view.

But historians of the future may struggle to understand fully how we lived our lives in the early 21st Century. That's because of a potentially history-deleting combination of how we live our lives digitally – and a paucity of official efforts to archive the world's information as it's produced these days.

However, an informal group of organisations are pushing back against the forces of digital entropy – many of them operated by volunteers with little institutional support. None is more synonymous with the fight to save the web than the Internet Archive, an American non-profit based in San Francisco, started in 1996 as a passion project by internet pioneer Brewster Kahl. The organisation has embarked what may be the most ambitious digital archiving project of all time, gathering 866 billion web pages, 44 million books, 10.6 million videos of films and television programmes and more. Housed in a handful of data centres scattered across the world, the collections of the Internet Archive and a few similar groups are the only things standing in the way of digital oblivion.

Depending on what you're looking for, the Internet Archive's collection is so thorough it can sometimes feel like a functionally complete record of the web

"The risks are manifold. Not just that technology may fail, but that certainly happens. But more important, that institutions fail, or companies go out of business. News organisations are gobbled up by other news organisations, or more and more frequently, they're shut down," says Mark Graham, director of the Internet Archive's Wayback Machine, a tool that collects and stores snapshots of websites for posterity. There are numerous incentives to put content online, he says, but there's little pushing companies to maintain it over the long term.

Despite the Internet Archive's achievements thus far, the organisation and others like it face financial threats, technical challenges, cyberattacks and legal battles from businesses who dislike the idea of freely available copies of their intellectual property. And as recent court losses show, the project of saving the internet could be just as fleeting as the content it's trying to protect.

"More and more of our intellectual endeavours, more of our entertainment, more of our news, and more of our conversations exist only in a digital environment," Graham says. "That environment is inherently fragile."

Saving our history

A quarter of all web pages that existed at some point between 2013 and 2023 now… don't. That's according to a recent study by Pew Research Center, a think tank based in Washington, DC, which raised the alarm of our disappearing digital history. Researchers found the problem is more acute the older a web page is: 38% of web pages that Pew tried to access that existed in 2013 no longer function. But it's also an issue for more recent publications. Some 8% of web pages published at some point 2023 were gone by October that same year.

This isn't just a concern for history buffs and internet obsessives. According to the study, one in five government websites contains at least one broken link. Pew found more than half of Wikipedia articles have a broken link in their references section, meaning the evidence backing up the online encyclopaedia's information is slowly disintegrating.

Serenity Strull/ Getty Images

With no formalised public efforts to document the web, the Internet Archive has become a critical piece of digital infrastructure (Credit: Serenity Strull/ Getty Images)

But thanks to the work of the Internet Archive, not all those dead links are totally inaccessible. For decades, the Archive's Wayback Machine project has sent armies of robots to crawl through the cascading labyrinths of the internet. These systems download functional copies of websites as they change over time – often capturing the same pages multiple times in a single day – and make them available to public free of charge.

"When we then went and looked at how many of those URLs were available in the Wayback Machine, we found that two-thirds of those were available in a way," he says. In that sense, the Internet Archive is doing what it set out to do – it's saving records of online society for posterity.

Historians of the future may struggle to understand fully how we lived our lives in the early 21st Century

A few other organisations, big and small, work on similar projects. The US Library of Congress, for example, preserves government websites, the sites of congressmembers and a collection of US news sites. The Library of Congress also preserved a copy of every single tweet sent since the founding of Twitter (now known as X), until the project was shut down in 2017. Other governments run their own initiatives. The UK Web Archive conducts an annual crawl of websites with .UK domain names, capturing a snapshot of the British internet at least once a year. In 2022, band of volunteers to set out to save the Ukrainian internet as it was hit by Russian cyberattacks.

But the scope of these projects is narrow, while the Internet Archive aims for a comprehensive approach. Given the available resources, it would be impossible to collect anything close to the whole internet, but its systems cast a broad net. Depending on what you're looking for, the Internet Archive's collection is so thorough it can sometimes feel like a functionally complete record of the web.

Success breeds complacency

The Archive's publicly accessible documents help sustain records of our lives in the current era. It's become a standard practise on Wikipedia to cite copies of websites from the Internet Archive’s Wayback Machine, rather than the original websites themselves. The organisation also preserves a vast collection of media that predates the digital era. The beloved 1977 comedy series Fernwood 2 Night isn't available on any streaming service, but you can watch it free on the Internet Archive. Books, magazines and websites cite the Internet Archive’s scanned digital copies of books that are unavailable in physical libraries. It even acts as a preservation tool for the public; anyone can upload videos, websites and practically anything else to organisation's servers.

Every few years there's a new platform come along and then the economic forces suddenly kind of collapse in it – Andrew Jackson

Among the major collections that the Wayback Machine has salvaged from the digital scrapheap are deep records of websites built on GeoCities, a now defunct personal web hosting service. Long before social media, GeoCities was among the first platforms that made it easy for anyone to create their own website. Historians view GeoCities as one of the most important chapters in the early days of the world wide web, without the efforts of the Internet Archive, most of its websites would be lost. In more recent history, a US Congressional Committee relied on the Internet Archive to preserve article and documents related to the January 6 insurrection.

"Every few years there's a new platform come along and then the economic forces suddenly kind of collapse in it," says Andrew Jackson, preservation registry technical architect at the Digital Preservation Coalition, a UK-based advocacy group and charity that advises on how to preserve the world's online digital archives. "That's one big source of churn."

The tech news website CNET faced backlash in 2023 after reports that the company had deleted tens of thousands of articles, amounting to decades of lost history. Among CNET's responses was a promise that all its deleted articles had been preserved in the Wayback Machine. Many critics argued the company was taking the Internet Archive for granted, passing on its own archival responsibilities.

According to the Pew Research Center, a quarter of all web pages that existed at some point between 2013 and 2023 now… don't

"Even though Google and other search engines actively incentivise you to maintain stable URLs, it's just technically quite difficult to do that," says Jackson. "Every time a new company kind of revamped its website, it has to work out how much of its new URLs it's going to try and maintain through time."

But it's worth remembering what the Internet Archive is: a non-profit organisation, financed by donations from charitable foundations. It makes for a never-ending project with exponentially growing costs. The Internet Archive volunteered to take on the mantle of being the world's leading library for our digital lives. As the web approaches its fourth decade, this entirely unofficial project has become a foundational pillar of the internet.

But as our reliance on the Internet Archive grows, so too do the threats pecking away at its efforts.

Single points of failure

Last week, the organisation announced a major partnership with Google, where the tech giant engine will include links to the Wayback Machine in search results – though neither released financial details about the deal.

But other recent news demonstrates that the project is still fragile. That vulnerability was laid bare in a court case against the Internet Archive by four large book publishers, who alleged that the Internet Archive’s practise of scanning physical books and lending out digital copies breaches US copyright law. Before the pandemic, the Internet Archive would only lend one digital copy at a time for each physical book in its collection. But during the Covid shutdowns, the organisation lifted that restriction, letting patrons borrow unlimited digital copies of books to try and make up for the closure of physical libraries.

A US court ruled that practice was illegal in 2023, and in early September, the Internet Archive’s appeal against that decision was rejected. The organisation previously said that it agreed to pay the a publishing industry trade group an undisclosed sum in relation to the case.

With that lawsuit in the rearview, the Internet Archive is fighting yet another court case against music labels for digitising records that could cost it $400m (about £305m) if it loses. It's an amount that could jeopardise the non-profit's survival.

Serenity Strull/ Getty Images

The Internet Archive's three-decades-long collection spans across hundreds of billions of web pages (Credit: Serenity Strull/ Getty Images)

Internet Archive's director of library services Chris Freeland said the organisation is reviewing the courts' opinion a statement about the ruling.

Existential legal battles aren't the only hazards menacing the world of digital preservation. The British Library's UK Web Archive got a taste of some malevolent technical challenges last when a cyberattack took its digital systems offline in October 2023. Almost a year later, the UK Web Archive is still dealing with the fallout. Online access to much of its collection is still unavailable.

In May 2024, the Internet Archive announced it was in the midst of a large distributed denial of service (DDoS) attack. In a DDoS attack, vandals or other bad actors set up automated systems to bombard websites with visits, attempting to push them offline by overwhelming their servers. At its peak, tens of thousands of concurrent visits were happening every second. Services, including the Wayback Machine, went down. It meant that the regular drumbeat of archiving was disrupted for a time, and there may be permanent gaps in the historical record as a result.

We have a wealth of documents from the past. But we only have certain documents and certain people's voices, and a lot of those voices that were missing were incredibly important, and they've been erased – Mar Hicks

The Internet Archive "was started by one individual, and it has become a kind of linchpin", says Jackson. "It also feels like this potential single point of failure. Although it's a lot more sophisticated than just volunteers, it is one institution in one region, under one legal framework."

The organisation shares these concerns. If the Internet Archive's work stopped and "that void wasn't immediately filled, then much of what is currently made available on the public web would be at risk", says Graham.

He's clear that the Internet Archive won't step back from its responsibilities anytime soon, but the project can use outside help. "There are opportunities for many others to contribute in a variety of ways," he says.

Shared responsibilities, split priorities

With no formal effort to organise efforts to preserve the internet, the project is left to hobbyists, volunteers, and a few groups of unofficial bodies that generally operate independently.

"It makes sense that the archival response is decentralised," says Mar Hicks, a historian of technology at the University of Virginia. "But one of the problems is the varied priorities."

Hicks points out that one of the first things any archivist will consider when building an archive is what to prioritise. "And when it's so decentralised, the priorities are going to be very different," Hicks says. "There's going to be people in groups who prioritise trying to grab everything – as much as they possibly can, they might be very completionist." Then there will be others who are focused only on certain areas – for instance, the UK archiving effort.

The concern about such an ad hoc, decentralised approach is that it's possible there's overlap, meaning precious archiving resources are wasted getting duplicate or triplicate copies of the most popular websites – all while some areas that may have historical importance are overlooked because they fall between different groups' responsibilities.

A book is a more obviously finite resource; it can be lost or damaged. But the internet feels so accessible. Anyone with an internet connection can pull up a web browser and dial in a URL. It's all right there – until it isn't

"Archivists will tell you that these issues have existed for a very long time," Hicks says. But they're exacerbated by the level of stuff being produced in our digital world. Nearly a billion emails are sent every day. YouTube reports that more than 500 hours' worth of video content is posted on the platform every minute.

The internet is "essentially a firehose of information and material," says Hicks. "It doesn't make sense to try to catch everything that comes out of the firehose. That wouldn't make sense from a resource standpoint."

In one sense this is an old concern. "We have, as historians, those same problems," says Hicks. "We have a wealth of documents from the past. But we only have certain documents and certain people's voices, and a lot of those voices that were missing were incredibly important, and they've been erased."

For Hicks, there needs to be some sort of priority about what is being saved from the digital footprints of our generation. Otherwise we run the risk that rapidly ballooning costs will sideline efforts to save the history of the web – not to mention the oceans of digital files that live offline.

"If you have to keep everything, it becomes very expensive," says Jackson of the Digital Preservation Coalition. "There's often older content or less compelling content [that] gets lost by the wayside," he says.

"We're not capturing the non-Western world well," admits Jackson. "There are gaps now around incompleteness in different cultural domains."

And while many of those organisations work to fight against their biases and prejudices, they're often left to carry the weight of the task while governments and the companies that run the platforms and websites sit by. "Independent groups of people, who are just caring about it and are willing to spend their free time doing it, are better resourced and more highly skilled than the institutions which are formally responsible," says Jackson.

More like this:

• Why there’s so little left of the early internet

• How will future archaeologists study us?

• Google just updated its algorithm. The Internet will never be the same

There's a vacuum, argues Hicks, which few people other than a handful of archivist obsessives are filling. "It's not clear whose responsibility it is to archive [the internet] or whose interest it would serve," Hicks says.

One thing is clear, though, Hicks says, we should all pay up to support the fight for preservation. "From a very pragmatic perspective, if you do not pay these people and make sure that these archives are funded, they will not exist into the future, they will break down and then the whole point of collecting them will have gone out the window," says Hicks. "Because the whole point of the archive is not that it just gets collected, but that it persists indefinitely into the future."

The Enlightenment of the 18th century saw the birth of an international library movement as governments and philanthropists took on the need to preserve and distribute books for the public. But that sense of civic responsibility hasn't extended to the internet. That may be due to the complicated business interests of the digital world, or just the immense technical challenge. Or, perhaps, it's because it doesn't feel like the web needs saving to casual observers. A book is a more obviously finite resource; it can be lost or damaged. But the internet feels so accessible. Anyone with an internet connection can pull up a web browser and dial in a URL. It's all right there – until it isn't.

For timely, trusted tech news from global correspondents to your inbox, sign up to the Tech Decoded newsletter, while The Essential List delivers a handpicked selection of features and insights twice a week.

For more science, technology, environment and health stories from the BBC, follow us on Facebook and X.

Watch

Miami Heat: the basketball team turned tech startup

The iconic team has developed technology which spread to concerts, nightclubs, and other sports teams.

6 Sep 2024

Technology

The giant 350-year-old model of St Paul's Cathedral

Hiding in a London cathedral is an intricate wooden mock-up of Sir Christopher Wren's masterpiece.

4 Sep 2024

History

Texas fever: The lesser-known history of the US border

In 1911, a fence was constructed on the US-Mexico border. But its purpose was not to stop humans.

18 Aug 2024

History

Testing out smart swim goggles for the Paris Games

A new generation of smart goggles provide real time visual feedback to enhance athletic performance.

16 Aug 2024

Technology

The rare medieval street about to reveal its secrets

One of Europe's oldest residential streets hides in the heart of the English countryside.

23 Jul 2024

History

How technology will power the opening ceremony

A new network along the Seine in Paris will broadcast Olympic festivities from the river.

18 Jul 2024

Technology

Why Olympic venues are using digital twins

A peak inside both the real and virtual new Aquatic Centre built for the Paris Games.

17 Jul 2024

Technology

Tutankhamun: The first ever view inside the tomb

One month after the famous discovery, photographer Harry Burton recreated the first view of Tutankhamun's tomb.

4 Jul 2024

History

Listen to the oldest known recording of a human voice

Thomas Edison wasn't the first person to record sound. It was a Frenchman who invented sound recording in 1857.

2 Jul 2024

History

D-Day veteran remembers: We didn't have time to be scared

On the 80th anniversary of D-Day, veterans who were on the beaches of Normandy recount that fateful day.

5 Jun 2024

History

White gold rush: Harvesting lithium from Great Salt Lake

Could the United States' largest saltwater lake hold the key to its energy future?

12 May 2024

Technology

Why are scientists making 'moon dust'?

Space agencies around the world need lunar soil.

10 May 2024

Innovation

King Charles: One year since the Coronation

How does King Charles III's Coronation year compare to that of his mother?

7 May 2024

History

Where do your old mobile phones and TVs go to die?

Dandora sits on the outskirts of Nairobi, Kenya, and 800 tonnes of garbage is dumped on the site every day.

3 May 2024

Technology

The song that ended Europe's longest running fascist regime

Fifty years ago, on April 25, 1974, a Eurovision song gave the signal for a military coup.

25 Apr 2024

History

The tiny piece of the US hidden in England

How one day in 1963 changed history forever and created a piece of America in the UK.

10 Apr 2024

History

Tracing Marco Polo's footsteps along the Silk Road

700 years after his death, Marco Polo's travellogue is full of wonder but also 'hard to believe' in some parts.

7 Apr 2024

History

A Russian Spy Story: Vladimir Putin and his time in the KGB

How Putin 'dreamed of being the Russian version of James Bond'.

18 Mar 2024

History

How a US TikTok ban will affect Gen Z

We spoke to two influencers who use the short form video platform to raise awareness and inform.

13 Mar 2024

Technology

The history of virtual reality that led to Apple Vision Pro

Apple is turning science fiction into reality but was VR meant to be used like this?

7 Feb 2024

Technology

The US South's coolest college town

This small Georgia city is quietly showing its bigger counterparts how a place can grow up while keeping its edge.

4 hrs ago

Travel

The classic novel rescued from the reject pile

William Golding's novel Lord of the Flies was first published on 17 September 1954, and is now recognised as a classic.

6 hrs ago

Culture

Rainbow Portrait back after 'meticulous' conservation

The "iconic" painting of Elizabeth I at Hatfield House has undergone forensic study and conservation.

11 hrs ago

Culture

'London's lost treasures' in mudlarking exhibition

Historic items from London's past are going on display in a new London Museum Docklands exhibition.

11 hrs ago

London

Mario Lopez's guide to Mexican food in Los Angeles

Mario Lopez loves hole-in-the wall Mexican food. Here are his top eats in LA from birria tacos at Birrieria Gonzalez to cócteles de camarón at Mariscos el Bigoton.

1 day ago

Travel

Watch

F U L L Q U . A R T

Losing our digital history

We're losing our digital history. Can the Internet Archive save it?

Saving our history

Success breeds complacency

Single points of failure

Shared responsibilities, split priorities

Miami Heat: the basketball team turned tech startup

The giant 350-year-old model of St Paul's Cathedral

Texas fever: The lesser-known history of the US border

Testing out smart swim goggles for the Paris Games

The rare medieval street about to reveal its secrets

How technology will power the opening ceremony

Why Olympic venues are using digital twins

Tutankhamun: The first ever view inside the tomb

Listen to the oldest known recording of a human voice

D-Day veteran remembers: We didn't have time to be scared

White gold rush: Harvesting lithium from Great Salt Lake

Why are scientists making 'moon dust'?

King Charles: One year since the Coronation

Where do your old mobile phones and TVs go to die?

The song that ended Europe's longest running fascist regime

The tiny piece of the US hidden in England

Tracing Marco Polo's footsteps along the Silk Road

A Russian Spy Story: Vladimir Putin and his time in the KGB

How a US TikTok ban will affect Gen Z

The history of virtual reality that led to Apple Vision Pro

The US South's coolest college town

The classic novel rescued from the reject pile

Rainbow Portrait back after 'meticulous' conservation

'London's lost treasures' in mudlarking exhibition

Mario Lopez's guide to Mexican food in Los Angeles

Follow BBC on: