Hacker News new | past | comments | ask | show | jobs | submit login
More than 9M broken links on Wikipedia are now rescued (archive.org)
689 points by infodocket on Oct 1, 2018 | hide | past | favorite | 100 comments

I worked at the Archive for a few years remotely. It permanently altered my view of the tech. world. Here are the notable differences. I think these would apply to several non-profits but this is my experience

1. There was no rush to pick the latest technologies. Tried and tested was much better than new and shiny. Archive.org was mostly old PHP and shell scripts (atleast the parts I worked on).

2. The software was just a necessity. The data was what was valuable. Archive.org itself had tons of kluges and several crude bits of code to keep it going but the aim was the keep the data secure and it did that. Someone (maybe Brewster himself) likened it to a ship traveling through time. Several repairs with limited resources have permanently scarred the ship but the cargo is safe and pristine. When it finally arrives, the ship itself will be dismantled or might just crumble but the cargo will be there for the future.

3. Everything was super simple. Some of the techniques to run things etc. were absurdly simple and purposely so to help keep the thing manageable. Storage formats were straightforward so that even if a hard disk from the archive were found in a landfill a century from now, the contents would be usable (unlike if it were some kind of complex filesystem across multiple disks).

4. Brewster, and consequently the crew, were all dedicated to protecting the user. e.g. https://blog.archive.org/2011/01/04/brewster-kahle-receives-.... There was code and stuff in place to not even accidentally collect data so that even if everything was confiscated, the user identities would be safe.

5. There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.

Archive.org is truly only of the most underappreciated corners of the world wide web. Gives me faith in the positive potential of the internet.

> There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.

This resonates with me. Sometimes we developers need to get off the "move fast and break stuff" bandwagon (which has been ongoing for over decades now), and consider we're the ones responsible for preserving almost all human digital heritage of our epoch. There's a simple and obvious method to implement preservation-friendly content implicit in the web architecture: emit/materialize everything as plain HTML, even dynamic content. This is of course antithetical to most of this decade's SPA web development trends, but I think it's worth drawing a line between web content (worth preserving in the first place) and web apps (which have highly volatile content not worth preserving). I feel like this distinction isn't considered sufficiently in our staged web app architecture dicussions which are all about your latest JS MVw framework, to the degree that newby web devs really don't learn the fundamentals of HTML etc. anymore, and are lead to use React, Vue, etc. for content-based web sites.

Thank you for sharing this. I’ve donated (small amounts of) money to Mozilla and Wikipedia in the past. Your post makes me consider donating to archive.org this year.

Edit: typo

How does archive.org make money? I imagine their storage costs must be quite high.

They don’t. They are a non-profit 501c3 charity that relies on donations.

I think what the parent meant is "how does archive.org pay their bills?".

I thought I was answering that. Where did I go wrong?

Are you saying they don't pay their bills?

I'm saying they pay their bills (utilities, hardware costs, salaries) with donations, US dollars obtained from those donating.

> How does archive.org make money?

Donations (https://projects.propublica.org/nonprofits/organizations/943...)

> I imagine their storage costs must be quite high.

No, they aren't. Building and hosting your own storage is cheap. Same reason Backblaze and Dropbox built their own storage systems.

> Building and hosting your own storage is cheap

Archive.org uses S3 extensively. Not exactly cheap.

Can you provide a citation? To my knowledge, the Archive does not use Amazon's S3 storage system (which they refer to in places as "S3" [1]), only they're on their own internal storage system [2].

[1] https://archive.org/help/abouts3.txt

[2] https://archive.org/web/petabox.php

To the best of my knowledge, the Archive has it's own machines to store data. It is an Archive and one of the principles was to have the know how to preserve data even if the cloud providers disappear.

If you're curious to learn more about us, we're hosting our big annual event in SF this Wednesday (Oct 3)! Details: https://blog.archive.org/2018/08/20/save-the-date-building-a...

This will be my first year attending the annual event, and I'm super excited! I'm a long time fan of the Wayback Machine, even though my early work is terribly embarrassing :) Love what the archive does, which is so much more than snapshots of old websites. If you somehow have not read the mission statement, it's worth a minute: https://archive.org/about/

I'm now almost weirdly proud of the crap work of mine that I can find on the wayback machine :-)

For instance: 2nd April 2004 [1] was my first crap Flash animation for my consultancy company at the time. It's terrible, but watching it now gives me so much nostalgia!

[1]: WARNING: REQUIRES FLASH: https://web.archive.org/web/20040402230914/http://noisiadetr...

If you're not hosting that site P2P yet you're just centralizing the distributed Web.

What an entitled attitude to throw at one of mankind's greatest digital resource, without even researching the subject. And you have the balls to ask "When did HN turn into Reddit?" (https://news.ycombinator.com/item?id=18092654) ...

A digital library sort of implies centralizing? I think this thread explores how you can contribute to decentralization of the hosted content:


Archive.org is such a wonderful institution.

The other day, I discovered that the Wayback Machine has been archiving YouTube videos in full HD. Most videos aren't on there, of course, and it seems to only go back as far as ~2012 (HTML5 video switchover?), but some of them are there.

Y'all will be getting more donations from me. :)

I was wondering if something like that existed. A lot of times I add to my YouTube WatchLater list, which can often times be much later in tim that by the time I get around to it it’s not unusual for videos to be removed by the user or deleted for unknown reasons. And there’s no text to even see what the video was.

that does my head in. I have so many videos on most of my playlists that im hardly ever able to guess what the missing video was.

This is why I backup videos with youtube-dl. Internet history can be completely erased at a moment's notice, and some things ought to belong to the public if they're to never see the light of day again otherwise.

I'd suggest a slight revision to your theorem: > Internet history can be completely erased at a momement's notice (unless it would embarrassing for you later in life)...

Yes, indeed.

In these cases, I just Google the video I'd. Enough to find its title somewhere.

Thanks, I had no idea about that. I found their Game Books Collection today, thought that was pretty neat.


I've always wondered this - How does Archive.org work in terms of storage? Internet is massive and caching every single site periodically for years on, isn't that unreasonably huge amount of data?

Edit: I just checked Wikipedia, it says they're using about 15 PB of storage.

Edit 2: 15 PB cost => 15,000 TB x $30/TB = $450,000. Ofcourse, back of the napkin cost (no maintenance, power, etc). That's not too bad actually.

The Archive currently has about 46 Petabytes of content ("bytes archived"), and over 120 PB of raw disk capacity; the difference is due to data replication, "currently filling" storage, non-storage infrastructure, etc.

We save a lot on web content storage by de-duplicating "revists" when the page hasn't changed. This works out to save a whole lot for content like jQuery served from a common CDN URL; it doesn't work well when there is a page counter or any trivial changing content on a page.

If you are interested in the storage back-end, it's actually pretty simple: HTTP requests/responses are concatenated and compressed in WARC files (sort of like .tar.gz) that get stored on regular old ext4 filesystems. An index of "what URL captures are in what WARC files on what servers" is continuously generated in the form of, basically, a giant sorted (and shareded) .tsv file; replay requests on web.archive.org look up the URL and timestamp and get a reference to a machine, file, and file offset, and make an HTTP 1.1 range request for the content in question. There are a bunch of other details, like checking robots.txt status, but the core design is super simple, cheap, and (relatively) easy to operate at scale.

Apart from web crawl content (including, these days, "heavy" video content which is difficult to de-dupe), we have a large amount of live recorded TV, scanned books (raw photos), etc.

(I currently work at IA)

Would be nice if you could store a diff, so a changing counter would only have to store the changed counter after the initial save.

You can do this if you group all site visits into one common WARC and compress it (or dedup it otherwise).

WARC itself does have a method of dedup if the response is the same (or mostly the same) but terrible if content changes.

Wondering the same. How if used git protocol itself? Not sure how efficient is git, but if it is then it's a relatively easier change

Git doesn't store diffs, it stores whole files.

No, git packs do store base blobs and diffs, which are not visible to end users; the git plumbing hides this and only presents blobs to the user.

Have you evaluated compression algorithms that support custom dictionaries, like zstd? You could generate a compression dictionary for each domain, or just for those above a certain size.

> (including, these days, "heavy" video content which is difficult to de-dupe)

Still waiting for the Shazam for video to know that 2 videos are the same even when they are of different codecs/framesizes/etc; just based on the visual imagery.

Isn't that Youtube Content ID? Upload a video and see who sends you a takedown

120PB is still an insane amount of data.

I imagine at least a million dollars an year to keep the lights on just for the infrastructure.

Does IA have a big endowment to keep it going for a while?

How do you combat bit-rot/silent corruption?

A lot more than 15PB; it was over 20PB last year. And that figure doesn't count the duplicate copies they keep of everything.

According to this 9-28-2018 article,

"Today, the Wayback Machine houses some 388B web pages, and its parent ... The Internet Archive’s collection, which spans not just the web, but books, audio 78rpm records, videos, images, and software, amounts to more than 40 petabytes, or 40 million gigabytes, of data. The Wayback Machine makes up about 63% of that."


It's essentially the only cure for the serious link-rot problems in Wikipedia's references.

According to their website, it's over 35PB now.

"We preserve 750 million Web pages per week! We’ve saved 35 petabytes (that’s 35,000,000,000,000,000 bytes) of data."

That doesn't account for redundancy and backups. I have asked some storage cost questions on /r/sysadmin and /r/datahoarder in the past about cost per TB and I tend to see numbers ranging from $300 to $1000 a TB of useable space.

That must be for SSDs, which are ~$150/tb vs ~$20/tb for HDDs. I expect IA must use HDDs for the bulk of their data.

SSDs are about $400/TB for raw storage. Prices are substantially different for consumer and enterprise spaces, but 4TB disks (Micron, Samsung, etc) tend to be about $1500 and 8TB around $3200. If you factor in redundancy, you're looking at around $1000-1500 per TB of usable storage.

This also isn't a matter of "just go with cheaper consumer disks". If you care at all about data integrity you won't use consumer-grade SSDs.

Or even tape. For offline redundancy it’s still cheaper.

3-way mirrored spinning storage would cost roughly $100/TB. 12TB disks are about $350/ piece these days. There are a whole bunch of possible considerations and alternate scenarios, but this is probably the worst-case.

That does not factor in compression / dedup, which can gain you a substantial amount of savings depending on your work profile; I bet if Archive.org wanted to, they could slash their storage using block-level dedup. Of course that creates potential recoverability headaches, so it's possible they don't.

If the internet grows exponentially, and sites decay in finite time, then that's about the same as caching the current internet, i.e. Google.


I see this is downvoted. Actually, if a blockchain used proof of actual useful storage or computation, that'd be great. (instead of wasting energy by computing "useless" hashes)

It's probably downvoted because replying 'blockchain' to a question about how Archive.org stores their data is not even a useful answer.

There may well be fascinating ways one can use blockchain for archives of this sort, but the grandparent comment was essentially useless noise.

Proof of storage is used in siacoin, but not to secure the network. I haven't seen a scheme where proof of storage can be used to secure the network.

In terms of making sure the work that is done is useful it's fairly hard to do that because if you let someone dictate what work should be done then you potentially allow them to do the work upfront and a 51% attack becomes pretty easy.

"you potentially allow them to do the work upfront and a 51% attack becomes pretty easy."

I do wonder if there are potential consensus algorithms that rely upon game theory, the threat of being booted off the network, and the need for some investment before they're "trusted" by other peers. The basic idea is that yes, you could 51%-attack the network with a lot less than 51% computing power, but it would never be profitable because if you're discovered you're booted off into your own little partition where nobody plays with you, and the value you get from being part of the network in the future is greater than the value you would get from rewriting history into something only you believe.

51% attacks are possible now - roughly 75% of Bitcoin hashing power is located in China, so if the Chinese government decides that they want to shut it down, they send the military after the largest Bitcoin miners, say "Hash these transactions or else", and take control of the chain. They haven't found it profitable to do so yet, though - what would they gain from it?

The thing is that reputation isn't built into the protocol. Everyone is basically anonymous strangers.

When space-intensive computations are used for consensus, it's usually called proof of space. There are a few projects doing it already.


Proof of space and proof of storage are two different things.

I suppose that would be kind of interesting. Although if it's conceivable that someone could just store all the useful information then you can't just request a random piece of information as a proof of 'work', so you'd need something clever.

Anyway the comment was probably downvoted because it consisted of a single buzzword.

Blockchain actually could be used to provide data integrity in a distributed environment with known bad actors (presuming that they fall under the 50% threshold), but that is not applicable in this case since archive data is not stored in a distributive fashion by clients (as far as I know). Having worked on a production application that uses blockchain, my take away was it cleverly provides distributed ledger transparency, but at a price that makes it very inefficient as a data store.

It's an invaluable resource. In the past week I realized it is quite likely that a site I used to work at which hosts a lot of my portfolio might disappear or be heavily amended. So I located and indexed a complete list of my articles there -- and I was even able to click a button and create an archive for the few pages that IA hadn't bothered to index (I wish I had known about this before, since it was a shock to find that IA can be quite selective, and to find a page you were hoping was there simply isn't, and is now irretrievable).

But as one other commenter here has mentioned, you're only a robots.txt amend away from the oblivion that the entire IMDB comments section fell into [1], so a good archiving system is essential. I use (no affiliation)Save Page WE on Waterfox:


[1] https://news.ycombinator.com/item?id=13571893

Wayback Machine isn’t Googlebot, it doesn’t crawl the web, so there’s no such thing as “hadn’t bother to index”... Someone, be it a human or a bot, needs to submit a page for archival.

Programmatically submitting to Wayback Machine is trivial enough, so I have cron jobs backing up most of my static sites (in their entirety) periodically.

"bothered to index" and "IA can be quite selective" implies that there's a human overseer to the Wayback operation, not true!

Pages are archived automatically, and sites are crawled by robots, not humans.

This just reminded me to donate. I've used the archive several times just in the past few days to resolve 404s on old gamedev blogs. I'm amazed how often what I'm looking for is in the archive considering how big the internet is and how niche the content I'm looking for. Truly an amazing resource comparable to Wikipedia in value.

To everyone considering donating, please set up a monthly donation if it’s within your means!

I used to donate every time something reminded me of the value of the archive. Now I just think “that’s why I have a monthly pledge!”

Also, your employer may match charitable contributions (up to a predefined amount). Check if they do! It’s effectively free money for the Internet Archive.

Googlers can donate with two clicks on G-Give, with Google matching. I highly recommend it.

I got a call a few years ago from a member of a humanitarian organization that had accidentally lost a significant percentage of their web site detailing projects that they had completed over many years with no backups. The people that had completed the work had moved on and they were frantic that the work was gone forever, but the Wayback Machine had almost perfect records to restore everything.

There is a github project out there where you can specify the site, and it will rebuild the content locally from wayback content. Something to consider for last resort recovery.

EDIT: https://github.com/oduwsdl/warrick

They mention at the end of the article that 'content drift' may be a bigger issue than link rot; when the content of the post is simply changed rather than missing, it is much harder to notice.

Is there a scalable way to monitor Wikipedia links to see if the content is changed after originally being posted?

They are already storing every link in the Internet Archive when it gets added, so there should be a reference point to compare against.

One easy option would be to make Internet Archive links available for every single link on Wikipedia, even if it hasn't rotted yet. So a 'live' link to the current content, and an archive link for what it was at the time of linking.

Interesting aspect!

The biggest problem in this would probably be how to recognize if "content" changed. A site can change the full design, navigation, footer and header and everything and still have the exact same "content". For a human being this will be simple enough to understand, but a tool might have its problems with that.

Yes, this is a fundamental issue if you wanted to do this at scale.

There are a few solutions to this already, using solutions like outline.com to pull the content out of the cruft, but I don't know how many of these are general purpose and how many are purpose built for each site (and maintained for the current version of the site, perhaps?)

As seen in the article, most links are to a small number of sites, so perhaps hard coding the content extraction would be feasible, especially for an initial study.

It would be interesting I think to see just how many links have identical content, but you're right in that the number will be skewed greatly if there are any ads or similar included.

You might start with Wikipedia article's history pages.

Does the article's history page keep a record of what the content of the link was when it was updated?

I guess it doesn't, but assuming that IA records every link as it gets added you can use the history page to see which snapshot on the wayback machine corresponds to the link at the time it was added.

You could use that to compare to now, but unless you made sure to snapshot the content regularly (whenever the page is edited?) you wouldn't necessarily know when the content drift happened, or if multiple changes have happened to the content over time.

Using the history page would likely give you enough information to, for example, do a study on links to find out how many have different content now vs when the link was added.

Ah, no, that would be a separate problem.

Though IA's archival-on-creation mechanism should at least mean that the orighinal reference is preserved.

An explicit Wikipedia link to reference-at-time-of-archive, or better, a diff flag or listing, would be useful.

And, for those who don't know, you can help host the Internet Archive now by running a P2P/decentralized backup of it:


Thank you for pointing this out, I will see what I can do to help, this is the sort of thing I'm more than happy to dedicate resources to. I still have a 36 drive enclosure lying around that would make a nice bit of storage if I can get it to be silent enough for the home.

This is the repo Mitra has been working on for this: https://github.com/internetarchive/dweb-mirror

And they are one robots away from cancellation. For all the good they do, retroactively applying robots exclusions to their crawler is a terrible thing. Luckily there are alternatives for going forward.

They keep the data and just don’t display it. The last thing they need is a court order demanding they delete it.

Sure the archive is useful today, but it’s primary purpose is retaining information for future generations. If that means placating copyright holders it’s worth the cost.

> If that means placating copyright holders it’s worth the cost.

In many cases it's nothing to do with the copyright holder. In fact, the complete opposite: someone who subsequently bought the domain and unwittingly stuck a generic robots.txt on the site.

If the IA hid content upon receiving a request from webmaster@domain then that'd make sense. But doing it automatically and retrospectively from robots.txt is bizarre and shows again the dangers of centralisation on the Internet.

I now block IA from my sites as a protest. We need more competition and fresh ideas in digital archiving.

> They keep the data and just don’t display it.

I’ve read about the robots.txt mentioned before but hadn’t seen this mentioned. Any idea if they have this somewhere on the site?

IA are ignoring robots.txt as of earlier this year.

Really? Reference, please!

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.



They are gradually expanding the scope of sites where robots.txt is ignored, perhaps "testing the water".

I spend quite some time on archive.org. Of course the wayback machine is great, but I am mostly interested in old digitized media. There really is some great stuff on there - But what is really missing is organization (and a less broken-seeming website, I guess). It doesn't help you much to have a great archive if noone will find anything. User-curation would help a lot with this.

Found any gems you want to share?

What you're doing is essentially digital archeology, which is super cool. In 50-100 years, if not sooner, people will be digging through digital "graveyards" for evidence of this and that. That's so intriguing.

I like this term “digital archaeology”. I find myself saying it a lot when dealing with “only” 20 year old database data in my day job. Apparently, but not surprisingly, it’s a real thing[1]!

[1] https://en.m.wikipedia.org/wiki/Digital_archaeology

There's long been a saying that "Once it's out there (on the Internet), it's forever," but I used to save links in a Microsoft Word document, and I went through them a few years later and almost none of them worked anymore. The years in which they were saved was 2006 to 2009, and the year I went through them was 2012. The links were from MySpace (which totally overhauled the entire site and all content), Facebook (where users had deleted their profiles or pictures), Tumblr (where bloggers rename their blogs, which change the URL, or they wipe them clean, or delete their blogs), YouTube (tons of videos and whole accounts have been deleted because of copyright infringement, whether by the account holder or by YouTube itself), Blogspot (same, but also that some bloggers made their blogs private, perhaps to prevent spam-comments or trolling), Yahoo articles (which I see Yahoo deletes after some time), Style.com (Vogue magazine's website of all runway shows, which are now on Vogue.com instead, with a different URL structre), and dozens of other websites that don't exist anymore.

I think the statement about "stuff that's out there" really only applies to famous or public people, where leaked and/or damning photos or videos are quickly copied, saved, and rehosted by websites all over the world, including Twitter, Pinterest, and other platforms. For instead, while Google Images fastidiously won't show you the hacked photos of "Jennifer Lawrence naked," as Google sought to avoid a $100M lawsuit [0], Bing Images, once you turn off Safe Search, shows plenty of sites that host the pictures, with most frequent such site being a German-based one called "OhFree," but there are at least 3 Blogspot sites as well, I suppose ironically.

[0] > "We've removed tens of thousands of pictures," says the web giant - https://www.hollywoodreporter.com/thr-esq/google-responds-je...

There was a big drama in 2012. http://Archive.is was proactively archiving Wikipedia links. An unauthorized bot (RotlinkBot) was linking to Archive.is. The bot was banned.

I liked how Archive.is was so fast at archiving, its UI more clean. And since it proactively archived links, it still happens today that a dead reference link will be archive in Archive.is, but not in the Wayback Machine.

See https://en.wikipedia.org/wiki/User:RotlinkBot

I'm guessing Archive.is will probably disappear within the next 5 years, taking all data down with it.

Nobody knows who owns or maintains the site, and recently the mysterious owner started taking donations to keep the site running. It's a commercial enterprise.

Slick UI or not, Archive.org's longevity is probably more feasible.

The internet archive is great for static pages but what will happen for today's interactive content with complex data stored across different domains?

They save JavaScript, flash apps, and even some downloads too. Just recently I used them to get an old flash game of a studio that went bust a couple years ago

Developers like using those time saving frameworks because it's easier. The management likes them because it makes linking to, downloading, or otherwise inspecting page content so much harder.

For people that make those kind of sites it is a feature that you can't archive them easily. They are a cancer.

The Archive's crawlers will improve.

If a tree falls in the forest and nobody is there to hear it, does it make a noise? If IA is storing copyrighted (noarchive) content but not displaying it, does that make it acceptable?

Yes, because copyright only lasts for a finite duration. In 100+ years when the original right-holders have disappeared, or when copyright expires and the original right-holders have no incentive to keep the originals; IA archive and similar archiving efforts will be able to make their copies available as a matter of historical importance.

Yes, because then it's saved. One day we'll have a little memory stick that contains the collected and organised works of mankind (all movies, all music, all websites, digitised books, ...) and it'll be thanks to the archivists.

Yes at least in the United State copyright law has exemptions for archival purposes. http://www.ala.org/advocacy/copyright/dmca/section108

a worthy donation candidate ;)

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact