1. There was no rush to pick the latest technologies. Tried and tested was much better than new and shiny. Archive.org was mostly old PHP and shell scripts (atleast the parts I worked on).
2. The software was just a necessity. The data was what was valuable. Archive.org itself had tons of kluges and several crude bits of code to keep it going but the aim was the keep the data secure and it did that. Someone (maybe Brewster himself) likened it to a ship traveling through time. Several repairs with limited resources have permanently scarred the ship but the cargo is safe and pristine. When it finally arrives, the ship itself will be dismantled or might just crumble but the cargo will be there for the future.
3. Everything was super simple. Some of the techniques to run things etc. were absurdly simple and purposely so to help keep the thing manageable. Storage formats were straightforward so that even if a hard disk from the archive were found in a landfill a century from now, the contents would be usable (unlike if it were some kind of complex filesystem across multiple disks).
4. Brewster, and consequently the crew, were all dedicated to protecting the user. e.g. https://blog.archive.org/2011/01/04/brewster-kahle-receives-.... There was code and stuff in place to not even accidentally collect data so that even if everything was confiscated, the user identities would be safe.
5. There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.
Archive.org is truly only of the most underappreciated corners of the world wide web. Gives me faith in the positive potential of the internet.
This resonates with me. Sometimes we developers need to get off the "move fast and break stuff" bandwagon (which has been ongoing for over decades now), and consider we're the ones responsible for preserving almost all human digital heritage of our epoch. There's a simple and obvious method to implement preservation-friendly content implicit in the web architecture: emit/materialize everything as plain HTML, even dynamic content. This is of course antithetical to most of this decade's SPA web development trends, but I think it's worth drawing a line between web content (worth preserving in the first place) and web apps (which have highly volatile content not worth preserving). I feel like this distinction isn't considered sufficiently in our staged web app architecture dicussions which are all about your latest JS MVw framework, to the degree that newby web devs really don't learn the fundamentals of HTML etc. anymore, and are lead to use React, Vue, etc. for content-based web sites.
> How does archive.org make money?
> I imagine their storage costs must be quite high.
No, they aren't. Building and hosting your own storage is cheap. Same reason Backblaze and Dropbox built their own storage systems.
Archive.org uses S3 extensively. Not exactly cheap.
For instance: 2nd April 2004  was my first crap Flash animation for my consultancy company at the time. It's terrible, but watching it now gives me so much nostalgia!
: WARNING: REQUIRES FLASH: https://web.archive.org/web/20040402230914/http://noisiadetr...
The other day, I discovered that the Wayback Machine has been archiving YouTube videos in full HD. Most videos aren't on there, of course, and it seems to only go back as far as ~2012 (HTML5 video switchover?), but some of them are there.
Y'all will be getting more donations from me. :)
Edit: I just checked Wikipedia, it says they're using about 15 PB of storage.
Edit 2: 15 PB cost => 15,000 TB x $30/TB = $450,000. Ofcourse, back of the napkin cost (no maintenance, power, etc). That's not too bad actually.
We save a lot on web content storage by de-duplicating "revists" when the page hasn't changed. This works out to save a whole lot for content like jQuery served from a common CDN URL; it doesn't work well when there is a page counter or any trivial changing content on a page.
If you are interested in the storage back-end, it's actually pretty simple: HTTP requests/responses are concatenated and compressed in WARC files (sort of like .tar.gz) that get stored on regular old ext4 filesystems. An index of "what URL captures are in what WARC files on what servers" is continuously generated in the form of, basically, a giant sorted (and shareded) .tsv file; replay requests on web.archive.org look up the URL and timestamp and get a reference to a machine, file, and file offset, and make an HTTP 1.1 range request for the content in question. There are a bunch of other details, like checking robots.txt status, but the core design is super simple, cheap, and (relatively) easy to operate at scale.
Apart from web crawl content (including, these days, "heavy" video content which is difficult to de-dupe), we have a large amount of live recorded TV, scanned books (raw photos), etc.
(I currently work at IA)
WARC itself does have a method of dedup if the response is the same (or mostly the same) but terrible if content changes.
Still waiting for the Shazam for video to know that 2 videos are the same even when they are of different codecs/framesizes/etc; just based on the visual imagery.
I imagine at least a million dollars an year to keep the lights on just for the infrastructure.
Does IA have a big endowment to keep it going for a while?
"Today, the Wayback Machine houses some 388B web pages, and its parent ... The Internet Archive’s collection, which spans not just the web, but books, audio 78rpm records, videos, images, and software, amounts to more than 40 petabytes, or 40 million gigabytes, of data. The Wayback Machine makes up about 63% of that."
It's essentially the only cure for the serious link-rot problems in Wikipedia's references.
"We preserve 750 million Web pages per week! We’ve saved 35 petabytes (that’s 35,000,000,000,000,000 bytes) of data."
This also isn't a matter of "just go with cheaper consumer disks". If you care at all about data integrity you won't use consumer-grade SSDs.
That does not factor in compression / dedup, which can gain you a substantial amount of savings depending on your work profile; I bet if Archive.org wanted to, they could slash their storage using block-level dedup. Of course that creates potential recoverability headaches, so it's possible they don't.
There may well be fascinating ways one can use blockchain for archives of this sort, but the grandparent comment was essentially useless noise.
In terms of making sure the work that is done is useful it's fairly hard to do that because if you let someone dictate what work should be done then you potentially allow them to do the work upfront and a 51% attack becomes pretty easy.
I do wonder if there are potential consensus algorithms that rely upon game theory, the threat of being booted off the network, and the need for some investment before they're "trusted" by other peers. The basic idea is that yes, you could 51%-attack the network with a lot less than 51% computing power, but it would never be profitable because if you're discovered you're booted off into your own little partition where nobody plays with you, and the value you get from being part of the network in the future is greater than the value you would get from rewriting history into something only you believe.
51% attacks are possible now - roughly 75% of Bitcoin hashing power is located in China, so if the Chinese government decides that they want to shut it down, they send the military after the largest Bitcoin miners, say "Hash these transactions or else", and take control of the chain. They haven't found it profitable to do so yet, though - what would they gain from it?
Anyway the comment was probably downvoted because it consisted of a single buzzword.
But as one other commenter here has mentioned, you're only a robots.txt amend away from the oblivion that the entire IMDB comments section fell into , so a good archiving system is essential. I use (no affiliation)Save Page WE on Waterfox:
Programmatically submitting to Wayback Machine is trivial enough, so I have cron jobs backing up most of my static sites (in their entirety) periodically.
Pages are archived automatically, and sites are crawled by robots, not humans.
I used to donate every time something reminded me of the value of the archive. Now I just think “that’s why I have a monthly pledge!”
Is there a scalable way to monitor Wikipedia links to see if the content is changed after originally being posted?
They are already storing every link in the Internet Archive when it gets added, so there should be a reference point to compare against.
One easy option would be to make Internet Archive links available for every single link on Wikipedia, even if it hasn't rotted yet. So a 'live' link to the current content, and an archive link for what it was at the time of linking.
The biggest problem in this would probably be how to recognize if "content" changed. A site can change the full design, navigation, footer and header and everything and still have the exact same "content". For a human being this will be simple enough to understand, but a tool might have its problems with that.
There are a few solutions to this already, using solutions like outline.com to pull the content out of the cruft, but I don't know how many of these are general purpose and how many are purpose built for each site (and maintained for the current version of the site, perhaps?)
As seen in the article, most links are to a small number of sites, so perhaps hard coding the content extraction would be feasible, especially for an initial study.
It would be interesting I think to see just how many links have identical content, but you're right in that the number will be skewed greatly if there are any ads or similar included.
I guess it doesn't, but assuming that IA records every link as it gets added you can use the history page to see which snapshot on the wayback machine corresponds to the link at the time it was added.
You could use that to compare to now, but unless you made sure to snapshot the content regularly (whenever the page is edited?) you wouldn't necessarily know when the content drift happened, or if multiple changes have happened to the content over time.
Using the history page would likely give you enough information to, for example, do a study on links to find out how many have different content now vs when the link was added.
Though IA's archival-on-creation mechanism should at least mean that the orighinal reference is preserved.
An explicit Wikipedia link to reference-at-time-of-archive, or better, a diff flag or listing, would be useful.
Sure the archive is useful today, but it’s primary purpose is retaining information for future generations. If that means placating copyright holders it’s worth the cost.
In many cases it's nothing to do with the copyright holder. In fact, the complete opposite: someone who subsequently bought the domain and unwittingly stuck a generic robots.txt on the site.
If the IA hid content upon receiving a request from webmaster@domain then that'd make sense. But doing it automatically and retrospectively from robots.txt is bizarre and shows again the dangers of centralisation on the Internet.
I now block IA from my sites as a protest. We need more competition and fresh ideas in digital archiving.
I’ve read about the robots.txt mentioned before but hadn’t seen this mentioned. Any idea if they have this somewhere on the site?
They are gradually expanding the scope of sites where robots.txt is ignored, perhaps "testing the water".
What you're doing is essentially digital archeology, which is super cool. In 50-100 years, if not sooner, people will be digging through digital "graveyards" for evidence of this and that. That's so intriguing.
I think the statement about "stuff that's out there" really only applies to famous or public people, where leaked and/or damning photos or videos are quickly copied, saved, and rehosted by websites all over the world, including Twitter, Pinterest, and other platforms. For instead, while Google Images fastidiously won't show you the hacked photos of "Jennifer Lawrence naked," as Google sought to avoid a $100M lawsuit , Bing Images, once you turn off Safe Search, shows plenty of sites that host the pictures, with most frequent such site being a German-based one called "OhFree," but there are at least 3 Blogspot sites as well, I suppose ironically.
 > "We've removed tens of thousands of pictures," says the web giant - https://www.hollywoodreporter.com/thr-esq/google-responds-je...
I liked how Archive.is was so fast at archiving, its UI more clean. And since it proactively archived links, it still happens today that a dead reference link will be archive in Archive.is, but not in the Wayback Machine.
Nobody knows who owns or maintains the site, and recently the mysterious owner started taking donations to keep the site running. It's a commercial enterprise.
Slick UI or not, Archive.org's longevity is probably more feasible.
For people that make those kind of sites it is a feature that you can't archive them easily. They are a cancer.