I worked at the Archive for a few years remotely. It permanently altered my view of the tech. world. Here are the notable differences. I think these would apply to several non-profits but this is my experience
1. There was no rush to pick the latest technologies. Tried and tested was much better than new and shiny. Archive.org was mostly old PHP and shell scripts (atleast the parts I worked on).
2. The software was just a necessity. The data was what was valuable. Archive.org itself had tons of kluges and several crude bits of code to keep it going but the aim was the keep the data secure and it did that. Someone (maybe Brewster himself) likened it to a ship traveling through time. Several repairs with limited resources have permanently scarred the ship but the cargo is safe and pristine. When it finally arrives, the ship itself will be dismantled or might just crumble but the cargo will be there for the future.
3. Everything was super simple. Some of the techniques to run things etc. were absurdly simple and purposely so to help keep the thing manageable. Storage formats were straightforward so that even if a hard disk from the archive were found in a landfill a century from now, the contents would be usable (unlike if it were some kind of complex filesystem across multiple disks).
4. Brewster, and consequently the crew, were all dedicated to protecting the user. e.g. https://blog.archive.org/2011/01/04/brewster-kahle-receives-.... There was code and stuff in place to not even accidentally collect data so that even if everything was confiscated, the user identities would be safe.
5. There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.
Archive.org is truly only of the most underappreciated corners of the world wide web. Gives me faith in the positive potential of the internet.
> There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.
This resonates with me. Sometimes we developers need to get off the "move fast and break stuff" bandwagon (which has been ongoing for over decades now), and consider we're the ones responsible for preserving almost all human digital heritage of our epoch. There's a simple and obvious method to implement preservation-friendly content implicit in the web architecture: emit/materialize everything as plain HTML, even dynamic content. This is of course antithetical to most of this decade's SPA web development trends, but I think it's worth drawing a line between web content (worth preserving in the first place) and web apps (which have highly volatile content not worth preserving). I feel like this distinction isn't considered sufficiently in our staged web app architecture dicussions which are all about your latest JS MVw framework, to the degree that newby web devs really don't learn the fundamentals of HTML etc. anymore, and are lead to use React, Vue, etc. for content-based web sites.
Thank you for sharing this. I’ve donated (small amounts of) money to Mozilla and Wikipedia in the past. Your post makes me consider donating to archive.org this year.
Can you provide a citation? To my knowledge, the Archive does not use Amazon's S3 storage system (which they refer to in places as "S3" [1]), only they're on their own internal storage system [2].
To the best of my knowledge, the Archive has it's own machines to store data. It is an Archive and one of the principles was to have the know how to preserve data even if the cloud providers disappear.
This will be my first year attending the annual event, and I'm super excited! I'm a long time fan of the Wayback Machine, even though my early work is terribly embarrassing :) Love what the archive does, which is so much more than snapshots of old websites. If you somehow have not read the mission statement, it's worth a minute: https://archive.org/about/
I'm now almost weirdly proud of the crap work of mine that I can find on the wayback machine :-)
For instance: 2nd April 2004 [1] was my first crap Flash animation for my consultancy company at the time. It's terrible, but watching it now gives me so much nostalgia!
What an entitled attitude to throw at one of mankind's greatest digital resource, without even researching the subject. And you have the balls to ask "When did HN turn into Reddit?" (https://news.ycombinator.com/item?id=18092654) ...
The other day, I discovered that the Wayback Machine has been archiving YouTube videos in full HD. Most videos aren't on there, of course, and it seems to only go back as far as ~2012 (HTML5 video switchover?), but some of them are there.
I was wondering if something like that existed. A lot of times I add to my YouTube WatchLater list, which can often times be much later in tim that by the time I get around to it it’s not unusual for videos to be removed by the user or deleted for unknown reasons. And there’s no text to even see what the video was.
This is why I backup videos with youtube-dl. Internet history can be completely erased at a moment's notice, and some things ought to belong to the public if they're to never see the light of day again otherwise.
I'd suggest a slight revision to your theorem:
> Internet history can be completely erased at a momement's notice (unless it would embarrassing for you later in life)...
I've always wondered this - How does Archive.org work in terms of storage? Internet is massive and caching every single site periodically for years on, isn't that unreasonably huge amount of data?
Edit: I just checked Wikipedia, it says they're using about 15 PB of storage.
Edit 2: 15 PB cost => 15,000 TB x $30/TB = $450,000. Ofcourse, back of the napkin cost (no maintenance, power, etc). That's not too bad actually.
The Archive currently has about 46 Petabytes of content ("bytes archived"), and over 120 PB of raw disk capacity; the difference is due to data replication, "currently filling" storage, non-storage infrastructure, etc.
We save a lot on web content storage by de-duplicating "revists" when the page hasn't changed. This works out to save a whole lot for content like jQuery served from a common CDN URL; it doesn't work well when there is a page counter or any trivial changing content on a page.
If you are interested in the storage back-end, it's actually pretty simple: HTTP requests/responses are concatenated and compressed in WARC files (sort of like .tar.gz) that get stored on regular old ext4 filesystems. An index of "what URL captures are in what WARC files on what servers" is continuously generated in the form of, basically, a giant sorted (and shareded) .tsv file; replay requests on web.archive.org look up the URL and timestamp and get a reference to a machine, file, and file offset, and make an HTTP 1.1 range request for the content in question. There are a bunch of other details, like checking robots.txt status, but the core design is super simple, cheap, and (relatively) easy to operate at scale.
Apart from web crawl content (including, these days, "heavy" video content which is difficult to de-dupe), we have a large amount of live recorded TV, scanned books (raw photos), etc.
Have you evaluated compression algorithms that support custom dictionaries, like zstd? You could generate a compression dictionary for each domain, or just for those above a certain size.
> (including, these days, "heavy" video content which is difficult to de-dupe)
Still waiting for the Shazam for video to know that 2 videos are the same even when they are of different codecs/framesizes/etc; just based on the visual imagery.
"Today, the Wayback Machine houses some 388B web pages, and its parent ... The Internet Archive’s collection, which spans not just the web, but books, audio 78rpm records, videos, images, and software, amounts to more than 40 petabytes, or 40 million gigabytes, of data. The Wayback Machine makes up about 63% of that."
That doesn't account for redundancy and backups. I have asked some storage cost questions on /r/sysadmin and /r/datahoarder in the past about cost per TB and I tend to see numbers ranging from $300 to $1000 a TB of useable space.
SSDs are about $400/TB for raw storage. Prices are substantially different for consumer and enterprise spaces, but 4TB disks (Micron, Samsung, etc) tend to be about $1500 and 8TB around $3200. If you factor in redundancy, you're looking at around $1000-1500 per TB of usable storage.
This also isn't a matter of "just go with cheaper consumer disks". If you care at all about data integrity you won't use consumer-grade SSDs.
3-way mirrored spinning storage would cost roughly $100/TB. 12TB disks are about $350/ piece these days. There are a whole bunch of possible considerations and alternate scenarios, but this is probably the worst-case.
That does not factor in compression / dedup, which can gain you a substantial amount of savings depending on your work profile; I bet if Archive.org wanted to, they could slash their storage using block-level dedup. Of course that creates potential recoverability headaches, so it's possible they don't.
I see this is downvoted. Actually, if a blockchain used proof of actual useful storage or computation, that'd be great. (instead of wasting energy by computing "useless" hashes)
Proof of storage is used in siacoin, but not to secure the network. I haven't seen a scheme where proof of storage can be used to secure the network.
In terms of making sure the work that is done is useful it's fairly hard to do that because if you let someone dictate what work should be done then you potentially allow them to do the work upfront and a 51% attack becomes pretty easy.
"you potentially allow them to do the work upfront and a 51% attack becomes pretty easy."
I do wonder if there are potential consensus algorithms that rely upon game theory, the threat of being booted off the network, and the need for some investment before they're "trusted" by other peers. The basic idea is that yes, you could 51%-attack the network with a lot less than 51% computing power, but it would never be profitable because if you're discovered you're booted off into your own little partition where nobody plays with you, and the value you get from being part of the network in the future is greater than the value you would get from rewriting history into something only you believe.
51% attacks are possible now - roughly 75% of Bitcoin hashing power is located in China, so if the Chinese government decides that they want to shut it down, they send the military after the largest Bitcoin miners, say "Hash these transactions or else", and take control of the chain. They haven't found it profitable to do so yet, though - what would they gain from it?
I suppose that would be kind of interesting. Although if it's conceivable that someone could just store all the useful information then you can't just request a random piece of information as a proof of 'work', so you'd need something clever.
Anyway the comment was probably downvoted because it consisted of a single buzzword.
Blockchain actually could be used to provide data integrity in a distributed environment with known bad actors (presuming that they fall under the 50% threshold), but that is not applicable in this case since archive data is not stored in a distributive fashion by clients (as far as I know). Having worked on a production application that uses blockchain, my take away was it cleverly provides distributed ledger transparency, but at a price that makes it very inefficient as a data store.
It's an invaluable resource. In the past week I realized it is quite likely that a site I used to work at which hosts a lot of my portfolio might disappear or be heavily amended. So I located and indexed a complete list of my articles there -- and I was even able to click a button and create an archive for the few pages that IA hadn't bothered to index (I wish I had known about this before, since it was a shock to find that IA can be quite selective, and to find a page you were hoping was there simply isn't, and is now irretrievable).
But as one other commenter here has mentioned, you're only a robots.txt amend away from the oblivion that the entire IMDB comments section fell into [1], so a good archiving system is essential. I use (no affiliation)Save Page WE on Waterfox:
Wayback Machine isn’t Googlebot, it doesn’t crawl the web, so there’s no such thing as “hadn’t bother to index”... Someone, be it a human or a bot, needs to submit a page for archival.
Programmatically submitting to Wayback Machine is trivial enough, so I have cron jobs backing up most of my static sites (in their entirety) periodically.
This just reminded me to donate. I've used the archive several times just in the past few days to resolve 404s on old gamedev blogs. I'm amazed how often what I'm looking for is in the archive considering how big the internet is and how niche the content I'm looking for. Truly an amazing resource comparable to Wikipedia in value.
Also, your employer may match charitable contributions (up to a predefined amount). Check if they do! It’s effectively free money for the Internet Archive.
I got a call a few years ago from a member of a humanitarian organization that had accidentally lost a significant percentage of their web site detailing projects that they had completed over many years with no backups. The people that had completed the work had moved on and they were frantic that the work was gone forever, but the Wayback Machine had almost perfect records to restore everything.
There is a github project out there where you can specify the site, and it will rebuild the content locally from wayback content. Something to consider for last resort recovery.
They mention at the end of the article that 'content drift' may be a bigger issue than link rot; when the content of the post is simply changed rather than missing, it is much harder to notice.
Is there a scalable way to monitor Wikipedia links to see if the content is changed after originally being posted?
They are already storing every link in the Internet Archive when it gets added, so there should be a reference point to compare against.
One easy option would be to make Internet Archive links available for every single link on Wikipedia, even if it hasn't rotted yet. So a 'live' link to the current content, and an archive link for what it was at the time of linking.
The biggest problem in this would probably be how to recognize if "content" changed. A site can change the full design, navigation, footer and header and everything and still have the exact same "content". For a human being this will be simple enough to understand, but a tool might have its problems with that.
Yes, this is a fundamental issue if you wanted to do this at scale.
There are a few solutions to this already, using solutions like outline.com to pull the content out of the cruft, but I don't know how many of these are general purpose and how many are purpose built for each site (and maintained for the current version of the site, perhaps?)
As seen in the article, most links are to a small number of sites, so perhaps hard coding the content extraction would be feasible, especially for an initial study.
It would be interesting I think to see just how many links have identical content, but you're right in that the number will be skewed greatly if there are any ads or similar included.
Does the article's history page keep a record of what the content of the link was when it was updated?
I guess it doesn't, but assuming that IA records every link as it gets added you can use the history page to see which snapshot on the wayback machine corresponds to the link at the time it was added.
You could use that to compare to now, but unless you made sure to snapshot the content regularly (whenever the page is edited?) you wouldn't necessarily know when the content drift happened, or if multiple changes have happened to the content over time.
Using the history page would likely give you enough information to, for example, do a study on links to find out how many have different content now vs when the link was added.
Thank you for pointing this out, I will see what I can do to help, this is the sort of thing I'm more than happy to dedicate resources to. I still have a 36 drive enclosure lying around that would make a nice bit of storage if I can get it to be silent enough for the home.
And they are one robots away from cancellation. For all the good they do, retroactively applying robots exclusions to their crawler is a terrible thing. Luckily there are alternatives for going forward.
They keep the data and just don’t display it. The last thing they need is a court order demanding they delete it.
Sure the archive is useful today, but it’s primary purpose is retaining information for future generations. If that means placating copyright holders it’s worth the cost.
> If that means placating copyright holders it’s worth the cost.
In many cases it's nothing to do with the copyright holder. In fact, the complete opposite: someone who subsequently bought the domain and unwittingly stuck a generic robots.txt on the site.
If the IA hid content upon receiving a request from webmaster@domain then that'd make sense. But doing it automatically and retrospectively from robots.txt is bizarre and shows again the dangers of centralisation on the Internet.
I now block IA from my sites as a protest. We need more competition and fresh ideas in digital archiving.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
I spend quite some time on archive.org. Of course the wayback machine is great, but I am mostly interested in old digitized media. There really is some great stuff on there - But what is really missing is organization (and a less broken-seeming website, I guess). It doesn't help you much to have a great archive if noone will find anything. User-curation would help a lot with this.
What you're doing is essentially digital archeology, which is super cool. In 50-100 years, if not sooner, people will be digging through digital "graveyards" for evidence of this and that. That's so intriguing.
I like this term “digital archaeology”. I find myself saying it a lot when dealing with “only” 20 year old database data in my day job. Apparently, but not surprisingly, it’s a real thing[1]!
There's long been a saying that "Once it's out there (on the Internet), it's forever," but I used to save links in a Microsoft Word document, and I went through them a few years later and almost none of them worked anymore. The years in which they were saved was 2006 to 2009, and the year I went through them was 2012. The links were from MySpace (which totally overhauled the entire site and all content), Facebook (where users had deleted their profiles or pictures), Tumblr (where bloggers rename their blogs, which change the URL, or they wipe them clean, or delete their blogs), YouTube (tons of videos and whole accounts have been deleted because of copyright infringement, whether by the account holder or by YouTube itself), Blogspot (same, but also that some bloggers made their blogs private, perhaps to prevent spam-comments or trolling), Yahoo articles (which I see Yahoo deletes after some time), Style.com (Vogue magazine's website of all runway shows, which are now on Vogue.com instead, with a different URL structre), and dozens of other websites that don't exist anymore.
I think the statement about "stuff that's out there" really only applies to famous or public people, where leaked and/or damning photos or videos are quickly copied, saved, and rehosted by websites all over the world, including Twitter, Pinterest, and other platforms. For instead, while Google Images fastidiously won't show you the hacked photos of "Jennifer Lawrence naked," as Google sought to avoid a $100M lawsuit [0], Bing Images, once you turn off Safe Search, shows plenty of sites that host the pictures, with most frequent such site being a German-based one called "OhFree," but there are at least 3 Blogspot sites as well, I suppose ironically.
There was a big drama in 2012. http://Archive.is was proactively archiving Wikipedia links. An unauthorized bot (RotlinkBot) was linking to Archive.is. The bot was banned.
I liked how Archive.is was so fast at archiving, its UI more clean. And since it proactively archived links, it still happens today that a dead reference link will be archive in Archive.is, but not in the Wayback Machine.
I'm guessing Archive.is will probably disappear within the next 5 years, taking all data down with it.
Nobody knows who owns or maintains the site, and recently the mysterious owner started taking donations to keep the site running. It's a commercial enterprise.
Slick UI or not, Archive.org's longevity is probably more feasible.
They save JavaScript, flash apps, and even some downloads too. Just recently I used them to get an old flash game of a studio that went bust a couple years ago
Developers like using those time saving frameworks because it's easier. The management likes them because it makes linking to, downloading, or otherwise inspecting page content so much harder.
For people that make those kind of sites it is a feature that you can't archive them easily. They are a cancer.
If a tree falls in the forest and nobody is there to hear it, does it make a noise? If IA is storing copyrighted (noarchive) content but not displaying it, does that make it acceptable?
Yes, because copyright only lasts for a finite duration. In 100+ years when the original right-holders have disappeared, or when copyright expires and the original right-holders have no incentive to keep the originals; IA archive and similar archiving efforts will be able to make their copies available as a matter of historical importance.
Yes, because then it's saved. One day we'll have a little memory stick that contains the collected and organised works of mankind (all movies, all music, all websites, digitised books, ...) and it'll be thanks to the archivists.
1. There was no rush to pick the latest technologies. Tried and tested was much better than new and shiny. Archive.org was mostly old PHP and shell scripts (atleast the parts I worked on).
2. The software was just a necessity. The data was what was valuable. Archive.org itself had tons of kluges and several crude bits of code to keep it going but the aim was the keep the data secure and it did that. Someone (maybe Brewster himself) likened it to a ship traveling through time. Several repairs with limited resources have permanently scarred the ship but the cargo is safe and pristine. When it finally arrives, the ship itself will be dismantled or might just crumble but the cargo will be there for the future.
3. Everything was super simple. Some of the techniques to run things etc. were absurdly simple and purposely so to help keep the thing manageable. Storage formats were straightforward so that even if a hard disk from the archive were found in a landfill a century from now, the contents would be usable (unlike if it were some kind of complex filesystem across multiple disks).
4. Brewster, and consequently the crew, were all dedicated to protecting the user. e.g. https://blog.archive.org/2011/01/04/brewster-kahle-receives-.... There was code and stuff in place to not even accidentally collect data so that even if everything was confiscated, the user identities would be safe.
5. There was a mission. A serious social mission. Not just, "make money" or "build cool stuff" or anything. There was a buzz that made you feel like you were playing your role in mankinds intellectual history. That's an amazing feeling that I've never been able to replicate.
Archive.org is truly only of the most underappreciated corners of the world wide web. Gives me faith in the positive potential of the internet.