Hacker News new | past | comments | ask | show | jobs | submit login
Wayback Machine Hits 400,000,000,000 (archive.org)
300 points by tweakz on May 9, 2014 | hide | past | favorite | 56 comments



And not one of those hits is from Quora due to their robots.txt:

https://web.archive.org/web/http://www.quora.com/

Good job Quora, preserving all that crowd sourced content away from the crowd, keeping it from everyone not logged in. Hats off to getting into YC so you can post your job openings on the HN home page and get some press. This doesn't add anything to your image though, it just takes a little away from YC.

On another better note, a great big thank you to the wayback machine for all of the public good it does. Now there's an organization that is amazing and wonderful and enriching our lives in an open and honest way with information.


Thank you. I've hated Quora since day one precisely because of this - they're the archetype of a company taking all its value from the community and giving absolutely nothing back. Sadly, every time I've voiced this opinion in the Bay Area, I'm met with blank stares.

My dream would be a Quora-like service run by the Wikimedia (or similar organization), but that'll likely never happen.


Man. I hate on Quora. Didn't realize they were YC. Shame.


And why would you do that ? I can see why the must-register roadbloack is annoying in theory but It takes a few minutes to register, and then its an absolute gold-mine of knowledge, with an extra-ordinary community, which is very positive, constructive and helpful. I find it way more engaging than HN Actually to be honest.


If you want to have your private answer community, that's fine. Don't go whoring it on Google and other search engines as if the results are publicly visible when they're not. I'm also opposed on principle to sites which require registration and (in theory if not in practice) log all of my interactions -- what I read and how I interact with it is very personal information and data. It's why I'm opposed to the idea of logging in to, say, read The New York Times. It's one of a very small number of sites I'd subscribe to, but even then I'd prefer to access it without logging in.

I've created throwaway accounts a few times to see if there's anything sufficiently worthwhile there. I'm largely convinced that 1) there's not and 2) the dynamics of the site work against it scaling or surviving.

So I'm not going to waste my time with it.


As expected, i am getting downvoted because of my pro-quora stance. You clearly haven't spent much time on the site. The beauty of it is, that you can reach people, and ask questions who you normally can only dream of reaching.Nasa's engineers, Phd Doctors, Entrepreneurs, famous bloggers, Actors, filmmakers,World class athletes, they are all there. If you have a specific question , ask away. I once needed to find out how Gana.com the biggest music startup in india operated, i asked its CEO a question, and within a day i had a response. For all this amazing value, i dont see why taking a few minutes to register is such a big deal.


Note that as the parent of your responses, I certainly can't downvote. And from the HN Guidelines: "Resist complaining about being downmodded. It never does any good, and it makes boring reading."

http://ycombinator.com/newsguidelines.html

Regarding Quora: its behavior runs counter to good netizen behavior. Clearly, people have an issue with that, and among those people is Paul Graham (as several others have cited and referenced in the discussions over the past few days).

Regards who you can encounter on Quora: in my experience the level and quality of conversation reached on a clueful but open site tends to be generally higher. You'll find quite expert people on HN (I've run across Charlie Stross a few times), and I've been online long enough to have seen numerous highly-qualified participants on Usenet, Slashdot, reddit, and even G+ (there are still numerous techies there). President Obama is among those who've run reddit AMAs, and I'll routinely find very qualified people within various subreddits there.

And again: the conversations are visible to whomever wants to view them without restriction.

It's not the registration, it's the morality.


I've found the Quora community to be great, but the register-walling and the robots.txt is very close minded, especially to the Internet as whole. Most of the time I find a link to a Quora I am turned down by the fact I can't read more than half an answer without having to sign in. Imagine if Stack Overflow did the same thing? They'd be dead.


The robots.txt thing really ices the cake.

The whole benefit of online discussion in general is to create as wide and persistent an archive of both questions and answers as possible. To that extent, I applaud StackExchange, even with its aggressive moderation (not too different from better-managed subreddits such as /r/askhistorians or /r/askscience): the audience isn't the person asking the question but all people who might want to know the answer to a given question. And they're better served by having a canonical response.

Quora blocking (explicitly) TIA means that that knowledge stays locked in their silo. Pretty much forever. Fuck that.

Just yesterday I was musing on energy (something I do a lot) and the fact that even in such advanced energy systems as nuclear fission and fusion powerplants (real or theoretical), all this advanced energy, provided directly in the form of electron volts, is degraded to heat, used to create steam, and then to spin turbines. That's fundamentally a technology that's hundreds of years old (and I think early steam or heat turbines go back further). Windmills date from the 1400s.

So I checked AskScience on reddit, and it turns out the question's been asked a few times. And I could read the responses. And, while, yes, I've got an account there, it's not necessary for that purpose:

http://www.reddit.com/r/askscience/search?q=electricity+nucl...


Stack Overflow was created in response to a site that was doing something similar to what Quora is doing with search engines, except that site required a paid membership to see the answers. I recall one trick that people used before SO was around, which was to spoof the User-Agent to be GoogleBot. Doing so would show you the answer(s) to the question. :)


I thought Google specifically penalised sites which did that?


DISCLAIMER: I'm one of Experts Exchange's volunteer administrators.

Google penalized sites that redirected from SERPs or buried content below a lot of "sign up here" stuff very heavily; you can even make the case that it could have been called the "EE Penalty" instead of Panda, if only because Google's web search team (i.e. Matt Cutts) collaborated heavily with the Stack ownership in developing the algorithm changes and consequences.

That's not to excuse EE's management's behavior -- quite the contrary. From the perspective of longtime users, EE's string of decisions, the consequences of those decisions, and its reactions to those consequences starting not quite a decade ago nearly destroyed what had been a vibrant community.

If there's anything good to have come of it, it's that EE has finally moved -- about six or seven months ago -- to a business model that allows the non-member to see what members see: the entire question and solutions. Joining for free does have a few minor advantages; paying (either with a credit card or by answering questions) has more.

But it took a long time for EE to learn those lessons and begin to implement those fixes. Whether Quora will learn them is a whole 'nother story; like so many other sites, it built its systems and market-share without much thought given to how it was going to monetize.

EE made that mistake too, in 1997. Quora has a lot of money backing it up, so it can maintain its facade for a long time... but I wouldn't be placing any bets on it being around 15 years from now if it actually had to depend on income.


I haven't seen an expertsexchange result come up in google in a long time.


At some point they switched to a slightly different model where you would see the normal obfuscated answers with a request to log in at the top of the page, but if you scrolled down far enough, the answers would be there in clear text.

They were certainly walking a very fine line, but Google seemed to give them a pass with this system for quite some time.


You can append ?share=1 to show the content. (A workaround which should not be required but sadly is.)


PSA/ranty thing: Just because something is archived in the Wayback machine, do not trust that archive.org will keep it there for all time. If you need something, make a local copy! A few months ago TIA changed their stance on robots.txt. They now retroactively honor robot blocks. Now any site can completely vanish from the archives.

Let's say I died tomorrow. My family lets my domain slip. A squatter buys it and throws up a stock landing page, with a robots.txt that forbids spidering. TIA would delete my entire site from their index.

I've already lost a few good sites to this sort of thing. If you depend on a resource, archive it yourself.

edit - Official policy: https://archive.org/about/exclude.php

If I am reading it properly, once blocked they never check later in case of a change of heart? No procedure for getting re-indexed at all?


A fair rant, but to correct some misperceptions:

The retroactive application of robots.txt is not a new policy; it's been in place for at least 11 years, and I believe it arrived very soon after the Wayback Machine was first unveiled.

An updated robots.txt does not irreversibly delete prior captures, so if the robots.txt changes again, access to previously-collected material can be re-enabled.

This policy has served to minimize risk and automate the most common removal-scenario, when a webmaster wants a total opt-out of current crawling and past display. But, the collateral damage to unrelated content from prior domain-owners has grown as the web has aged, and more domains have changed hands. (The tradeoff that made sense in 2002 probably doesn't make sense in 2014.)

Figuring out a system that can automate most legitimate exclusions, while rejecting or reversing those that lack a firm basis in content ownership or personal privacy, is a thorny task, but it would be worth pursuing if/when the Wayback Machine has the necessary staff resources.

(My proposal since 2008 has been a DMCA-inspired 'put-back' procedure, where an original content owner can assert, formally, that they are the content owner and do not want the current-day robots.txt applied to captures before a certain date. Then, the current domain-owner would have to counter-notify that to maintain the block. This idea hasn't had legal review, but would reverse some current damage, and any bad-faith blockers would have to go on record with a false claim to maintain the block, potentially exposing them to a third-party legal challenge, with minimal risk to IA.)


Whoops, I probably should have checked the exclude.php page in the Wayback Machine before pinning a date on it. My bad.

> An updated robots.txt does not irreversibly delete prior captures, so if the robots.txt changes again, access to previously-collected material can be re-enabled.

Is there any official statement you can cite? Generally TIA does not mince words and is honest. If they say "removed" then I would assume they are not doing Facebook-style "deletion" shenanigans.


I don't know of a linkable official statement to that effect. Unfortunately much of the information available (for example in the official onsite FAQ) is incomplete or outdated. The Wayback Machine gets only a tiny fraction of the product-management, public communication, documentation, end-user support, and fundraising that it should.


Logging domain registration information (WHOIS records) and having an alternative method to applying opt-outs for sites whose domains have transferred might help address this.

It's not clear-cut. WHOIS records are principally oriented at the domain itself. They may not clearly identify a registrant, registrants' identities may not be clearly evident across renewals, changes, or even registrar-related actions (M&A, failures, splits), the formats (and quality) vary wildly by registrar, some registrars (attempt to) restrict access and use of the records, and more.

But it's a start.

Hell, a TIA archive of WHOIS registrations might itself be useful....


Archive.org honors the robots.txt, at during the indexing period - okay. But current domain owners should not be allowed to remove historic content of the domain at a later date by just modifying the robots.txt.

A lot of information is lost as domain squatters take over domains set new robots.txt files. On Wikipedia for example you find a lot of reference-links that point to archived website URLs on archive.org. Every now and then the vital information source is lost - it's a bit surreal like book burning. I really like archive.org, this is the single feature I dislike a lot.


I'd like to rave about an underappreciated but absolutely brilliant piece of the Internet Archive's infrastructure: its book reader (called, I gather, "BookReader").

TIA includes copious media archives including video, audio, and books. The latter are based on full-image scans and can be read online.

I generally dislike full-format reading tools: Adobe Acrobat, xpdf, evince, and other PDF readers all have various frustrations. Google's own online book reader is a mass of Web and UI frustrations.

I'm a guy who almost always prefers local to Web-based apps.

TIA's book reader is the best I've seen anywhere, hands down.

It's fast, it's responsive. The UI gets out of the way. Find your text and hit "fullscreen". Hit 'F11' on your browser to maximize it, you can then dismiss the (subtle) UI controls off the page and you are now ... reading your book. Just the book. No additional crap.

Page turn is fast. Zoomed, the view seems to autocrop to the significant text on the page. Unlike every last damned desktop client, the book remains positioned on the screen in the same position as you navigate forward or backward through the book. Evince, by contrast, will turn a page and then position it with the top left corner aligned. You've got to. Reposition. Every. Damned. Page. Drives me insane (but hey, it's a short trip).

You can seek rapidly through the text with the bottom slider navigation.

About the only additions I could think of would be some sort of temporary bookmark or ability to flip rapidly between sections of a book (I prefer reading and following up on footnotes and references, this often requires skipping between sections of a text).

Screenshot: http://i.imgur.com/Reg8KLB.png

Source: http://archive.org/stream/industrialrevol00toyngoog#page/n6/...

But, for whomever at TIA was responsible for this, thank you. From a grumpy old man who finds far too much online to be grumpy about, this is really a delight.

This appears to be an informational page with more links (including sources):

https://openlibrary.org/dev/docs/bookreader


Wow, I had no idea. If you'd have asked me where to read older texts I'd have said "Project Gutenberg."

This is miles better.


Gutenberg is also an absolute treasure, though its formats tend to be rather spartan. When you consider that it started on mainframes (and with founder Michael S. Hart hand-typing in the earliest works), and that for reasons of compatibility it has long standardized on flat ASCII text as a storage format (I'm not sure if this remains the case: ISO-8859-1 is now supported and many works are available in HTML, PDF, and ePub versions, though I understand ASCII remains the reference).

The wealth of material at Gutenberg is pretty staggering.

But yeah, TIA's BookReader is really nice.

If you go to the project page and view some of the sample works (from read.gov and the Internet Archive itself) you'll find some more colorful examples than my boring focus on 19th century economic heterdoxies ;-)


Thanks for the kind words :)


Raj: I mean what I said, it's an amazingly well-executed app.

I'm glad I posted my little kudos if only because it was incentive to find out just what the backstory behind it was. And if I could make one other suggestion: the info link's a bit too successfully buried (I just found it under the book info icon).

I've taken so many shots at so many other products, projects, and websites, I figured I owed someone a bit of praise.

I'm also massively impressed and thankful for what you, Brewster, and the rest of TIA team are doing. I'm not a full-on regular of the site, but I keep finding interesting and useful things there, and there's no way I could research what I do without sites such as TIA, Gutenberg, and a number of other online digital archives. It's a huge boon.


Has there been any High Scalability articles on their infrastructure? We have a similar need: storing a large volume of text-based content over a period of time, with versioning as well. On top of it we have various metadata. We're currently storing everything in MySQL -- a lightweight metadata row and a separate table for the large (~400KB on average) BLOB fields in a compressed table.

We're looking at ways to improve our architecture: simply bigger+faster hardware? Riak with LevelDB as a backend? Filesystem storing with database for the metadata? We even considered using version control such as git or hg but that proved to be far too slow for reads compared to a PK database row lookup.

Any HN'ers have suggestions?


I am a former Archive employee. I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go:

- Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of seperately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread across a lot of commodity grade machines.

- An sorted index of all the content was built that would let you lookup (url) and give you a list of times or (url, time) to (filename, file-offset). It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes. Binary search across a sorted text file is surprisingly fast -- in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently.

- (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine. Then it would use a little mechanism (network broadcast maybe?) to find out what server that (unique) filename was on, then it would request the particular record from that server.

(Edit: FYI, my knowledge is 5 years old now. I know they've done some things to keep the index more current than they did back then.)

At the very least, I'd think about getting your blobs out of MySQL and putting them in the filesystem. Filesystems are good at this stuff. You can certainly do something as simple as a SHA-1 hash of the content as the filename, and then depending on your filesystem's performance characteristics, you can have a couple levels in the tree you store them in. e.g. da39a3ee5e6b4b0d3255bfef95601890afd80709 goes into the directory da/39/ Then you stick da39a3ee5e6b4b0d3255bfef95601890afd80709 into the 'pointer' field in your table that replaces the actual data. Obviously this design assumes the content of _that_ file doesn't change. If you want to change the data for that row in the table, you have to write a new file in the filesystem and update the 'pointer'.


Thanks! We were writing up a response at the same time:

The Wayback Machine data is stored in WARC or ARC files[0] which are written at web crawl time by the Heritrix crawler[1] (or other crawlers) and stored as regular files in the archive.org storage cluster.

Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records[2]. The first level fits in core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a pointer to larger CDX block.

Index lookup works by binary searching the first level list stored in core, then HTTP range-request loading the appropriate second-level blocks from the CDX index. Finally, web page data is loaded by range-requesting WARC data pointed to by the CDX records. Before final output, link re-writing and other transforms are applied to make playback work correctly in the browser.

The server stack:

- frontend: Tengine + HAProxy to a pool of Wayback tomcat app servers[3]

- backend: The redis-backed archive.org metadata API[4] for object location and nginx on linux (via ext4) for data service

  [0] http://en.wikipedia.org/wiki/Web_ARChive
  [1] https://github.com/internetarchive/heritrix3
  [2] https://github.com/internetarchive/CDX-Writer
  [3] https://github.com/internetarchive/wayback
  [4] http://blog.archive.org/2013/07/04/metadata-api/
-sam and raj, Internet Archive


And despite the size, it's so much faster than it used to be! Good work.


Why not use hashtable instead of binary search? I'm assuming your index is immutable and query for the data structure is essentially random. Another advantage of looking up item by URL hash may be that you can use hash-prefix to direct the query to appropriate machine (so basically, your infrastructure simply looks like giant distributed hashtable top to bottom with no binary searches required).


Former Archive employee (& still occasional contract contributor) here. This was one of my 1st questions when joining in 2003!

Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix.

Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. And once you've got that artifact, the individual capture lookups can be satisfied fairly efficiently, too. (A distributed-hashtable would then be something extra to maintain.)

Also, the queries aren't random: there are hot ranges, and even a single user's session begins with a range query (all dates for an URL), then visits one URL from that same range. Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often.

There's no doubt many places that could be improved, but this basic sorted-index model has fit the application well for a long while, avoided too much domain-specific complexity, and been amenable to many generations of index/sharding/replication/internal-API tweaks.

BTW, the Archive is hiring for multiple technical roles, including a senior role developing a next-generation of the Wayback Machine: https://archive.org/about/jobs.php


Perhaps caching reasons? If people follow links internally within a domain then randomly distributed hashes would mean a visit to randomly distributed data servers to retrieve every new page. With their architecture the 'CDX' block should contain similar URLs and accessing the linked URLs would could be a seek within the already retrieved block.

Just my guess.


hypertable (store data and metadata in different access groups, basically different files on disk but in the same table) on top of QFS for distributed file system in reed-solomon replication(for ~more efficiency)

you can keep n-versions(miscrosecond timestamps), sorted compressed data on disk for fast range reads, you can compress and group in blocks differently across access groups(access groups are just groups of columns)

it was based on bigtable, which was written to store the crawling data(like you & waybackmachine?) for google search


I'm using Azure that supports blob storage of up to around 2GB per blob, has snapshots and allows you to add metadata to blobs.


If you're looking to donate, they take bitcoin too! https://archive.org/donate/index.php


A little known fact is that there is a mirror of the Wayback Machine hosted by [The Bibliotheca Alexandrina:

http://www.bibalex.org/isis/frontend/archive/archive_web.asp...

I have sometimes had luck retrieving pages from this mirror that were unavailable (or returned errors) in the main site.


awesome! it's a great tool to go back in time to check out our past websites full of blinking gifs and whatnot.

I didn't know that they also maintain the "HTTP Archive", showing website latency over time as well as some interesting live-statistics: http://httparchive.org/


As far as I know, Internet Archive does not maintain "HTTP Archive" (http://httparchive.org/). HTTP Archive was founded by and is being maintained by Steve Souders (Chief Performance Officer at Fastly - http://www.fastly.com/). He's previously held titles at both Google ("Head Performance Engineer") and Yahoo!. He's also a co-founder of the popular web-development debug add-on Firebug.

Sources: http://httparchive.org/about.php and http://stevesouders.com/about.php


from OPs linked article:

  June 15th, 2011 – The HTTP Archive becomes part of the Internet Archive [0], adding data about the performance of websites to our collection of web site content.
[0] https://blog.archive.org/2011/06/15/http-archive-joins-with-...


Oh schucks, I completely missed that! Thanks for linking to it! It doesn't hurt to be wrong occationally, then you get to learn something.. :-)


Thanks, it's nice to see factual data backing trends in web design. :)


Can anyone explain to me how displaying those sites on demand is not copyright infringement? I'm seriously curious, I don't know much about copyright laws.


Not a lawyer, but I guess this probably falls under fair use, which includes (from the Wikipedia article) both search engine use, and library archiving of content. http://en.wikipedia.org/wiki/Fair_use


The Archive is also legally a library, which allows them to get away with a lot of things that companies cannot.


Also, doesn't the IA have specific privileges given it by Congressional legislation at some point? I seem to recall that coming up in abandonware discussions.


https://archive.org/about/dmca.php - it looks like the exemptions applied to everyone, for 3 years at least.


It should be fair use (non commercial, library purpose usage etc.).


I still cannot fathom how they are able to store huge amounts of data and not run out of space. Care anyone to explain?


From a conversation with Brewster a few years ago: The doubling of density of disk drives has allowed them to stay relatively neutral with respect to space for the Wayback machine. It still occupies approximately the same size as it has for the last 10 years, which is essentially a set of racks about 15-20 feet long altogether I think?

However, the new TV news and search capability requires substantially more space than even the archive IIRC, or certainly is heading that way.


Funny story about the Wayback Machine and how it helped me. I had let my blog go into disrepair for a couple months, and eventually, when I went back to it, I found that since I hadn't kept up with security updates, I wasn't able to access any of my old posts.

When I went back to start writing again (this time using paid hosting so I didn't have to deal with that), I was disappointed that I wasn't going to have ~20-30 posts I had before. On a hunch, I checked the Wayback Machine and found that it had archived about 15 of my posts! Very excited that I could restore some of my previous writings.


> Before there was Borat, there was Mahir Cagri. This site and the track it inspired on mp3.com created quite a stir in the IDM world, with people claiming that “Mahir Cagri” was Turkish for “Effects Twin” and that the whole thing was an elaborate ruse by Richard D. James (Aphex Twin). (Captured December 29, 2004 and December 7, 2000)

Okay this just blew my mind. Anyone else follow Aphex Twin's various shenanigans? Was this ever investigated further?


Nice work on this over the years, gojomo et al.


Cool, but on a lot of sites (including some of my own, from 10+ years ago to recently) it doesn't get hardly any of the images. Am I the only one experiencing this?


If you're using AWS S3 check your resource policy


Wow, that's one billion more pages than there are stars in our Milky Way galaxy. That's a lot!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: