
Wayback Machine Hits 400,000,000,000 - tweakz
http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000/
======
leorocky
And not one of those hits is from Quora due to their robots.txt:

[https://web.archive.org/web/http://www.quora.com/](https://web.archive.org/web/http://www.quora.com/)

Good job Quora, preserving all that crowd sourced content away from the crowd,
keeping it from everyone not logged in. Hats off to getting into YC so you can
post your job openings on the HN home page and get some press. This doesn't
add anything to your image though, it just takes a little away from YC.

On another better note, a great big thank you to the wayback machine for all
of the public good it does. Now there's an organization that is amazing and
wonderful and enriching our lives in an open and honest way with information.

~~~
dredmorbius
Man. I hate on Quora. Didn't realize they were YC. Shame.

~~~
khalidmbajwa
And why would you do that ? I can see why the must-register roadbloack is
annoying in theory but It takes a few minutes to register, and then its an
absolute gold-mine of knowledge, with an extra-ordinary community, which is
very positive, constructive and helpful. I find it way more engaging than HN
Actually to be honest.

~~~
hrrsn
I've found the Quora community to be great, but the register-walling and the
robots.txt is very close minded, especially to the Internet as whole. Most of
the time I find a link to a Quora I am turned down by the fact I can't read
more than half an answer without having to sign in. Imagine if Stack Overflow
did the same thing? They'd be dead.

~~~
gldnspud
Stack Overflow was created in response to a site that was doing something
similar to what Quora is doing with search engines, except that site required
a paid membership to see the answers. I recall one trick that people used
before SO was around, which was to spoof the User-Agent to be GoogleBot. Doing
so would show you the answer(s) to the question. :)

~~~
bencollier49
I thought Google specifically penalised sites which did that?

~~~
Netminder_EE
DISCLAIMER: I'm one of Experts Exchange's volunteer administrators.

Google penalized sites that redirected from SERPs or buried content below a
lot of "sign up here" stuff very heavily; you can even make the case that it
could have been called the "EE Penalty" instead of Panda, if only because
Google's web search team (i.e. Matt Cutts) collaborated heavily with the Stack
ownership in developing the algorithm changes and consequences.

That's not to excuse EE's management's behavior -- quite the contrary. From
the perspective of longtime users, EE's string of decisions, the consequences
of those decisions, and its reactions to those consequences starting not quite
a decade ago nearly destroyed what had been a vibrant community.

If there's anything good to have come of it, it's that EE has finally moved --
about six or seven months ago -- to a business model that allows the non-
member to see what members see: the entire question and solutions. Joining for
free does have a few minor advantages; paying (either with a credit card or by
answering questions) has more.

But it took a long time for EE to learn those lessons and begin to implement
those fixes. Whether Quora will learn them is a whole 'nother story; like so
many other sites, it built its systems and market-share without much thought
given to how it was going to monetize.

EE made that mistake too, in 1997. Quora has a lot of money backing it up, so
it can maintain its facade for a long time... but I wouldn't be placing any
bets on it being around 15 years from now if it actually had to depend on
income.

------
keenerd
PSA/ranty thing: Just because something is archived in the Wayback machine, do
not trust that archive.org will keep it there for all time. If you need
something, make a local copy! A few months ago TIA changed their stance on
robots.txt. They now _retroactively_ honor robot blocks. Now any site can
completely vanish from the archives.

Let's say I died tomorrow. My family lets my domain slip. A squatter buys it
and throws up a stock landing page, with a robots.txt that forbids spidering.
TIA would delete my entire site from their index.

I've already lost a few good sites to this sort of thing. If you depend on a
resource, archive it yourself.

edit - Official policy:
[https://archive.org/about/exclude.php](https://archive.org/about/exclude.php)

If I am reading it properly, once blocked they never check later in case of a
change of heart? No procedure for getting re-indexed at all?

~~~
gojomo
A fair rant, but to correct some misperceptions:

The retroactive application of robots.txt is not a new policy; it's been in
place for at least 11 years, and I believe it arrived very soon after the
Wayback Machine was first unveiled.

An updated robots.txt does not irreversibly delete prior captures, so if the
robots.txt changes again, access to previously-collected material can be re-
enabled.

This policy has served to minimize risk and automate the most common removal-
scenario, when a webmaster wants a total opt-out of current crawling and past
display. But, the collateral damage to unrelated content from prior domain-
owners has grown as the web has aged, and more domains have changed hands.
(The tradeoff that made sense in 2002 probably doesn't make sense in 2014.)

Figuring out a system that can automate most legitimate exclusions, while
rejecting or reversing those that lack a firm basis in content ownership or
personal privacy, is a thorny task, but it would be worth pursuing if/when the
Wayback Machine has the necessary staff resources.

(My proposal since 2008 has been a DMCA-inspired 'put-back' procedure, where
an original content owner can assert, formally, that they are the content
owner and do _not_ want the current-day robots.txt applied to captures before
a certain date. Then, the current domain-owner would have to counter-notify
_that_ to maintain the block. This idea hasn't had legal review, but would
reverse some current damage, and any bad-faith blockers would have to go on
record with a false claim to maintain the block, potentially exposing them to
a third-party legal challenge, with minimal risk to IA.)

~~~
keenerd
Whoops, I probably should have checked the exclude.php page in the Wayback
Machine before pinning a date on it. My bad.

> An updated robots.txt does not irreversibly delete prior captures, so if the
> robots.txt changes again, access to previously-collected material can be re-
> enabled.

Is there any official statement you can cite? Generally TIA does not mince
words and is honest. If they say "removed" then I would assume they are not
doing Facebook-style "deletion" shenanigans.

~~~
gojomo
I don't know of a linkable official statement to that effect. Unfortunately
much of the information available (for example in the official onsite FAQ) is
incomplete or outdated. The Wayback Machine gets only a tiny fraction of the
product-management, public communication, documentation, end-user support, and
fundraising that it should.

------
dredmorbius
I'd like to rave about an underappreciated but absolutely brilliant piece of
the Internet Archive's infrastructure: its book reader (called, I gather,
"BookReader").

TIA includes copious media archives including video, audio, and books. The
latter are based on full-image scans and can be read online.

I generally dislike full-format reading tools: Adobe Acrobat, xpdf, evince,
and other PDF readers all have various frustrations. Google's own online book
reader is a mass of Web and UI frustrations.

I'm a guy who almost _always_ prefers local to Web-based apps.

TIA's book reader is the best I've seen anywhere, hands down.

It's fast, it's responsive. The UI gets out of the way. Find your text and hit
"fullscreen". Hit 'F11' on your browser to maximize it, you can then dismiss
the (subtle) UI controls off the page and you are now ... reading your book.
Just the book. No additional crap.

Page turn is fast. Zoomed, the view seems to autocrop to the significant text
on the page. _Unlike every last damned desktop client, the book remains
positioned on the screen in the same position as you navigate forward or
backward through the book. Evince, by contrast, will turn a page_ and then
position it with the top left corner aligned. You've got to. Reposition.
Every. Damned. Page. Drives me insane (but hey, it's a short trip).

You can seek rapidly through the text with the bottom slider navigation.

About the only additions I could think of would be some sort of temporary
bookmark or ability to flip rapidly between sections of a book (I prefer
reading and following up on footnotes and references, this often requires
skipping between sections of a text).

Screenshot: [http://i.imgur.com/Reg8KLB.png](http://i.imgur.com/Reg8KLB.png)

Source:
[http://archive.org/stream/industrialrevol00toyngoog#page/n6/...](http://archive.org/stream/industrialrevol00toyngoog#page/n6/mode/2up)

But, for whomever at TIA was responsible for this, thank you. From a grumpy
old man who finds far too much online to be grumpy about, this is really a
delight.

This appears to be an informational page with more links (including sources):

[https://openlibrary.org/dev/docs/bookreader](https://openlibrary.org/dev/docs/bookreader)

~~~
jsmthrowaway
Wow, I had no idea. If you'd have asked me where to read older texts I'd have
said "Project Gutenberg."

This is _miles_ better.

~~~
dredmorbius
Gutenberg is also an absolute treasure, though its formats tend to be rather
spartan. When you consider that it started on mainframes (and with founder
Michael S. Hart hand-typing in the earliest works), and that for reasons of
compatibility it has long standardized on flat ASCII text as a storage format
(I'm not sure if this remains the case: ISO-8859-1 is now supported and many
works are available in HTML, PDF, and ePub versions, though I understand ASCII
remains the reference).

The wealth of material at Gutenberg is pretty staggering.

But yeah, TIA's BookReader is really nice.

If you go to the project page and view some of the sample works (from read.gov
and the Internet Archive itself) you'll find some more colorful examples than
my boring focus on 19th century economic heterdoxies ;-)

------
meritt
Has there been any High Scalability articles on their infrastructure? We have
a similar need: storing a large volume of text-based content over a period of
time, with versioning as well. On top of it we have various metadata. We're
currently storing everything in MySQL -- a lightweight metadata row and a
separate table for the large (~400KB on average) BLOB fields in a compressed
table.

We're looking at ways to improve our architecture: simply bigger+faster
hardware? Riak with LevelDB as a backend? Filesystem storing with database for
the metadata? We even considered using version control such as git or hg but
that proved to be far too slow for reads compared to a PK database row lookup.

Any HN'ers have suggestions?

~~~
mmagin
I am a former Archive employee. I can't speak to their current infrastructure
(though more of it is open source now - [http://archive-
access.sourceforge.net/projects/wayback/](http://archive-
access.sourceforge.net/projects/wayback/) ), but as far as the wayback
machine, there was no SQL database anywhere in it. For the purposes of making
the wayback machine go:

\- Archived data was in ARC file format (predecessor to
[http://en.wikipedia.org/wiki/Web_ARChive](http://en.wikipedia.org/wiki/Web_ARChive))
which is essentially a concatenation of seperately gzipped records. That is,
you can seek to a particular offset and start decompressing a record. Thus you
could get at any archived web page with a triple (server, filename, file-
offset) Thus it was spread across a lot of commodity grade machines.

\- An sorted index of all the content was built that would let you lookup
(url) and give you a list of times or (url, time) to (filename, file-offset).
It was implemented by building a sorted text file (first sorted on the url,
second on the time) and sharded across many machines by simply splitting it
into N roughly equal sizes. Binary search across a sorted text file is
surprisingly fast -- in part because the first few points you look at in the
file remain cached in RAM, since you hit them frequently.

\- (Here's where I'm a little rusty) The web frontend would get a request,
query the appropriate index machine. Then it would use a little mechanism
(network broadcast maybe?) to find out what server that (unique) filename was
on, then it would request the particular record from that server.

(Edit: FYI, my knowledge is 5 years old now. I know they've done some things
to keep the index more current than they did back then.)

At the very least, I'd think about getting your blobs out of MySQL and putting
them in the filesystem. Filesystems are good at this stuff. You can certainly
do something as simple as a SHA-1 hash of the content as the filename, and
then depending on your filesystem's performance characteristics, you can have
a couple levels in the tree you store them in. e.g.
da39a3ee5e6b4b0d3255bfef95601890afd80709 goes into the directory da/39/ Then
you stick da39a3ee5e6b4b0d3255bfef95601890afd80709 into the 'pointer' field in
your table that replaces the actual data. Obviously this design assumes the
content of _that_ file doesn't change. If you want to change the data for that
row in the table, you have to write a new file in the filesystem and update
the 'pointer'.

~~~
rajbot
Thanks! We were writing up a response at the same time:

The Wayback Machine data is stored in WARC or ARC files[0] which are written
at web crawl time by the Heritrix crawler[1] (or other crawlers) and stored as
regular files in the archive.org storage cluster.

Playback is accomplished by binary searching a 2-level index of pointers into
the WARC data. The second level of this index is a 20TB compressed sorted list
of (url, date, pointer) tuples called CDX records[2]. The first level fits in
core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a
pointer to larger CDX block.

Index lookup works by binary searching the first level list stored in core,
then HTTP range-request loading the appropriate second-level blocks from the
CDX index. Finally, web page data is loaded by range-requesting WARC data
pointed to by the CDX records. Before final output, link re-writing and other
transforms are applied to make playback work correctly in the browser.

The server stack:

\- frontend: Tengine + HAProxy to a pool of Wayback tomcat app servers[3]

\- backend: The redis-backed archive.org metadata API[4] for object location
and nginx on linux (via ext4) for data service

    
    
      [0] http://en.wikipedia.org/wiki/Web_ARChive
      [1] https://github.com/internetarchive/heritrix3
      [2] https://github.com/internetarchive/CDX-Writer
      [3] https://github.com/internetarchive/wayback
      [4] http://blog.archive.org/2013/07/04/metadata-api/
    

-sam and raj, Internet Archive

~~~
sytelus
Why not use hashtable instead of binary search? I'm assuming your index is
immutable and query for the data structure is essentially random. Another
advantage of looking up item by URL hash may be that you can use hash-prefix
to direct the query to appropriate machine (so basically, your infrastructure
simply looks like giant distributed hashtable top to bottom with no binary
searches required).

~~~
gojomo
Former Archive employee (& still occasional contract contributor) here. This
was one of my 1st questions when joining in 2003!

Some Wayback Machine queries require sorted key traversal: listing all dates
for which captures of an URL are available, the discovery of the _nearest-
date_ for an URL, and listing all available URLs beginning with a certain URL-
prefix.

Maintaining the canonically-ordered master index of (URL, date, pointer) –
that 20TB second-level index rajbot mentions – allows both kinds of queries to
be satisfied. And once you've got that artifact, the individual capture
lookups can be satisfied fairly efficiently, too. (A distributed-hashtable
would then be something extra to maintain.)

Also, the queries _aren 't_ random: there are hot ranges, and even a single
user's session begins with a range query (all dates for an URL), then visits
one URL from that same range. Then loading nearest-date captures for the
page's inline resources starts hitting similar ranges, as do followup clicks
on outlinks or nearby dates. So even though the master index is still on
spinning disk (unless there was a recent big SSD upgrade that escaped my
notice), the ranges-being-browsed wind up in main-memory caches quite often.

There's no doubt many places that could be improved, but this basic sorted-
index model has fit the application well for a long while, avoided too much
domain-specific complexity, and been amenable to many generations of
index/sharding/replication/internal-API tweaks.

BTW, the Archive is hiring for multiple technical roles, including a senior
role developing a next-generation of the Wayback Machine:
[https://archive.org/about/jobs.php](https://archive.org/about/jobs.php)

------
swalsh
If you're looking to donate, they take bitcoin too!
[https://archive.org/donate/index.php](https://archive.org/donate/index.php)

------
pimlottc
A little known fact is that there is a mirror of the Wayback Machine hosted by
[The Bibliotheca Alexandrina:

[http://www.bibalex.org/isis/frontend/archive/archive_web.asp...](http://www.bibalex.org/isis/frontend/archive/archive_web.aspx)

I have sometimes had luck retrieving pages from this mirror that were
unavailable (or returned errors) in the main site.

------
alternize
awesome! it's a great tool to go back in time to check out our past websites
full of blinking gifs and whatnot.

I didn't know that they also maintain the "HTTP Archive", showing website
latency over time as well as some interesting live-statistics:
[http://httparchive.org/](http://httparchive.org/)

~~~
ersii
As far as I know, Internet Archive does not maintain "HTTP Archive"
([http://httparchive.org/](http://httparchive.org/)). HTTP Archive was founded
by and is being maintained by Steve Souders (Chief Performance Officer at
Fastly - [http://www.fastly.com/](http://www.fastly.com/)). He's previously
held titles at both Google ("Head Performance Engineer") and Yahoo!. He's also
a co-founder of the popular web-development debug add-on Firebug.

Sources: [http://httparchive.org/about.php](http://httparchive.org/about.php)
and [http://stevesouders.com/about.php](http://stevesouders.com/about.php)

~~~
alternize
from OPs linked article:

    
    
      June 15th, 2011 – The HTTP Archive becomes part of the Internet Archive [0], adding data about the performance of websites to our collection of web site content.
    

[0] [https://blog.archive.org/2011/06/15/http-archive-joins-
with-...](https://blog.archive.org/2011/06/15/http-archive-joins-with-
internet-archive/)

~~~
ersii
Oh schucks, I completely missed that! Thanks for linking to it! It doesn't
hurt to be wrong occationally, then you get to learn something.. :-)

------
Kenji
Can anyone explain to me how displaying those sites on demand is not copyright
infringement? I'm seriously curious, I don't know much about copyright laws.

~~~
dvirsky
Not a lawyer, but I guess this probably falls under fair use, which includes
(from the Wikipedia article) both search engine use, and library archiving of
content.
[http://en.wikipedia.org/wiki/Fair_use](http://en.wikipedia.org/wiki/Fair_use)

~~~
karasinski
The Archive is also legally a library, which allows them to get away with a
lot of things that companies cannot.

~~~
gwern
Also, doesn't the IA have specific privileges given it by Congressional
legislation at some point? I seem to recall that coming up in abandonware
discussions.

~~~
ivank
[https://archive.org/about/dmca.php](https://archive.org/about/dmca.php) \- it
looks like the exemptions applied to everyone, for 3 years at least.

------
Vecrios
I still cannot fathom how they are able to store huge amounts of data and not
run out of space. Care anyone to explain?

~~~
dwhly
From a conversation with Brewster a few years ago: The doubling of density of
disk drives has allowed them to stay relatively neutral with respect to space
for the Wayback machine. It _still_ occupies approximately the same size as it
has for the last 10 years, which is essentially a set of racks about 15-20
feet long altogether I think?

However, the new TV news and search capability requires substantially more
space than even the archive IIRC, or certainly is heading that way.

------
jackschultz
Funny story about the Wayback Machine and how it helped me. I had let my blog
go into disrepair for a couple months, and eventually, when I went back to it,
I found that since I hadn't kept up with security updates, I wasn't able to
access any of my old posts.

When I went back to start writing again (this time using paid hosting so I
didn't have to deal with that), I was disappointed that I wasn't going to have
~20-30 posts I had before. On a hunch, I checked the Wayback Machine and found
that it had archived about 15 of my posts! Very excited that I could restore
some of my previous writings.

------
ultrasandwich
> Before there was Borat, there was Mahir Cagri. This site and the track it
> inspired on mp3.com created quite a stir in the IDM world, with people
> claiming that “Mahir Cagri” was Turkish for “Effects Twin” and that the
> whole thing was an elaborate ruse by Richard D. James (Aphex Twin).
> (Captured December 29, 2004 and December 7, 2000)

Okay this just blew my mind. Anyone else follow Aphex Twin's various
shenanigans? Was this ever investigated further?

------
sutro
Nice work on this over the years, gojomo et al.

------
mholt
Cool, but on a lot of sites (including some of my own, from 10+ years ago to
recently) it doesn't get hardly any of the images. Am I the only one
experiencing this?

~~~
buren
If you're using AWS S3 check your resource policy

------
rietta
Wow, that's one billion more pages than there are stars in our Milky Way
galaxy. That's a lot!

