
Wayback Machine: Now with 240,000,000,000 URLs - cleverjake
http://blog.archive.org/2013/01/09/updated-wayback/
======
thomasvendetta
Seriously, the Wayback Machine is awesome. Just last week I used it to find a
website I made twelve years ago at the ripe age of 10. If there are any
maintainers/developers reading this, thank you.

The fact that they were able to preserve a masterpiece like this means a lot
to me:
[http://web.archive.org/web/20010124071800/http://expage.com/...](http://web.archive.org/web/20010124071800/http://expage.com/thomasvendetta)

~~~
arscan
Agreed. Back in '96 I was a teenager who spent all his free time running a
modest gaming fan website [1]. I dropped it when I went to college a year
later, but its nice to know all that hard work will forever be memorialized
within the wayback machine. It helps me remember why I became a software
engineer in the first place. Thanks for that.

[1]
[http://web.archive.org/web/19970414022225/http://www.scorche...](http://web.archive.org/web/19970414022225/http://www.scorched.com/)

~~~
citricsquid
Do you still own that domain? It's fantastic.

~~~
arscan
Thanks! Sadly, I brought on a volunteer who bought the domain in his name. Not
a very smart idea in retrospect... but I was young, and not many people
understood the value of a good domain name back in 1996 ;-)

We were up to 20k daily uniques when I quit (not bad for 1997). I wrote the
forum & related software myself in perl, which was an amazing learning
experience.

------
binarycrusader
My only gripe with the wayback machine is that when old sites go offline and
some random domain squatter picks up the domain when it expires, they apply
the _current_ robots.txt to all of the old content making archive.org useless.

robots.txt should have a limit; it shouldn't be applied retroactively so
aggressively.

~~~
scottbartell
I suppose it's a pretty good approach in order to help avoid upsetting website
owners and possible lawsuits. While I'm not too sure I agree with it, I can
appreciate that they have provided a very easy way for websites to opt-out.

~~~
ithkuil
A possible pragmatic solution would be to track the site and spot ownership
changes and freeze the robots.txt when it happens.

Reliable ownership change detection can be tricky though, but it's doable
IMHO.

------
NelsonMinar
The Wayback Machine is run by archive.org, a non-profit. If you like what they
do consider donating at <http://archive.org/donate/index.php>

~~~
sroecker
You can even donate some Bitcoin if you don't have a Paypal or Amazon account.

------
lucb1e
Just imagine hosting this beast and then having 1,000 people wanting to scan
the entire thing _every second_! For free!

~~~
hna0002
Even more!

------
attabi
Please I need help,

Hi, please I am using wayback-1.6 on my tomcat-5.28 (java-1.7 , ubuntu-11.04)
to display all my arc.gz files but I have got this error, however this folder
contains all my arc.gz files /tmp/wayback/files1/IA.arc.gz

Resource Not In Archive

The Resource you requested is not in this archive.

------
comfyred
[http://web.archive.org/web/20020827023250/http://home.netc.n...](http://web.archive.org/web/20020827023250/http://home.netc.net.au/~pennywgt/morryworld/morryworld)

------
tzury
How much is 10PB anyway

[http://archive.org/details/10000000000000000BytesArchived?st...](http://archive.org/details/10000000000000000BytesArchived?start=1735)

------
mtrn
I'm ten percent into an implementation of a 'personal web archive', mostly as
a fun side-project. I just wonder, if historical data gets more interesting as
the web ages?

~~~
Tichy
The other day it occurred to me that the digital age might result in some
serious data loss. I was wondering about historic prices for cars. Not sure
how you would have gone about it in previous times, but I suppose you could
find old catalogs, adverts in newspapers and stuff life that. But what if
vendors only advertise prices on their web sites? Those sites with old prices
will be all gone once new prices are set up, same with the advertisements.

Not even sure if archives can help - with some algorithmically created content
it might be impossible to index it all.

Just one example - there are surely more. I used to think digital data would
be easier to preserve for the future, but now I am not so sure anymore.

Not even mentioning Facebook, which presumably can not be archived because of
the walled garden thing.

------
agatto2
Is there a way to get the list of those url's ? If anybody knows how show me
the way.... or you can email me agatto2@gmail.com

------
zwieback
Cherished memories: word.com from the late '90s

------
hayksaakian
I wonder how this handles all the more recent HTML pages with all their
javascript?

------
ddorian43
Does anyone know what database they use? Or just files and folders?

~~~
ato
The index for Wayback is a massive sorted text file (called a CDX) containing
a line for each URL and timestamp. For very large installations this index is
sharded across multiple servers and queried in parallel. The lookups are done
using plain old binary search.

<http://archive.org/web/researcher/cdx_file_format.php>

Each CDX record maps a URL-timestamp pair to a byte offset into an ARC or WARC
file. These are essentially just gzipped HTTP responses concatenated together:

<http://archive.org/web/researcher/ArcFileFormat.php>
[http://www.digitalpreservation.gov/formats/fdd/fdd000236.sht...](http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml)

The document is retrieved, uncompressed, URLs are rewritten, the navigation
banner javascript injected and the result is sent to the client.

The code is here: <https://github.com/internetarchive/wayback>

~~~
agatto2
How do you get a hold of the list of urls?

