
Wayback machine gets a facelift, new features - anigbrowl
https://archive.org/web/
======
memset
In case anyone doesn't remember the old design, here is a link:
[https://web.archive.org/web/20131016082142/http://archive.or...](https://web.archive.org/web/20131016082142/http://archive.org/web/)

~~~
blueblob
Did you just waybackmachine the waybackmachine?

~~~
gocard
Why didn't this cause an infinite loop?

~~~
rahul286
You can try searching for snapshot of
[https://web.archive.org/web/20131016082142/http://archive.or...](https://web.archive.org/web/20131016082142/http://archive.org/web/)
like
[https://web.archive.org/web/20131016082142/https://web.archi...](https://web.archive.org/web/20131016082142/https://web.archive.org/web/20131016082142/http://archive.org/web/)
then
[https://web.archive.org/web/20131016082142/https://web.archi...](https://web.archive.org/web/20131016082142/https://web.archive.org/web/20131016082142/https://web.archive.org/web/20131016082142/http://archive.org/web/)

and so on for infinite loop!

Oops this looks like recursion. ;-)

------
jakobe
Oh how I wish the wayback machine would ignore robots.txt... So many websites
lost to history because some rookie webmaster put some misguided commands into
the file without thinking about the consequences (eg. block all crawlers
except google)

~~~
jccalhoun
I'm surprised that some upstart search engine hasn't made a selling point that
they ignore robots.txt and claim they search the pages google doesn't or
something.

~~~
greglindahl
Speaking as an upstart search engine guy (blekko) who also has a bunch of
webpages and a huge robots.txt, that's a bad idea. Such a crawler would be
knocking down webservers by running expensive scripts and clicking links that
do bad things like deleting records from databases or reverting edits in
wikis. You don't want to go there.

~~~
derekp7
Really? I was always taught that search engines only do "get" requests, and
anything that modifies data is in a "post" request. Are there really that many
broken web sites out there, that hasn't already fallen victim to crawlers that
ignore robots.txt?

~~~
greglindahl
Yes, there are a lot of broken websites out there.

~~~
blueblob
I noticed this today. Googling "united check in" and clicking the link "check"
gave me a link that told me the confirmation number that I entered was invalid
though I never entered one.

------
danso
The "Save Page Now" feature looks great. Hopefully this cures Wikipedia of its
increasing link-rot.

Also, the Supreme Court will be happy:
[http://www.nytimes.com/2013/09/24/us/politics/in-supreme-
cou...](http://www.nytimes.com/2013/09/24/us/politics/in-supreme-court-
opinions-clicks-that-lead-nowhere.html)

~~~
neilk
I mentored a Google Summer of Code project to do just that - every citation on
Wikipedia would be forwarded to Archive.org for permanent storage, and the
citation link would be modified to offer the cached version as an alternative.

[https://www.mediawiki.org/wiki/User:Kevin_Brown/ArchiveLinks](https://www.mediawiki.org/wiki/User:Kevin_Brown/ArchiveLinks)

For various reasons this didn't get completed or deployed. It's still a good
idea though. IMO it should be rewritten, but it wouldn't be a lot of code. I'd
love to help anyone interested.

(French Wikipedia already does this, by the way. Check out the article on
France, for example - all the footnotes have a secondary link to WikiWix.
[https://fr.wikipedia.org/wiki/France](https://fr.wikipedia.org/wiki/France))

~~~
greglindahl
Alexis said (at the IA 10th Anniversary bash) that they are going to have this
running very soon, using a bot to go over all of Wikipedia and insert archived
links close to the dates of existing references (if available), and also
capturing newly added links.

~~~
neilk
Excellent. Alexis rocks.

------
axefrog
Glad to see they finally got an API, however I'm a bit disappointed that it
doesn't return the oldest archived date for a site, only the newest. I often
need to check how long ago a site was originally archived. The API would have
been very helpful for that, but the closest they provide is an option to query
whether or not it was archived on a specific date, which is nowhere near as
helpful.

~~~
greglindahl
Their older CDX API provides that functionality:

[https://github.com/internetarchive/wayback/tree/master/wayba...](https://github.com/internetarchive/wayback/tree/master/wayback-
cdx-server)

~~~
axefrog
Wow, I never knew it existed, thanks for the link!

------
cypher543
I love the Wayback Machine (and all of Archive.org, really). I recently used
it to reminisce about some old VRML-based chat communities that I frequented
about 10 years ago. It had a record for every single of them.

~~~
Harelin
Cybertown, perhaps?

~~~
cypher543
Yep! There was also GoonieTown, which didn't last very long and eventually
became VR Dimension. Flatland Rover was another, but it used its own 3DML
engine instead of Blaxxun Contact and VRML. Good times!

~~~
Harelin
I remember both. I was active from '99 to about 2002, and was a City
Councilor/Colony Leader at one point. Nice to run into someone else with a
similar background, the internet just isn't the same as it used to be in those
days.

------
derwiki
I just launched a similar service called
[https://www.DailySiteSnap.com](https://www.DailySiteSnap.com) that
screenshots, emails, and archives a specified web site on a daily basis. My
use case is to be able to look back at any one day and see what my site looks
like, since Archive.org doesn't refresh my page as often as I update it.

Disclaimer: I'm really not trying to over-market myself, but I figured readers
of this thread might be interested in my project. Happy to take down this post
if it's read as too spammy.

~~~
toomuchtodo
You can always package your site up in a format Archive.org can read:

[http://www.archiveteam.org/index.php?title=Wget_with_WARC_ou...](http://www.archiveteam.org/index.php?title=Wget_with_WARC_output)

[http://www.archiveteam.org/index.php?title=The_WARC_Ecosyste...](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem)

[http://justsolve.archiveteam.org/wiki/WARC](http://justsolve.archiveteam.org/wiki/WARC)

[http://warcreate.com/](http://warcreate.com/)

~~~
derwiki
Thanks for the info, was unaware of ARC/WARC formats. That said, I still think
many people are looking for something simpler/easier, and a daily screenshot
is good enough. Particularly, it will guarantee preserved formatting as
browsers continue to evolve.

~~~
Cameron_D
You can probably make your service do both screenshots and WARC, instead of
loading a site directly, load it through WARC Proxy
([https://github.com/odie5533/WarcProxy](https://github.com/odie5533/WarcProxy)),
that will write out a WARC file and you can still store your screenshot.

Once you have the WARCs you can upload them to Archive.org and they can be
added to the wayback, or you can set up your own service for browsing them,
built off something like warc-proxy [https://github.com/alard/warc-
proxy](https://github.com/alard/warc-proxy) (Yeah, same name different
purpose...)

There is also a MITM version of WARCProxy that will let you store HTTPS sites:
[https://github.com/odie5533/WarcMITMProxy](https://github.com/odie5533/WarcMITMProxy)

~~~
hncommenter13
As of version 1.14, wget natively supports warc (including built-in gzip and
cdx index file generation).

[http://www.archiveteam.org/index.php?title=Wget_with_WARC_ou...](http://www.archiveteam.org/index.php?title=Wget_with_WARC_output)

This makes creating a browse-able mirror of a site in warc format fairly
straightforward, as wget will automatically make links relative, as well as
fetch requisite files (css, js, images) for each page.

~~~
Cameron_D
Yeah, but as far as I can guess, derwiki's service doesn't use wget, so
running a proxy to store the WARCs is the next-simplest thing.

~~~
toomuchtodo
If his service runs on any sort of Linux distro, its stupid simple to call
wget with a system call. Wget comes standard with all of the most popular
distros.

------
memracom
They still won't let you look at pages if some domainer has aquired the domain
and installed a robots.txt that disallows crawling.

They really should look at the date on the robots.txt and only apply it to
pages retrieved while it is in effect.

Show us the pages from before the robots.txt became so restrictive!

------
granttimmerman
Now I have a real reason to search the Wayback machine on the Wayback machine!
Then:
[https://web.archive.org/web/20131024095443/https://archive.o...](https://web.archive.org/web/20131024095443/https://archive.org/web/)
Now:
[https://web.archive.org/web/20131029213051/https://archive.o...](https://web.archive.org/web/20131029213051/https://archive.org/web/)

------
slacka
The new "Save Page Now" feature is great, but there is still no way to add
full sites to crawl. For example, I added:
[http://www.cgw.com/Publications/CGW.aspx](http://www.cgw.com/Publications/CGW.aspx)

But it would take hours or days to add every article from every issue.

------
talles
Thank god they didn't change much. I hate when extremely functional websites
decide to 'revolutionize' their interface (I'm looking @ you Google Maps).

I love this service.

------
agumonkey
On a purely aesthetic side, the new input form does clash with the old menu.
The ~carousel seems a bit "cpu consuming", maybe a simpler tile grid as in
Windows Phone 8. That said I love the service, and the frontend is probably
not the most important part of their system.

------
vbuterin
They accept donations, and they even take Bitcoin:
[https://archive.org/donate/index.php](https://archive.org/donate/index.php)

Be sure to send them some!

------
powertower
Disregarding whatever the rules are about this in the TOS, is there a good way
to download/scrape your old archived website?

~~~
shaunpud
Don't know how good they are but waybackdownloader.com seems to provide that
service

------
fmitchell0
they need to fire their plastic surgeon if a font update and spacing is what
is considered a facelift.

