
Internet Archaeology: Scraping time series data from Archive.org - foob
http://sangaline.com/post/wayback-machine-scraper/
======
hartator
Really cool, congrats!

I have built something similar, but to retrieve a backup for one of my dead
websites. It was a fun project.

Shameless plug: [https://github.com/hartator/wayback-machine-
downloader/](https://github.com/hartator/wayback-machine-downloader/)

------
natch
Do they no longer have a program like they used to where researchers can apply
for direct access to the crawl data?

~~~
mdaniel
Are you thinking of [http://commoncrawl.org](http://commoncrawl.org) or is/was
there actually an Internet Archive program like that? Because that would be
_amazing_

~~~
natch
Funny enough, the only way I found info on this now was to go back through the
wayback machine to find old versions of archive.org...

Here's a page with some tantalizing information, but I'm gathering from the
lack of current info that maybe this access is a thing of the past:

[https://web-beta.archive.org/web/20060209225202/http://www.a...](https://web-
beta.archive.org/web/20060209225202/http://www.archive.org:80/web/researcher/intended_users.php)

Clicking through to an item on the sidebar, it's clear these were actual UNIX
logins made available to researchers with approved projects:

"Research.archive.org houses the personal files of the users on the system.
Each user has access to the directory /home/<login> for file storage. Since
research.archive.org is NFS mounted on all of the hosts, a user's home
directory <blah blah blah>...

...Individual hosts can be accessed using the remote shell (rsh) UNIX command.
The hosts in the cluster have an auto-authenticating script, so the secure
shell (ssh) command is unnecessary. Access to the hosts is limited depending
on the type of user account that is held. User accounts directly on
research.archive.org have access to..."

------
deferredposts
So what is the policy of The Internet Archive on this level of scraping? Do
they have a rate limit in place?

~~~
foob
Yes, they start sending 429 (Too Many Requests) responses if you don't use
appropriate delays. They also provide a public API [0] which I believe is
intended for automated requests of this type (as opposed to crawling the
Wayback Machine website directly).

[0] -
[https://archive.org/help/wayback_api.php](https://archive.org/help/wayback_api.php)

