
Waybackpack: download the entire Wayback Machine archive for a given URL - ingve
https://github.com/jsvine/waybackpack
======
bakztfuture
Wow this is amazing!!! I've built my whole site around showcasing wayback
content: [http://www.StartupTimelines.org/](http://www.StartupTimelines.org/)

This definitely makes it a whole lot easier, wish I had access to it from day
1. Great work guys/gals!!

Shameless plug, if you found it interesting, please consider donating:

[https://www.tilt.com/tilts/startup-timelines-support-
fund](https://www.tilt.com/tilts/startup-timelines-support-fund)

[http://archive.org/donate/](http://archive.org/donate/)?

~~~
brassattax
First visit...

"Sorry, you've already viewed 3 Startup Timelines"

~~~
bakztfuture
Sorry about that - this is a known bug we're trying to fix. If you refresh or
view incognito it might let you browse after a minute or so. This is no way
intentional to try to get you to sign up

~~~
RIMR
>This is no way intentional to try to get you to sign up

Yes, it is. Why else would you put a 3-startup limit for non-registered users,
and present them with a registration page? If the script is malfunctioning,
turn it off. Maybe if I could actually see the content I would sign up, but
since I have to sign up first, I guess I'll do what most others who are
visiting your site for the first time are doing: Go away and never come back.

Do yourself a favor and get rid of intentional annoyances. You're already
funding this thing with donations.

~~~
orik
I think we can take someone at their word when they say:

    
    
        Sorry about that - this is a known bug we're trying to fix.
        This is no way intentional to try to get you to sign up
    

Be civil. Don't say things you wouldn't say in a face-to-face conversation.
Avoid gratuitous negativity. (-:

I'm sure this bug will be fixed shortly, right bakztfuture?

~~~
bakztfuture
Yes - Orik thank you so much for understanding. We're using a library called
Flask-limiter, so I'm looking into what could be causing this ... I probably
misread the docs somewhere.

Startup Timelines was always made to be free and accessible, it didn't even
ask you to create an account until last month (been running the site for a
year now). I don't want anyone to be upset so I've quickly made an account you
can use to browse if you've gotten this rogue error:

username: hn_user

password: startuptimelines1

(all accounts are full btw)

I'm sorry and hope this doesn't ruin your take on the site forever. There's a
tour that walks you through the site when you register, so, here are
screenshots of the pages:

Tour page 1: [http://i.imgur.com/5DCwdbg.png](http://i.imgur.com/5DCwdbg.png)

Tour page 2: [http://i.imgur.com/o7ghamJ.png](http://i.imgur.com/o7ghamJ.png)

Tour page 3: [http://i.imgur.com/iHN775V.png](http://i.imgur.com/iHN775V.png)

let me know if there is anything else I can do guys bakz[at]bakzdesign.com ...
sorry and thank you again

------
genop
Internet Archive has an HTTP header called "X-Archive-Wayback-Perf:"

I can guess what it means but maybe someone here has some insight?

It certainly looks like their Tengine (nginx) servers are configured to expect
pipelined requests. It has no problem with greater than 100 requests at a
time. See HTTP header above.

Downloading each snapshot one at a time, i.e., many connections, one after the
other, perhaps each triggering a TIME_WAIT and consuming resources, may not be
the most sensible or considerate approach. If just requesting the history of a
single URL, maybe pipelined requests over a single connection is more
efficient? I'm biased and I could be wrong.

However their robots.txt says "Please crawl our files." I would guess that
crawlers use pipelining and minimize the number of open connections.

I have had my own "wayback downloader" for a number of years, written in shell
script, openssl and sed. It's fast.

IA is one of the best sites on the www. Have fun.

------
TazeTSchnitzel
How much load does this place on the Internet Archive? It'd be a shame if this
thing's access patterns caused them trouble.

~~~
greglindahl
If I read the code correctly, it's one-at-a-time? Which minimizes the stress;
if we're slow, it'll slow down.

It'd be nice if it had identification in the UserAgent, so that we could
complain to the right people if it was a problem.

~~~
jsvine
Hi, Greg! Library author here. I'd be happy to add a configurable UserAgent.
Perhaps the default would be a generic "waybackpack" but could be configurable
to add contact info for the user. Does that sound about right? Prefer a
different approach?

And, yep, the library is intentionally designed only to request one snapshot
at a time.

~~~
greglindahl
waybackpack would be a great default; encouraging the actual user to add
contact info would be better for you because we could complain to them instead
of you :-)

~~~
jsvine
Updated, merged, and pushed to PyPi as part of v0.1.0:
[https://github.com/jsvine/waybackpack/pull/5](https://github.com/jsvine/waybackpack/pull/5)

Thanks again for the feedback. Really appreciate it — and the existence of the
Internet Archive and Wayback Machine.

------
hartator
Ha fun, I've made a similar tool not so long ago:
[https://github.com/hartator/wayback-machine-
downloader/](https://github.com/hartator/wayback-machine-downloader/)

------
shaunpud
I've been using this with great success too;
[https://github.com/hartator/wayback-machine-
downloader](https://github.com/hartator/wayback-machine-downloader)

------
speeder
I wish I had an OS now that I could run this... I wanted a tool like that for
a long time, so I can reconstruct some of SimCity series documentation (Maxis
had a sort of tradition of team members writing as "PR stunt" detailed
accounts of their work on the game, and the few bits I could scavenge from
Archive.org, because this is not available on EA site anymore, had been great
to help modding efforts, and in my effort to "restore" the SimCity 4 game to
work on modern OSes)

~~~
chungy
What OS might you be running that you can't run Python on it?

------
robbiemitchell
I was looking for something like this literally two hours ago. Thanks!

------
nxzero
Always been confused by how the wayback machine works. Feel like if they were
able to partner with browsers to anonymously hash of content and discover new
content combined with do a better job doing version control that there index
would be a lot big and granular too.

~~~
greglindahl
The Wayback Machine crawls stuff based on popularity (Alexa top million),
search engine metadata (donated by blekko), the structure of the web, and the
desires of our various crawling partners, ranging from the all-volunteer
ArchiveTeam to 400+ libraries and other institutions who use our ArchiveIt
system. And, finally, there's always the "save page now" button at
[https://archive.org/web/](https://archive.org/web/)

There are big privacy issues to getting data from browsers. A lot of websites
depend on "secret" URLs, even though that's unsafe, and we don't want to
discover or archive those. That means we need opt-in, and smarts.

We do have a project underway with major browsers to send 404s to us to see if
we have the page... and offering to take the user to the Wayback if we do.

~~~
toomuchtodo
Is there a JSON API call that can be made to archive.org to archive a provided
URL and get a success/fail response back?

~~~
greglindahl
Alas there's no formal save-page-now API, but if you experiment with using it
from a browser, it's not hard to call from a program: fetch
[https://web.archive.org/save/<url>](https://web.archive.org/save/<url>). The
return is html, but if you examine the headers you get back, you'll see that
Content-Location: tells you the permanent archived name of that particular
capture.

I call APIs like this "accidental APIs"! From looking at our traffic, we have
quite a few programmatic users of it.

~~~
toomuchtodo
Thank you!

------
Negative1
Is there a way to download content using this? There is a zip I'm trying to
get from a particular site and it keeps failing due to some kind of download
cap.

Using --continue with wget doesn't work (I'm guessing they turned it off).

------
pjc50
Does this let you get at material which is hidden due to the current
robots.txt, even though it wasn't in force when the site was crawled?

~~~
mirimir
That would be wonderful!

Also stuff that's been censored for other reasons.

------
sanbor
Is it there any way to download the assets of the website too? Right now the
html has urls pointing to archive.org.

------
Kinnard
You're halfway to a blockchain!

~~~
j_s
One area I'm interested in is a legally binding verification of content at a
specific point in time - for example: tracking changes to a breaking news
article on CNN.

I'm not sure what technologies would be required to implement something like
this, but I feel like the Internet Archive would be important and BitCoin
might be away of encouraging verification from a globally distributed network
of 3rd-party verifiers.

~~~
rakoo
This idea has already been explored, here's one implementation:
[https://proofofexistence.com/](https://proofofexistence.com/)

~~~
Kinnard
Hmmm, I'd say not quite. That's a precursor though, sure.

