

Meet PastPages.org, the news homepage archive. (And help keep it alive) - palewire
http://www.pastpages.org/
I cre­ated this site be­cause I think it ought to ex­ist. The shift­ing homepages of ma­jor me­dia sites should be saved so they can be stud­ied. Done right, I be­lieve Pas­t­Pages could serve as a re­source for schol­ars seek­ing to study cov­er­age of news events, like the up­com­ing U.S. pres­id­en­tial elec­tion.<p>Regularly collecting the data costs money. So I've organized a Kickstarter in hopes of raising funds to keep it up. http://www.kickstarter.com/projects/651552740/keep-pastpages-alive
======
showerst
Do you have permission to be taking these snapshots? The Newseum does
something similar for print content and has agreements with all of the
organizations so that they don't just Cease & Desist them out of existence.

<http://www.newseum.org/todaysfrontpages/>

~~~
kgen
Doesn't Fair Use cover "commentary, criticism, news reporting, research,
teaching, library archiving and scholarship."? This site seems non-profit and
educational in nature, so I would imagine it would pass the balancing test if
push came to shove.

------
palewire
I cre­ated this site be­cause I think it ought to ex­ist. The shift­ing
homepages of ma­jor me­dia sites should be saved so they can be stud­ied. Done
right, I be­lieve Pas­t­Pages could serve as a re­source for schol­ars
seek­ing to study cov­er­age of news events, like the up­com­ing U.S.
pres­id­en­tial elec­tion.

Collecting this data cost money. So I've set up a Kickstarter drive to raise
funds. If you'd like to help keep PastPages alive, please considering giving.

[http://www.kickstarter.com/projects/651552740/keep-
pastpages...](http://www.kickstarter.com/projects/651552740/keep-pastpages-
alive)

~~~
atlbeer
What stack are you using for the web page capture? It's a perfect crisp
capture. I've tried before and never got such good programatic results.

~~~
palewire
I'm using Selenium's Firefox driver from inside a Django app. There is Python
binding that's slick once you figure out a couple timeout related workarounds
that are necessary. Their forums helped me over that hurdle.

------
there
I love the New York Times shots. It's a great demonstration of how off-putting
their interstitial ads are, and how many other sites don't need them.

~~~
palewire
Ha! If you know a way around them I'd appreciate the tip.

~~~
donohoe
Hmm. Not sure if it would make a difference but maybe:

(1) Try this URL instead: <http://www.nytimes.com/pages/>

(2) Hit the URL by date: <http://www.nytimes.com/indexes/yyyy/mm/dd/>

Example: <http://www.nytimes.com/indexes/2011/12/03/>

You can also use this URL structure to get the Homepage back several years
(2001) as it was around midnight of that date.

<http://www.nytimes.com/indexes/2001/01/01/>

<http://www.nytimes.com/indexes/2001/09/11/> (Notable)

Not sure if this (or any other section) is of interest too:

<http://www.nytimes.com/indexes/2010/12/03/todayspaper/>

(3) When you scrape the page find out the link it provides to the Homepage and
then try that. I had some success doing that.

What I really want is THIS:

"Reward - NYTimes Login Script"
[http://donohoe.tumblr.com/post/10723388191/reward-nytimes-
lo...](http://donohoe.tumblr.com/post/10723388191/reward-nytimes-login)

which would get around that problem.

~~~
palewire
Thanks for this great information. I'm stuck at the jury duty cattle call this
morning but will try to put this into action later.

~~~
jashkenas
Take care with putting this into action -- the external homepage archives
constantly suck in the latest version of any precoded module on the page ...
so for example, this should show the Iowa Caucuses, but shows nearly two
months later instead:

<http://www.nytimes.com/indexes/2012/01/03/>

... a misstep we later fixed. For what it's worth, there's also an internal
version of the homepage archive that doesn't suffer from this problem, and is
snapshotted hourly.

~~~
danso
So even all the external stylesheets/js are (relatively) preserved? That's
pretty slick...was this something that was retroactively applied, or something
that's existed since early iterations of the CMS?

------
sp332
I think hosting a static image is cool, but doesn't the Internet Archive
already have a full-HTML archive of these pages? e.g.
[http://web.archive.org/web/20110729013424/http://www.nytimes...](http://web.archive.org/web/20110729013424/http://www.nytimes.com/)

~~~
palewire
It does an I love that site but unless I'm mistaken I don't think they grab
often enough to track fast moving news events.

------
xabi
Same service here, but for newspapers (with more than 1000 newspapers around
the globe):

<http://en.kiosko.net/us/>

