Errant cron task yields yearlong time lapse of nytimes.com

zenpaul · on July 19, 2011

I created a web application and set of scripts late last year to snapshot sites like that on a daily/hourly/minutely interval. Also set up the web app to manage the captured images and turn them into videos.

Some of the interesting things I found:

- interesting to compare news sites coverage of the same news stories - see who publishes stories first and where on the page...

- quickly analyse site ad and content refresh rates

- instant time lapse videos from web cam sites

- some interesting artistic effects as content changes and moves on sites

- "photographic record" of web sites was interesting to see some sites not update or be broken at times

- very easy to generate gigabytes of content in small amounts of time!

I have't have time to extend the project further right now, but I still have jobs running capturing some of the top sites daily to get some year-long web-time-lapse videos and do something with the content. If anyone has ideas to commercialize the content or technology, let me know.

Note: Technologies used - Ubuntu, bash, CutyCapt, JSP, ImageMagick

ChrisArchitect · on July 18, 2011

Sorry, just realized more original source is http://news.ycombinator.com/item?id=2777508

peteforde · on July 18, 2011

The original post suffers from poor titling. Phill MV would have been better off taking a page from TechCrunch's confrontational naming style.

phillmv · on July 18, 2011

Ah, but the Dylan reference was TOO GOOD TO PASS UP.

MiguelHudnandez · on July 18, 2011

Or with partisan hackery and tech-buzzword flavoring: "cron effortlessly disembowels nytimes.com temporal perception"

nostromo · on July 18, 2011

Mid-terms at 1:19 -- http://www.youtube.com/watch?v=sCKGOiauJCE&feature=playe...

Fun to see the results pile in -- then the usual reaction shots from pols (frowns and smiles depending on party).

codeslush · on July 19, 2011

I must confess, I actually watched the entire video! Much better than looking through hundreds of screen-captures in a list and interestingly entertaining!

It's striking to me the number of watch makers that advertised through the course of the year. The ads primarily caught my attention - which is strange, because I rarely look at ads when browsing sites. The one constant, from all the ads, was a watch manufacturer.

Curious if the person who captured these images had a browsing history for watches, or if that's what everyone witnessed? Next experiment: Two completely different users capture these on the same time interval -- side by side comparison! ;-)

phillmv · on July 19, 2011

>Curious if the person who captured these images had a browsing history for watches, or if that's what everyone witnessed?

Alas, no. I used a webkit to jpeg generator that should, in theory, be pretty void of any browsing history. I'd be surprised if they've started tracking those by ip!

mrkurt · on July 19, 2011

You shouldn't be too surprised, it's quite possible they were tracking some combination of IP, user agent, and a number of other things to identify the browser. I don't know that you would have ended up having a "watch" preference, though, especially without clicking on watch ads or visiting a watch site.

robryan · on July 19, 2011

I would love to see the times adopt a layout for their front page which is more web based and less like imitating the front page of a paper, at 5 columns across in places it is too squished together at only 970px across.

It would actually be a great candidate for responsive design, could make the current columns setup look far nicer with more width when it is available then remove columns where there is less screen width available, similar to http://theconversation.edu.au/ which has a similar amount of columns across.

roadnottaken · on July 18, 2011

how do you use cron to take screenshots like this? does a browser window have to be open somewhere the whole time?

phillmv · on July 19, 2011

I used http://code.google.com/p/wkhtmltopdf/ . It's an amazing project that I try to use as often as possible, especially whenever a client requires pdfs.

icebraining · on July 19, 2011

"Open," yes, but not visible. You can use Xvfb[1] which does everything in memory but doesn't actually show any image.

[1]: http://en.wikipedia.org/wiki/Xvfb

adamhowell · on July 18, 2011

I've often found myself needing -- and thinking a/b building -- a webapp/site for taking/collecting screenshots of other sites. But I'm not sure it'd actually be useful and I certainly don't know if anyone'd pay for it.

askedrelic · on July 19, 2011

I was thinking of building something like this recently to test my Django + background worker skills.

http://cutycapt.sourceforge.net/ is one of better screen capture engines I found for OSX. Most are based on Webkit/QT and Xvfb for Unix support.

pyre · on July 18, 2011

There were quite a few of these that popped up ~5 years ago. I think most of them are gone.

adamhowell · on July 18, 2011

Yeah, most of those were for thumbnailing.

What I'm talking about is most like a screenshot version of archive.org.

mnutt · on July 19, 2011

That would be useful. Archive.org is better at not breaking than I imagined it would be, but still often leaves broken assets. The combination of the two would be really cool: see the page, and how the page was viewed at the time.

ChrisArchitect · on July 20, 2011

one odd thing about the video is it should be 7+ mins long, but shows up in YouTube as 5:36 or something. Trick: Watch in 480 mode, and it keeps playing to the full length :-S

You gotta keep watching so you don't miss some more big news events like Osama's death !

morgandev · on July 18, 2011

This just in.... users of NYT.com like yogurt (see banner ads).