Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Does anybody need 1.7M screenshots?
6 points by tomw1808 on Oct 10, 2016 | hide | past | favorite | 10 comments
Hi HN,

I just saw that one of my projects [1] which is constantly crawling websites, is still making screenshots in addition of each website. I completely forgot about the screenshots thing as I do not utilize them anymore.

It is in total around 1,790,000 full-page screenshots from sites that were posted on reddit, hacker news, tweets, financial news, since Jan 2014.

Don't ask me to open-source them and make them available for download. I dare not to get involved in any licensing issues or whatever.

They are on an S3 Bucket. Just got a bill from Amazon...

If you'd like to have them or you have any idea what I can do with them, outside of deleting, contact me at thomas@newscombinator.com

Thomas

[1] http://www.newscombinator.com



This sort of data falls under the category of "I might need this someday but I can't figure out why".

For the reasons you can't think of, perhaps you might consider indexing them locally then throwing them onto Glacier. The odds of you needing every single one of them (thus making Glacier cost-prohibitive) are far less than the odds of you needing one at random.

I haven't done the math on how many months of S3 hosting it takes to equal the upload cost once to Glacier, primarily because I don't know how big 1790000 screenshots are.

Alternatively, provided downloading them all to your local desktop doesn't run your S3 bill to Mars, tape drives can still be quite cost-effective ways to store a LOT of data, cheap.


High resolution uncompressed PNG screenshots should be a couple hundred kilobytes each at most. 1.8 million of 'em would be less than 1TB of data. If you transfer 'em to a temporary EC2 instance (no bandwidth cost) and zip them, it'll probably cost $50 or so in bandwidth to save them locally from there. Then, they'll fit on a <$50 commodity hard drive.


you are both to the point. It is "I might need it sometime, but not now" and it doesn't cost a fortune but it is a pain to download and move around a big bucket of files I don't really need at the moment.


> I dare not to get involved in any licensing issues or whatever

You might be covered under the DMCA: https://en.wikipedia.org/wiki/Dmca

And since they're screenshots some might be considered fair use: https://en.wikipedia.org/wiki/Fair_use

Some being blatant copies of logos, etc


I will leave that up here and on some other places. Maybe someone need it for some simple cgcv tasks or whatever.

I really don't want to publish it. Some websites are already down and not reachable anymore and there is no way for the creator of the websites to take down anything of that... Some of these things hunt me and I hope they don't bite me one day.

But thanks for the input.


Sounds cool and it's more like 1.8M.

Sadly I don't think it'd be much use for anyone since Archive.org takes care of all archiving.

The only time that I'd see screenshots come handy is for live previews.


It would be great to see such a collection from 1990s...


What was the bill , eeek


not much, $30. :) but now reading what I was writing makes it look like I paid thousands haha


That's astoundingly low! Was the main cost the CPU resources or b/w?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: