
Ask HN: Does anybody need 1.7M screenshots? - tomw1808
Hi HN,<p>I just saw that one of my projects [1] which is constantly crawling websites, is still making screenshots in addition of each website. I completely forgot about the screenshots thing as I do not utilize them anymore.<p>It is in total around 1,790,000 full-page screenshots from sites that were posted on reddit, hacker news, tweets, financial news, since Jan 2014.<p>Don&#x27;t ask me to open-source them and make them available for download. I dare not to get involved in any licensing issues or whatever.<p>They are on an S3 Bucket. Just got a bill from Amazon...<p>If you&#x27;d like to have them or you have any idea what I can do with them, outside of deleting, contact me at thomas@newscombinator.com<p>Thomas<p>[1] http:&#x2F;&#x2F;www.newscombinator.com
======
mchannon
This sort of data falls under the category of "I might need this someday but I
can't figure out why".

For the reasons you can't think of, perhaps you might consider indexing them
locally then throwing them onto Glacier. The odds of you needing every single
one of them (thus making Glacier cost-prohibitive) are far less than the odds
of you needing one at random.

I haven't done the math on how many months of S3 hosting it takes to equal the
upload cost once to Glacier, primarily because I don't know how big 1790000
screenshots are.

Alternatively, provided downloading them all to your local desktop doesn't run
your S3 bill to Mars, tape drives can still be quite cost-effective ways to
store a LOT of data, cheap.

~~~
dangrossman
High resolution uncompressed PNG screenshots should be a couple hundred
kilobytes each at most. 1.8 million of 'em would be less than 1TB of data. If
you transfer 'em to a temporary EC2 instance (no bandwidth cost) and zip them,
it'll probably cost $50 or so in bandwidth to save them locally from there.
Then, they'll fit on a <$50 commodity hard drive.

~~~
tomw1808
you are both to the point. It is "I might need it sometime, but not now" and
it doesn't cost a fortune but it is a pain to download and move around a big
bucket of files I don't really need at the moment.

------
kfrat
> I dare not to get involved in any licensing issues or whatever

You might be covered under the DMCA:
[https://en.wikipedia.org/wiki/Dmca](https://en.wikipedia.org/wiki/Dmca)

And since they're screenshots some might be considered fair use:
[https://en.wikipedia.org/wiki/Fair_use](https://en.wikipedia.org/wiki/Fair_use)

 _Some_ being blatant copies of logos, etc

~~~
tomw1808
I will leave that up here and on some other places. Maybe someone need it for
some simple cgcv tasks or whatever.

I really don't want to publish it. Some websites are already down and not
reachable anymore and there is no way for the creator of the websites to take
down anything of that... Some of these things hunt me and I hope they don't
bite me one day.

But thanks for the input.

------
VertexRed
Sounds cool and it's more like 1.8M.

Sadly I don't think it'd be much use for anyone since Archive.org takes care
of all archiving.

The only time that I'd see screenshots come handy is for live previews.

------
dorfuss
It would be great to see such a collection from 1990s...

------
majortennis
What was the bill , eeek

~~~
tomw1808
not much, $30. :) but now reading what I was writing makes it look like I paid
thousands haha

~~~
VertexRed
That's astoundingly low! Was the main cost the CPU resources or b/w?

