

Ask HN: Where can I host a large data-set (35GB) in public? - donohoe

I've been scraping various parts of a web site since 2008 and have accumulated an interesting data set (tracking changes every hour, and in some cases every 6 minutes) covering published RSS, markup changes etc.<p>The data is cool and I think people would have fun analyzing and making mashups. It consists of JSON and HTML fragments with a simple API of sorts to navigate:<p><pre><code>  2010/index.js</code></pre>
The content of "index.js" is an array showing what months, within 2010, are available:<p><pre><code>  [ "01" , "02" , "03" , "04" , "05" , "06" , "07" , "08" ]
</code></pre>
Knowing the Months, you can then look up the days:<p><pre><code>  2010/08/index.js
  [ "01" , "02" , "03" , "04" , "05" , and so on... , "17" , "18" , "19" ]
</code></pre>
Within each day you can check the Hours:<p><pre><code>  2010/08/19/index.js
  [ "00" , "01" , "02" , "03" , "04" , "and so on... , "09" , "10" ]
</code></pre>
Within each hour, you can get the JSON file:<p><pre><code>  2010/08/19/10/index.js
</code></pre>
and so on and so on...<p>Freebase and Amazon Public Data Sets have been suggested but don't look like a good fit. Right now the file size of the 3 TAR files is about 35GB aprox.<p>Any further suggestions?
======
paulgb
How big is it if you gzip or bz2 it? I've found compression to be pretty
effective for datasets like this.

~~~
donohoe
Good suggestion. gzip reduced one file from 2332528640 to 129440511

~~~
jerf
Bzip may do yet better, though it will take longer. 7z may be worth a try too.
You're getting into the range where you could just toss the whole thing up
somewhere without much hassle. If you're not trying to make money, I'd just
throw the whole data set out there on a torrent. Why take the time to write an
access service when you don't even know what people are going to want to do
with it? First thing a serious researcher is likely to do anyhow is just
scrape the whole thing off your website, at much greater expense to you.

------
oman121
Have you considered setting up a torrent of all the data thus far and then
hosting only recent data? Then every month or so you can just create a new
torrent with the data for that month and you will only be hosting a small
amount of data at one time.

------
there
<http://theinfo.org/>

~~~
donohoe
Doesn't look like they have resources on hosting the data but I'll dig around
their discussion groups some more. Thanks.

------
deno
You can try using Amazon S3 (storage) + CoralCDN (distribution). Just use S3's
permission system to restrict access other than through CoralCDN and you
should have cheap and reliable storage that won't drive your bill crazy.

------
ddispaltro
Would infochimps.com be useful for this?

~~~
kellyjosephc
Yes, we'd be love to host this dataset for free. Send an email to
upload@infochimps.com and we can arrange the transfer.

~~~
d_r
Sorry! Accidentally clicked the "down" button on your comment and there's no
way to undo it.

------
aw3c2
If what you did is legal and you are legally able to share it, archive.org
would be a good choice I guess.

------
hyuen
make a torrent

