

80 terabytes of archived web crawl data available for research - cleverjake
http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/

======
kristopolous
Napkin math says it would cost ~$5000 to put the data online and ~$138/month
to power it:

Here's the "Everything will always work" math:

~ 27 3TB SATA drives @ $129.95 [0]

~ 7 machines @ $60.17 [1] [2]

~ 8-port switch @ $11.99 [3]

~ 100ft cat5 cable @ $15.95 [4]

~ 14 cat5 connectors @ ~$5 total.

~ 2 6-prong power strips @ $5 [5].

Total: 27 * $129.95 + 7 * $60.17 + $11.99 + $15.95 + $5 + 2 * $5 = $3972.78.

With decent redundancy: ~$5000 [6].

Monthly power bill: ~$138 [7].

Labor: $0 [8].

You can store basically a copy of "The entire internet" for 1/4th the cost of
a new sedan [9] and power it at 1/5th the cost of using that sedan [10].

I officially christen this the future.

\-----

References:

[0]
[http://www.goharddrive.com/ProductDetails.asp?ProductCode=G0...](http://www.goharddrive.com/ProductDetails.asp?ProductCode=G01-0530&Click=46406)

[1] The cheapest all-in-one with SATA I found was the $49 cubieboard
([http://www.linuxbsdos.com/2012/09/11/cubieboard-raspberry-
pi...](http://www.linuxbsdos.com/2012/09/11/cubieboard-raspberry-pi-
competitor-with-sata-port/)). 4 of them would run you $200 ... putting it at
higher expense.

[2] Breakdown:

$24.99 Motherboard: [http://www.ascendtech.us/asus-p5rc-le-lga775-ddr-
motherboard...](http://www.ascendtech.us/asus-p5rc-le-lga775-ddr-
motherboard_i_mbasus51885472h.aspx)

$3.50 256MB RAM: [http://txmicro.com/256MB-DDR-RAM-PC3200-184-Pin-DIMM-Name-
Br...](http://txmicro.com/256MB-DDR-RAM-PC3200-184-Pin-DIMM-Name-
Brand-p-1035.html)

$4.00 celeron: [http://starmicroinc.net/intel-
celeron-325j-253ghz-256k-533mh...](http://starmicroinc.net/intel-
celeron-325j-253ghz-256k-533mhz-sl7tls-p-927.html)

$8.95 CPU fan: <http://3btech.net/inorlga775co.html>

$21.99 550 W PSU: <http://3btech.net/24pinch550wa.html>

$0.74 Molex -> SATA: [http://www.amazon.com/Syba-SY-CAB40007-Molex-Power-
Inches/dp...](http://www.amazon.com/Syba-SY-CAB40007-Molex-Power-
Inches/dp/B0027AGK3M)

\-----

$60.17

[3] <http://3btech.net/giee24ramo10.html>

[4] <http://www.acnt.com/product.asp?pf_id=NHG08>

[5] <http://www.google.com/shopping/product/18338766357132175733>

[6] Using base price * 1.25.

[7] <http://www.bls.gov/ro9/cpilosa_energy.htm> Assuming $0.15 KW/hr + 180
W/hr per machine avg usage we have: 7 * 0.180 KW/hr * $0.15 KW/hr* 24 hr *
365.25 d / 12 m = $138.07/month.

[8] I mean equity, _cough cough_.

[9] Based on KBB value for a baseline 2013 Nissan Altima
([http://www.kbb.com/nissan/altima/2013-nissan-
altima/25/?vehi...](http://www.kbb.com/nissan/altima/2013-nissan-
altima/25/?vehicleid=374605&intent=buy-new&category=sedan&options=))

[10] Based on 15,291 miles per year average
(<http://www.fhwa.dot.gov/ohim/onh00/bar8.htm>) * IRS mileage rate of $0.55
([http://www.irs.gov/uac/IRS-Announces-2012-Standard-
Mileage-R...](http://www.irs.gov/uac/IRS-Announces-2012-Standard-Mileage-
Rates,-Most-Rates-Are-the-Same-as-in-July)) = $700.

~~~
moe
Let me assure you, storage has gotten cheap but not _that_ cheap. ;)

You've omitted the labor cost to assemble, debug and maintain your McGyver-
Device. That's easily another $2500/mo (amortized).

Secondly you don't _really_ want to store 80T on the cheapest components you
can possibly get without _a lot_ of testing and planning. This $22 PSU, trust
me, it will come back to haunt you.

Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.

And finally: If you want to put this stuff online and have people actually
download it then you'll soon notice that redundancy is not only needed for
availability but also for performance.

A reasonable ballpark figure for low-end networked storage nowadays is
$0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T
should cost roughly $4000/mo, give or take a few.

~~~
kristopolous
> That's easily another $2500/mo (amortized).

I'd be doing this myself, so I'll charge myself $0.

> Secondly you don't really want to store 80T on the cheapest components you
> can possibly get without a lot of testing and planning. This $22 PSU, trust
> me, it will come back to haunt you.

Sure. Of course. Bump that to $45. Ok, another $200. Not huge.

> Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.

If you are serving it to the internet at large. But for personal use, 1.25 is
fine unless you are saying the proper RAID setup is Number of disks * 2.5;
which would be something new to me, for sure.

> And finally: If you want to put this stuff online and have people actually
> download it then you'll soon notice that redundancy is not only needed for
> availability but also for performance.

I don't. The presumption is that it's a copy (my copy, actually), not the
original.

> A reasonable ballpark figure for low-end networked storage nowadays is
> $0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T
> should cost roughly $4000/mo, give or take a few.

You might be getting ripped off :-(.

I can get a half-rack (that's 22U) for $900/month. Even at 2.5 redundancy and
if I had to pay for the patch and the switch, it's still way under
$4000/month.

Besides, the _thought experiment_ was to run it from somewhere like my entry-
way, near my coat-hanger: "What's this? Oh, it's just the internet; the Whole
Internet. No no, just a copy."

~~~
moe
Ah, I missed the personal use bit.

Yes, of course you can cobble something together when availability does not
matter at all (it might blow the fuse in your apt, though;)).

I was just saying that in an application with most basic availability-
requirements you're not getting the cost down like that.

I.e. even though you _could_ fit that into one rack, nobody actually would
(redundancy is measured in powers of >=2). And even though you _might_ find an
ISP who won't bitch about you drawing >10 Amps in "half a rack" (cough), you
should still be a little concerned about other tenants screwing around in the
same rack as your only copy of 80T of data that you care about... ;)

------
tisme
One of the major reasons to make something like this available is not research
but to get the eggs in more than one basket. If archive.org ever went offline
permanently that would be a pretty big disaster.

A backup copy of the library of Alexandria would have been a nice to have
before it burned down but would have been priceless afterwards.

So please, make all of the archive available in some form. It will be an
insane amount of data but at least there will be some institutions that will
be able to insure this precious resource against various disasters.

~~~
itry
Would somebody really care for the library of Alexandria these days? Im not
through reading the internet yet.

~~~
forensic
I have no words.

The internet is flooded with pop culture bullshit. The Library contained
precious works from some of the greatest geniuses in history--works that we
know exist but were forever lost.

~~~
kami8845
There's still people that care relatively little for the study of the past.
The internet is also filled with lots of things that aren't "pop culture
bullshit"

~~~
teeja
After I look at what careful study of great geniuses of the past and high
culture did for us in the 1940s and ever since, forgetting the bulk of it
(less technical stuff) is probably best. Most of the (intellectual) past is an
albatross.

------
admp
CommonCrawl also has a fairly large ("The crawl currently covers 5 billion
pages") dataset of this sort, which unlike the one from archive.org is already
available to everyone on S3 under the requester-pays model.

<http://commoncrawl.org/data/accessing-the-data/>

------
bravura
Cool! @matpalm, please run your English extractor scripts on it!
([http://matpalm.com/blog/2011/12/10/common_crawl_visible_text...](http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/))

'If you would like access to this set of crawl data, please contact us at info
at archive dot org and let us know who you are and what you’re hoping to do
with it. We may not be able to say “yes” to all requests, since we’re just
figuring out whether this is a good idea, but everyone will be considered.'

This is annoying, that they're using the enterprise sales model for
distribution. Just put it on S3.

~~~
dalke
"Just?" That's $9000/month for standard storage on S3 - $7000 for reduced
redundancy. Each download would cost them about $4000 in bandwidth fees.

They already know how to store massive amounts of data, and how to send it
over the network. Assuming $100/TB for their own media means it would only
cost them about $4000 to store it themselves.

Assuming you have 1Gb/s connection rate, that would take you over 7 days to
download. It's probably both cheaper and faster to write the data to disk and
ship the disk then to an S3 download.

It reads more like they don't know if or how people want to use this. (The
"are interested in exploring how others might be able to interact with or
learn from this content if we make it available in bulk.") Simply making the
data available doesn't give them feedback.

For example, is it sufficiently worthwhile for them to go through the effort
of providing the data on S3, given the costs?

~~~
waterlesscloud
Amazon hosts some large data sets for public usage, I suspect a deal could be
arranged here. The cost of access is then on the user.

<http://aws.amazon.com/publicdatasets/>

~~~
dalke
Talking with Amazon to make a special deal for hosting that data would not be
a "just." The major point remains - archive.org knows how to host and provide
large files, so the issue must be some other factor. I think they want to know
if it's worthwhile to do so.

------
hnwh
maybe amazon would be generous enough to host this freely, as they did with
common-crawl?

------
dchichkov
Chain with Lempel-Ziv-Marcov and squeeze on a couple of 5TB hard drives :)

