Hacker News new | comments | show | ask | jobs | submit login
80 terabytes of archived web crawl data available for research (archive.org)
79 points by cleverjake on Oct 26, 2012 | hide | past | web | favorite | 30 comments



One of the major reasons to make something like this available is not research but to get the eggs in more than one basket. If archive.org ever went offline permanently that would be a pretty big disaster.

A backup copy of the library of Alexandria would have been a nice to have before it burned down but would have been priceless afterwards.

So please, make all of the archive available in some form. It will be an insane amount of data but at least there will be some institutions that will be able to insure this precious resource against various disasters.


There is a mirror of the Wayback Machine (which isn't all of archive.org's content, but still): http://www.bibalex.org/isis/frontend/archive/archive_web.asp...

Ironically enough, it is hosted at the New Library of Alexandria, Egypt. :)


Once they acquire old domains, some websites start blocking archive.org through robots.txt. It would be better if there's any elegant solution for this problem.


Would somebody really care for the library of Alexandria these days? Im not through reading the internet yet.


I have no words.

The internet is flooded with pop culture bullshit. The Library contained precious works from some of the greatest geniuses in history--works that we know exist but were forever lost.


On the other hand, it is still literally true that the internet also contains (although minuscule in proportion) precious works from some of the greatest geniuses in history, that you will not finish reading in your lifetime.

My point is that it's not like there's any shortage of great reading materials. Unless you want some specific materials from the library of Alexandria, it does not make much sense to miss it.

More here: http://www.gwern.net/Culture%20is%20not%20about%20Esthetics


There's still people that care relatively little for the study of the past. The internet is also filled with lots of things that aren't "pop culture bullshit"


After I look at what careful study of great geniuses of the past and high culture did for us in the 1940s and ever since, forgetting the bulk of it (less technical stuff) is probably best. Most of the (intellectual) past is an albatross.


I don't know that we necessarily miss it for the information that was in it, but I think we certainly miss it for the progress we lost without it.

Electricity was discovered and even used in the ancient middle east. Steam powered perpetual motion devices were constructed, but never applied to locomotion.

Can you imagine where we would be now as a species if ideas like these were allowed to propagate across the Mediterranean thousands of years ago? Steam powered devices are only 350+ years old, and your grandpa's grandpa probably did't have electricity in his house.


In what way was electricity used? (I support your overall point that better preservation would've sped progress, and wish to know if there's something I didn't know.)


Possibly electroplating.

http://en.wikipedia.org/wiki/Baghdad_Battery

But without positive proof of a series connection I don't see how that is a sustainable theory. Wiring is a necessity.


Neat, thanks -- I'd forgotten about that.


Carl S would be rolling in his grave. You need to watch this:

http://www.youtube.com/watch?v=jixnM7S9tLw


I find it hard to believe I'm reading this comment on Hacker News, if there was any group of people where I would expect the value of such a trove of knowledge about the past to be estimated at close to its true value.


CommonCrawl also has a fairly large ("The crawl currently covers 5 billion pages") dataset of this sort, which unlike the one from archive.org is already available to everyone on S3 under the requester-pays model.

http://commoncrawl.org/data/accessing-the-data/


Cool! @matpalm, please run your English extractor scripts on it! (http://matpalm.com/blog/2011/12/10/common_crawl_visible_text...)

'If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.'

This is annoying, that they're using the enterprise sales model for distribution. Just put it on S3.


"Just?" That's $9000/month for standard storage on S3 - $7000 for reduced redundancy. Each download would cost them about $4000 in bandwidth fees.

They already know how to store massive amounts of data, and how to send it over the network. Assuming $100/TB for their own media means it would only cost them about $4000 to store it themselves.

Assuming you have 1Gb/s connection rate, that would take you over 7 days to download. It's probably both cheaper and faster to write the data to disk and ship the disk then to an S3 download.

It reads more like they don't know if or how people want to use this. (The "are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.") Simply making the data available doesn't give them feedback.

For example, is it sufficiently worthwhile for them to go through the effort of providing the data on S3, given the costs?


Amazon hosts some large data sets for public usage, I suspect a deal could be arranged here. The cost of access is then on the user.

http://aws.amazon.com/publicdatasets/


Talking with Amazon to make a special deal for hosting that data would not be a "just." The major point remains - archive.org knows how to host and provide large files, so the issue must be some other factor. I think they want to know if it's worthwhile to do so.


It'd be a HELL of a lot cheaper and a MUCH faster transfer rate to mail a few really large capacity hard drives full of the data instead of hosting it on S3.

edit: Just saw dalke's response. Great minds think alike!


Napkin math says it would cost ~$5000 to put the data online and ~$138/month to power it:

Here's the "Everything will always work" math:

~ 27 3TB SATA drives @ $129.95 [0]

~ 7 machines @ $60.17 [1] [2]

~ 8-port switch @ $11.99 [3]

~ 100ft cat5 cable @ $15.95 [4]

~ 14 cat5 connectors @ ~$5 total.

~ 2 6-prong power strips @ $5 [5].

Total: 27 * $129.95 + 7 * $60.17 + $11.99 + $15.95 + $5 + 2 * $5 = $3972.78.

With decent redundancy: ~$5000 [6].

Monthly power bill: ~$138 [7].

Labor: $0 [8].

You can store basically a copy of "The entire internet" for 1/4th the cost of a new sedan [9] and power it at 1/5th the cost of using that sedan [10].

I officially christen this the future.

-----

References:

[0] http://www.goharddrive.com/ProductDetails.asp?ProductCode=G0...

[1] The cheapest all-in-one with SATA I found was the $49 cubieboard (http://www.linuxbsdos.com/2012/09/11/cubieboard-raspberry-pi...). 4 of them would run you $200 ... putting it at higher expense.

[2] Breakdown:

$24.99 Motherboard: http://www.ascendtech.us/asus-p5rc-le-lga775-ddr-motherboard...

$3.50 256MB RAM: http://txmicro.com/256MB-DDR-RAM-PC3200-184-Pin-DIMM-Name-Br...

$4.00 celeron: http://starmicroinc.net/intel-celeron-325j-253ghz-256k-533mh...

$8.95 CPU fan: http://3btech.net/inorlga775co.html

$21.99 550 W PSU: http://3btech.net/24pinch550wa.html

$0.74 Molex -> SATA: http://www.amazon.com/Syba-SY-CAB40007-Molex-Power-Inches/dp...

-----

$60.17

[3] http://3btech.net/giee24ramo10.html

[4] http://www.acnt.com/product.asp?pf_id=NHG08

[5] http://www.google.com/shopping/product/18338766357132175733

[6] Using base price * 1.25.

[7] http://www.bls.gov/ro9/cpilosa_energy.htm Assuming $0.15 KW/hr + 180 W/hr per machine avg usage we have: 7 * 0.180 KW/hr * $0.15 KW/hr* 24 hr * 365.25 d / 12 m = $138.07/month.

[8] I mean equity, cough cough.

[9] Based on KBB value for a baseline 2013 Nissan Altima (http://www.kbb.com/nissan/altima/2013-nissan-altima/25/?vehi...)

[10] Based on 15,291 miles per year average (http://www.fhwa.dot.gov/ohim/onh00/bar8.htm) * IRS mileage rate of $0.55 (http://www.irs.gov/uac/IRS-Announces-2012-Standard-Mileage-R...) = $700.


Let me assure you, storage has gotten cheap but not that cheap. ;)

You've omitted the labor cost to assemble, debug and maintain your McGyver-Device. That's easily another $2500/mo (amortized).

Secondly you don't really want to store 80T on the cheapest components you can possibly get without a lot of testing and planning. This $22 PSU, trust me, it will come back to haunt you.

Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.

And finally: If you want to put this stuff online and have people actually download it then you'll soon notice that redundancy is not only needed for availability but also for performance.

A reasonable ballpark figure for low-end networked storage nowadays is $0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T should cost roughly $4000/mo, give or take a few.


> That's easily another $2500/mo (amortized).

I'd be doing this myself, so I'll charge myself $0.

> Secondly you don't really want to store 80T on the cheapest components you can possibly get without a lot of testing and planning. This $22 PSU, trust me, it will come back to haunt you.

Sure. Of course. Bump that to $45. Ok, another $200. Not huge.

> Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.

If you are serving it to the internet at large. But for personal use, 1.25 is fine unless you are saying the proper RAID setup is Number of disks * 2.5; which would be something new to me, for sure.

> And finally: If you want to put this stuff online and have people actually download it then you'll soon notice that redundancy is not only needed for availability but also for performance.

I don't. The presumption is that it's a copy (my copy, actually), not the original.

> A reasonable ballpark figure for low-end networked storage nowadays is $0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T should cost roughly $4000/mo, give or take a few.

You might be getting ripped off :-(.

I can get a half-rack (that's 22U) for $900/month. Even at 2.5 redundancy and if I had to pay for the patch and the switch, it's still way under $4000/month.

Besides, the thought experiment was to run it from somewhere like my entry-way, near my coat-hanger: "What's this? Oh, it's just the internet; the Whole Internet. No no, just a copy."


Ah, I missed the personal use bit.

Yes, of course you can cobble something together when availability does not matter at all (it might blow the fuse in your apt, though;)).

I was just saying that in an application with most basic availability-requirements you're not getting the cost down like that.

I.e. even though you could fit that into one rack, nobody actually would (redundancy is measured in powers of >=2). And even though you might find an ISP who won't bitch about you drawing >10 Amps in "half a rack" (cough), you should still be a little concerned about other tenants screwing around in the same rack as your only copy of 80T of data that you care about... ;)


A synology box with a slave unit would do 96T uncompressed with some room for redundancy in there based on 24 4T drives.


I don't see why you would use 20 machines. I think a good place to start would be RAID 5 using 6 drives so 5 for data and 1 for backup. Which gives 6 machines, assuming your dealing with uncompressed data assuming it's mostly text you can probably use 2 or 3.


How did I screw that up ... woops, let me change it.

Ok done.

I still argue 7 machines though. I mean, sure you could do USB + enclosure or have some more expensive board (with 6 SATA connectors, I don't know how cheap those go). Then you also may need more power, depending on how you use the thing. It's true that fewer machines generally = fewer faults, just as a matter of statistics.

But in reality, in practice, the users will probably have some IBM or SGI solution that is a full-height rack with a bunch of SAS drives or something. I'm sure you've seen those things at trade shows.

But my point here was to try to determine how much it would cost with total baseline OTS hardware.


Wouldn't an adaptation of the Backblaze Pod do..?

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-h...


maybe amazon would be generous enough to host this freely, as they did with common-crawl?


Chain with Lempel-Ziv-Marcov and squeeze on a couple of 5TB hard drives :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: