Hacker News new | comments | show | ask | jobs | submit login

Cool! @matpalm, please run your English extractor scripts on it! (http://matpalm.com/blog/2011/12/10/common_crawl_visible_text...)

'If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.'

This is annoying, that they're using the enterprise sales model for distribution. Just put it on S3.

"Just?" That's $9000/month for standard storage on S3 - $7000 for reduced redundancy. Each download would cost them about $4000 in bandwidth fees.

They already know how to store massive amounts of data, and how to send it over the network. Assuming $100/TB for their own media means it would only cost them about $4000 to store it themselves.

Assuming you have 1Gb/s connection rate, that would take you over 7 days to download. It's probably both cheaper and faster to write the data to disk and ship the disk then to an S3 download.

It reads more like they don't know if or how people want to use this. (The "are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.") Simply making the data available doesn't give them feedback.

For example, is it sufficiently worthwhile for them to go through the effort of providing the data on S3, given the costs?

Amazon hosts some large data sets for public usage, I suspect a deal could be arranged here. The cost of access is then on the user.


Talking with Amazon to make a special deal for hosting that data would not be a "just." The major point remains - archive.org knows how to host and provide large files, so the issue must be some other factor. I think they want to know if it's worthwhile to do so.

It'd be a HELL of a lot cheaper and a MUCH faster transfer rate to mail a few really large capacity hard drives full of the data instead of hosting it on S3.

edit: Just saw dalke's response. Great minds think alike!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact