Google did the terabyte slightly slower (68 seconds) on 4x fewer machines, but d...

tc · on May 12, 2009

Yes, but the key difference here is that you can download the Hadoop sources and try it out Yahoo's way yourself.

neilc · on May 12, 2009

Sure, you also just need 3600 machines.

diego · on May 12, 2009

You don't have to sort a terabyte in 62 seconds, you could do it on 10 machines and it would probably take you less than a day.

If you are a company that needs to sort a petabyte once in a while and your data resides on the Amazon cloud, you could get 3600 high-cpu EC2 instances and do it for about $12k.

A petabyte is on the order of 100k bytes for every person on the planet. If you have that much data that you need to sort in less than a day, you can afford it.

neilc · on May 12, 2009

I'd actually be curious to see what would happen if you tried to sort a petabyte with 3800 EC2 nodes. I wonder how much worse your performance would be? EC2 instance-local storage is pretty slow by default (the "first-write" problem), but if you used multiple EBS volumes on each node you might be able to get pretty good I/O performance.

sanswork · on May 12, 2009

Just a small correction. Using their mapreduce large instances it would cost you less that $3500.

sp332 · on May 12, 2009

IIRC, the mapreduce fees are on top of the normal EC2 instance fees?

sanswork · on May 12, 2009

Ah my mistake I missed that line above the pricing. Please disregard my previous comment.

Changes my figure quite a bit too. Up to $25K.

ariwilson · on May 12, 2009

Forgot the link: http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapr...