Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google did the terabyte slightly slower (68 seconds) on 4x fewer machines, but did the petabyte in 6 hours and 2 minutes (around 1/3 of the time of Hadoop) on nearly the same number of machines (4000).


Yes, but the key difference here is that you can download the Hadoop sources and try it out Yahoo's way yourself.


Sure, you also just need 3600 machines.


You don't have to sort a terabyte in 62 seconds, you could do it on 10 machines and it would probably take you less than a day.

If you are a company that needs to sort a petabyte once in a while and your data resides on the Amazon cloud, you could get 3600 high-cpu EC2 instances and do it for about $12k.

A petabyte is on the order of 100k bytes for every person on the planet. If you have that much data that you need to sort in less than a day, you can afford it.


I'd actually be curious to see what would happen if you tried to sort a petabyte with 3800 EC2 nodes. I wonder how much worse your performance would be? EC2 instance-local storage is pretty slow by default (the "first-write" problem), but if you used multiple EBS volumes on each node you might be able to get pretty good I/O performance.


Just a small correction. Using their mapreduce large instances it would cost you less that $3500.


IIRC, the mapreduce fees are on top of the normal EC2 instance fees?


Ah my mistake I missed that line above the pricing. Please disregard my previous comment.

Changes my figure quite a bit too. Up to $25K.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: