Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds (yahoo.net)
36 points by voberoi on May 12, 2009 | hide | past | favorite | 16 comments


Google did the terabyte slightly slower (68 seconds) on 4x fewer machines, but did the petabyte in 6 hours and 2 minutes (around 1/3 of the time of Hadoop) on nearly the same number of machines (4000).


Yes, but the key difference here is that you can download the Hadoop sources and try it out Yahoo's way yourself.


Sure, you also just need 3600 machines.


You don't have to sort a terabyte in 62 seconds, you could do it on 10 machines and it would probably take you less than a day.

If you are a company that needs to sort a petabyte once in a while and your data resides on the Amazon cloud, you could get 3600 high-cpu EC2 instances and do it for about $12k.

A petabyte is on the order of 100k bytes for every person on the planet. If you have that much data that you need to sort in less than a day, you can afford it.


I'd actually be curious to see what would happen if you tried to sort a petabyte with 3800 EC2 nodes. I wonder how much worse your performance would be? EC2 instance-local storage is pretty slow by default (the "first-write" problem), but if you used multiple EBS volumes on each node you might be able to get pretty good I/O performance.


Just a small correction. Using their mapreduce large instances it would cost you less that $3500.


IIRC, the mapreduce fees are on top of the normal EC2 instance fees?


Ah my mistake I missed that line above the pricing. Please disregard my previous comment.

Changes my figure quite a bit too. Up to $25K.



I think you mean "3800 machines running Hadoop"


Their report (linked to from the post) goes into greater detail: http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf

I'd love to know why the 500 GB and 100 TB sorts ran at about half the speed of the other two (~0.5 TB/min as opposed to ~1 TB/min).


They doubled the ram before the petabyte sort. Is that what you're talking about?


A minute to sort 1TB on a system with 11TB of ram?


IIRC the rules require the data to be read from disk, sorted, and written back to disk, so that's at least 32GB/s of disk I/O by my calculations. Also, the cluster has fairly weak bisection bandwidth.


1000+ machines to sort 1TB should work out to each machine reading, sorting and writing rows around 32MB/s on machines with 4 SATA disks per node.

While it is kind of neat that Hadoop makes it easy to run big jobs on clusters almost nobody can afford, it doesn't look like it's very efficient.

For comparison, check out http://www.ordinal.com


Is their minute to sort 34 GB on 1 machine? I couldn't tell.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: