Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds

ariwilson · on May 12, 2009

Google did the terabyte slightly slower (68 seconds) on 4x fewer machines, but did the petabyte in 6 hours and 2 minutes (around 1/3 of the time of Hadoop) on nearly the same number of machines (4000).

tc · on May 12, 2009

Yes, but the key difference here is that you can download the Hadoop sources and try it out Yahoo's way yourself.

neilc · on May 12, 2009

Sure, you also just need 3600 machines.

diego · on May 12, 2009

You don't have to sort a terabyte in 62 seconds, you could do it on 10 machines and it would probably take you less than a day.

If you are a company that needs to sort a petabyte once in a while and your data resides on the Amazon cloud, you could get 3600 high-cpu EC2 instances and do it for about $12k.

A petabyte is on the order of 100k bytes for every person on the planet. If you have that much data that you need to sort in less than a day, you can afford it.

neilc · on May 12, 2009

I'd actually be curious to see what would happen if you tried to sort a petabyte with 3800 EC2 nodes. I wonder how much worse your performance would be? EC2 instance-local storage is pretty slow by default (the "first-write" problem), but if you used multiple EBS volumes on each node you might be able to get pretty good I/O performance.

sanswork · on May 12, 2009

Just a small correction. Using their mapreduce large instances it would cost you less that $3500.

sp332 · on May 12, 2009

IIRC, the mapreduce fees are on top of the normal EC2 instance fees?

sanswork · on May 12, 2009

Ah my mistake I missed that line above the pricing. Please disregard my previous comment.

Changes my figure quite a bit too. Up to $25K.

ariwilson · on May 12, 2009

Forgot the link: http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapr...

tlrobinson · on May 12, 2009

I think you mean "3800 machines running Hadoop"

voberoi · on May 12, 2009

Their report (linked to from the post) goes into greater detail: http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf

I'd love to know why the 500 GB and 100 TB sorts ran at about half the speed of the other two (~0.5 TB/min as opposed to ~1 TB/min).

sp332 · on May 12, 2009

They doubled the ram before the petabyte sort. Is that what you're talking about?

bayareaguy · on May 12, 2009

A minute to sort 1TB on a system with 11TB of ram?

wmf · on May 12, 2009

IIRC the rules require the data to be read from disk, sorted, and written back to disk, so that's at least 32GB/s of disk I/O by my calculations. Also, the cluster has fairly weak bisection bandwidth.

bayareaguy · on May 12, 2009

1000+ machines to sort 1TB should work out to each machine reading, sorting and writing rows around 32MB/s on machines with 4 SATA disks per node.

While it is kind of neat that Hadoop makes it easy to run big jobs on clusters almost nobody can afford, it doesn't look like it's very efficient.

For comparison, check out http://www.ordinal.com

oconnor0 · on May 12, 2009

Is their minute to sort 34 GB on 1 machine? I couldn't tell.