

Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds - voberoi
http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

======
ariwilson
Google did the terabyte slightly slower (68 seconds) on 4x fewer machines, but
did the petabyte in 6 hours and 2 minutes (around 1/3 of the time of Hadoop)
on nearly the same number of machines (4000).

~~~
tc
Yes, but the key difference here is that you can download the Hadoop sources
and try it out Yahoo's way yourself.

~~~
neilc
Sure, you also just need 3600 machines.

~~~
diego
You don't have to sort a terabyte in 62 seconds, you could do it on 10
machines and it would probably take you less than a day.

If you are a company that needs to sort a petabyte once in a while and your
data resides on the Amazon cloud, you could get 3600 high-cpu EC2 instances
and do it for about $12k.

A petabyte is on the order of 100k bytes for every person on the planet. If
you have that much data that you need to sort in less than a day, you can
afford it.

~~~
sanswork
Just a small correction. Using their mapreduce large instances it would cost
you less that $3500.

~~~
sp332
IIRC, the mapreduce fees are _on top of_ the normal EC2 instance fees?

~~~
sanswork
Ah my mistake I missed that line above the pricing. Please disregard my
previous comment.

Changes my figure quite a bit too. Up to $25K.

------
tlrobinson
I think you mean "3800 machines running Hadoop"

------
voberoi
Their report (linked to from the post) goes into greater detail:
<http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf>

I'd love to know why the 500 GB and 100 TB sorts ran at about half the speed
of the other two (~0.5 TB/min as opposed to ~1 TB/min).

~~~
sp332
They doubled the ram before the petabyte sort. Is that what you're talking
about?

------
bayareaguy
A _minute_ to sort 1TB on a system with 11TB of ram?

~~~
wmf
IIRC the rules require the data to be read from disk, sorted, and written back
to disk, so that's at least 32GB/s of disk I/O by my calculations. Also, the
cluster has fairly weak bisection bandwidth.

~~~
bayareaguy
1000+ machines to sort 1TB should work out to each machine reading, sorting
and writing rows around 32MB/s on machines with _4 SATA disks per node_.

While it is kind of neat that Hadoop makes it easy to run big jobs on clusters
almost nobody can afford, it doesn't look like it's very efficient.

For comparison, check out <http://www.ordinal.com>

~~~
oconnor0
Is their minute to sort 34 GB on 1 machine? I couldn't tell.

