

Apache Hadoop Wins Terabyte Sort Benchmark (1 terabyte of data in 209 seconds) - nickb
http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html

======
jcdreads
According to the Amazon Web Services calculator, reproducing this result
yourself would cost $118,553.47. That's for 1 TB of data transfer in and 1/10
of that out (presumably one would pull down only sorted indices, not the
entire sorted dataset), and 910 Extra Large compute instances for an hour. The
price goes up pretty quickly if you spend longer than an hour setting up your
many nodes. Obviously feel free to check my assumptions.

------
bayareaguy
209 seconds of time on 910 2ghz nodes gives about 3.8e14 instructions, or
about 380 instructions per byte (assuming only one of the dual cores on each
Xeon was active). That's quite a lot of overhead especially given that they
could have most of the sort in the 7TB of available cluster memory.

They were probably hitting their interconnect limits.

------
owenomalley
You wouldn't need to transfer the data in or out. You can generate the data
using a map/reduce program and check it using another map/reduce cluster. It
takes about 4-5 minutes to generate the data (it is longer because we write 3
replicas of the input data on 3 different nodes). The checker program takes a
couple of minutes to make sure that all of the data is correctly sorted. There
aren't any indices, just the flat text files for both input and output.

