Hacker News new | past | comments | ask | show | jobs | submit login
Apache Hadoop Wins Terabyte Sort Benchmark (1 terabyte of data in 209 seconds) (yahoo.com)
15 points by nickb on July 3, 2008 | hide | past | favorite | 3 comments



According to the Amazon Web Services calculator, reproducing this result yourself would cost $118,553.47. That's for 1 TB of data transfer in and 1/10 of that out (presumably one would pull down only sorted indices, not the entire sorted dataset), and 910 Extra Large compute instances for an hour. The price goes up pretty quickly if you spend longer than an hour setting up your many nodes. Obviously feel free to check my assumptions.


209 seconds of time on 910 2ghz nodes gives about 3.8e14 instructions, or about 380 instructions per byte (assuming only one of the dual cores on each Xeon was active). That's quite a lot of overhead especially given that they could have most of the sort in the 7TB of available cluster memory.

They were probably hitting their interconnect limits.


You wouldn't need to transfer the data in or out. You can generate the data using a map/reduce program and check it using another map/reduce cluster. It takes about 4-5 minutes to generate the data (it is longer because we write 3 replicas of the input data on 3 different nodes). The checker program takes a couple of minutes to make sure that all of the data is correctly sorted. There aren't any indices, just the flat text files for both input and output.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: