
Official Google Blog: Sorting 1PB with MapReduce - Anon84
http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
======
jwilliams
If my math is right:

(68sec * 1024) / 60 / 60 / 4 \--- (68 sec for 1TB) (1024 to make 1TB) /60/60
(hours) and /4 as there are 4x as many computers....

So ideally taking the 1TB sort to 1PB on 4x the hardware would be 4.83hrs.

Google's 1PB sort was in 6 hours - So that's fairly linear. Impressive.

~~~
lutorm
Isn't the mapreduce sort NlogN in the number of entries? That would make 1e13
log 1e13/(1e10 log 1e10) 68s/4 = 6.1h. So they were actually _faster_ than the
possible scaling? Doesn't sound right...

~~~
jwilliams
This is a really good point.

I was assuming that the sort was actually a partial sort - returning the top
"x" (admittedly - even then isn't linear). However, I was also assuming (n <<
x), which would almost certainly be the case with a large data set. It's a
fairly common approach, but not sure why I automatically assumed it.

If the sort is complete, then yeah, that changes things.

------
nebula
_We were writing it to 48,000 hard drives (we did not use the full capacity of
these disks, though), and every time we ran our sort, at least one of our
disks managed to break (this is not surprising at all given the duration of
the test, the number of disks involved, and the expected lifetime of hard
disks)._

I'm a bit at lost here. Does that mean that one in ~50K hard disks fail within
six hours of usage?

~~~
russell
Not at all surprising given the large number of disks involved. Here is a very
interesting analysis of actual Google failure rates:
[http://storagemojo.com/2007/02/19/googles-disk-failure-
exper...](http://storagemojo.com/2007/02/19/googles-disk-failure-experience/)

~~~
nebula
Thanks for the pointer; it's interesting.

------
slackerIII
Heh... I guess this is what Yahoo gets for talking smack about how quickly
they could sort 1TB:
[http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoo...](http://developer.yahoo.net/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html)

~~~
sh1mmer
Google have been working on Map Reduce for 4 years longer than anyone else.

I'm actually pretty happy that we (Yahoo!) are talking about what we are doing
in the Open and contributing in an Open project.

You could reproduce what we did. Obviously you can't reproduce what Google
did. What's more useful?

------
iowahansen
4000 computers over 6 hours... Makes you wonder about electricity costs of
this experiment.

~~~
wmf
Speaking of electricity, at one point Google claimed that during a Web search
your PC uses more energy sending the request, waiting, and displaying the
results than Google uses to perform the search. So the next time you need to
sort 1PB of data it may be cheaper to send it to Google. :-)

~~~
gsmaverick
Lol, good idea. Although I could see M$ or Yahoo! looking to make some sort of
a deal like that.

------
est
OK. Zhuangbility.

