Hacker News new | past | comments | ask | show | jobs | submit login

Jeff Dean's original implementation of MapReduce.

It was, IIRC, only 3 C++ classes and just a few hundred lines of code. It outsourced much of the distribution, task running, and disk-access tasks to other Google infrastructure, and only focused on running the computation, collecting results for each key, and distributing to the reducers.

The current (as of ~2012, so not that current anymore) version of MapReduce is much faster and more reliable, but there's a certain elegance to starting a trillion-dollar industry with a few hundred lines of code.

There was another doozy, also by Jeff Dean, in the current (again, as of 2012) MapReduce code. It was an external sorting algorithm, and like most external sorts, it worked by writing a bunch of whole-machine-RAM sized temporary files and then performing an N-way merge. But how did it sort the machine's RAM? Using the STL qsort() function, of course! But how do you sort ~64GB of data efficiently using a standard-library function? He'd written a custom comparator that compared whole records at a time, using IIRC compiler intrinsics that compiled down into SIMD instructions and did some sort of Duff's-Device like unrolling to account for varying key lengths. It was a very clever mix of stock standard library functions with highly-optimized, specialized code.




Minor correction: qsort() is CRT, std::sort() is STL.


My memory's actually hazy over whether it was qsort or sort; my intuition is that it would've been qsort because QuickSort is what you'd use when you need an in-place sort with little additional RAM required, but it's been so long that I honestly don't remember.


Not a C++ programmer, but isn't std::stable_sort usually mergesort, while std::sort is usually an introspective quicksort?


Yes.


Ah, good. I'd initially written std::sort in the comment and then went back and edited it because I was like "Isn't std::sort usually mergesort? That wouldn't work here because it takes extra space." It's been a while since I've written C++.


You are assuming qsort is Quicksort and std::sort is not. Both typically use quicksort; but neither is required to.


In C++11, std::sort() is forbidden from being just a quicksort, as it's required to have worst-case O(N log N) complexity.


I'd read somewhere [1] that the built-in Python sort function has a lot of good / clever optimizations too, though maybe not of the same kinds that you describe, i.e. may not be at machine language level. Tim Peters did a lot of that, per what I read, though others may have too.

[1] Think I read it in the Python Cookbook, 2nd Edition, which is a very good book, BTW.


Time Peters wrote Timsort for python https://en.wikipedia.org/wiki/Timsort

I think quite a few languages adopted it as their default sort.


Java's standard lib uses Timsort for sorting Object[], and quicksort for arrays of primatives.

http://stackoverflow.com/questions/4018332/is-java-7-using-t...


Ah, I remembered his name, but had forgotten that it was called Timsort, thanks.

Interesting that others also used it.


Different kinds of optimizations. Timsort tries to collect runs and falls back on insertion sort for them; it exploits the fact that much real-world data is already partially sorted to reduce the number of comparisons made. This optimization exploits the fact that MapReduce keys are always strings in contiguous areas of memory, and are often fairly large, to compare them really quickly.


Do you have a link to the code?


He's talking about proprietary Google code.


Dead proprietary Google code, too - it's long since replaced, I was looking back in the version control history out of curiosity.


Since it's old and no longer relevant to the Google, it would be really interesting to have that code in some kind of code museum, as I feel it gave an interesting insight on how Google did big data (the real one, not the marketing one) back then. Not sure it that's feasible, but I guess it doesn't hurt to ask.


I wish, but it's unfortunately not my call to make. A few other companies have done this, eg. Microsoft open-sourcing Altair Basic about 30 years after it came out or id open-sourcing DOOM. Maybe if I ever go back to work for them, I can propose it. For now, consider getting a job at Google if you want to peek into the VCS history.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: