
History of massive-scale sorting experiments at Google - vgt
https://cloud.google.com/blog/big-data/2016/02/history-of-massive-scale-sorting-experiments-at-google
======
dzdt
I think the alphago program is getting similar crazy scale burn-in access to
the gpu computing that google will soon offer in its cloud. Recall alphago was
the first program to beat a professional in a tournament setting. They
published in nature. The emphasis was on their deep learning approach, but the
technical details were pretty impressive. Alphago beat Fan Hui, a 2 dan
professional, using a 1202 cpu/776 gpu diatributed system. They don't give
hardware details, but my back-of-the-envelope estimate based on recent
hardware is they were at around the petaflop calculation rate. In the nature
paper they examine how the program's ELO rating varies with computation power;
it looks like to reach world champion level they need at least an order of
magnitude more computation. They challenged a player at that level (Lee Sedol)
for a series in March and express a quiet confidence in winning. On their blog
they give google cloud credit for supplying the compute power, but google
cloud doesn't yet have gpu units available publicly. I am thinking alphago is
getting to do burn-in on a massive gpu cloud computing center. Look for public
availability shortly after the Sedol match!

------
jlebar

      Note that this sort [in 2012] was 500 times larger than the
      GraySort large-scale requirement and twice as fast in 
      throughput as the **current 2015** official GraySort winner.
    

(emphasis mine)

~~~
Klathmon
It had to feel fucking amazing to be an engineer that was part of that at the
time. To know (even if it was publicly unknown) that you were able to so
massively destroy a record which still stands 4 years later.

~~~
dpe82
Part of the magic of working at Google is there are lots of projects that can
make you feel that way.

~~~
fapjacks
From what I understand, those kinds of projects are only available to certain
engineers, and the vast majority of engineers working at Google don't get to
play those games. Six of my seven interviewers essentially said that they did
not get to work on problems that challenged them when I asked "If you could
change one thing without veto...." This is actually one of the two big reasons
I chose not to accept an offer.

~~~
thrownaway2424
"Moving huge amounts of data from hither to yon" is not a specialized role at
Google. Anybody with solid C++ skills can find a way to work on that.

~~~
dpe82
Heck if that's the metric anybody familiar with SQL or even basic unix command
line operations can do it if they really want. It's sometimes easy to forget
how much data we regularly move around without really thinking about it.

------
seanp2k2
I'm sure they're asking about this in interviews now. My experience with
Google interviews was that the interviewers were very keen on proving that
they knew more theoretical CS than I did vs talking about what the actual work
would require or entail.

~~~
KMag
I worked at Google, and performed quite a few interviews. At least at that
time, the interviewer had no idea where in the company the software engineer
would go. The people who got hired would go into a pool, and the managers who
needed staff would then horse-trade for them. They needed to hire people who
could be plugged in to tons of different positions, including some that
genuinely had tough problems where if they accidentally did something in
O(N^2), their half-day compute job would suddenly literally take centuries to
complete.

Also, part of the point of a Google style interview is to keep pushing you to
the point where your ability fails, then push in another direction to the
point your ability fails. That way, they get an idea of the dimensions of your
skillset. The vast majority of the people I recommended hiring didn't answer
the questions perfectly.

Or, someone could not quite put enough thought into the integer encoding for
Protocol Buffers. The encoding actually used is that the first bit of every
byte is a flag for if there's a next byte, for a maximum of 10 bytes. The only
advantage to this over encoding the length of the integer as the number of
leading ones in the first byte (like UTF-8) is if they want the option to have
the option later of a forward-compatible encoding for longer integer types,
without just supporting multi-precision integers as byte arrays. If you slide
all of those flag bits to the first byte, you don't have any more or any fewer
flag bits, so it takes as much space, but you can use a jump table and the
ability to use native 8-byte and 4-byte simple loads instead of many more
masking operations and conditional branches. Giving up on anything longer than
int64_t means that the maximum encoding length becomes 9 bytes instead of 10.

One day, the speed of the indexing system just dropped in half. It turns out
that someone wrote a job that used machine learning to generate a bunch of
regexes for some signal. They were recompiling the regexes for every page in
the index. Google really needs most of their engineers that will naturally
spot this kind of thing in code reviews, because if every engineer on the
indexing team made a mistake like that once a year, the indexing system would
always run at half speed, sometimes even 1/3 or 1/4 speed.

The indexing system uses as much electricity as a small town. The tiniest of
improvements can mean thousands of dollars per year in savings.

~~~
seiji
_The encoding actually used is that the first bit of every byte is a flag for
if there 's a next byte, for a maximum of 10 bytes._

Fun facts: those are commonly known as "vbytes." They are the slowest kind of
variable width integer encoding. The simplest (naive, and generally considered
"wrong") implementations of variable-width-by-continuation-bits uses 10 bytes
maximum. A proper implementation uses 9 bytes maximum.

There's actually no reason to ever use 10 bytes in this encoding. If you use
10 bytes, that means your last byte only holds _one bit_ of actual user data.
That's not very cool. But, your next-to-last byte holds seven bits of user
data and _one bit_ of metadata. We can easily say "if we're at the next to
last byte, don't use metadata, just use all the bits we need." All you have to
do is say "if we are currently at 9 bytes, don't use a 10th byte, just use
this 9th byte directly." bam. Your 9th byte now has 8 bits of user data and
you don't roll over a useless 10th byte with one bit of data.

The "slide all continuation bits into the first byte" sounds like a trick, but
it's really using a TLV encoding where the first byte just holds a number
between 1 and 8, so the entire integer+metadata is now [1 type byte][1 to 8
user data] = 2 to 9 bytes total. Using this scheme also kills any "1 byte,
standalone, variable width integer" capability (unless you're storing partial
values in the first T/L byte, but then that limits you to a much lower max
value for one byte).

~~~
KMag
No, it's not a type-length-value encoding, just a length-value encoding. I
should have been more explicit in my UTF-8 reference. I'm really talking about
a UTF-8 like encoding, except that it doesn't specially mark any of the
continuation bytes and generalizes to lengths of 9 bytes, so the encoding
(plus using zigzag encoding to put the sign bit in the least significant bit)
is:

    
    
        0xxxxxxS : 1 byte, 7 bits of data, -64 to 63
        10xxxxxx xxxxxxxS : 2 bytes, 14 bits of data, -8192 to 8191
        110xxxxx xxxxxxxx xxxxxxxS : 3 bytes, 21 bits of data, -(2^20) to 2^20 -1
        1110xxxx xxxxxxxx xxxxxxxx xxxxxxxS : 4 bytes, 28 bits of data, -2^27 to 2^27-1
        ... and so on.
    
        int64_t zigzag_decode(uint64_t in) {
          /* Signed right-shift is implementation-defined behavior in C */
          COMPILE_TIME_ASSERT( (1LL << 63) >> 63 == -1, compiler_uses_signed_arith_shift ); 
          return (int64_t) ((((int64_t) in << 63 ) >> 63 ) ^ (in >> 1));
        }
    

You can actually get slightly more dense packing using an encoding that
doesn't have any non-canonical encodings by adding a length-dependent constant
before the zigzag decoding step, but that's a bit more complicated to explain,
and allows some 9 byte encoding values that won't fit in an int64_t.

~~~
seiji
well, the "type" is presumed by a schema or somewhere by the time you hit your
decoder, so it's technically there even if physically absent (and "LV
encoding" doesn't seem to be a thing with any meaningful search results).

Or we can just store everything as int64_t natively anyway. It's only 8 bytes
after all (and storage is big these days).

~~~
KMag
Storage is big, but disk bandwidth and network bandwidth are still limiting
factors for many applications.

------
trengrj
It is interesting to compare this with Apache Spark, which won the open source
Graysort competition of 2014 [1].

Apache Spark completed a 1PB sort on 190 EC2 (i2.8xlarge) instances in 234
mins. Google did their 2011 1PB sort on 8000 computers in 33 minutes.

Moore's law is at work here and the results aren't very comparable but it
would be interesting to see a head-to-head test on similarly speced clusters.

[1] [https://databricks.com/blog/2014/10/10/spark-petabyte-
sort.h...](https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html)

------
vgt
"To reduce the impact of stragglers, we used a dynamic sharding technique
called reduce subsharding. This is the precursor to fully dynamic sharding
used in Dataflow."

Fun ways in which internal Google tech makes its way to Google Cloud services.

~~~
throwaway6497
I want to learn more about dynamic sharding. Will be great, if people can help
me out here. I get the feeling "my googling" skills will fail me.

------
seeing

      We haven’t found a single use case for the problem as stated.
    

Impressive nonetheless.

~~~
matt_wulfeck
This is my favorite part! There's so much and learning to be had in
engineering solving problems with no real practical use. I don't consider it a
waste of time at all and I'm glad they went ahead with their experiment.

------
vijucat
If you know the distribution of the data (which can be imputed as part of the
sorting process by sampling the data), it is possible to guess if the next
number is going to be larger or smaller than the current one 75% of the time
_without looking at it_ by using Cover's pick the largest number trick [1, 2,
3]. I wonder if distributed sorting algorithms know and incorporate this,
which should reduce the cost of hitting the network in distributed algorithms
considerably? (Asking because I don't work in that field; this is just an idea
that I had).

References

1\. "Playing a Trick on Uncertainty" by Thomas Bruss, page 7 of
[http://www.emis.de/newsletter/newsletter50.pdf](http://www.emis.de/newsletter/newsletter50.pdf)

2\. "Tom Cover’s Number Guessing Game" by Robert Snapp,
[http://www.ibrarian.net/navon/paper/Tom_Cover_s_Number_Guess...](http://www.ibrarian.net/navon/paper/Tom_Cover_s_Number_Guessing_Game.pdf?paperid=15348704)

3\. "Who discovered this number-guessing paradox?",
[https://math.stackexchange.com/questions/709984/who-
discover...](https://math.stackexchange.com/questions/709984/who-discovered-
this-number-guessing-paradox)

------
known
I prefer to use "split -b 10G data.txt" and "sort -nrs"

------
azurezyq
>> Nobody really wants a huge globally sorted output. We haven’t found a
single use case for the problem as stated.

Do anyone has a real world use case of global sorting other than top-k?

~~~
rockinghigh
If you sort records by time you can perform time range queries. You can also
create an index to access random keys (like a MapFile in Hadoop).

~~~
amelius
But isn't sorting by time easy, because records come in the correct order? I
would say you don't need an algorithm for that.

~~~
rockinghigh
You may want to sort by settlement date instead of transaction date. There are
plenty of examples where you need to sort before you can index: let's say you
want to create a spatial index. You first need to sort/group spatially (by
tile) to query an area.

------
JavaScriptrr
Only at Google

~~~
fintler
I wonder how fast this could do it:

[http://www.lanl.gov/projects/trinity/specifications.php](http://www.lanl.gov/projects/trinity/specifications.php)

They're claiming 87.0 TB/min on an 80PB filesystem, relative to Google's 36.2
TB/min.

~~~
cshimmin
That just their I/O bandwidth. There is computing overhead in actually doing
the sort. Also google was using redundant persistence.

~~~
tyre
To add to this, Google's results were in 2012.

------
isomorphism
"We are best but we do not wanted to participate the competition."
Whatever....

------
AI_Overlord
To each their own. While impressive I do not find it inspiring. Instead,
working on strong AI is, in my mind, the ultimate challenge. Far more amazing
by any measure than anything we have acomplished so far.

~~~
seiji
_working on strong AI is, in my mind, the ultimate challenge._

So is working on antigravity, free energy, and backwards time travel, but
people _don 't_ do that because they are no reasonable approaches we can try.

Glorifying "AI AI AI" is silly because — there are no approaches we can try.
Sure, we can identify ten million images per second, but none of that involves
the least bit of "thinking."

~~~
AI_Overlord
Evolution has already proven that strong AI is possible. It is up to us to
discover how. It is only a matter of time.

~~~
seiji
Just one summer of study and it'll be solved, right? It's always been "one
summer away" for the past 60 years.

Get back to us when you have an algorithm for love and art and petrichor.

~~~
AI_Overlord
Emotions are not a requirement for strong AI. I frankly would not waste any
time with them. I just need an AI that can learn and solve problems at the
human level.

I'm not naive enough to think that it can be solved in a short time. I do
think that it is worth it for a person to spend the rest of their life working
on it. There is just nothing more exciting than AI in my opinion.

Just imagine the possibilities...

~~~
wantreprenr007
"Narrow" AI is already here: search, banking, insurance, internet ads, crime
prediction, Siri, plane autopilot (autotakeoff/landing too), autoparking,
driver assist, on and on.

(At Trimble, we had fully autonomous tractor PoC in 2001)

"Deep" AI (self-directed / human-interactive) will take more time and effort,
and can have (simulated) emotions if so programmed; the determinate is how to
sell such as a viable product or service that doesn't freak people out too
much or do something stupid like place untrustworthy systems in charge of live
nuclear missiles.

~~~
xiphias
Deep AI already started with deep learning, which improves exponentially every
year, if you look at the performances. It already freaks me out that you can
put together cheap drones+guns+self driving+better-than-human level face
detection. Imagine controlling a drone botnet...

