

Sibyl: A System for Large Scale Machine Learning at Google - midko
https://www.youtube.com/watch?v=3SaZ5UAQrQM

======
midko
The talk is about the system's design decisions, not about ML. Bits I found
interesting:

* Tushar Chandra believes soon ML primitives will be available to application developers just like nowadays distributed and db primitives are becoming available in a standard way

* There was an early design decision for Sibyl not to be build on top of a custom distributed system solution but instead to rely on existing primitives such as MapReduce and GFS.

* 100B+ training examples with 100s features per example, use cases with 50TB of data

* Because logging all the features of each examples can make the logs grow extremely fast and because some features might be experimentally used and come and go, the logs would contain only the example id and then before training the model, the data will be inner joined with the example database/GFS

* Examples were stored as columns (partition by features, each file containing 1 feature for many examples) instead of the more common approach of partitioning per row where you store all features of a batch of examples in the same file. This had great benefits in terms of faster feature transformations, less data to be read because some features were less useful than others, and better compression of the data. Further feature compression achieved by finding all unique feature values and mapping them to numbers in a Huffman encoding way. Total compression achieved was 3-5 times

Towards the end, the talk contains some use cases with big numbers (throughput
per core, for e.g.) worth checking out

