
Machine Learning for Systems and Systems for Machine Learning [pdf] - andrew3726
http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
======
cs702
TPUs are only one part of this eye-opening presentation. Skip to page 28,
where Jeff starts talking about:

* Using reinforcement learning so the computer can figure out how to parallelize code and models on its own. In experiments, the machine beats human-designed parallelization.

* Replacing B-tree indices, hash maps, and Bloom filters with _data-driven indices_ learned by deep learning models. In experiments, the learned indices outperform the usual stalwarts by a large margin in both computing cost and performance, and are auto-tuning.

* Using reinforcement learning to manage datacenter power. Machine intelligence outperforms human-designed energy-management policies.

* Using machine intelligence to replace user-tunable performance options in all software systems, eliminating the need to tweak them with command line parameters like --num-threads=16, --max-memory-use=104876, etc. Machine intelligence outperforms hand-tuning.

* Using machine intelligence for all tasks currently managed with heuristics. For example, in compilers: instruction scheduling, register allocation, loop nest parallelization strategies, etc.; in networking: TCP window size decisions, backoff for retransmits, data compression, etc.; in operating systems: process scheduling, buffer cache insertion/replacement, file system prefetching, etc.; in job scheduling systems: which tasks/VMs to co-locate on same machine, which tasks to pre-empt, etc.; in ASIC design: physical circuit layout, test case selection, etc. Machine intelligence outperforms human heuristics.

IN SHORT: machine intelligence (today, that means deep learning and
reinforcement learning) is going to penetrate and ultimately control EVERY
layer of the software stack, replacing human engineering with auto-tuning,
self-improving, better-performing code.

Eye-opening.

~~~
candiodari
Ok, I can understand how a bloom filter can be replaced by a neural network
predictive model. You could actually train it while stuff gets added. This
would make adding somewhat more expensive, but ...

Ah so it appears they're advocating using neural networks as index functions
to sorted arrays (hashmaps are simply sorted by hash instead of by something
in the data).

So what they do is they take a FIXED set of data that you want to quickly
lookup in, already sorted, train a model (2 layer 32 width, relu activation is
one architecture, but they also train sequences of models, HUGE changes to
error (as the cost of max and min error are huge, you minimize max error
rather than average error)).

They have the following brilliant insight : an index over a database (which
gives the position of the data given the search key) is a CDF (cumulative
distribution function) ! That's brilliant ! Of course it is !

And of course, this is Google. Once you have an index trained (which is a
linear operation), you can translate the neural network model directly into
C++, and compile it into machine instructions that don't depend on anything
like tensorflow libraries. The resulting code can be pasted into anything you
want. This may work fast, but seems less then entirely practical ... although
I guess you could do the same in Java far easier and you could just include
that code.

Paper here: [http://learningsys.org/nips17/assets/slides/dean-
nips17.pdf](http://learningsys.org/nips17/assets/slides/dean-nips17.pdf)

------
cobookman
Nvidia Titan V can do 110 TFLOPS, 12GB of 1.7 Gb/s Memory [1] and sells for
3,000$. TPU v2 does 180 TFLOPS, 64GB of 19.2Gb/s Memory [2].

That's a heck of a performance boost for a chip that's likely costing google
way less than the nvidia flagship.

[1] [http://www.tomshardware.com/news/nvidia-
titan-v-110-teraflop...](http://www.tomshardware.com/news/nvidia-
titan-v-110-teraflops,36085.html)

~~~
shaklee3
It's not clear to me how programmable the tpu is. I'm sure it's great at
convolutions and matrix multiplies. Can it do anything else?

~~~
EvgeniyZh
Neither do tensor cores

~~~
shaklee3
The tensor core is one part of the GPU. It has plenty of other capabilities.

------
jamesblonde
Great talk, with lots of new insights into what's happening at Google. I
really think his point that ImageNet is the new Mnist now holds true. Even
research labs should be buying DeepLearning11 servers (10 x 1080Ti) for $15k,
and training large models in a reasonable amount of time. It may seem that
Google are way ahead, but they are just doing synchronous SGD, and it was
interesting to see the drop in prediction accuracy from 128 TPU2 cores to 256
TPU2 cores for ImageNet (76 -> 75% accuracy). So, the algorithms for dist.
training aren't unknown, and with cheap hardware like the DL11 server, many
well-financed research groups can compete with this.

~~~
eggie5
ballpark how much would it cost to train ImageNet (ILSVRC) on a std deep CNN
arch (VGG or inception) on AWS using a p2 or p3?

~~~
jamesblonde
Ballpark - 1100 dollars on AWS. 44hr 28min (from Dawnbench -
[http://dawn.cs.stanford.edu/benchmark/](http://dawn.cs.stanford.edu/benchmark/)
) on a DGX-1 (cost 24.48 dollars/hour on p3.16xlarge).
[https://aws.amazon.com/ec2/pricing/on-
demand/](https://aws.amazon.com/ec2/pricing/on-demand/)

On a DL11 server, it will take about 60 hrs, and only cost you 15k upfront.
The economics speak for themselves for fp32 training, at this moment in time.

~~~
eggie5
I didn't know about the dawn project, thank you for the reference and figures.

------
larelli
It looks like this paper has more information:
[https://arxiv.org/pdf/1712.01208v1.pdf](https://arxiv.org/pdf/1712.01208v1.pdf)

------
EvgeniyZh
Was it filmed? If yes, when video will be available?

~~~
swah
Yep - not very useful without the video.

~~~
laythea
But we have HN comments section!

------
nickpsecurity
Great presentation. Far as application, I already thought this might be useful
in lightweight, formal methods to spot problems and suggest corrections for
failures in Rust's borrow checkers, separation logic on C programs, proof
tactics, and static analysis tooling. For Rust example, the person might try
to express a solution in the language that fails the borrow checker. If they
can't understand why, they submit it to the system that attempts to spot where
the problem is. The system might start with humans spotting it and
restructuring the code to pass borrow checker. Every instance of those will
feed into the learning system that might eventually do that on its own.
There's also potential to use automated, equivalence checks/tests between
user-submitted code and the AI's suggestions to help human-in-the-loop decide
if it's worth review before passing onto the other person.

In hardware, both digital and analog designers seem to use lots of heuristics
in how they design things. Certainly could help there. Might be especially
useful in analog due to small number of experienced engineers available.

------
yeukhon
While this is a collective work, honestly, after hearing about JD for so many
years: is there anything he CAN’T do?

~~~
justicezyx
He did little for tpu.

------
1024core
This is some really cool stuff, I hope this submission gets more upvotes and
reaches a wider audience.

------
novaRom
I speculate that Google will sell TPUv2 for as less as 500 USD per PCIe card
already in 2018. Nvidia's Volta TensorCores are essentially the same: 32-bit
accumulators and 16-bit multipliers, but GPUs are more general-purpose which
is not necessary for Deep Learning since most intensive operation is dot-
product (y+=w*x).

~~~
quadrature
I feel like the cloud play would be much stronger than entering the hardware
market.

------
nl
That "Learned Index Structures" makes it pretty clear that Karpathy was right
in his widely criticized "Software 2.0" piece.

~~~
ekr
I haven't read that paper (Learned Index Structures), but things like gperf
have existed for decades. Are these enhanced data structures dynamic, i.e.
unlike gperf which is a static one, does it reoptimize as you insert new
elements?

In the case of the hash table, I assume it's using the model to compute the
hash function.

~~~
sanxiyn
No, it doesn't handle inserts. On the other hand, the paper writes:

"An ... approach to handling inserts is to build a delta-index. All inserts
are kept in buffer and from time to time merged with a potential retraining of
the model."

~~~
pg314
Under some assumptions it does handle inserts. From [1]: _Finally, assume that
the inserts follow roughly a similar pattern as the learned CDF; [...] Under
these assumptions the model might not need to be retrained at all._

[1] [https://www.arxiv-vanity.com/papers/1712.01208v1/](https://www.arxiv-
vanity.com/papers/1712.01208v1/)

