
Learning to Optimize Tensor Programs - antinucleon
https://arxiv.org/abs/1805.08166
======
namibj
Compare to polyhedral optimization [0], supported in GCC via Graphite [1] and
in LLVM via Polly [2]. These have a lower ceiling than hand-optimized
assembler, but they automate the tuning of how to nest the loops for maximum
benefits due to the cache hierarchy. Considering their ability to do so for
general purpose number crunching loops, they are rather nice, but there is
still some integration work needed, especially for LLVM, as lack a nice place
in the existing optimization pipeline.

[0]:
[https://en.wikipedia.org/wiki/Polytope_model](https://en.wikipedia.org/wiki/Polytope_model)
[1]: [https://gcc.gnu.org/wiki/Graphite](https://gcc.gnu.org/wiki/Graphite)
[2]: [https://gcc.gnu.org/wiki/Graphite](https://gcc.gnu.org/wiki/Graphite)

~~~
sidereal
Polyhedral optimisation is cool (Facebook has been using it to great effect
for ML kernels recently [1]), but it’s not the end of the story. It’s
complementary to this paper, which seems to be about learning an effective and
transferrable cost model to guide the optimisation process (you could use that
learned cost model in a polyhedral optimiser).

[1]: [https://arxiv.org/abs/1802.04730](https://arxiv.org/abs/1802.04730)

~~~
namibj
Uhh.... this sounds interesting. Maybe we'll see it soon-ish (<5y) in
LLVM/GCC.

~~~
jules
It already exists! Check out Polly for LLVM.

------
antinucleon
Summary: Tensor program is able to be optimized by using machine learning and
transfer learning. The numerical program optimization model is trained on
feature from low-level AST of the program. Experiments: Tasks: ResNet,
MobileNet, LSTM LM, DQN Hardware: CUDA/ARM GPU/ARM CPU Speed up compare to
CUDNN, TensorFlow Lite and ARMComputeLib: ~from 1.2X to 3.8X faster in end-to-
end test.

~~~
namibj
Be careful what they benchmark against. The Maxwell/Pascal GEMM routines are
_very_ optimized, since Nervana Systems (in their infancy) made an assembler
for them which they showed off/demonstrated by implementing GEMM (single-
precision dense matrix*matrix multiplication). It was lacking like 2% from the
theoretical performance, as predicted by the clockrate, but they had no idea
why these 2% were missing. They hand-scheduled the instructions to get around
the limited number of register banks, and the associated bank conflicts, a
concept normal CPUs afaik only see with DRAM. At that time, the official GEMM,
albeit hand-optimized, only reached like 80% of the theoretical value.

~~~
crowwork
the benchmarks are not about GEMM, but real-world deep learning workloads
which could have very different characteristics from GEMM

~~~
namibj
I just wanted to caution that one has to be careful what one is comparing
against, as the libraries got significant speed improvements over time,
without that being widely advertised. So it matters a lot if one compares this
library against CuDNN from 1 month ago, or to CuDNN from 2 years ago. The
latter is _much_ slower.

The GEMM example was just there as the details of the optimization have been
published, unlike most other hand-tuned assembler routines for DNN workloads.

~~~
antinucleon
CuDNN v7 was used in the experiments, in the experiments parts each comparison
was listed with version or commit number.

~~~
namibj
Well, I didn't RTFA. This was not meant to be specific for this article
though.

------
empiricus
Does this mean that it is easy now to get optimized deep learning kernels for
AMD GPUs?

------
bitL
I'd have never thought that the art of low-level optimization would be the
first one with some serious automation from Deep Learning... Very interesting
paper!

------
evrydayhustling
Great example of recursive self improvement.

