
Show HN: High-performance ahead-of-time compiler for Machine Learning - andrew-wja
https://www.scss.tcd.ie/~andersan/projects/live/triNNity.html
======
dragandj
A few minor notes:

1\. This seems to be oriented to convolutions. While convolution is rather
important for image-oriented DNN workloads, there is DNN beyond that, and DNN
are not the only technique for machine learning.

2\. The graph shows 2-3x speedup (of convolutions, I suppose) over Intel's
MKL-DNN 0.17 on the __i5-2500K __processor, which is a rather old low-end
device. If the format used for convolutions in the test used 8-bit integers
for storing image channels (which is possible) this is to be expected, since
i5-2500 does __not __suport AVX-512 integer instructions that are employed
there by MKL-DNN. It does not even have AVX-2! If that 's the case, just
switching to 32-bit float could speed up MKL-DNN almost an order of magnitude.
The most informative test would be something run on SkylakeX, since it does
support AVX-512...

~~~
andrew-wja
The speedup is for the __whole __network, as the graph labels show! The point
of the compiler is that you produce code that implements the __entire
__forward pass so you can deploy that code where you need to do the inference.

I agree we need AVX-512 -- hopefully I can get access to a SkylakeX machine in
the next few days.

~~~
stochastic_monk
Even with AVX512, CPUs are going to be pretty slow for most networks of decent
size. Is the purpose to easily embed small models into other code?

~~~
andrew-wja
Yes, that's a big part of it. Also, if you want to do something like (for
example) keyword spotting in audio on a small device, like a Cortex-M class
processor, the constraints are really really difficult to satisfy: most of
them have significantly less than 1MB of main memory, for a start! Like this
guy, for example: [https://openmv.io/products/openmv-
cam-m7](https://openmv.io/products/openmv-cam-m7) \-- 512KB of RAM, and it
runs at 216MHz. You just can't run tensorflow on something like that; it takes
real effort to produce something that can do inference in that context.

With that said, the techniques we've developed here are totally applicable to
GPUs as well, and you might even expect that something like algorithmic choice
would have an even bigger effect there, if we're just talking about delta-
execution-time, but that's future work for us!

~~~
stochastic_monk
Thanks for the response! With respect to the GPU, this reminds me of Tensor
Comprehensions. Do you see a fundamental advantage to your approach over their
kernel evolution protocol?

~~~
andrew-wja
Thanks for the question, this really gets to the heart of the issue!

We do see a fundamental advantage, and I'll try to explain what it is.

So TC's kernel evolution is based on autotuning, and what they are autotuning
for is the parameters for a specific implementation strategy for convolution
(what we would call a convolution algorithm).

There are two conceptual problems with this. The first is that you're
constrained by what's written in the algorithm. So for example, you cannot
take a direct convolution algorithm and hope the compiler will optimize it
into a Winograd convolution. You can only move around within the algorithmic
domain. Introducing algorithmic choice is a big conceptual difference.

The second is that we know that this is a problem that can be solved
optimally. The autotuner can never tell you if what you have found is the best
way of arranging the network, it can only tell you that what you have is the
best candidate it's seen so far. But using a real optimization formulation,
what comes out is a proof by construction that _no better arrangement exists_.

The obvious downside is just that the optimization might take longer to run
than the autotuning, but what we observe in practice is that it takes the PBQP
solver a few _seconds_ to derive the solution from the microbenchmarked cost
model, and the cost model takes a few minutes to build for most popular
ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)

Where the TC approach comes in is at the "leaves" of the problem. When you
have the optimal algorithmic selection, _then_ you can try to wiggle around
trying to find the best way to map that to the hardware, and you might expect
percentage return on investment. But the higher-level algorithmic selection
gives you _factors_ return on investment.

------
fmap
Ok, my first reaction is that this is that it's really wonderful work -
straightforward and with a big payoff at the end.

But this really begs the question: why hasn't this been done before? People
have been throwing resources at machine learning for a decade now, and somehow
nobody has thought to perform instruction selection before executing a model
to optimize the kernels used?

What other low-hanging fruit is out there? Automatic partitioning of networks
over several GPUs and CPUs? Such dynamic load balancing algorithms have been
available in the HPC literature since there was HPC literature. Fusing
multiple primitives to simpler kernels? That's what linear algebra libraries
have been doing for decades. Optimizing internal data layout (although that
seems to be part of this paper)? Optimizing scheduling decisions to minimize
data movement?

\---

Also since the author seems to be reading this thread: Have you tried
measuring the tree-width of the instruction selection DAGs you generate for
the PBQP problem? The heuristics for solving these problems in llvm are
applicable to tree-width <= 2, but could be extended to, e.g., tree-width <= 4
without too much slowdown. I wonder if there is still an iota of performance
to be gained here. :)

~~~
andrew-wja
Hi, author here. There is a staggering amount of low hanging fruit. I have
been half-seriously blaming GEMM in correspondence. When you have a problem
that looks like GEMM, it's such an attractive hammer to pick up that people
just don't look beyond it to other techniques!

To answer your other questions: we already have auto load balancing and
primitive fusion, albeit rudimentary, but optimizing scheduling is the obvious
next step. We've extended this stuff to use ILP, and we're on our way to press
at the moment!

Re: tree width: the tree widths are huge, but the solver library we're using
handles them :)

~~~
marmaduke
how tied are the compiler aspects to the specific kernels (eg for
convolutional layers)? Could load balancing and fusion logic be broken out
into a library which could work for user defined kernels?

~~~
andrew-wja
They aren't tied at all -- in fact the optimizer is a totally separate project
(triNNity-optimizer) that just does graph optimization. You can add user
defined kernels as long as you have some way of microbenchmarking them!

------
snaky
'ML' is really confusing nowadays. Especially with such words as 'compiler'
and 'optimizer'.

[https://en.wikipedia.org/wiki/ML_(programming_language)](https://en.wikipedia.org/wiki/ML_\(programming_language\))

~~~
andrew-wja
Fair! Edited.

------
andrew-wja
If anyone would like to know more about the toolkit, I'll be checking the
comment thread here periodically today!

~~~
creatio
Might want to change ML to Machine Learning. I thought it was about a compiler
for ML, the programming language
([https://en.wikipedia.org/wiki/ML_(programming_language)](https://en.wikipedia.org/wiki/ML_\(programming_language\))).

~~~
andrew-wja
Updated to clarify. Part of the toolkit is written in Haskell, I really oughta
know better!

------
andrew-wja
I should mention that we have made Arch Linux packages for all this stuff:
trinnity-git and trinnity-optimizer-git

(you can just git clone the compiler and the demos to get started).

We'd love people to start using and breaking this stuff so we can find the
pain points and improve things!

------
scottlocklin
Is there any plan to make this compile to cuda? I can't imagine bothering with
DNNs on an intel cpu in virtually any situation, even if it is 3x faster this
way.

~~~
disdi
Agree and compile to Opencl?

~~~
andrew-wja
Yes, both! We also are working on ARM Compute and cuDNN backends. Soon :)

------
p1esk
How does this approach compare to Tensor Comprehensions, or TF XLA, or
whatever are the best DNN compilers these days?

~~~
andrew-wja
Full disclosure: I'm a compiler person who for funding reasons moved into
performance of machine learning systems.

None of those things should be called compilers. At best, they are scaffolding
for peephole optimization. When you can get these crazy speedups just from
bolting on an instruction selector, that's a real indicator that a lot of
stuff is just waiting to be done.

For context, MKL-DNN embeds XBYAK, an optimizing JIT targeting SSE4.2, AVX2,
and AVX512. It sees all the dimensions of the tensors, it knows the strides of
the kernel, and so on and so forth. So for us to be able to beat it by such a
margin just by stepping back and doing something simple at a higher level of
abstraction kinda indicates that the approach it's using is running up against
some conceptual limits. It's not that MKL-DNN's JIT isn't good -- it's great,
and it's a credit to the engineers working on it. But the problem is that the
smarts are being applied in the wrong place!

~~~
p1esk
How does it compare with what CuDNN auto-tuner does for GPUs?

~~~
andrew-wja
cuDNN's autotuner is only making _local_ decisions; If layer X is connected to
layer Y is connected to layer Z, simply choosing the fastest algorithm to
implement each layer does not guarantee optimality. That's because the
different convolution algorithms work best using different data layouts. For
example, winograd convolution is usually fastest using NHWC layout, while
patch-matrix based approaches like im2col are usually fastest using NCHW
layouts.

When two connected layers disagree about the tensor layout, you have to do a
data layout transformation, which is expensive. The key to the performance
we're getting is that we benchmark those data layout transformations, and we
make a _global_ decision about the network, where the optimizer is aware that
the cost for selecting algorithms for connected layers is the total cost of
algo A + layout transform + algo B, not just the cost of algo A + algo B.

So in a nutshell, the optimizer is making use of global information, while the
autotuner is only using local information. If you have a squint at our paper,
we actually model what the autotuner is doing; that's the "local optimal" bar
on the graphs. We always beat it!

~~~
p1esk
I see. So what do you intend it to become? Are you building a middleware to be
inserted between, say, PyTorch and cuDNN? I'm training convnets and rnns
written in PyTorch and TF on GPUs, how can I benefit from your work?

~~~
andrew-wja
You're essentially correct, but there is a bit of a problem with PyTorch and
TF specifically, because you don't really have a definition of the model per
se. You construct it dynamically using a Python or C++ program.

The Caffe .prototxt format or the ONNX model format are nice declarative
specifications for what the model is supposed to do; so those are good input
formats for the compiler. I hope more frameworks will prioritize ONNX, because
it's really the wild west out here with every framework reinventing the wheel
for model specification!

------
disdi
Amazing work. What changes are needed in this library to make it benchmark
GPUs ?

~~~
andrew-wja
No changes are required in the library -- you just need to have some way of
generating the code for the forward pass using e.g. cuDNN (which already has a
heuristic selector!)

~~~
andrew-wja
In the compiler we need to add backends to cuDNN and probably some OpenCL DNN
framework -- this is already fairly high on our list of priorities.

~~~
disdi
I am really not able to connect the dots here. Bear with me for my ignorance
since I am from OS/drivers background and not a compiler guy.

Can I compile code at below link with nvcc(Nvidia compiler) -
[https://bitbucket.org/STG-
TCD/trinnity/src/b33c94e1fabd18f83...](https://bitbucket.org/STG-
TCD/trinnity/src/b33c94e1fabd18f83c8cdd70475bf50e2ade6298/include/dense/cuda/?at=master)

If I write gemm.cl like above program implementing a bunch of CNN primitives,
why I cannot compile it with gcc.

For compiling any Opencl program, I need to write kernel in suppose sample.cl
file and compile it with gcc –Wall sample.cl –o sample -lOpenCL

This -lOpenCL picks up libopencl.so file from my Hardware vendor(Qcom, Intel)
and generates the binary which runs it on GPU(or whereever Opencl is
available)

Why do I need anything trinity compiler and optimizer.

~~~
andrew-wja
The short answer is that there are dozens of ways to _use_ that GEMM to do
convolution (convolution algorithms), and there are NUM_ALGORITHMS *
NUM_LAYERS way to implement the network.

Our toolkit figures out which of those arrangements are the fast ones!

