
DLVM: A modern compiler framework for neural network DSLs - protomok
http://dlvm.org/
======
wcrichton
Current tally of high-performance, deep-learning-oriented DSLs/IRs/compilers,
in no particular order:

\- TensorComprehensions (Facebook):
[https://github.com/facebookresearch/TensorComprehensions](https://github.com/facebookresearch/TensorComprehensions)

\- XLA (Google):
[https://www.tensorflow.org/performance/xla/](https://www.tensorflow.org/performance/xla/)

\- taco (MIT): [http://tensor-compiler.org/](http://tensor-compiler.org/)

\- DLVM (UIUC): [http://dlvm.org/](http://dlvm.org/)

\- nGraph (Intel):
[http://ngraph.nervanasys.com/docs/cpp/](http://ngraph.nervanasys.com/docs/cpp/)

\- TVM (DMLC): [https://github.com/dmlc/tvm](https://github.com/dmlc/tvm)

Honorable mention to Julia ([http://julialang.org](http://julialang.org)) as
well.

~~~
hedgehog
As far as I know Tile/PlaidML (Vertex.AI) is the only DSL+compiler that's
usable for real workloads across a variety of hardware.
[https://github.com/plaidml/plaidml](https://github.com/plaidml/plaidml)

~~~
grandmczeb
Tensorflow + XLA seems pretty usable. Also, it's generally good practice to
note that you're a cofounder of Vertex.AI in discussions like this.

~~~
hedgehog
Yes, I'm cofounder and I pretty much live and breathe the company. I see how
my comment reads as soulless shilling so I'll lay out my perspective and you
can make of it what you will. This is all my personal opinion and not
necessarily related to our product or company.

At a basic level I think making new powerful technology accessible to more
people is on average strongly positive. There are various efforts making good
progress to address different parts of deep learning accessibility such as
Keras (developer-friendly Python API), OpenAI (open basic research & safe AI),
fast.ai (practical training for developers), etc. I'm a fan of all of that
work. PlaidML is the company's contribution to making adoption easier.

For the purposes of proliferation and democratization making deep learning
work on the most readily available hardware helps people get started with less
friction. PlaidML is a step in that direction. It's fully open source and you
can right now 'pip install' it on Mac/Win/Linux with Intel/AMD/NVIDIA GPU and
have a Keras net running in a couple minutes. There are certainly warts and
some missing features but as far as I know it's the only one an ordinary
practitioner can use right now.

From a "what problem does this solve" standpoint PlaidML is most similar to
Tensor Comprehensions and TVM. Each makes different tradeoffs but might
eventually be able to share components like code generation for OpenCL, LLVM,
etc. Layers like XLA, nGraph, ONNX, NNVM, etc, you can mostly think of as
being stacked on top (they are ways to talk to lower layer runtimes like
PlaidML). For example it would be reasonable for a future version of PlaidML
to support TensorFlow integration via XLA or deployment of ONNX models on
OpenCL-capable GPUs.

Anyway, I personally care most about what people can use. There's a cute demo
that will run the pre-trained Keras examples against images from your webcam
on your local GPU. It's quick to try and can serve as the basis for
prototyping a real application:
[https://github.com/plaidml/plaidvision](https://github.com/plaidml/plaidvision)

------
deepnotderp
Why are all the neural network DSLs JIT obsessed?

~~~
grandmczeb
Lots of modern models have very late binding variables which are hard to
precompile for (sentence length in MNT, for example). That means you're going
to need to do some form of specialization at runtime, so a JIT makes sense.

~~~
deepnotderp
Just treat it as an infinite loop , there's no need to JIT in an optimized
version that late.

~~~
grandmczeb
One of the core operations of the transformer network[1] is a (LxL) x (LxE)
matrix multiply (where L is the sentence length and E is the network width).
Can you be more specific about how you would get good performance without
specializing on L?

[1] [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)

~~~
deepnotderp
You use the loop based GEMM kernel and inject the loop counters as the input
size.

~~~
grandmczeb
L can be as small as 1 and bigger than 512. For small L it makes sense to do
different optimizations than large L. A loop based GEMM doesn’t help with
that.

------
stealthcat
What does it mean by modern?

