
Tensor Comprehensions - smhx
https://research.fb.com/announcing-tensor-comprehensions/
======
wrs
>produce the high-performance codes that the machine learning community needs

Somewhat OT, but I've been wondering for a long time… Is the HPC community the
only place the word "codes" is used like this? In usual CS parlance
programming is done using a substance called "code" ("the high-performance
_code_ the community needs"), but in HPC literature the word "codes" is used,
as if programming consisted of distinct objects. Does this arise from some
divergent history (would I have called my LINPACK library punched card deck a
"code"?) or what?

~~~
grandmczeb
Code is generally considered a mass noun among software engineers, but "codes"
is pretty commonly used by academics, especially in other disciplines. In
particular, physicists and mathematicians seem to use it pretty frequently, so
that might explain why some in the HPC community use it as well.

I've also noticed that it seems more common among Europeans, but that might be
just personal experience.

~~~
r00fus
code:codes::math:maths?

~~~
Nition
In the Commonwealth we say maths but we still say code.

I've noticed we also tend to also write "computer program" the US way, despite
writing "TV programme".

~~~
qubex
Yup, British-educated here: programme for sequences of activities, program for
the thing the computer runs... it’s just a little oddity I’ve noticed I’ve
picked up and seem to be quite consistently applying.

~~~
qubex
Yes... and I use ”to programme” to denote the activity of planning activities,
and ”to program” to indicate the process of coding instructions into a
computer. I would probably use ”programme to program” if I ever had to discuss
the idea of planning one’s intentions to issue instructions to a machine.

------
phaedrus
This web page is also the first I've heard of Halide and Polyhedral
Compilation. This is exciting to me because I've been working on relational
(database) data and logic comprehensions, and in a case of convergent
evolution Halide looks a lot like my notation and Polyhedral Compilation looks
much like diagrams I've been drawing on my whiteboard. Where can I learn more
on this?

~~~
richardlethin
The R-Stream Encyclopedia article talks about how to raise to the polyhedral
model from SSA; this is no problem especially for the super regular
computations in Tensor computations used in deep learning. If you get in touch
with me I can also get you a copy of the 2008 paper about R-Stream that
described it: [https://www.reservoir.com/publication/final-report-r-
stream-...](https://www.reservoir.com/publication/final-report-r-
stream-3-0-compiler/)

By raising from C you don't need to use or learn a new language. Or one can
generate C from a succinct notation or framework, and polyhedral optimize from
there. That is what was in the R-Stream-TF paper.

~~~
aray
Is this paper published yet? The `Article` link doesn't go anywhere, and
Google Scholar doesn't know where to find a copy.

------
falcor84
Could someone please explain how this compares to the TensorFlow approach? I
can only assume that it's omitted from the article due to marketing reasons.

~~~
davesque
My understanding is that Tensor Comprehensions provides a way to automatically
generate optimized CUDA code for algorithms written in a high-level language
that more closely mirrors the notation used in mathematical formulas. So you
could use it to automatically find more optimal low-level implementations for
components used in libraries such as PyTorch and TensorFlow which usually call
out to hand-written low-level implementations.

~~~
p1esk
How can it be both general, and fast? For example, CuDNN ops are fast because
they are very specialized and highly tuned. Is convolution written with TC
going to be as fast as CuDNN convolution?

Or, if TC's strength is in its generality, then what are the advantages over
something like CuPy for Chainer?

Can someone give an example where TC shines?

~~~
ozinenko
Section 7 of the paper
([https://arxiv.org/abs/1802.04730](https://arxiv.org/abs/1802.04730)) has a
couple of examples.

In short, yes CuDNN is fast for the _cases it was tuned for_. It is probably
faster on power-of-two sizes, but when you operate on a 26 x 1024954 x 3
tensor, TC can generate specialized code. Want 42 x 17 x 5? TC can generate
differently specialized code. With almost no effort from the user (or
performance engineers).

Can a performance expert do better job than TC optimizer? Very likely yes, but
it will very likely take much more time.

TC is _not_ a framework. It can be integrated with any framework of your
liking.

------
cgmg
Related: [http://tensor-compiler.org/codegen.html](http://tensor-
compiler.org/codegen.html). This converts an expression in tensor index
notation into executable code.

~~~
ozinenko
Sure, it is one of the works we cite. It seems to be mostly targeted at sparse
computations and does not have GPU support.

Tensor Comprehensions does not try to manage memory and thus can be integrated
into DL frameworks easily.

~~~
fredrikbk
That is correct (and we appreciate the citation). The tensor compiler (taco)
has focused on compiling expressions that contain one or more sparse tensors
so far and, even though it can generate code for dense expressions just fine,
does not optimize these like TC, TCE, and XLA does. We are working on a
scheduling language for it so that it will perform well across all types of
formats and on GPUs.

------
tehsauce
I'm a fan of evolutionary algorithms, but are they really effective enough
here to be comparable to an engineer tuning code? They might be able to find a
good configuration of a few canned options but real optimization often
requires some creativity or at least an understanding of the hardware. Will
certainly be interesting to see this in practice!

~~~
ozinenko
The crucial part is the polyhedral optimizer which does indeed include several
GPU-specific heuristics (multilevel parallelization, coalescing, etc) and
specialization to tensor sizes. Evolutionary autotuner is used to tweak the
parameters of the optimizer. As a result, TC can beat cublas and cudnn on
certain networks; details in the report.

~~~
p1esk
What would be a relationship between TC and something like CuPy?

~~~
ezyang
CuPy itself is just a framework, and you could slot TC in as a thing that
generates operators for it. CuPy also famously has support for inline CUDA
kernels; the equivalent TC kernels are shorter and autotunable.

------
jabl
Slightly resembles the Tensor Contraction Engine for quantum chemistry/physics
([http://www.csc.lsu.edu/%7Egb/TCE/](http://www.csc.lsu.edu/%7Egb/TCE/) ).
Although it predates this by a couple of decades.

~~~
nicovasilache
Hello, Tensor Comprehensions absolutely use techniques that have been existing
for a few years (Halide) or many decades (polyhedral model, Einstein
notation(century old?), ...). The TCE is definitely also a motivational prior
work which also uses the polyhedral model for optimizing loop nests. What we
tried to achieve here, is a solid research tool to make a subset of underlying
optimizations algorithms usable in practice by non-experts. One such
optimization algorithm is described in our joint 2011 PoPL paper with authors
of TCE (Loop transformations: convexity, pruning and optimization).

------
alexbeloi
The most surprising thing to me is that they can parameterize a nontrivial
section of the implementation space of a function, or that such a section
exists that hasn't been optimized away by the compiler.

~~~
nicovasilache
The thing is that compilers usually quickly go to SSA form and it is not the
best IR to optimize loops in. Then you fight it to extract a high-level IR and
this little process makes it very easy to lose high-level information. We
don't do this so we begin from a friendlier starting point.

------
amelius
I think tensors are an overly crude way to model things. It's like computer
science went back to the 60s and replaced all data structures by homogeneous
blocks of memory.

Edit: of course a computer works best with blocks of memory; that doesn't mean
a human developer should have the same view. As a simple example, think of the
output vector of a classifier. Why is it a vector, and not a structure? Or
think of the internals of an LSTM network; there is more structure in there
than just tensors.

~~~
PeterisP
Tensors are the way to effectively execute things - modeling things is about
the "human interface" to the data, but when you need to do stuff with very,
very, very large quantities of it at a good performance, then you'd want the
system to transform the data and desired operations from your "human-friendly"
model to a "machine-friendly" model, and that's generally going to be
homogenous blocks of memory, and "vectorizing" processing as much as possible
so you have less or no item-specific logic but instead have matrix operations
that do the same thing to many data items at once in parallel.

It's just like with OOP in game programming where performance matters - even
if you want a nice object model for programmer convenience, you'd also want to
ensure that you can store the object data into a homogeneous array
sequentially in memory, as that gets you a major performance impact; I seem to
recall that Carmack had an in-detail article about that some time ago, but
can't easily find it.

------
charlescearl
Out of curiosity, is there any overlap with objectives or performance of
accelerate
[https://github.com/AccelerateHS/accelerate](https://github.com/AccelerateHS/accelerate)?

~~~
ozinenko
Tensor Comprehensions is mostly targeted at arithmetics operations that appear
in DL workloads, and the notation strives to be usable for DL experts.
Polyhedral optimizer is oriented towards imperative languages, so we won't end
up doing the same optimizations. Really hard to compare. The spirit of making
parallel programming simpler is common :)

------
grondilu
From the documentation on arxiv:

> Variables not defined anywhere, implicitly become index variables.

That seems like a bold choice. Wasn't there a trend in programming languages,
even very high level ones, to encourage variable declaration?

~~~
nicovasilache
This is one reason I also prefer to call TC a notation personally. We can't
allocate and declare inside TC, that may change in the future but for now we
went for the easiest entry point to Halide and polyhedral IR we could think
of. You can lower simple C loops or other real languages in those IRs too but
the programs are so much terser in TC that we have come to expect terseness.

------
pathsjs
I could not find how to integrate this into a C++ program. Does it __need
__ATen or is there a lower level of integration? Is there even the possibility
of getting C bindings?

~~~
ozinenko
Tensor Comprehensions does not try to own memory allocation and CPU/GPU
transfers. ATen is one simple way of getting that, which we used for tests.
Anything convertible to DLPack tensors should work as long as nothing fancy
happens with tensor shapes.

C bindings don't seem to be a priority.

Feel free to use the contacts provided in the documentation here:
[https://facebookresearch.github.io/TensorComprehensions/cont...](https://facebookresearch.github.io/TensorComprehensions/contacts.html).

~~~
pathsjs
It makes sense to let client code handle allocations and whatnot. What I
cannot find is how to pass one or more tensors (given, I suppose, by some
shape parameters and a pointer to the data buffer) to Tensor Comprehensions.

I would have expected that operations expressed in the tensor language could
be compiled once and for all (for a given target) into a DLL, and then it
would be just a matter of passing the right buffer and shape parameters to a
function. I see no reason why this should not be easily handled in C (which
makes it easier to bind it from most languages).

If I understand correctly, DLPack is the format of choice to express a tensor,
and ATen is not required, but I cannot find examples using DLPack

------
tomrod
Neat!

This needs Python bindings, stat!

~~~
nicovasilache
Python bindings are in there :) This needs a tensor library callable from
python that works on GPUs. One direction we are going towards is PyTorch via
ATen / Torch tensors; we already use the C++ parts of ATen. Of course any
other CUDA tensor library with minimal alloc/copy/synchronize would work too.
Send a PR? ;)

