
The Tensor Algebra Compiler - dharma1
http://tensor-compiler.org/
======
fredrikbk
Hi Hacker News! I’m one of the developers. This project was also featured in
MIT News yesterday:

[http://news.mit.edu/2017/faster-big-data-analysis-tensor-
alg...](http://news.mit.edu/2017/faster-big-data-analysis-tensor-algebra-1031)

The code is available at:

[https://github.com/tensor-compiler/taco/issues](https://github.com/tensor-
compiler/taco/issues)

I’m happy to discuss the project and to answer any questions :)

~~~
philipkglass
The MIT story focuses on "big data," but it looks like this might be
applicable to coupled-cluster calculations in physics/chemistry too. Is it?
Can you compare/contrast with e.g. the Cyclops Tensor Framework
([http://solomon2.web.engr.illinois.edu/ctf/](http://solomon2.web.engr.illinois.edu/ctf/))
or NWChem's Tensor Contraction Engine
([http://www.csc.lsu.edu/~gb/TCE/](http://www.csc.lsu.edu/~gb/TCE/))?

~~~
fredrikbk
Larry chose to focus on the big data part, because it is intuitive. But I
think you're absolutely correct, that it has applications in physics/chemistry
(and machine learning too). We're actually talking to people in our
theoretical physics department, who may want to use taco for their QCD
computations. There's also a new issue on our tracker about adding complex
number support for nuclear computations and quantum computing:
[https://github.com/tensor-
compiler/taco/issues/116](https://github.com/tensor-compiler/taco/issues/116).

The tensor contraction engine is great work, and focuses on dense tensor. We
currently optimize for sparse tensors so TCE will do better than us for pure
dense expressions. We want to bridge this gap next semester though.

The cyclops framework is also great. We discuss it in our related work, but we
did not directly compare to it in our evaluation. The first version of it,
like TCE, focused on dense tensors, and their main focus is on distributed
computing, which we don't support yet (we do shared memory parallel execution
at the moment). They have some followup work on sparse computations. The
difference to our work is that they, if I read their paper correctly,
transposes the data until they can call pre-existing matrix multiplication
routines. This causes data movement overheads. Our work compiles expressions
to work at the data at hand, without moving it.

------
surban
We did some work on computing derivative expressions (goal is application to
deep learning) for such tensor algebras. I was going to release it on arXiv in
the next few weeks, but now seems to be a good time. Here you go (preliminary
version):

[https://github.com/surban/TensorAlgDiff/raw/master/elemdiff....](https://github.com/surban/TensorAlgDiff/raw/master/elemdiff.pdf)

Our system takes a tensor algebra (we call it element-wise defined tensor) and
outputs expressions for the derivatives w.r.t. all of its arguments. The
expression may contain sums and the indices of the argument tensors can be any
linear combination of the function indices (see our example for more details).
It correctly handles the cases where an index does not appear in an argument,
appears twice, appears as (i+j) etc.

Code to play with at
[https://github.com/surban/TensorAlgDiff](https://github.com/surban/TensorAlgDiff)

~~~
throwaway613834
I'm not sure if this is a sensible question, but what is the use of this
compared to (say) an autodiff library like FABAD++ [1]? Is performance the
main advantage, or expressive power, or something else?

[1] [http://www.fadbad.com/fadbad.html](http://www.fadbad.com/fadbad.html)

~~~
surban
If you use autodiff and don't have an 1:1 relationship between function and
argument indices you might have to do atomic sums or locking when computing
the derivative because multiple elements of the function derivative correspond
to one element of the argument derivative. On a CPU this might be okay but on
CUDA GPUs this usually has a performance impact. Thus we transform the
derivatives so that we have an explicit expression for each derivative element
and thus can use one CUDA thread per derivative element.

~~~
throwaway613834
Oh wow, interesting. Thanks!

------
comnetxr
Often (in the scientific computing communities that I am in) large tensor
contractions are done by reshaping tensors into matrices, transposing, and
using matrix multiplication. The contraction time can change dramatically
based on the order one contracts the tensors and the relative sizes of the
tensors. It seems this just uses a large nested for loop - how does this
strategy compare to using dedicated matrix multiplication algorithms?

~~~
lsorber
An efficient xgemm kernel is probably faster than the code that taco generates
if you don't need to permute your tensor. A contraction like B(i,k) = T(i,j,k)
* A(j) would require a permutation of T before you could run the matrix
multiplication though, while taco can just keep the data in-place.

------
macawfish
Awesome!

I made something similar recently, a GLSL code printer for for SymPy. I use it
for generating GLSL out of Clifford algebras. I'm using it to do higher
dimensional & conformal geometry.

[https://github.com/sympy/sympy/pull/12713](https://github.com/sympy/sympy/pull/12713)

------
gugagore
Conventional wisdom has that dense matrix-matrix multiply is generally faster
if you use some cache-aware scheme instead of the textbook 3 nested for loops.

Is there a similar story for sparse matrices ever?

~~~
fredrikbk
The story for sparse matrices is more complicated. Because of the dependencies
imposes by sparse data structures (you don’t have random access and often have
to traverse them) you cannot do loop ruling without adding if statements (I
think you can probably get rid of these by preconputing, e.g. row halfway
points). It may still make sense, but there’s a bigger cost. However, you can
lay out data in a tiled way to get better cache blocking; taco lets you do
this by storing a blocked matrix as a 4-tensor.

Because of the lack of random access, some algorithm like linear combination
matrix-matrix multiplication benefits from adding a dense workspace that gives
you a view into one row. Then you can scatter into it and when you’re done
copy the nonzeroes to the sparse result matrix. This algorithm is sometimes
called Gustavson’s algorithm after the person who published it first). We have
worked out this optimization within the taco framework, it applies to other
kernels too, and are now writing it up for publication.

------
nmca
Hm, the lack of GPU-specific support seems unfortunate. Cool work though, the
formulation as a lattice is pretty sweet :)

~~~
floatboth
Yeah, would be nice to see one of these compute frameworks compile to SPIR-V
and run on Vulkan…

------
ianai
Is there a list of applications this may be especially geared toward? I’m a
little out of practice and this seems interesting.

~~~
fredrikbk
I don't know of an explicit list like this, but some areas that we are
particularly interested in are: data analytics (tensor factorization), machine
learning (anything sparse including perhaps sparse neural networks), and
scientific computing/graphics (general relativity, quantum mechanics such as
QCD, finite element simulations, and apparently nuclear physics). It is
particularly suited anywhere you have sparse matrices or tensors.

------
realitygrill
Saw you guys at SPLASH; good shit!

~~~
fredrikbk
Thanks! I really like that conference!

------
rsodhi3050
Man, this is awesome!

