Hacker News new | comments | show | ask | jobs | submit login
JIT native code generation for TensorFlow graphs using Python and LLVM (christianperone.com)
95 points by perone on Aug 22, 2016 | hide | past | web | favorite | 19 comments

Going to take this opportunity to plug my related project [Likely](www.liblikely.org), a DSL for lowering machine learning inference algorithms.

One of the projects we've built on top of it is a static compiler for Caffe model files. This allows you to execute Caffe models _without_ a runtime dependency on Caffe libraries. Thus you can target OSes and CPUs not supported by mainline Caffe. If you have commercial interest in this capability please reach out to me.

Very interesting, I recently put together a simple caffe static compiler of my own on the path to FPGA deployment of CNN inference pipelines. I looked at halide[0] as a possible intermediate representation for CPU, GPU, and FPGA, (as a step above LLVM IR), but Likely seems like an interesting option.

[0]: http://halide-lang.org/

Targeting FPGAs is something we've been keeping our eye on as well. Most of the pieces are in place. I think the last non-trivial part is removing any malloc()/free() calls. This should be possible as the size and lifetime of memory allocations are known at compile time and can be either moved to the stack or made global as desired.

One of the reasons I opted against Halide myself was the feeling that for this domain there ought to be enough information available to the compiler to intelligently pick tiling and vectorization parameters. For example, using Polly [0]. However in practice, manual tiling and vectorization is hard to beat.

[0] http://polly.llvm.org/

Really nice project, congratulations ! I'll add a link in my post about it.

Forgive my ignorance, but it seems like this is just attempting to take advantage of the optimization done by LLVM, yes?

What I would love is a simple way of writing standalone functions that compile into a cross-platform LLVM file that I can call from a variety of other languages on a variety of other systems. In particular, if I train a recurrent network on text data for a chat bot, I want to be able to use that LLVM file + model in a game I release for the PC and for Android without worrying about the NDK/gcc/clang/Windows/OSX build nightmare. The ability to easily and quickly define a model in TensorFlow, write a Python function that takes an array of data, and spits out an array of data would be incredible and would mean that all the work I'm doing for a native Rust library is unneeded.

Admittedly, with Bazel I could create a C++ wrapper for the function which loads the library. It's just... that produces a 150mb shared library with all the dependencies and it's also a pain in the ass.

This is actually easy to do, you just need to generate the IR and then merge them into a Module, after that you apply passes over the entire module to optimize, to do function inlining, etc.

Cool idea, but is this of any benefit?

Isn't this essentially what TensorFlow does internally, except it inserts CUDA primitives at the right positions...

TensorFlow doesn't yet do loop fusion (though I believe the specific example shown in that article may already be done via constant folding). But if you have a bunch of elementwise operations, JIT-techniques can reduce the number of memory passes over a buffer. If your model is already very computationally dense (time dominated by matmul or convolution), then this won't help as much, but otherwise, JIT techniques can help.

Yes it is, and in my opinion is very important. TensorFlow team is actually actively working on a JIT (https://github.com/tensorflow/tensorflow/issues/164). I'll paste here a relevant part of the TensorFlow paper regarding the Future Work:

"We also have a number of concrete directions to improve the performance of TensorFlow. One such direction is our initial work on a just-in-time compiler that can take a subgraph of a TensorFlow execution, perhaps with some runtime profiling information about the typical sizes and shapes of tensors, and can generate an optimized routine for this subgraph. This compiler will understand the semantics of perform a number of optimizations such as loop fusion, blocking and tiling for locality, specialization for particular shapes and sizes, etc."


Indeed, some benchmarks would be interesting.

Maybe P2P Tensflow as a service would be a neat idea?

E.g. I have data, TensorFlow model anybody can bid who can do quickly the cheapest compute power. E.g. EC2 spot, Google Cloud preemptive or some NVIDIA CUDA spare computer.

I decided to build an obscenely overpowered desktop machine because I can't get myself to trust 3rd parties to run all the tensorflow code I have. For some reason I can't trust this code in the same way I trust my server side code running on multiple servers across the world.

Are you worried about them stealing your code or something?

Not the code but the data being leaked, stolen by third parties (I assume all servers are powned), or damaged due to a botched os install. I prefer the peace of mind.

Cool project. But since TensorFlow works best with built-in proprietary backend like cudnn, what role will LLVM play here?

can't we use numba directly, not llvmlite?

numba works on python code, not on dags generated by tensorflow.

In practice, what numba does is turn the python code into llvm types, and then compile those with LLVM. What the OP is doing is turning tensorflow dags into llvm types, and then compiling those with LLVM. You can look at numba and the OP's project both as front ends to LLVM.

numba is a Python JIT that uses llvmlite

llvmlite is an Python interface to LLVM

So, JIT compilation for TensorFlow graphs would need llvmlite as TensorFlow graphs aren't Python.

I'm not sure that LLVM is the correct way to go about this. Don't get wrong, it can be used, but most of the work of the frameworks work on very large tensors/multi-arrays. As such optimization of the computation graph for such arrays, although very similar to standard optimization, has also and some significant differences. I do believe, however, that all frameworks should start using the same graph IR representation and optimize procedure, with potentially having different back ends based on hardware and different front ends based on language. I in fact tried to achieve this some time ago, and is still in progress, but lately have no time to work on it. Still the post is really great.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact