>produce the high-performance codes that the machine learning community needs
Somewhat OT, but I've been wondering for a long time… Is the HPC community the only place the word "codes" is used like this? In usual CS parlance programming is done using a substance called "code" ("the high-performance code the community needs"), but in HPC literature the word "codes" is used, as if programming consisted of distinct objects. Does this arise from some divergent history (would I have called my LINPACK library punched card deck a "code"?) or what?
Code is generally considered a mass noun among software engineers, but "codes" is pretty commonly used by academics, especially in other disciplines. In particular, physicists and mathematicians seem to use it pretty frequently, so that might explain why some in the HPC community use it as well.
I've also noticed that it seems more common among Europeans, but that might be just personal experience.
"Codes" tends to refer to error-correcting codes or encoding formats in my experience, while code refers to source or assembly code for machine execution.
One is a concept/idea/format while the other is a set of instructions.
Yup, British-educated here: programme for sequences of activities, program for the thing the computer runs... it’s just a little oddity I’ve noticed I’ve picked up and seem to be quite consistently applying.
Yes... and I use ”to programme” to denote the activity of planning activities, and ”to program” to indicate the process of coding instructions into a computer. I would probably use ”programme to program” if I ever had to discuss the idea of planning one’s intentions to issue instructions to a machine.
And not without meaning. HPC people hold themselves to higher standards than the average Joes throwing stuff at the currently-fashionable "continuous integration server" until the lights turn green. "Codes" are often things that have been proven, written on a legal pad, typed into the machine, and finally validated on a set of test problems. Could you prove that your program has second-order numerical stability? https://en.wikipedia.org/wiki/Numerical_stability
A shining example of the Dunning–Kruger effect. Scientific code is some of the worst-engineered code you'll ever deal with because it's written by people that are great at physics but terrible at software engineering.
It will be poorly documented, have single letter var names everywhere without comments, have a bunch of hidden dependencies on whatever the postdoc had installed on his/her laptop, leak memory but 'terminate before it should be a problem', etc. Oh, and there will almost never be even a single regression test.
Sure, the underlying algorithm might have a nice proof of numerical stability, but the actual software is almost always trash until some poor soul has to re-engineer it if the software turns out to be useful.
Sure, the code is usually pretty ugly, but it also tends to work. The people who write it view software as just another tool, so they care more about working than looking pretty.
> Oh, and there will almost never be even a single regression test.
Bullshit. "Codes" tend to be run against analytical and/or experimental results.
It's not about looking pretty. It's about being composable, extendable, testable, and portable to other languages and architectures.
The fact that you think looking pretty has anything to do with sound software engineering further illustrates my point.
>Bullshit. "Codes" tend to be run against analytical and/or experimental results.
A README that says the author ran it against dataset Y and got X is not a regression test. I'm talking about automated tests run on every proposed patch to ensure the thing is never broken. Real automated tests that cover 90%+ of branches are about as rare as unicorns in scientific computing (much to my annoyance).
Don't get me wrong, I completely understand why people running experiments skip testing (fast iteration). But that leads to the very opposite of "higher standards".
It’s common in Indian English, as in “I have written a code to do the needful, but I have a doubt about it that perhaps you could answer.” Outside that dialect I’d call it an error.
This web page is also the first I've heard of Halide and Polyhedral Compilation. This is exciting to me because I've been working on relational (database) data and logic comprehensions, and in a case of convergent evolution Halide looks a lot like my notation and Polyhedral Compilation looks much like diagrams I've been drawing on my whiteboard. Where can I learn more on this?
The TC paper uses techniques that are in the R-Stream compiler. You can get more information about the R-Stream compiler at https://www.reservoir.com/research/publications/?cat=35 - the first paper there also validates the use of polyhedral compilation for deep learning last year. I particularly recommend the article about R-Stream in the Encyclopedia of Parallel Compilation, David Padua (ed). https://www.reservoir.com/publication/r-stream-compiler-2/
The R-Stream Encyclopedia article talks about how to raise to the polyhedral model from SSA; this is no problem especially for the super regular computations in Tensor computations used in deep learning. If you get in touch with me I can also get you a copy of the 2008 paper about R-Stream that described it: https://www.reservoir.com/publication/final-report-r-stream-...
By raising from C you don't need to use or learn a new language. Or one can generate C from a succinct notation or framework, and polyhedral optimize from there. That is what was in the R-Stream-TF paper.
Could someone please explain how this compares to the TensorFlow approach? I can only assume that it's omitted from the article due to marketing reasons.
My understanding is that Tensor Comprehensions provides a way to automatically generate optimized CUDA code for algorithms written in a high-level language that more closely mirrors the notation used in mathematical formulas. So you could use it to automatically find more optimal low-level implementations for components used in libraries such as PyTorch and TensorFlow which usually call out to hand-written low-level implementations.
How can it be both general, and fast? For example, CuDNN ops are fast because they are very specialized and highly tuned. Is convolution written with TC going to be as fast as CuDNN convolution?
Or, if TC's strength is in its generality, then what are the advantages over something like CuPy for Chainer?
In short, yes CuDNN is fast for the cases it was tuned for. It is probably faster on power-of-two sizes, but when you operate on a 26 x 1024954 x 3 tensor, TC can generate specialized code. Want 42 x 17 x 5? TC can generate differently specialized code. With almost no effort from the user (or performance engineers).
Can a performance expert do better job than TC optimizer? Very likely yes, but it will very likely take much more time.
TC is not a framework. It can be integrated with any framework of your liking.
With cuDNN, some expert had to sit down and write an optimized kernel for your ops. With TC, the idea is you write the simple, tensor comprehension for your op, and then throw the autotuner at it, and get something that approaches or even beats the hand-tuned example. The diagrams in the paper give a sense for some of these situations.
Tensor Comprehensions are not a deep learning framework, but a way to write new optimized operators for deep learning. Hence, Tensor Comprehensions can be integrated with TensorFlow.
That is correct (and we appreciate the citation). The tensor compiler (taco) has focused on compiling expressions that contain one or more sparse tensors so far and, even though it can generate code for dense expressions just fine, does not optimize these like TC, TCE, and XLA does. We are working on a scheduling language for it so that it will perform well across all types of formats and on GPUs.
I'm a fan of evolutionary algorithms, but are they really effective enough here to be comparable to an engineer tuning code? They might be able to find a good configuration of a few canned options but real optimization often requires some creativity or at least an understanding of the hardware. Will certainly be interesting to see this in practice!
Hello, evolutionary algorithms by themselves I am not sure and don't have enough experience with atm. In our particular context the key is the intermediate representation on which the evolution happens and the compiler behaviors it triggers. We are still far from competitive against highly tuned expert code in computation-bound regimes but we do see >50% shared memory BW peak usage in nvprof in multiple cases. This first iteration is aimed at addressing the immediate research productivity needs: no need to write hundreds of lines of framework integration and low-level kernels to get a reasonable CUDA performance. We think we are helping remove the first immediate bottleneck that are very frustrating in practice to ML researchers when their layer does not translate exactly in supported, optimized, vendor libraries.
The crucial part is the polyhedral optimizer which does indeed include several GPU-specific heuristics (multilevel parallelization, coalescing, etc) and specialization to tensor sizes. Evolutionary autotuner is used to tweak the parameters of the optimizer. As a result, TC can beat cublas and cudnn on certain networks; details in the report.
CuPy itself is just a framework, and you could slot TC in as a thing that generates operators for it. CuPy also famously has support for inline CUDA kernels; the equivalent TC kernels are shorter and autotunable.
Slightly resembles the Tensor Contraction Engine for quantum chemistry/physics (http://www.csc.lsu.edu/%7Egb/TCE/ ). Although it predates this by a couple of decades.
Hello, Tensor Comprehensions absolutely use techniques that have been existing for a few years (Halide) or many decades (polyhedral model, Einstein notation(century old?), ...). The TCE is definitely also a motivational prior work which also uses the polyhedral model for optimizing loop nests. What we tried to achieve here, is a solid research tool to make a subset of underlying optimizations algorithms usable in practice by non-experts. One such optimization algorithm is described in our joint 2011 PoPL paper with authors of TCE (Loop transformations: convexity, pruning and optimization).
The most surprising thing to me is that they can parameterize a nontrivial section of the implementation space of a function, or that such a section exists that hasn't been optimized away by the compiler.
The thing is that compilers usually quickly go to SSA form and it is not the best IR to optimize loops in. Then you fight it to extract a high-level IR and this little process makes it very easy to lose high-level information. We don't do this so we begin from a friendlier starting point.
I think tensors are an overly crude way to model things. It's like computer science went back to the 60s and replaced all data structures by homogeneous blocks of memory.
Edit: of course a computer works best with blocks of memory; that doesn't mean a human developer should have the same view. As a simple example, think of the output vector of a classifier. Why is it a vector, and not a structure? Or think of the internals of an LSTM network; there is more structure in there than just tensors.
Tensors are the way to effectively execute things - modeling things is about the "human interface" to the data, but when you need to do stuff with very, very, very large quantities of it at a good performance, then you'd want the system to transform the data and desired operations from your "human-friendly" model to a "machine-friendly" model, and that's generally going to be homogenous blocks of memory, and "vectorizing" processing as much as possible so you have less or no item-specific logic but instead have matrix operations that do the same thing to many data items at once in parallel.
It's just like with OOP in game programming where performance matters - even if you want a nice object model for programmer convenience, you'd also want to ensure that you can store the object data into a homogeneous array sequentially in memory, as that gets you a major performance impact; I seem to recall that Carmack had an in-detail article about that some time ago, but can't easily find it.
Tensor Comprehensions is mostly targeted at arithmetics operations that appear in DL workloads, and the notation strives to be usable for DL experts. Polyhedral optimizer is oriented towards imperative languages, so we won't end up doing the same optimizations. Really hard to compare. The spirit of making parallel programming simpler is common :)
This is one reason I also prefer to call TC a notation personally. We can't allocate and declare inside TC, that may change in the future but for now we went for the easiest entry point to Halide and polyhedral IR we could think of. You can lower simple C loops or other real languages in those IRs too but the programs are so much terser in TC that we have come to expect terseness.
I could not find how to integrate this into a C++ program. Does it need ATen or is there a lower level of integration? Is there even the possibility of getting C bindings?
Tensor Comprehensions does not try to own memory allocation and CPU/GPU transfers. ATen is one simple way of getting that, which we used for tests. Anything convertible to DLPack tensors should work as long as nothing fancy happens with tensor shapes.
It makes sense to let client code handle allocations and whatnot. What I cannot find is how to pass one or more tensors (given, I suppose, by some shape parameters and a pointer to the data buffer) to Tensor Comprehensions.
I would have expected that operations expressed in the tensor language could be compiled once and for all (for a given target) into a DLL, and then it would be just a matter of passing the right buffer and shape parameters to a function. I see no reason why this should not be easily handled in C (which makes it easier to bind it from most languages).
If I understand correctly, DLPack is the format of choice to express a tensor, and ATen is not required, but I cannot find examples using DLPack
Also TC doesn't do data allocation itself and it requires tensor library from users to do that. So if you are using Caffe2 for example, you could use the TcOp that we ship inside TensorComprehensions for Caffe2 and using caffe2 pybindings, you can already write TC in python. No work needed for creating python bindings. We welcome PRs on this :)
As for other tensor libraries, like ATen based on TH* used in torch/pytorch, you have the ability to create tensor but these require integrating tensor library into TC and writing pybindings for them. PRs welcome on this as well.
Python bindings are in there :) This needs a tensor library callable from python that works on GPUs. One direction we are going towards is PyTorch via ATen / Torch tensors; we already use the C++ parts of ATen. Of course any other CUDA tensor library with minimal alloc/copy/synchronize would work too. Send a PR? ;)
Somewhat OT, but I've been wondering for a long time… Is the HPC community the only place the word "codes" is used like this? In usual CS parlance programming is done using a substance called "code" ("the high-performance code the community needs"), but in HPC literature the word "codes" is used, as if programming consisted of distinct objects. Does this arise from some divergent history (would I have called my LINPACK library punched card deck a "code"?) or what?