
Building a Language and Compiler for Machine Learning - one-more-minute
https://julialang.org/blog/2018/12/ml-language-compiler
======
ChrisRackauckas
As someone who works in merging differential equations and machine learning, I
have found this kind of work essential for what I do. Pervasive AD that allows
merging neural networks and diffeq solvers is allowing us to explore all of
kinds of new models and new problems. Sure it doesn't impact vanilla machine
learning all that much (though Zygote.jl does allow for a lot of optimizations
that wouldn't be possible with tracing-based AD), but it definitely opens up a
new wave of AI possibilities.

~~~
dnautics
there are a few things about Flux that bugs me. It automatically assumes that
I want to optimize matrix multiplications by parallelizing those against
cores, which has Amdahl scaling, instead of parallelizing across samples in
the batch, which has Gustafson scaling. It would probably help if batches and
minibatches (or something like that) were datatypes, which they are not. Doing
something like this would probably also help with distributing computation,
down the line.

I'm also not entirely sure what is going on under the hood with Tracker types,
and the documentation is not that great, which became a problem when I was
trying to chase down errors in something really custom I was doing.

I much prefer Knet's way of autodifferentiating, which is more intuitive to
me, but Knet's layering doesn't feel as nice as Flux's.

I really wish GPU computation in Julia had a different semantic - by making it
a 'virtual computational node', accessible using Distributed module with the
same semantics as a _totally separate node_. That would really make async
distributed batch processing a thing, the system could profile all the nodes
in use and if we really want to get fancy be able to use something like JUMP
to make best use of the processing power available to it.

~~~
byt143
Two of your issues are currently being worked on. The autodiff tracker stuff
is temporary until the lower overhead and almost invisible compiler based AD
mentioned in the blog post is fully ready. No custom types needed.

There are also various autobatching packages being developed.

Regarding the GPU semantics, wouldn't that be solved by simply using a
distributed array of GPU arrays?

Since flux is lightweight, generic, modular and pure Julia, these things can
be developed in third party packages.

~~~
taliesinb
Can you point at the autobatching packages? What strategy are they taking?
Will they recognize opportunities to combine compatible operations within a
given function body into a batch? Does one need a cost model for merging and
splitting data into batches?

Also, what does an approach like bucketing even look like for the approach
that Julia is taking? The idea there of course is to have 'slop': to combine
many similar examples whose tensor sizes differ by small amounts, and to
carefully define all your primitive operations such that they can can ignore
the padding used to combine similar tensors into a uniform shape. Doing this
requires awareness of the tensor sizes all the way back to the way you sample
the training data, so I don't see how compiler magic can achieve the same
performance as you get from bucketing.

Of course, bucketing becomes more complex for things like trees and graphs and
other higher-level objects. And bucketing, theoretically, can bias into your
gradients, if there is any correlation between the gradient of an example and
its tensor shape.

~~~
byt143
From the blogpost

"Automatic Batching To get the most from these accelerators – which can have
significant overheads per kernel launch, but scale very well over input size –
it is common to batch programs, applying the forwards and backwards passes to
multiple training examples at once. In simple cases, such as with
convolutional nets, it’s simple to handle this by concatenating, say, 10
images along an extra batch dimension. But this task becomes much harder when
dealing with variably-structured inputs, such as trees or graphs.

Most researchers address this by taking on the significant burden of batching
code by hand. Different solutions have been proposed for different frameworks
(DyNet, TensorFlow Fold, which heuristically try to batch some high level
operations together when possible, but these typically either have their own
usability issues or do not achieve the performance of hand-written code.

We suggest that this problem is identical to that of Single Program Multiple
Data (SPMD) programming, which has been well-studied by the language and
compiler community for decades, and becomes visible in more recent approaches
to batching like matchbox. Indeed, it is very similar to the model of
parallelism used by GPUs internally, and has been implemented as a compiler
transform for the SIMD units of CPUs. Taking inspiration from this work, we
are implementing the same transform in Julia to provide SPMD programming both
for scalar SIMD units and for model-level batching. This allows us to reach
the ideal of writing simple code that operates on individual samples, while
still getting the best performance on modern hardware."

------
manojlds
[https://news.ycombinator.com/item?id=18593453](https://news.ycombinator.com/item?id=18593453)

------
yahyaheee
I’m finding the ML work being done in Julia very refreshing. It feels like
they are building things right from the ground up and the community is great
to work with.

------
pbalau
> [...] bake for fifteen minutes and out pops a fully-featured ML stack

where is logging, where is model storage and versioning, where is input data
processing and normalizing, where is results processing?

~~~
mlevental
lowbrow comment.

the hard part of ML stacks is AD and GPU not all of those other things (i'm
sure there has been zero cutting edge research done on better ways to log).

~~~
one-more-minute
Yes, and unlike AD and GPU support, things like logging have nothing (special)
to do with ML. Julia has both very nice logging and plenty of good
serialisation options, all of which works nicely with the ML stack. It's
entirely unnecessary to duplicate these tools just so they can be baked in to
a huge framework.

~~~
ChrisRackauckas
Well, I think there is something to be said there though. The reason why the
Julia stack is nice is because Julia's standard logging tools can be used for
logging in ML codes. Even other things like Julia's standard progress
monitoring toolbars just work on ML codes. That's quite a surprising result.
Tools which build a sub-language for graph building like TensorFlow have to
build and document such tooling. So for newcomers to Julia, they will search
the package documentation and package codes and find nothing. It is a
confusing problem because the functionality exists but no one thought to
document its usage for this context since it is just the standard Julia usage!

------
xvilka
Hopefully one day Julia won't need patched LLVM. Will improve packaging in
various distributions too.

