
Machine learning systems are stuck in a rut - feross
https://blog.acolyer.org/2019/06/28/machine-learning-systems-are-stuck-in-a-rut/
======
cs702
Direct link to the Google Brain paper, which is well worth a read:

[https://dl.acm.org/citation.cfm?id=3321441](https://dl.acm.org/citation.cfm?id=3321441)
(click on "PDF" link to read)

Abstract: "In this paper we argue that systems for numerical computing are
stuck in a local basin of performance and programmability. Systems researchers
are doing an excellent job improving the performance of 5-year-old benchmarks,
but gradually making it harder to explore innovative machine learning research
ideas. We explain how the evolution of hardware accelerators favors compiler
back ends that hyper-optimize large monolithic kernels, show how this reliance
on high-performance but inflexible kernels reinforces the dominant style of
programming model, and argue these programming abstractions lack
expressiveness, maintainability, and modularity; all of which hinders research
progress. We conclude by noting promising directions in the field, and
advocate steps to advance progress towards high-performance general purpose
numerical computing systems on modern accelerators."

The main example the authors use to illustrate these issues is capsule
networks, first proposed two years ago.[a]

To date, no one has been able to develop a high-performance implementation of
capsule networks. At present, the best-performing implementations in
Tensorflow and PyTorch must copy, rearrange, and materialize to memory _two
orders of magnitude more data_ than necessary, due to the issues raised by the
authors. See sections 1 and 2 of the paper for the gory details.

Two orders of magnitude. That is pathetic.

[a] [https://arxiv.org/abs/1710.09829](https://arxiv.org/abs/1710.09829)

~~~
cr0sh
Note: I am not an expert on ML, ANNs, etc - most of what I have done has been
mainly "hobby level" and MOOC learning.

From what I understand, though, what GH is trying to accomplish with capsule
networks (which I have tried to understand, and have yet to succeed) is
optimizing (or possible removal of) backpropagation.

He has noted in the past how - so far (from what I know) - there isn't a
biological equivalent to backpropagation for learning. Backprop is a
completely artificial mathematical construct that doesn't happen in natural
systems. It is also extremely energy intensive - at least in the manner that
it is currently done.

So the question is - that he's hoping to answer I think - what is a proper
working framework for implementing an artificial neural network that can
learn, without using backpropagation (or using it differently)?

I think whoever can solve this will fundamentally remake the field of ML/AI -
much in the same way that backprop (and later "deep learning") did.

~~~
visarga
It's not the backprop that is holding back ML - it takes about as much time as
the forward pass and requires 3x the memory. Backprop is necessary for almost
all the deep neural nets that have state of the art results, attempting to
replace it would push us back a decade or more.

The main problem with ML is the separation of data and compute. It takes a lot
of time and energy to move data around. We need 'in memory compute'.

~~~
cr0sh
So you're in disagreement with Hinton?

[https://www.axios.com/artificial-intelligence-pioneer-
says-w...](https://www.axios.com/artificial-intelligence-pioneer-says-we-need-
to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html)

I've heard the argument made about "in memory compute" \- and I do know that
various companies have made hardware toward that direction (and some of it has
very low power requirements).

I just find it compelling that Hinton - at least as of 2 years ago - has this
viewpoint that we should be rethinking the concept of backprop...

~~~
jasallen
I'd say this is a difference of scale. Hinton's comments are meant to inspire
and encourage _The Next Great Thing_. We could read the paper that broadly,
but they seem to be addressing a more immediate local minima.

------
YeGoblynQueenne
That's it? We need better frameworks?

I thought this was going to be about this "stuck in a rut":

 _GH: One big challenge the community faces is that if you want to get a paper
published in machine learning now it 's got to have a table in it, with all
these different data sets across the top, and all these different methods
along the side, and your method has to look like the best one. If it doesn’t
look like that, it’s hard to get published. I don't think that's encouraging
people to think about radically new ideas.

Now if you send in a paper that has a radically new idea, there's no chance in
hell it will get accepted, because it's going to get some junior reviewer who
doesn't understand it. Or it’s going to get a senior reviewer who's trying to
review too many papers and doesn't understand it first time round and assumes
it must be nonsense. Anything that makes the brain hurt is not going to get
accepted. And I think that's really bad.

What we should be going for, particularly in the basic science conferences, is
radically new ideas. Because we know a radically new idea in the long run is
going to be much more influential than a tiny improvement. That's I think the
main downside of the fact that we've got this inversion now, where you've got
a few senior guys and a gazillion young guys._

[https://www.wired.com/story/googles-ai-guru-computers-
think-...](https://www.wired.com/story/googles-ai-guru-computers-think-more-
like-brains/)

Which is the granddaddy of the other one. You don't really need better
frameworks that make it easier to explore innovative ideas if you can't ever
hope to publish those innovative ideas, even if you manage to make them work.

NB: "GH" is Geoff Hinton.

~~~
visarga
Ideas are cheap. I have ideas, you have ideas, everyone has ideas. What counts
is beating the state of the art, that turns heads and raises eyebrows (such as
ResNet, AlphaZero, WaveNet, BigGAN and Transformer). Usually those results are
based on lots of compute and training data, though.

~~~
angel_j
Ideas are cheap, but the skills to engineer ML pipes and the knowledge
requirements to grok models are not. And the cost of testing some ideas not is
not cheap either.

Lots of people have legit skills, high level math, pro software engineering,
actual ML chops (I got mine), and probably significant learning in other
sciences. But there is no door to walk through on that alone, only strange VC
corridors and FAANGY career mazes.

There should be a wide door to support people with real skills taking chances.
If 100K for so many silly startups is worth the gamble, so is the gamble on
people who have gone the extra distance with their learning and experience.

------
gambler
Yes.

If you have an idea about machine learning that doesn't fit the mold, you're
pretty much forced to write your own mini-framework before trying it. And I'm
not even talking about CUDA compilation.

Mainstream languages don't have syntax for trivial math like graphs and
matrices. Even 2d array support is horrible everywhere.

Try to write the following algorithm: 1\. Given an image in greyscale, chose
two n*n regions at random. 2\. Compute the median difference between pixels in
those regions. 3\. Add both regions to a graph (unless they are already there)
and set the edge value to that difference.

Trivial to visualize in your head, a nightmare to implement in most languages.

And what if you want to parallelize this to work on large amounts of data? I
have hopes that AMD's monster processors will make experimentation a little
bit more practical, since you wouldn't be in such a desperate need to push
everything onto GPU.

------
Tarq0n
Isn't this the kind of application where Julia would shine?

If you're trying to develop something new or cutting edge it makes sense that
frameworks don't necessarily suit you. Frameworks try to make common actions
easy (provide a 'happy path'), very much the opposite of developing something
new.

Something like Julia which offers useful primitives, but can also let you
compile general purpose code for CUDA seems like a better fit, rather than
expecting frameworks to cater to you.

~~~
cs702
The authors of the paper discuss how Julia fares with capsule networks in
section 4.1:[a]

> There are frameworks, such as Julia, which nominally use the same language
> to represent both the graph of operators and their implementations, but
> back-end designs can diminish the effectiveness of such a front end. In
> Julia, while 2D convolution is provided as a native Julia library, there is
> an overloaded conv2d function for GPU inputs which calls NVidia’s cuDNN
> kernel. Bypassing this custom implementation in favor of the generic code
> essentially hits a “not implemented” case and falls back to a path that is
> many orders of magnitude slower.

[a] The paper uses capsule networks as its main example. See my earlier
comment for context and a link to the paper:
[https://news.ycombinator.com/item?id=20304576](https://news.ycombinator.com/item?id=20304576)

~~~
ChrisRackauckas
Julia makes this substantially easier, but yes in many cases you might need to
come up with a cache-optimized tensor operation kernel. The Julia Lab is
building some tooling for the automatic construction of this stuff which is
compatible with the AD frameworks, kind of like Halide, but in the Julia
language and compatible with all of its generics. The real key though is that
packages play nicely in the Julia sphere, so if someone writes a good Julia
code for the operation and puts a package up, then you can take that package
and utilize it in Flux even if the person did not write for ML purposes.
There's a lot of quantum particle folks writing esoteric tensor operations in
Julia that I personally have been pulling stuff from.

~~~
cs702
> ...in many cases you might need to come up with a cache-optimized tensor
> operation kernel...

Yes, exactly. That's one of the issues raised by the authors of the paper.

They note that in practice, most AI researchers will not do that. They iterate
and test code very quickly and cannot invest the time/effort necessary (a) to
figure out how to write a cache-optimized kernel every time they might need
one, or even (b) wait for an automated kernel-writing tool to finish searching
for and compiling an optimized kernel (which in practice often ends up being
slower than manipulating the data in inefficient ways in order to use
existing, inflexible, prebuilt kernels).

~~~
ChrisRackauckas
>They note that in practice, most AI researchers will not do that.

I think the main issue there is that these kernel compilers are not well-
integrated into the ML libraries. If it's a standard part of it, well-
documented, easy to build (usually the difficult part is compiling someone's
custom compiler...), and if the compiled results can be "stored" for future ML
models to just use, I think it would see more adoption. As of now, tensor
compilers for ML frameworks are more of a (good but shaky) research tool. I
think a Julia approach of "do it on Julia code" (even the compiler, so it's
easy to just ]add the package) could definitely break down some of these
barriers to getting it done in a way that garners widespread adoption, but a
lot of work will need to be done in order to get a good enough tensor compiler
for people to care. If anything, this is a fun space with many opportunities.

~~~
cs702
Clearly, that would help. And clearly, there are opportunities for improvement
-- as much as two orders of magnitude (!) in the case of capsule networks.

That said, I think the authors are right about the challenge here: "...current
frameworks excel at workloads where it makes sense to manually tune the small
set of computations used by a particular model or family of models.
Unfortunately, frameworks become poorly suited to research, because there is a
performance cliff when experimenting with computations that haven’t previously
been identified as important. While a few hours of search may be acceptable
before production deployment, it is unrealistic to expect researchers to put
up with such compilation times (recall this is just one kernel in what may be
a large overall computation); and even if optimized kernels were routinely
cached locally, it would be a major barrier to disseminating research if
anyone who downloaded a model’s source code had to spend hours or days
compiling it for their hardware before being able to experiment with it."

EDIT: removed third paragraph.

------
IshKebab
I definitely agree that if you stray outside the "traditional" CNN, LSTM, etc.
it is very hard to implement custom layers in most frameworks. Especially if
they are recurrent. I tried to implement a custom RNN in Tensorflow and gave
up. All of the documentation is "just call tf.Lstm()" or whatever.

The exception is CNTK. That makes it really really easy to describe any
network including recurrent ones. Sadly it doesn't seem to have caught on at
all.

~~~
orbifold
I agree that it is super painful and confusing to implement RNNs in TensorFlow
and had a similar experience starting out. You should take a look at
[https://www.tensorflow.org/api_docs/python/tf/scan](https://www.tensorflow.org/api_docs/python/tf/scan),
that is one of the easier ways of implementing one I think. Other than that
Pytorch is also great for experimentation and even has a nice example
[https://pytorch.org/tutorials/advanced/cpp_extension.html](https://pytorch.org/tutorials/advanced/cpp_extension.html)
of implementing a variant of an LSTM.

Contributing to this problem is that there is a ton of low quality blog posts
on all of these topics.

------
DEADBEEFC0FFEE
Not rut, local minima, surely.

~~~
beobab
What's the difference?

~~~
stevesimmons
it's a ML joke...

------
Arbalest
Is the fix here really just, lets build more infrastructure? If so, who is
going to fund it? AI/ML is already mostly a marketing term at this stage, it's
all just people who want results now now now. Maybe this is the reason that AI
winters have occured: Companies expect all this stuff to be plug and play, but
it really isn't. You've got to put in the hard yards, and that means the
dollars.

~~~
AstralStorm
Were it just dollars. Actual breakthroughs are not exactly predictable.

~~~
Arbalest
Fair enough, I was going to say time as well, but money is time as well, just
means you have to multiply it a bit more.

------
eitland
It's worse [0]. I have some fantastic recommendations from Google Now and
recently from Amazon as well :-/

Here is one brilliant example of something that must be a machine learning
thing that has gotten a little too much freedom: [http://erik.itland.no/more-
fun-with-google-mixing-images-fro...](http://erik.itland.no/more-fun-with-
google-mixing-images-from-different-sources)

If you are lucky you can still see them for yourself by searching Google for
mat-table. (It is a component of Angular Material, which the AI/ML thing kind
of realizes but not without fuzzing it to include an actual table with a mat
on as well :-)

Edit: meanwhile, for the first time I can remember for years Google isn't
pushing some _dumb_ dating site front and center but rather text ads for some
things that _could_ be relevant for a software engineer with a lovely wife and
small kids: mule integration, robotics stuff and holiday suggestions. Congrats
to whoever managed to convince their boss (or the AI) to try some other
options. I’d also be delighted if next year I’d get a well placed ad for a
couple of the conferences I missed out on this year.

Edit 2: Finally got around to posting a couple of screens showing off Amazons
understanding of Law books, Engineering books and Humor and Entertainment:
[https://erik.itland.no/fun-with-amazons-ai-machine-
learning](https://erik.itland.no/fun-with-amazons-ai-machine-learning)

[0]: since AI is now making things actively worse.

------
PaulHoule
It is not just academia that is stuck with systems that are fast as some
things and slow at other things.

Branching, pointer indirection, and variable length data structures all kill
performance. That is where the serialization tax comes from. Arrays of numbers
can be loaded into RAM, even memory mapped, and be used right away.

Commercial products are very much limited by what we know how to make fast.

------
platz
I think languages like
[http://unisonweb.org/posts/](http://unisonweb.org/posts/) are a bit early,
but in a couple decades there could be a strong institutional push to
languages with features like it.

------
ilaksh
This does seem to be a big problem. More than just frameworks not being
amenable to alternative approaches, people just don't know than non-mainstream
approaches exist. But if they find out about something different and try to
use it, good luck.

It's not impossible though. For example, I have been thinking about learning
OpenCL. And there are other GPGPU approaches like CUDA and a lot of less
popular stuff.

------
tim_sw
Probably bec of tensorflow, it’s optimized for deep learning and encourages
certain kinds of models

------
angel_j
I have it all figured out, not gonna lie.

