
MLIR Primer: A Compiler Infrastructure for the End of Moore’s Law - sandGorgon
https://ai.google/research/pubs/pub48035
======
m0zg
Meh. I haven't been able to get any speedup out of XLA, and it wasn't for the
lack of trying. I will take TF seriously when it's at least as fast as
PyTorch, and frugal enough with GPU ram to process batches of the same size as
PyTorch. Right now, PyTorch simply blows it out of the water, and it doesn't
have any advanced compiler backend or anything. Just lots good old fashioned
performance engineering elbow grease, nothing fancy.

~~~
svantana
Same here, I guess it's a complex function of hardware and what you're trying
to do. Personally I ended up writing my own autodiffing tensor library in C++,
because all existing solutions had abysmal performance on my problem (lots of
local updates in large tensors). The speedup is >50x compared to TF, pytorch,
julia, jax.

~~~
thecleaner
> Personally I ended up writing my own autodiffing tensor library in C++

But this will not be a general library right ? You must have only included
certain subset of functions of TF or PyTorch or whatever. Autodiffing is also
included in certain proprietary libraries like ones from NAG. But I doubt its
possible to achieve 50x speedup without compromising on functionality.

~~~
svantana
Of course it's a herculean task to write a library with that many features.
But I don't think that's the issue, it's more that the devs of TF can't
possibly optimize for every use case. For me, I knew what kind of ops I
needed, so I could focus on getting those as fast as possible.

------
jph00
For those interested in learning more about this, Chris Lattner (project lead
on MLIR and Swift for TensorFlow) and I co-taught two lessons covering these
topics. They're lesson 13 and 14 here:
[https://course.fast.ai/videos/?lesson=13](https://course.fast.ai/videos/?lesson=13)

------
mikewarot
Most of the transistors in a modern computation environment are just sitting
there waiting to be touched by an instruction (far, far more transistors in
RAM than in the CPUs, overall).... computing prematurely optimized on the
wrong architecture.... there's still a lot of room to grow, speed wise.

~~~
charlesdaniels
> Most of the transistors in a modern computation environment are just sitting
> there waiting to be touched by an instruction

True, but it's important to note that if all the transistors in a modern CPU
switched on at once, it would quickly overheat. This is the "power wall" \--
we can squish more transistors in one die than we can actually turn on at one
time due to electrical and thermal constraints.

> far, far more transistors in RAM than in the CPUs, overall

Also true, and this is an active area of research. Many people have tried
various approaches to performing computations using DDR and other memory
technologies. In the past, people were trying to trying to use DDR to run
automata. These days there seems to be a lot of focus on processor-in-memory
technologies; it turns out memristors can actually be used for computation,
effectively turning the entire memory array into a hugely wide SIMD RISC
processor. Here is some recent work presented on this subject:

Real Processing-in-Memory with Memristive Memory Processing Unit (mMPU) -
[https://ieeexplore.ieee.org/document/8825114](https://ieeexplore.ieee.org/document/8825114)

PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like
Operations -
[https://ieeexplore.ieee.org/document/8825013](https://ieeexplore.ieee.org/document/8825013)

Parallel Stateful Logic in RRAM: Theoretical Analysis and Arithmetic Design -
[https://ieeexplore.ieee.org/document/8825150](https://ieeexplore.ieee.org/document/8825150)

> there's still a lot of room to grow, speed wise.

Yes and no. Single-threaded performance is close to tapping out. Production
processes can only shrink so far before physics starts getting in the way.
Pipelines and speculation can only get so deep (and broaden surface area for
security vulnerabilities). Performance growth for massively parallel workloads
is continuing along at a healthy clip, and will probably continue to do so for
quite some time. Of course the trouble is that end-user desktop software is
generally not massively parallel.

~~~
namibj
Actually, we can significantly boost single-threaded raw speed. We can't do
this for the memory wall however, because the approach is based on MOS current
mode logic (MCML). We can build 20 GHz CPUs now with passive cooling, but they
don't beat cutting-edge CMOS cores in memory-hard single-threaded workloads.
They do reach 2-cycle add and like 3-cycle mul latency though. I hope someone
just plops a RISC-V core with that kind of design down, which you can use with
like explicit preloading into a tiny cache that gets 2 or 3 cycle load latency
into registers. I'm sure some computations could work well on that sort of
very fast, shallow-pipeline core suited well for highly-sequential stuff like
maybe SAT/SMT solvers and other inherently divide-and-conquer algorithms.

~~~
charlesdaniels
> We can't do this for the memory wall however

You're changing the effective throughput weather you do it by upping the
clockrate or deepening the pipeline. Using half the data per cycle at twice
the clock rate will cause the same memory pressure.

> which you can use with like explicit preloading into a tiny cache

That will kill it. As soon as you put it on the compiler designers or
programmers to do something special to realize performance benefits, you're
going to loose to architectures that don't.

Sure, compiler writers and programmers will optimize for your architecture...
if it's popular and widely used. So you have a chicken and egg problem where
you need to get adoption in the first place by running existing workloads
faster.

> We can build 20 GHz CPUs now with passive cooling,

Citation? Like for real, that's cool and I'd like to read about it!

------
lamchob
Albert Cohen, one the developers talks about the project at the PLI Summer
School 2019:
[https://youtu.be/3TNT5rFVTUY?t=2580](https://youtu.be/3TNT5rFVTUY?t=2580)

------
taliesinb
See the related HN discussion from earlier today about differentiable
programming with Swift for TensorFlow and its corresponding changes to the
Swift language:
[https://news.ycombinator.com/item?id=20890149](https://news.ycombinator.com/item?id=20890149)

------
sdan
I think this has some great potential... its just that with most things, it
needs to mature before mass adoption.

------
baybal2
I was not able to mentally digest the abstract. What it is actually about?

------
sandGorgon
i feel a pang that the initial targets for this will be elite hardware - both
ios and Mac.

All the work that Google did around NBU (Next Billion Users), accessibility
and Android are useless using this initiative.

They should have not chosen Swift. Kotlin Native would have worked equally
well with day zero targetting for Android

[https://resources.jetbrains.com/storage/products/kotlinconf2...](https://resources.jetbrains.com/storage/products/kotlinconf2018/slides/2_kotlin-
native-snake.pdf)

~~~
pjmlp
Julia would have been the proper answer given the target community.

Naturally with Lather on board, it is Swift all the way, even though its
support is WIP on Linux and nonexistent on Windows.

As for Kotlin/Native it is very imature, with incompatible memory semantics
between other variants, and to be honest I don't see Kotlin ever taking off
outside Android.

