
Llvm-mca – LLVM Machine Code Analyzer - nkurz
https://llvm.org/docs/CommandGuide/llvm-mca.html
======
debatem1
This is supported today in compiler explorer
([https://godbolt.org](https://godbolt.org), if you haven't played with it).
Just choose a recent clang compiler and hit "add tool"->"LLVM mca".

~~~
davidtgoldblatt
Example usage on some code with interesting pipelining + resource contention
properties: [https://godbolt.org/z/11oyav](https://godbolt.org/z/11oyav)

(The code is from the stream vbyte repo; see
[https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-
new-...](https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-speed-
records-for-integer-compression/) ).

Edit: Example interesting fact: the first iteration takes 35 cycles before it
finishes; but the reciprocal throughput of the loop (assuming it executes a
few times) is 5.3 cycles.

------
d99kris
On a related note I just found that Intel has EOL'ed their IACA recently:

 _April 2019: Intel® Architecture Code Analyzer has reached its End Of Life.
Users may want to try LLVM-MCA. This is NOT a recommendation to use LLVM-MCA
nor a comment on its accuracy or usefulness. Thanks for being faithful users
of Intel Architecture Code Analyzer throughout the years. We hope it was
useful for you._

From: [https://software.intel.com/en-us/articles/intel-
architecture...](https://software.intel.com/en-us/articles/intel-architecture-
code-analyzer)

------
benrbray
This is really cool! I wanted to point out this wonderful article also posted
to HN today that shows how LLVM MCA can help guide low level optimizations:

Article: [https://pdziepak.github.io/2019/05/02/on-lists-cache-
algorit...](https://pdziepak.github.io/2019/05/02/on-lists-cache-algorithms-
and-microarchitecture/)

HN:
[https://news.ycombinator.com/item?id=19810618](https://news.ycombinator.com/item?id=19810618)

------
mooreed
I want to really love this post. But I am not smart enough to know why it’s
awesome. Anyone care to give a digest for mere mortals?

~~~
tux3
Oh it really is an awesome low-level tool, but it's not as complicated as it
sounds! Here's my attempt at explaining it, hope this helps :)

The gist of it is that your CPU loves to multitask (because that's so much
faster), and if you like performance you want to maximize how many things it's
doing at once at any time. You want every part of the CPU to have work on its
schedule all the time so it doesn't sit idle. This shows you what the schedule
for your code looks like so you can optimize it.

\--

In more details, this tool is going to read your code instruction by
instruction, compute what kind of schedule the hardware will be able to make
for executing each instruction, and tell you which circuits might be
overworked or going unused. (Caveat: the CPU makes its own schedule on the fly
as best as it can — this tool is just an approximation in software — but your
compiler's best guess is still a pretty good guess in general!)

For example (actual numbers! [0]) your CPU could have 4 "ports" (circuits)
that can start doing math on integers (with their ALUs) at any instant in
time, or 2 of those "ports" could also multiply floating-points numbers, but
each circuit can only be given one kind of task at a time to keep things
manageable. Well it turns out making a good schedule is a surprisingly hard
problem, since if you send int work to the first two ports and you didn't
foresee there'd be float work coming after, the first two ports will have a
full schedule while the other two will be doing nothing. A better schedule
could have had all four of them busy in parallel!

You don't directly have a say in the schedule — the hardware does its best —
but with LLVM-Mca (or Intel's IACA, which inspired it) you can write code that
you know will be easy to schedule in parallel, and that's already some pretty
awesome tools to have!

[0]:
[https://en.wikichip.org/wiki/intel/microarchitectures/skylak...](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_\(client\)#Scheduler)

~~~
jcranmer
> You don't directly have a say in the schedule — the hardware does its best

That's not quite true. The order you present the instructions influences the
actual execution order greatly, and it's why instruction scheduling remains an
important part of compilers.

~~~
tux3
Yep, I wasn't sure how to word it (it's pretty hard to keep it short without
being too wrong!).

You're right that there are _many many_ things you can do to the code to
influence the scheduling — like the reordering the compiler is doing — and at
the end of the day that has a predictable impact on the scheduling. Don't get
me wrong I love my compiler, and the fact that we can impact scheduling is why
LLVM-Mca is useful in the first place.

What I meant to write is that your x86 isn't some kind of mostly statically
scheduled VLIW. the behavior of the hardware is only partly predictable, and
even IACA has to make some tragic simplifications. Tweaking the alignment to
play with fetch boundaries has an effect, vectorizing obviously does, picking
a different mix of instructions can help, artificially loading a port to
prevent a bad scheduling decision down the line is not always entirely stupid,
etc...

I feel it's important to keep in mind that the hardware scheduler keeps
dynamic statistics on port usage, so in a sense it's more like a JIT than a
compiler, static analysis is only an approximation. What little experience I
have told me it's always a good idea to compare IACA and LLVM-Mca's predicted
schedule with a real profiler's output :)

(Thanks for giving some nuance, it's appreciated. I'm not actually a compiler
engineer or doing low-level magic for a living, so if you see anything wrong I
would love to be corrected!)

~~~
souprock
Uh, do you want to be doing low-level magic for a living? You appear to be
capable of it. Here, my "Who is hiring?" comment from yesterday:
[https://news.ycombinator.com/item?id=19797601](https://news.ycombinator.com/item?id=19797601)

~~~
tux3
I would love to, but I'm an ocean away.

------
wyldfire
> The tool currently works for processors with an out-of-order backend, for
> which there is a scheduling model available in LLVM.

Which targets does that include?

~~~
jcranmer
x86(-64), ARM, AArch64 for sure. It looks like there's a few schedules going
on for SystemZ, PowerPC, and Hexagon as well, but I'm not sure these are out-
of-order processors.

~~~
classichasclass
Most recent POWER CPUs are out-of-order, with the notable exception of the
POWER6.

------
tylerflick
This looks really cool, but unfortunately has really limited arm support.

