
LLVM merges machine function splitter for reduction in TLB misses - lelf
https://lists.llvm.org/pipermail/llvm-dev/2020-August/144012.html
======
account42
In their comparisons they explain that the existing Hot Cold Split creates new
functions for the cold parts which brings overheads due to calling conventions
and exception handling. Their new method avoids that by using simple jumps.

I wish compilers were able to optimize functions boundaries for visible but
not inlined better so they could optimize those function call overheads away
automatically. There is no reason for a compiler to limit itself to the system
ABI for completely internal functions.

Besides leaving values in the registers they are already in instead of moving
them to where the ABI says they should be, this could enable passing C++
objects such as std::unique_ptr in registers where the current ABI forbids
this. Or to eliminate nop destructors for objects that are always moved from
in the called functions.

In general, I think compilers should see function boundaries in the source
code as a mere hint when it comes to code generation just like they already do
with the register keyword.

Is there any existing compiler work to optimize function calls where the
implementation and all uses are visible in the translation unit?

~~~
chriswarbo
> There is no reason for a compiler to limit itself to the system ABI for
> completely internal functions.

As far as I understand it, this limitation is also the only thing preventing
tail-call elimination in C/C++. Tail-recursive functions are essentially the
same as 'while' loops (execute some instructions then conditionally go back to
the start); they just have different scope and calling conventions. In
particular, 'while' loops are always inline and can't be referenced from other
functions, but tail-recursive functions can. Hence compilers can apply more
aggressive optimisations to 'while' loops, e.g. re-using the same stack frame
and iterating with a conditional jump.

However, as you say, functions-as-subroutines don't necessarily need to
coincide with functions-as-interface. Tail-recursive functions which are only
used internally can also be optimised to re-use the same stack frame and
iterate with a conditional jump. If such functions are _also_ exposed in the
interface, it's pretty trivial to expose a dummy entry point which invokes the
optimised version (if we want to); although cross-unit tail calls wouldn't be
eliminated in that case.

Of course, this is more useful than just writing 'while' loops in a different
style (although I personally prefer recursion
[https://news.ycombinator.com/item?id=11150346](https://news.ycombinator.com/item?id=11150346)
), since tail calls don't have to be recursive: we can split our code up into
modular, composable, self-contained functions without having to worry about
stack usage (or code bloat caused by inlining into multiple locations).

~~~
omginternets
>As far as I understand it, this limitation is also the only thing preventing
tail-call elimination in C/C++.

At the risk of introducing a tangent, can someone explain how compilers
_detect_ tail recursion? This is still a complete mystery to me.

~~~
tom_mellior
What part of the detection is giving you trouble? Detecting whether a call is
recursive, or whether it is a tail call that can be optimized? The first part
is relatively easy, you typically have some sort of data structure
representing calls and their targets, and you know what function you're
currently compiling. If the function you're compiling is the same as a call's
target, you have recursion.

LLVM's code is at
[https://llvm.org/doxygen/TailRecursionElimination_8cpp_sourc...](https://llvm.org/doxygen/TailRecursionElimination_8cpp_source.html)

The core of this detection is in findTRECandidate. It iterates backwards over
a basic block, looking for call instructions whose call target is the current
function F:

    
    
        CallInst *CI = nullptr;
        BasicBlock::iterator BBI(TI);
        while (true) {
          CI = dyn_cast<CallInst>(BBI);
          if (CI && CI->getCalledFunction() == &F)
            break;
      
          if (BBI == BB->begin())
            return nullptr;          // Didn't find a potential tail call.
          --BBI;
        }
    

The harder part is detecting whether calls are in fact tail calls that you can
replace by jumps. There are all sorts of special cases, like whether alloca()
might be called by the function.

~~~
omginternets
Sorry, I should have been more precise: I was referring to detecting that a
function is _tail_ recursive.

(Yours is still a helpful explanation, though -- thanks!)

~~~
maximilianburke
I can't say how LLVM does it but in a compiler I built I detected tail
recursion by scanning the IR and if the value of a recursive call (ie: call to
self) was immediately returned it was considered a candidate for TCE. If the
value was consumed by the parent, it wasn't.

------
jeffbee
The post on llvm-dev is ever so much more interesting.

[https://lists.llvm.org/pipermail/llvm-
dev/2020-August/144012...](https://lists.llvm.org/pipermail/llvm-
dev/2020-August/144012.html)

~~~
snehasish
The llvm-dev google group retains the formatting of the original email and is
slightly easier to read though it requires a gmail login.

[https://groups.google.com/g/llvm-dev/c/RUegaMg-
iqc/m/VFyV9cX...](https://groups.google.com/g/llvm-dev/c/RUegaMg-
iqc/m/VFyV9cX4BQAJ)

~~~
moonchild
> requires a gmail login

[https://archive.is/pqLgW](https://archive.is/pqLgW)

~~~
nielsbot
This link + Safari Reader Mode is good

------
MaxBarraclough
> Google engineers found a 2.33% runtime improvement with a ~32% reduction in
> iTLB and sTLB misses

Not bad. Presumably this will benefit every compiler that uses LLVM, not just
Clang.

~~~
monocasa
Or at least any that supports gathering and utilizing profiling information.
Last time I checked, that didn't include Rust yet for instance, but my
information could be out of date there.

~~~
kibwen
rustc does support PGO, but I don't know how mature the support is.
[https://doc.rust-lang.org/rustc/profile-guided-
optimization....](https://doc.rust-lang.org/rustc/profile-guided-
optimization.html)

~~~
rmdashrfstar
Can I create profile data by running the instrumented binaries through
testcases? Perhaps integration test cases which represent typical end to end
functionality of my application.

~~~
fluffy87
Unit tests are a bad fit, you need to run a test that is representative for an
actual work load.

A unit test that tests code that’s cold in practice would make it look like
hot code, and be counter productive.

------
saagarjha
Note that LLVM has already had a function outliner; you may have seen these
paths already if you've ever looked inside a binary on macOS and noticed *
.cold. * functions, which is the unlikely-to-execute code cut out from the
function they came from. This appears to be an improvement on that effort.

~~~
quotemstr
Right. The improvement that this work brings is that it performs the function
split very late and at a low level.

Basically, while the previous outliner split a function into two functions
(the hot one literally calling the cold one as needed) this new thing takes a
single function and splits it into two parts connected to each other by jumps.
The cold part of the function isn't really a function --- it's just a group of
basic blocks that happen to be located far away from the other group of basic
blocks.

By avoiding the call into the cold function and the return to the hot
function, the generated code can be smaller and more register-efficient.

~~~
loeg
Might be confusing to debuggers if the address-space range of a single
function is discontiguous. Does the cold portion get an independent symbol
with derived name, like, e.g., "Blocks?"

~~~
quotemstr
Yes --- the collection of cold basic blocks gets named "<origfunc>.cold". But
it's nevertheless not really an independent function from a code POV.

------
ladberg
FYI, this requires profile data to do the splitting. Yet another reason to do
PGO!

~~~
repsilat
For many use-cases FDO ("feedback-driven") can be more convenient and
sometimes mkre effective than PGO ("profile-driven").

The difference is sampling prod vs sampling separately in test. The arguments
for FDO:

\- Prod behaviour/data is always "representative", whereas synthetic or
recorded data can go out of date quickly.

-PGO test fixtures can contain sensitive user data. Instrumenting production processes doesn't put data in more places.

The benefits of both are huge though. The rule of thumb I've seen is a 20%
improvement for FDO over -O3.

~~~
throwaway17_17
I’m interested in reading some more on this concept, The Wikipedia, which
usually my first stop for a broad overview and maybe some linked research,
seems to suggest that profile directed/guided and feedback guided are the same
thing. Is there anywhere I can read about the varying approaches?

~~~
jeffbee
They are the same thing. The GP makes a distinction which does not exist.
Witness the fact that two different compilers refer to the exact same feedback
technique as either AutoFDO or SamplePGO.

~~~
repsilat
Ah, I definitely worked in a place where that distinction in terminology was
used, but maybe it isn't widespread. In any case, whatever you call either one
of them, sampling from production processes can have benefits over sampling
from synthetic workloads.

~~~
justinclift
Surely that's a given?

It's literally sampling from a representative workload (production) vs a non-
representative one (anything synthetic).

~~~
repsilat
Maybe so. I think it was a novel idea to me because my intuition around
profiling was formed with an assumption of delivering binaries to users, and
not running server processes. (It's probably also a pre-internet bias of
thinking it would be hard to _get_ prod data, whereas profiling data from
generated in a horrible enormous compile&run&profile&recompile process at
least doesn't need to "go" anywhere.

------
emerged
One issue with PGO is you are optimizing for the particular subset of use
cases which are profiled, on the exact machine you profile on. The 2% quoted
improvement will not necessarily generalize across all use cases for the
software and all machines it runs on. In fact, you may often be taking that 2%
or more away from other use cases.

~~~
drivebycomment
> One issue with PGO is you are optimizing for the particular subset of use
> cases which are profiled, on the exact machine you profile on.

The exact machine part is not true. There's nothing about this particular
optimization that's machine specific - e.g. as the original post explains,
this optimization gives performance boost on Intel and AMD, on Intel due to
reduction in iTLB misses, and on AMD due to reduction in L1 and L2 icache
misses. i.e. this kind of "working-set" reduction translates to any platform.

> In fact, you may often be taking that 2% or more away from other use cases.

In general, it is correct that profile-guided optimization can theoretically
reduce performance, as some of the aggressive optimizations are only done with
profile because of inherent trade-off the optimization has (e.g. aggressive
inlining which can be detrimental for the performance if hot functions are
entirely different).

However, empirically this is not true in most cases, unless you picked really
bad training input, _and_ your code has extremely different behavior under
different input. Moreover, nowadays with sampled profile, which you can
collect from the real, production runs, it's extremely unlikely for this to
happen.

------
derangedHorse
It always surprises me how many people are familiar with LLVM internals. Just
looking at the 60 unique commenters at the time of posting this is blowing my
mind.

------
IvyMike
> We find that most functions 5 KiB or larger have inlined children more than
> 10 layers deep

Ok, I would not have expected that.

~~~
jackpirate
I wouldn't have expected any functions to be 5KiB or larger. That seems like a
lot of assembly to me, and so it makes sense to me that these functions would
be "accidentally" generated by inlining.

I'm curious what percent of functions are this large?

~~~
mac01021
I've seen plenty of c/c++/java functions that exceeded 200 lines of high level
code. It seems reasonable to me that such a function would be 5k bytes of
machine code, especially on a 64-bit architecture.

Not that such functions are the right way to structure your program...

~~~
account42
> especially on a 64-bit architecture

Pointer size has little impact on code size. amd64 code can even be smaller
than x86 in some cases.

------
Ericson2314
This + multi-return (push multiple code addresses in stack frame) together
mean I think we can just use Result/Either and never unwinding, and leave no
performance behind.

~~~
zelly
It doesn't work in Rust, only Clang

~~~
Ericson2314
Why? Other commenters were talking about PGO and saying Rust did support it.

------
emerged
I take it this is a form of profiler guided optimization? Does MSVC already do
this during its PGO? It seems like an obvious thing to do.

~~~
neerajsi
MSVC does have this optimization. It has 3 sections: 'live', 'sick'
(referenced, but uncommon), 'dead' (unreferenced in the profile).

~~~
jeffbee
Yes, this is a pretty old technique. There were papers about hot/cold function
splits based on profile data as early as 1996.

~~~
caf
The technique itself isn't new in clang either. This is about a new
implementation of it, where the difference to the existing implementation is
that this happens later in the process (it's deferred to the machine-specific
code generation phase, whereas the existing implementation happens in the
middle-end and is target-agnostic).

~~~
vlovich123
The other major piece is that the hot-cold split is more efficient. Rather
than thunking out the cold code via a function call it just jumps to the basic
block, making it a more efficient approach (no register spilling and function
call overhead)

~~~
caf
The function call overhead itself is irrelevant, because by definition these
blocks are cold. The saving/restoring of callee-clobbered registers does
affect the code size of the hot function though, so that's important.

~~~
labawi
Blocks are cold by an imprecise measurement.

Decreasing the downside means you can apply the hopefully-optimisation more
aggressively for more gain so I would expect it to matter.

------
gosukiwi
LLVM is one impressive (and scary) piece of technology.

~~~
bonzini
FWIW this particular optimization has been in GCC for about 15 years, though
it was only enabled by default in GCC 8. It's not rocket science.

~~~
isatty
Idk, not learning compiler theory is one of my regrets from school and this
does seem like rocket science to me.

~~~
jjice
There are a ton of great books out there! If you're looking for a bit of the
whole shebang, check out Compilers: Principles, Techniques, and Tools (aka,
The Dragon Book). It's an older book, but a lot of what is described in there
is still very relevant and good info. If you want a more high level overview
at first, I'd recommend Crafting Interpreters [0] and Writing a
Compiler/Interpreter in Go [1]. The latter two focus more on lexing, parsing,
and intermediate code generation.

[0] [https://craftinginterpreters.com/](https://craftinginterpreters.com/) [1]
[https://compilerbook.com/](https://compilerbook.com/)

~~~
isatty
I know I should get to it - maybe I should've done that during the COVID
lockdown eh? Thanks for the links, I'll check them out.

------
vlovich123
2% better perf isn't anything to sneeze at but also not super drastic. Will be
curious to see if this unlocks more optimizations over time.

I'm also curious if there's been any work to organize code layout to make it
more efficient to page stuff out efficiently to reduce memory pressure (so
that you're using all code in the pages you are bringing in) & reduce paging
thrash (you're not evicting a code page only to bring it back for a few
hundred bytes).

~~~
hinkley
Given the greatly improved TLB hit rate, I sort of wonder if the test isn't
abusing the code enough. Does this change the inflection point where
throughput falls from trying to run too many tasks in parallel or
sequentially? 2% response time won't do much, but 10% higher peak load would
change quite a few priorities on a lot of backlogs.

~~~
vlovich123
It’s not too surprising to me. If TLB misses cost your program to run 6%
slower, wouldn’t you expect a 30% reduction in that to be a 2% overall
improvement? Why would it be more? If you’re surprised that TLB misses are not
so costly, consider that cache locality and CPU prediction are largely going
to hide the cost of misses (cache locality means the miss rate is low to begin
with, prediction reduces the cost of a miss). TLBs are also super complex
involving a few layers of caching (TLB look aside, L1 cache of your TLB table,
L2 cache, L3 and then RAM).

------
voldacar
So to use this new optimization pass you have to specify it explicitly with an
additional command line flag instead of it simply being turned on as an
ordinary part of PGO?

It seems like most people who just want to squeeze the last drop of speed out
of their code will probably never become aware of it then unless they are
specifically seeking it out

~~~
mafuy
It's an experimental feature. If it works well for a while, it may become part
of the default. But it should definitely not be forced on everyone if it's not
yet sure how well it works.

------
person_of_color
Google improving an Apple project. The high ground!

