
Mid-stack inlining in the Go compiler - dcu
https://docs.google.com/presentation/d/1Wcblp3jpfeKwA0Y4FOmj63PW52M_qmNqlQkNaLj0P5o/edit#slide=id.p
======
throwawayish
In case someone else is wondering: What is being called "mid-stack inlining"
here is what is generally understood by the term "inlining".

~~~
CUViper
The presentation makes a distinction between mid-stack and leaf inlining, and
apparently it was only done on leaf calls before because this is less
confusing in backtraces.

~~~
Ericson2314
The point is, as usual, Go is trying to catch up to what everyone else has had
for years. The use of non-standard terminology here is suspicious, raising the
question that the Go people are trying to hide the fact that they are playing
catch-up.

~~~
enneff
Comments like these are so depressing. Someone does a bunch of work to improve
the Go compiler and writes a presentation to share their approach (primarily
so that other people working on Go can understand and extend it), and gives it
all away for free. This is textbook open source citizenry, which should be
applauded.

But instead, you come along and criticize them for being too specific with
their terms (!!) and also accuse them of being deceptive. This is not a
marketing exercise. There is no conspiracy here.

~~~
vorg
> writes a presentation to share their approach (primarily so that other
> people working on Go can understand and extend it

Perhaps Ericson2314's gripe is that it's being posted (and upvoted) on HN,
rather than just a Go-specific forum (e.g. reddit.com/r/golang), and by
implication it's intended to be read by a more general audience.

~~~
enneff
Well that's just silly. People post random stuff to HN all the time. That
something is appears here does not mean that HN is the intended audience.

------
zamalek
This is absolutely fantastic work.

Since I've learned about continuation passing style (which Go channels could
_probably_ be formally transformed into), I've been convinced that there's a
better way to do codegen. Better calling convention, better stack
representation, better instruction architecture; I'm not yet sure - it's a nag
continuously at the back of my mind, almost as though it's at the tip of my
tongue. In this specific case, it must _surely_ be possible to inline a
continuation with some foreign architecture. I'd _love_ to see some literature
on the more experimental end of this stuff, if anyone has it.

~~~
pjmlp
Are you aware of "Compiling with Continuations?

[https://www.amazon.com/Compiling-Continuations-Andrew-W-
Appe...](https://www.amazon.com/Compiling-Continuations-Andrew-W-
Appel/dp/052103311X)

~~~
hinkley
That's an interesting read. One of only a handful of tech books I've read
twice. It's thin and doesn't repeat itself all that much so if you read it
twice it's still faster than reading most tech books once.

More recently though I heard someone proved mathematically that CPS can be
transformed one-for-one into one of the more conventional models. That doesn't
mean it might not still be easier for the humans to deal with however.

------
sheeshkebab
Jvm/java had that for a while in its JIT - it's nice to see it to coming to
Golang.

Byproduct of that could be little hacky things like this below that make code
faster by restructuring code a little

[https://techblug.wordpress.com/2013/08/19/java-jit-
compiler-...](https://techblug.wordpress.com/2013/08/19/java-jit-compiler-
inlining/)

------
chillydawg
9% faster, 15% bigger. I'll take that!

~~~
DannyBee
FWIW: This is actually not that great, but it's a good start.

You should be able to get about 15-20% with about 3-5% binary increase size.

In fact, with ThinLTO, we often see that gain with binary size _decrease_ from
smart inlining choices.

(The heuristics for inlining take a very very long time to get right and tune)

The issue they will next hit is that inlining is going to make the compiler
slower until they tune the heuristics well.

~~~
kevincox
If I'm understanding the increase is mostly due to the "debugging" info that
is added, not necessarily due to more code.

~~~
DannyBee
I strongly doubt this. It doesn't say this in the preso, and ...

1\. The compiler is a lot slower, which is usually from code growth and not
debugging info growth. If the compiler is that much slower from debugging info
growth, they have larger issues :) 2\. Usually people do not include debug
info sizes in binary sizes, because DWARF/et al info can be stripped and put
alongside the binary (IE it doesn't even have to be part of the binary)

~~~
sythe2o0
It does say this. One of the last slides says 4% of the additional size came
from adding more debugging information, excluding anything to do with the new
inlining.

~~~
DannyBee
"If I'm understanding the increase is mostly due to the "debugging" info that
is added, not necessarily due to more code. " vs " One of the last slides says
4% of the additional size came from adding more debugging information,
excluding anything to do with the new inlining"

So no, it doesn't say that it's "mostly due", it says ~25% is due to debugging
information.

------
micah_chatt
What impact would that have on build times? I know a lot of work has gone into
getting back to 1.4 build times, but would the added work of inlining prolong
builds?

~~~
lazard
The compiler got a bit slower:

[https://github.com/golang/go/issues/19386](https://github.com/golang/go/issues/19386)

Those numbers are just for my CLs that fix stack traces but with mid-stack
inlining still off. Turning it on makes builds noticeably slower:

    
    
        $ time ./make.bash
        real: 45.32s  user: 118.67s  cpu: 5.85s
    
        $ time GO_GCFLAGS='-l=4' ./make.bash
        real: 64.51s  user: 167.04s  cpu: 7.12s
    

We'll need to tweak the inlining heuristic to find a good balance between
performance, build times, and binary size.

~~~
Sunset
Please have a switch where I can sacrifice build time for maximum possible
runtime benefits.

~~~
chriswarbo
The compile-time/run-time tradeoff is interesting. Getting the "maximum
possible runtime benefits" probably calls for
[https://en.wikipedia.org/wiki/Superoptimization](https://en.wikipedia.org/wiki/Superoptimization)
:)

------
YesThatTom2
The presentation redacted the stats about how this affects Google performance.
I bet it saves enough CPU hours to pay the author's salary many many many
times over. Good job!

~~~
edgyswingset
Maybe? I'm under the impression that the vast majority of Google's software is
not written in Golang, though.

~~~
SEJeff
But even 1% of their software being go would still see more use than most
software you or I write in our lifetimes.

dl.google.com has been golang since 2013. Imagine the traffic that application
gets!

[https://talks.golang.org/2013/oscon-
dl.slide#1](https://talks.golang.org/2013/oscon-dl.slide#1)

------
ainar-g
Whoah, nine percent? That's a lot! Now I wonder if the improvement better or
worse on non-x86 platforms?

~~~
lazard
We measured 10% improvement on ppc64.

------
dap
This is interesting work!

That said, it's a little disappointing when runtimes require custom algorithms
or metadata to walk the stack and construct a stack trace. It makes it harder
to build debuggers that grok the state of multiple runtimes (e.g., the Go code
and the C code in the same program). This also affects runtime tracing tools
like DTrace, which by construction can't rely on runtime support for help.

~~~
aclements
We plan to expose all of the inlining information in the DWARF tables so
debuggers won't have any problems with this. Internally, the runtime uses a
different representation just so we can make it more compact and optimized for
the runtime's exact needs. This way, you can also strip the debug info without
breaking the runtime's own ability to walk stacks.

~~~
CUViper
Isn't that what `.eh_frame` is for?

~~~
mnemonik
.eh_frame is DWARF with a couple tiny tweaks

~~~
CUViper
Right, but it's an allocated section that doesn't get stripped like debuginfo.

------
snovv_crash
It looks like most of the improvements are in string formatting problems. I'm
curious if better heuristics will help other areas as well.

------
cetinsert
As someone who now professionally uses go on a range of mips devices, I, for
one, do care about any binary size increases!

------
spullara
It doesnt look like this solves inlining library calls?

~~~
aclements
Go already performs cross-package inlining, so it can already inline library
calls. (This is relatively easy to do in Go compared to other languages
because packages must form a DAG. Compiling package A writes out enough
information in the object file for A that compiling package B that depends on
A can inline calls to functions in A.)

~~~
DannyBee
"Compiling package A writes out enough information in the object file for A
that compiling package B that depends on A can inline calls to functions in A"

So it records the calling convention, architecture flags, alignment, and other
ABI pieces etc? As well as an estimate of instruction-level inlining cost,
summary info about arguments, etc, so you effectively decide whether inlining
it will help or hurt, without having the IR around to try?

FWIW: Writing out the info is usually not the hard part, actually, and is
unrelated to the DAG-ness of the packages.

GCC is just the perennial example here, but they refused to write it out for
years for political reasons, not technical ones :)

~~~
aclements
"So it records the calling convention, architecture flags, alignment, and
other ABI pieces etc?"

No. At the moment it records the AST in the object file, because the inliner
works at the Go AST level. In the future it may instead record the SSA
representation (which would obviously give better cost estimates; the current
heuristics are really extremely simple).

"FWIW: Writing out the info is usually not the hard part, actually, and is
unrelated to the DAG-ness of the packages."

The DAG-ness means it's always available when compiling the call site, even if
it's a cross-package call. It means you don't have to do it at link time.

~~~
pcwalton
> The DAG-ness means it's always available when compiling the call site, even
> if it's a cross-package call. It means you don't have to do it at link time.

Why is it any harder to do at link time?

(I've implemented this in a production compiler, and choosing whether to do it
at compile time or link time was a trivial decision.)

~~~
DannyBee
traditionally, this required a linker that understands there is ir in the
files. in practice, i don't believe this has been a problem for many years now
(and again, was only a problem in the open source world, so saying it's
related to the language is kind of strange.).

Every good production C++ compiler has had some form of link time optimization
for many years.

IBM's, for example, has been happily cross-optimizing between C++, java,
fortran, PL/IX, etc without any issues, going on at least 15, maybe 25+ years
now (I know it's 15 for sure, i suspect it's closer to 25).

