
Accelerate large-scale applications with BOLT - ot
https://code.facebook.com/posts/605721433136474/accelerate-large-scale-applications-with-bolt/?r=1
======
brendangregg
Great to see this open sourced, and more work in the FDO/PGO space. I also
like the access visualization heatmap: I'd done something similar before but
didn't have a big use for it yet, but showing how it improves access locality
is killer.

------
ebikelaw
It seems unfortunate that they developed their own data format for the input.
Why can't it be in the same format that SamplePGO ingests? Additionally, it
also seems unfortunate to add yet another stage to the toolchain. We already
have either build, run+profile, rebuild with pgo, link with LTO, or we have
build with samplepgo, link with thinlto, and this is adding a third or second
rebuild. It already takes a phenomenally long time to compile a large C++
application, and another pass isn't going to make it shorter.

~~~
maxpan
There's no need for another build as BOLT runs directly on a compiled binary,
and could be integrated into an existing build system. Operating directly on a
machine code allows BOLT to boost performance on top of AutoFDO/PGO and LTO.
Processing the binary directly requires a profile in a format different from
what a compiler expects, as the compiler operates on source code and needs to
attribute the profile to high-level constructs.

~~~
AboutTheWhisles
Can BOLT optimize a shared library as well?

~~~
maxpan
Not yet, but the support is coming.

------
wocram
Is it possible for this sort of thing to make it into the compilers? Seems
like it might see much more usage that way.

~~~
sanxiyn
AutoFDO is exactly that. Facebook's execuse to develop this is that AutoFDO
didn't work well with HHVM, but that sounds like an AutoFDO bugfix, not a
whole new project. I agree with you that it is better to work on AutoFDO,
because it will see more wide usage.

~~~
maxpan
In many cases BOLT complements AutoFDO. AutoFDO affects several optimizations,
and code layout is just one of them. Another critical optimization influenced
by AutoFDO/PGO is function inlining. After inlining a callee code profile is
often different from a "pre-inlined" profile seen by the compiler, which
prevents the compiler from making an optimal layout decisions. Since BOLT
observes the code after all compiler optimizations, its decisions are not
affected by context-sensitive inlined function behavior.

------
ssuresh
Great to see this open sourced. The front-end bottlenecks I$ misses are a big
problem for several managed runtimes (Node, Java, .NET etc). This works on the
executable, the JITed code will still need to do it's own layout to optimize.

------
rattray
Should this read "Speed up large binaries with BOLT"? Seems specific to many-
megabyte compiled programs.

~~~
asfasgasg
bzip2 isn't very large, and they seem to believe they speed that up too. I
don't see why this wouldn't speed up small things. Whether it's a big program
or a small one, it probably has hot code and cold code, and organizing the
code in such a way that the hot stuff is all together is likely to help. Same
goes for making sure you don't have to take branches, etc.

~~~
ebikelaw
On modern x86_64 implementations tiny changes in code layout can have large
effects. There are rampant stories of "load-bearing NOPs" that mysteriously
make programs several percent faster. Even experts do not understand these
effects. I'd like to hear more about the effect on various programs, not just
bzip2.

~~~
sanxiyn
There was a published paper in CGO 2011 on load-bearing NOPs by Google.

[https://ai.google/research/pubs/pub37077](https://ai.google/research/pubs/pub37077)

------
magicbuzz
Could this conceivably be used to speed binaries like nginx? I tend to compile
nginx on my systems as over time there are always custom modules I want to
add/modify anyway.

~~~
sanxiyn
This helps if and only if you suffer from instruction starvation. Given that
nginx binary size is around 1 MB which fits comfortably inside cache, it is
very unlikely BOLT would help with nginx.

~~~
AboutTheWhisles
I don't think that is true. Even without memory bandwidth constraints, jumping
to instructions that aren't in cache is going to incur memory latency. If the
instructions are all packed together, the next instructions could be
prefetched and already be in cache.

Also low cache levels have less latency and yet are much smaller, the L1
instruction cache is 32KB. Any linear access of memory will prefetch and
minimize the latency of memory access.

~~~
redtuesday
AMDs Zen architecture uses 32KB for data and 64KB for instructions (was
curious if there are differences between AMD and Intel designs regarding L1
cache).

------
polskibus
Can this work be used somehow to improve JITs like .NET's?

~~~
minxomat
A thing to try: Mono(LLVM) -> mkbundle(static, aot full) -> BOLT.

------
himom
Except for case, bolt already is a puppet project. Not a huge deal since both
projects are widely-spaced.

------
stochastic_monk
For some bizarre reason, regardless of which browser I use, I can't open this
facebook page. (ERR_CONNECTION_CLOSED)

Is there code or a summary available elsewhere?

------
jedberg
This is an interesting solution to avoid microservices.

Although it doesn't apply to many, one of the advantages of microservices is
avoiding exactly the problem they are solving -- that the monolith application
is too big for memory, too big for CPU cache, etc.

I like this alternate solution -- it involves a lot of up front work and will
surly require a lot of maintenance, but so does a complex microservices
architecture.

It'll be interesting to see how this holds up in a few years.

~~~
quotemstr
Huh? Your having a big monolith service binary doesn't mean that this whole
binary gets paged in from disk all the time for every request. If you're
specializing and dedicating particular instances to particular tasks, you'll
get cache-mediated locality even if you have a big _binary_ that can, in
principle, do lots of different things.

~~~
jedberg
> doesn't mean that this whole binary gets paged in from disk all the time for
> every request

No, but unless I'm reading this wrong, what this optimizes is what _does_ get
paged in.

Using microservices would naturally limit what _could_ get paged in because
the services would most likely be smaller, whereas with this, the developer
doesn't have to worry about those things and instead this tool optimizes for
them.

~~~
ebikelaw
In my experience if you take one big service and break it into two, each
binary will be pretty much just as big as the original. All of that library
code is still in there, all that static data is still reachable. Nobody knows
why but it is.

~~~
repsilat
Sure, but it's probably not all being executed, so you don't really pay for
what you don't use.

(OTOH, while you may not find yourself waiting on instructions, you'll no
doubt spend all of your time waiting on data...)

------
harigov
Working at large companies like these, it is clear that many of their own
applications never use these technologies. Why then is FB investing into such
technology? Where do they plan to use it?

~~~
paxy
The main example they gave in this article, HHVM, is what all of Facebook's
backend runs on, so I'm not sure what point you're trying to make.

~~~
notacoward
I work on Facebook infra, and to me that's all the _front_ end.

