
Things that make Go fast - davecheney
http://dave.cheney.net/2014/06/07/five-things-that-make-go-fast
======
hendzen
Escape analysis, dead code elimination, and function inlining are standard
optimizations taught in an undergraduate compilers course. Go is cool, but I
wouldn't really cite those as justifications for why.

~~~
Nitramp
Yes for dead code elimination and function inlining, not so sure about escape
analysis. The author acknowledges that, but there's a detail in Go: it does
the function inlining at compile time (unlike e.g. Java JITs), but still
manages to inline across compilation units (unlike C++, modulo LTO).

That's nice, and presumably what he wanted to point out. It's also nice that
in Go, these things are very straight forward due to the overall simplicity of
the system (unlike C++). The dead code elimination is just a supporting fact
for why that's useful, and again works across compilation boundaries.

I'm not sure about your assertion of escape analysis, at least Java JITs only
learned that trick as of lately, and are still pretty bad at it. C++ again
suffers from cross-compilation unit visibility; even if your LTO can detect an
inlineable call, its AFAIK not possible at that time to move heap allocations
to the stack.

This is an interesting pattern in Go, the longer one looks at it, the more you
understand that it's a whole bunch of good decisions in various subsystems
coming together.

~~~
pcwalton
> C++ again suffers from cross-compilation unit visibility; even if your LTO
> can detect an inlineable call, its AFAIK not possible at that time to move
> heap allocations to the stack.

Sure it is. Why not?

C++ compilers don't usually do this because it doesn't help much—explicit
memory management encourages people to not allocate unless necessary in the
first place.

~~~
Nitramp
> Sure it is. Why not?

Do you have a reference for that? I'd expect this to be hard, at linking time
you no longer have the C++ source, so it's much harder to make such decisions.

~~~
pcwalton
You don't need the C++ source, just the IR. LLVM already removes mallocs if
they are unused:

[http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-July/033017....](http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-July/033017.html)

------
Artemis2
Unfortunately, Go's compiler is not as fast as it could be; most of the
optimizations presented here were already made by compilers in the 80s.

The fact that modern compilers are a really complex piece of software that
took dozens of years to write and improve to the state we are at doesn't
helps. Hopefully, switching to a compiler written in pure Go in Go1.4 (IIRC)
will allow code maintainers to benefit of Go's simplicity.

------
Alupis
The comparison between GO and Java seems unfair, given they compare a
primitive variable with an object... which has methods and a bunch of other
things to increase it's size (for good reason).

Sure GO may be quick... but a JIT'ed java program will run at native C
speed... because it's been compiled down to native code at that point... (and
most language performance comparison's I've seen pop up generally ignore this
fact and measure "performance" by timing runtime which includes the JVM firing
up and executing cold/non-jit'ed code... not real-world scenarios for high
performance code.)

~~~
melling
What is Java's startup and JIT overhead? Go seems to be a good replacement for
when you need a faster Python. For large, long running programs the JIT
probably has better optimizations than the current Go compiler.

~~~
pjmlp
> What is Java's startup and JIT overhead?

Quite fast if you use an AOT compiler.

On the server side, it is usually doesn't matter that much. And when it does,
there are JVMs that cache JITed code.

~~~
papaf
This is true for PC's and servers. However, Java startup time on the Raspberry
Pi is horrific.

I recently saw a small server go from 3 seconds startup on my PC to 4 minutes
on a Raspberry.

~~~
pling
That's pretty much because the CPU on the Pi is awful. I mean really bad. The
CPU came with the SoC they could get their hands on rather than was selected
as being optimal for a desktop/server role.

------
astrange
Function calls aren't that slow in an OoO processor - they're perfectly
predictable branches, so it can just start decoding from over there. There
might be a cache miss, but there might also be fewer cache misses, or even
better the CPU might skip decoding with a µop cache.

Really, the purpose of inlining is so inline functions can be specialized for
their new context, which can easily make the total code size smaller. On x86,
size/speed tradeoffs just don't happen like they used to.

~~~
gsg
That's not the whole story. There are other costs associated with calls such
as spilling and imprecision of data flow analyses around a call site.

------
HeroesGrave
Things that make Go fast*

*compared to non-native languages like Python and Java.

Could people please stop calling their favourite language fast just because it
beats an interpreted/VM language.

~~~
pjmlp
Not only that. Usually these comparisons cleverly leave out AOT compilers for
the said languages to make theirs look better.

In Java's case there are quite a few JVMs, many of those with AOT compilation
to choose from, even implemented in Java itself.

~~~
marktangotango
For Java can you name any AOT compilers besides Excelsior JET and GCJ?

~~~
pjmlp
GCJ is dead.

Yes, CodenameOne, JamaicaVM, Aonix Perc and J9 all support AOT compilation
besides normal JIT.

The Oracle Hotspot replacement project, Graal allows for AOT compilation via
SubstrateVM.

There there is RoboVM for targeting iOS applications, with WP support getting
added now.

Android is replacing Dalvik with ART, which does AOT compilation at
installation time.

Probably a few more that I am not aware.

~~~
marktangotango
Thanks for the info, interesting the first four you mention are commercial
products. Two you may find interesting: avian vm, and xml vm at one point
could translate jvm bytecode to c for compilation with gcc.

------
zwieback
I was surprised to see stack-check preambles mentioned here. Does that really
happen on every function call? Or does it happen on a context switch? Usually
stack-checking on function entry is considered something that makes code slow.

~~~
4ad
Yes, it happens on every function call. It costs 3 machine instructions. That
is nothing.

There is no other "context-switch" other than the one triggered by this check
(and other similar mechanisms), Go is cooperatively scheduled; all preemption
is voluntary.

~~~
zwieback
Wow, what can you do in three instructions and what happens when the stack
check fails? Sounds intriguing, think I'll read up on that...

~~~
4ad
Let's take a look at Linux. Other systems are similar.

    
    
        ; go tool objdump -s main.main a
        TEXT main.main(SB) /private/tmp/a/a.go
        	a.go:9	0x400c10	64488b0c25f0ffffff	FS MOVQ FS:0xfffffff0, CX
        	a.go:9	0x400c19	483b21			CMPQ 0(CX), SP
        	a.go:9	0x400c1c	7707			JA 0x400c25
        	a.go:9	0x400c1e	e8ddf90100		CALL runtime.morestack00_noctxt(SB)
        	a.go:9	0x400c23	ebeb			JMP main.main(SB)
        	a.go:10	0x400c25	e8d6ffffff		CALL main.foo(SB)
        	a.go:11	0x400c2a	c3			RET
        	a.go:11	0x400c2b	0000			ADDL AL, 0(AX)
        	a.go:11	0x400c2d	0000			ADDL AL, 0(AX)
        	a.go:11	0x400c2f	00			?
    

On linux/amd64 we can use the Local Executables TLS access procedure. In
particular, we use a negative offset from the FS segment register to get a TLS
slot (our job is simpler because we are always the main executable).

    
    
        MOVQ FS:0xfffffff0, CX
    

We make use of two TLS variables, g and m (soon we will only use one), a
pointer to g is at -16(FS). We access it in this first instruction.

g is an instance of struct G, see go/src/pkg/runtime/runtime.h:/struct.G. It
contains many things, but it starts like this:

    
    
        struct	G
        {
        	uintptr	stackguard0;
        	uintptr	stackbase;
        ...
    

In particular the first word (at offset zero) is the stackguard, which
indicates the stack limit (it is also used for voluntary preemption, but that
doesn't matter here).

This instruction in the stack check preamble:

    
    
        CMPQ 0(CX), SP
    

Compares the current stack pointer with the stackguard. In most cases we have
enough stack, so the next instruction just skips past the preamble to the real
function code.

    
    
        JA 0x400c25
    

When we don't have enough stack, we call a function in the runtime (one of the
runtime.morestack functions). This function allocates a new stack segment
(from the heap). Currently we use contiguous stacks, so if we have complete
type information in the current stack we can just copy the old stack to the
new stack segment fixing any pointers as dictated by the type information, and
then we switch the stack pointer.

If we don't have enough type information (or in previous Go versions), we use
segmented stacks. We allocate a new stack segment, but we don't copy the
stack; we just switch the stack pointer and we take care to be able to do the
reverse operation when we return from the function.

Take a look at the next instruction after the call to runtime.morestack.

    
    
        JMP main.main(SB)
    

We just jump to the beggining of the function like nothing has happened. Then
the algorithm repeats, but we won't fails the stack limit check again, so it
will skip it. Why it jumps to the begining of the function instead of just
continuing in the body of the function is left as an exercise to the reader.

We used the Local Executables TLS access model here, sometimes we have to use
the Initial Executable model. If we ever allow Go programs to be loaded by C
programs as dynamic objects, we would have to use more complicated models.

On ARM we just use a register instead of using any form of TLS. On most
systems Go binaries set the FSbase register to some value on the heap, but
when we use cgo, or on platforms that don't support static binaries we don't
touch FSbase, as it was already set up by libc.

Functions that use little stack (under 120 bytes) can be excepted from this
stack check.

~~~
zwieback
Thanks, nice writeup.

------
fiatmoney
Hey, as long as we're talking about Go performance - can we please, please get
some kind of wide vector intrinsics (ie, no cgo overhead) in a library, or at
least aggressive compiler generation of vector ops that actually use AVX & the
ARM NEON equivalent?

Right now peak floating-point performance isn't even within half of what it
should be on a very recent CPU, and I'd love to be able to deploy Go that
exposes machine learning models to a network interface.

------
azam3d
Go should replace Java in Android Development

~~~
pling
Go should replace Dalvik and the half arsed Java runtime implementation on
Android yes, but I'd take a proper mature JVM over both on _any_ device.

~~~
pjmlp
I am looking forward to the Google IO presentation about ART.

Looking at the official languages both in the iOS and WP 8.x SDKs, Google
should at least give first class support to all major JVM languages.

------
robryk
Nitpick: goroutine context switch can also happen at function calls (when the
stack is being enlarged).

~~~
Rapzid
I guess taking advantage of that could be tricky. If your function gets
inlined...D'oh!

~~~
skj
If your function is inlined (which happens at compile time), then there won't
be any stack growth and the point is moot.

~~~
Rapzid
This is a form of pre-emptive scheduling. It doesn't happen when the stack
size needs to increase, it causes the stack check to fail. A bit of classic
Dmitry cleverness:
[https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4...](https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit)
[http://golang.org/doc/go1.2#preemption](http://golang.org/doc/go1.2#preemption)

Anyway, I was just offering this scenario up as a bit of curious humour where
somebody might think they are providing an escape hatch but the compiler in-
lines their call foiling their plans :)

~~~
stcredzero
Curious humor == Classic too clever by half.

