
Size visualization of Go executables using D3 - knz42
https://science.raphael.poss.name/go-executable-size-visualization-with-d3.html
======
cakoose
> as discussed in my previous article [1], Go uses memory instead of registers
> to pass arguments and return values across function calls.

[1] [https://science.raphael.poss.name/go-calling-
convention-x86-...](https://science.raphael.poss.name/go-calling-
convention-x86-64.html)

This is very surprising for a language that targets somewhat high performance.

Looks like it's a 5-10% performance hit, but makes it easier to provide good
backtrace information:
[https://github.com/golang/go/issues/18597](https://github.com/golang/go/issues/18597)

~~~
userbinator
That is indeed _very_ surprising. I think the fact that Go isn't dead-slow can
be attributed mainly to the sheer speed of CPUs. Looking at that code reminds
me of the early days of PC compilers, especially the output of the trivial
x+y-z function. The complete lack of push/pop instructions also shows a
massive defect in the understanding of how the x86 architecture is supposed to
be used. The icache bloat of doing that is enormous.

 _So return values are passed via memory, on the stack, not in registers like
in most standard x86-64 calling conventions for natively compiled languages._

Wow. That's "worse than cdecl" \--- which, despite passing parameters on the
stack, will at least use the accumulator (and high accumulator) for return
values that fit.

 _but makes it easier to provide good backtrace information_

This seems to be a common line of thought but it goes against my belief in how
tools should create efficient code --- anything intended for debugging
purposes only should have zero effect on the executable when not being used,
and compilers should focus on generating the most efficient code. Debugging
information goes in a separate file and there you can put as much detail as
you want. Don't make code generation worse, improve the debugging tools
instead. The code will spend far more time, across everyone who uses it, being
run than debugged.

~~~
drej
I wouldn't call Go dead slow, it really does depend on the use case. Go was
originally designed for systems that don't require absolute max performance,
it was focused on providing safety and a good standard library for building
things.

You can see from the common use cases, original authors' backgrounds or issue
discussions, that e.g. the target audience are not people who want maximum
throughput in data processing systems (something I've been interested in). The
serde libraries are dead slow (but correct), there is little to no assembly in
this code (unlike in the crypto packages), you don't have any higher level
access to intrinsics to build this yourself, there is no native (meaning
-march) compilation (by design, for portability reasons) etc. etc.

If you try writing high performance Go, it often starts looking like C rather
than Go (by avoiding channels, io.Reader, using unsafe etc). It's a shame, but
oftentimes, it's your only option. Plus, you don't have clang/gcc developers
to help speed up your code on a daily basis, you "only" have the Go team and
contributors (yes, there is gccgo, but...).

All that considered, I like the language.

~~~
userbinator
_I wouldn 't call Go dead slow_

I'm not saying it is -- but rather, that they could get away with such sloppy
code generation because CPUs are so fast now. I did mention that early in the
history of the PC, pretty much all compilers were like that due to other
constraints, and the difference between that and handwritten Asm was enormous.

~~~
drej
Whoops, apologies, misread that one sentence.

Agreed - if one takes the performance on an absolute scale, it is usually
sufficient (also the reason why people use Python or PHP, despite both being,
relatively, quite slow).

It is only once people start comparing it to the next best thing and/or when
they desire better performance that they realise, that there is a lot of not
so low hanging fruit.

------
everdev
Really interesting results and analysis, but small nit:

> there is about 70MB of source code currently in CockroachDB 19.1, and there
> was 50MB of source code in CockroachDB v1.0. The increase in source was just
> ~140%

That's a 40% increase, not 140%. This happens on all percentage calculations
throughout the article.

That said, super interesting discovery.

~~~
knz42
corrected

------
ploxiln
go minor releases make a surprising difference - you'll see a big difference
if compiling the same project with go-1.10.z vs go-1.12.z

[https://github.com/golang/go/issues/27266](https://github.com/golang/go/issues/27266)

a very recent cause of pclntab getting huge is adding preemption safepoint
info for every line/instruction range, and they're looking at alternatives:

[https://github.com/golang/go/issues/24543](https://github.com/golang/go/issues/24543)

------
justinclift
Another utility (cli based) for investigating go binary sizes is `goweight`:

[https://github.com/jondot/goweight](https://github.com/jondot/goweight)

Written about here:

[https://medium.com/@jondot/a-story-of-a-fat-go-
binary-20edc6...](https://medium.com/@jondot/a-story-of-a-fat-go-
binary-20edc6549b97)

------
userbinator
_The purpose of this data structure is to enable the Go runtime system to
produce descriptive stack traces upon a crash or upon internal requests via
the runtime.GetStack API._

 _In other words, the Go team decided to make executable files larger to save
up on initialization time._

Something about this whole thing just seems _wrong_. How often does (perhaps
_should_ ) an application crash? How often does (again, perhaps _should_ ) it
need to retrieve its own stack? ...and _how much_ of the binary is being taken
up just for that purpose?

 _Of size /performance trade-offs and use cases_

Why is startup time even the question when the common-sense approach is to
simply compress this rarely-used table and decompress it _upon the first time
it 's used_, not upon every startup? That's assuming it is always absolutely
necessary to have in the first place, since loading a huge executable isn't
going to be fast anyway.

I feel like this is a case of "the tail wagging the gopher".

~~~
jchw
Eh, it's a bit of a simplification I think. Certainly the pclntab is consulted
in more situations than just application crash; for example, when logging, you
can have a source line prepended to the beginning of the line, which certainly
uses the pclntab. I would be pretty surprised if there weren't a lot of other
cases where the pclntab is consulted.

~~~
userbinator
_for example, when logging, you can have a source line prepended to the
beginning of the line, which certainly uses the pclntab_

It needs to go through the table to find which source line corresponds to the
current instruction pointer? That's the only reason I can see for needing it,
and a very roundabout way of getting information which is known at compile-
time and could be simply a constant whereever it's used, much like C has
__FILE__ and __LINE__.

~~~
rwj
That's not enough for the stack traces. A crash in Go is way more informative
than failing an assert in C. Whether the extra memory is worth it is a
separate questions.

------
chessturk
Would it be possible to pass a flag to go build to change runtime.pclntab to
the pre Go 1.02 implementation?

I don't actually have the author's usecase though, I tend to build
microservices in Go!

~~~
marcus_holmes
considering the number of flags for Go already, that seems the sensible option
:)

even in docker-and-microservice-land, though, there's a cost to having an
extra 10-50Mb of executable to copy around the place... I'm nowhere near
experienced enough in that to work out if that counter-acts the gains on
initialisation speed, though.

------
zerotolerance
I put together a little demo looking at the impact of using fmt vs os for
Hello, World.

[https://github.com/allingeek/fmt-vs-os](https://github.com/allingeek/fmt-vs-
os)

------
fenollp
Could we see a comparison with both `-ldflags '-s -w'` and `CGO_ENABLED=0`?

I feel like this would solve:

> the Go standard library is not well modularized; importing just one function
> (fmt.Println) pulls in about 300KB of code.

