I performed the steps in the article and reduced the output binary size from ~35 MB for raw go build to around 29 MB for go build with the ldflags all the way down to 8.3 MB for compressed with upx.
Does UPX still trigger virus scanners, though? As much as I like the idea of tinier binaries, the chance that your user's endpoint security is going to freak out makes it not worth the hassle.
I hadn't tested the startup times of the upx binary before, so I just did by running `time $binary_name --help`. For my binary, this should print a simple help message and quit.
Big: 0.04s user 0.01s system 101% cpu 0.054 total
Med: 0.04s user 0.01s system 99% cpu 0.048 total
Sm : 0.14s user 0.02s system 99% cpu 0.163 total
So you're right - it makes execution time take about 3x as long. However, in my case, 3x slower is a fine trade off to make.
UPX binaries throw off most endpoint security models and so we have a chicken and egg problem where its more common use is for malware authors than commercial software vendors. But once UPX binary signatures are in significantly wider use new models can be developed to better score legit binaries.
35mb seems small to you? That's an absolutely massive binary.
But to answer your question, you're going to see some performance improvements if your binary can fit into lower levels of the CPU cache.
Kdb+/q for example is less than 700kb, which mean that the entire thing can fit into the L1 instruction cache on high end server CPUs. And that size is even more impressive considering its only dynamically linked to libc and libpthread.
This small size contributes to the out of this world performance the system has. And remember that's the language and the time series db.
Some of the gains here are from compression, does the compressed size matter for fitting into cache? Surely it's stored as the uncompressed size after expansion?
Yeah, if you're simply compressing the binary with upx or something, the uncompressed size is what actually matters. Though I guess it will decompress faster, the smaller it is. But I doubt that upx decompression speed is particularly important.
For context, 35MB is two and a half boxes of high-density floppy disks. There were whole operating systems, complete with applications, smaller than that.
I remember when a OO framework like MacApp would produce a minimal "hello world" program that took up hundreds of KB and that was considered almost too bloated to be usable. I can't remember exactly how much, but the better part of a floppy to do nothing was ridiculous.
Aren't Go binaries completely static? That sort of is a large part of an operating system (the libraries minus hardware-related libraries) -- Just add a kernel (tens of MB) and whatever userland stuff you need to get your hardware to work (which could be "nothing" in some cases) and you're ready to go.
This is a binary with ~1800 LoC including comments and generous whitespace. Probably 16-1700 Executable SLoC. Mostly some API handling, file reading, and user input requests. I'm running the following go version:
go version go1.14.4 darwin/amd64
Looks like the Linux and Darwin versions of the binaries are only a few hundred kb different.
I'm not familiar with UPX compression so it might invalidate what I'm about to say. However it can be beneficial for performance if you fit more of your executable into CPU cache (actual results may vary based on access patterns and many other factors)
UPX is just a (nice) compresser for binaries, where running the executable really runs a small decompressor on the binary, gets a binary in ram and runs it. So your binary is not smaller in cache when executed.
this is what I use, I do not use -s -w link flags these days, just run strip over the binary and it gets the job done. then upx to make the size further down.
No but the smaller the binary, the smaller the download size. Also it results in smaller docker containers, which is a popular way to ship go binaries.
I tried to answer upthread (https://news.ycombinator.com/item?id=23912394), but the long and short of it is that it takes about 3x longer to start the program for my example (35 MB no compression, 27 MB after ldflags="-s -w", 8 MB after UPX) after UPX compression than before.
Since the Go compiler is so primitive you can strongly influence the size of the program by just restructuring it. An example: if you have a function that converts integer return codes to errors, and you have another function that calls it a lot:
You're going to get multiple whole copies of the inner function inlined into the outer one. This can add up, if the outer function is long and/or if the inner switch statement is huge.
You can save a lot of code by jumping through an adapter function instead, like:
... because now the compiler doesn't get to inline g into f. These things make a big difference in whole-program size. I wrote about it in the context of tailscale and protobufs here:
It's not really a language problem, it's a compiler problem, so at least in the future we might get a compiler that recognizes error-handling idioms. Or we just get llvm? The language problems, like the performance-hostile calling convention, are harder to fix.
not gp, but it looks like go's calling convention requires pushing all function args and return values to the stack [1]. thats always going to be less efficient than most C calling conventions, where (iirc) you put what you can in registers and only use the stack if registers aren't enough
EDIT:
and all registers are caller-saved. so when calling a function, all register contents will be saved on the stack (and restored later) even if the function doesn't touch certain registers
Wow, maybe they have some valid reason for that, but from where I'm standing it just seems like an extremely lazy and thoughtless design choice that belongs in a prototype and nowhere else.
i don't think its intended as a stable target and will probably change at some point. the top google hit was a 3rd party site, and this issue is still open [#16922 doc: document assembly calling convention](https://github.com/golang/go/issues/16922)
btw the source i linked in my original comment links to a proposal to allow passing args in registers
It's more than a compiler problem: Go modules contain assembly routines which assume the Go calling conventions, so changing the convention is now quite breaking.
There's a lesson in there about success, or something...
How can that be true? Every assembly function in Go moves its arguments from memory relative to the stack pointer. If the arguments were in registers instead it would break every assembly function. Does gccgo not support asm?
I just tried it and it won't build my existing Go packages that include x86 assembly files. So I conclude that it does indeed break compatibility with asm, which certainly is one way to change the calling convention.
Consider that a hobbyist C compiler isn't going to perform the same optimizations as gcc. Likewise, clang and gcc aren't totally equivalent, either, despite being based on the same language standard. Most languages don't have multiple compilers, but any non-commercial language can in theory have multiple competing compilers that don't give equivalent output for the same input.
It being a compiler problem, you can just accept the performance/size penalty and leave your code as is, but hope that in the future it will eventually change and the eventually really smart compiler will accept your current code as is and produce output without the penalty.
Language problem would means you have to rewrite it differently.
A language is both the abstractions that it implements, and the way it implements them. Any problem in the code you write could be solved by changing the compiler as much as it could be solved by changing the code. The two parts of the system are inseparable. I’d say a language can only be as good as its best compiler.
But they are quite different. There are languages that are tightly coupled, but not true of all (like Go). You could write all your applications in the Go Language, but use a transpiler.
And as other comments call out, it is important to not conflate the two. A language issue is one that can't be easily fixed without breaking backwards compatibility. Compiler optimizations can be made (and is done all the time) without breaking the language.
The better C compilers use better heuristics in these cases, knowing that binary size can mean I$ pressure and therefore be a detriment to performance.
go has //go:noinline for the same effect. I don't think you need these weird trampoline calls and they are probably somewhat fragile since a future compiler might figure out how to inline through them.
Nobody needs to bounce through trampoline calls like this, but for people willing to chase a modest code size decrease like the 6% in the article, it's worth knowing that how your code is arranged may have a larger effect on program size than you're used to from other compiled languages.
That doesn't sound like Go's compiler is primitive, a primitive compiler wouldn't attempt the inlining at all. This sounds like a performance optimization which isn't something that a primitive compiler would attempt.
The compiler is not doing any sophisticated analysis at all, it can inline since the language is designed to be easy and fast to compile, but in my opinion at the expense of a little too much.
I went through most of the compiler about two years ago to see ,amongst other, how much of its time was spent during IO since both compile time and correctness of dependency handling varied wildly in virtualized environments.
Which is where I learned a lot, as for example that the correct way to invoke the compiler is essentially impossible to do with the supplied tools, but if you do it, it was actually possible to not recompile system libraries each time you build, that refactoring code where module scope is used for a most things is incredibly tedious.
To compile to/from memory, as a service I had to change approximately 10 000 lines of code without being entirely done, as everything was tied into the module namespace and a little bit too concrete vfs like file access mechanism.
Thus both the lifetimes of data, and where IO went were impossible to change without changing almost every type declaration site, hence often declaration, and attribute reference cite. Anything module based needed to be threaded through in appropriate places.
Yes, and because some package names are hardcoded to be forbidden by parts of the toolchain unless it knows it is compiling itself, you have to patch the compiler (iirc) and the build tool if you want it to compile itself to a different name, which you kind of have to when you are doing something incredibly experimental. Otherwise it'd just overwrite itself because of gopath. Because of how the compiler/build works, it becomes contagious, so you'll need different names for everything. Well, you could probably get away with a little less renaming if you compiled the compiler with make, but at the point I realized that, I was already too far in for it to be any point in backtracking.
Hardest issue was when it deadlocked on a mutex for a while, turns out one the main compiler go routines concurrency were not implemented correctly, so without the implicit serialization of the disk IO, it deadlocked almost every compile.
Apologies for the digression, go was kind of forced on me, and some of the most touted "opinionated" parts lost me countless of hours of grief.
TBH i do not really know much about Go, though i did spend some time reading the compiler's code last year and found it very readable despite not having any real knowledge of the language. It does try to keep things simple, but at the same time i do not think it is a primitive compiler - that would go probably go to something like Tiny C Compiler, though even TCC has some optimization features i think (in general i wouldn't say that any compiler that has an optimizer can really be called "primitive") - as it does a lot of things that strictly speaking aren't really necessary to produce executables. I'm certain that it is possible to create a much simpler and much smaller compiler for Go than the official one.
It's even faster (at least on OSX) if you can invoke it correctly, which at least 2 years ago was almost impossible.
If you manage to supply exactly the packages (transitively), and not a wildcard more, it would avoid recompiling the system libraries on each invocation. One would think a compiler where so much effort was put into making it fast would make it easy to have it run fast? The primitives were sort of there, but they didn't really fit together, and I actually had to write quite a bit of script/code to make it work. Got very fast though, thought I broke it the first time I got it to work. <return> and instant prompt. Toke me a while to convince myself it was actually compiling anything at all.
Cut the recompile time in a VM to less than a tenth. Still slowed down by some pointless file IO (don't remember exactly what) but that was less of an issue.
Vanity package URL's though was what broke this camels back, after GOPATH and some other miscellanea had strained it.
When I ... probably needed to quickly fork+fix an external package, and suddenly had this additional obstacle, I was so ... unhappy. Already too much work, and then you get more work for exactly no discernible reason at all. It's even called a vanity url officially.
This didnt seem to change anything for me. I am currently using Go 1.14.4 (Windows), and after trying go1.15beta1, the resultant executables are actually larger.
Another thing Ive always found strange is people always recommend (-ldflags -w), when (-ldflags -s) gives smaller result, and (-ldflags '-s -w') is identical to (-ldflags -s).
Given the article is talking about DWARF debugging formats, I believe the numbers in the article are for Linux binaries. I'm not sure if Windows binaries store line numbers in the same way.
Doesn't Windows usually just use a .pdb file next to the exe/dll, containing the debugging information? So stripping debug information on Windows is just deleting (or not distributing) that file.
Yes, .pdb files are used by most toolchains for debug information on Windows.
No, Go doesn't use .pdb files.
On all platforms Go stores a significant amount of meta-data in the binary, in the same format so that they can access it using the same code on all platforms.
This information is used by the runtime. Precise garbage collector needs to know the layout of structures and stack to know which fields are pointers; To generate readable callstack Go needs info about function and where they come from (source code file name, position). etc.
A sidenote: you can convert DWARF debug info to .pdb using external tool which allows debugging Go programs with Windows native debuggers like WinDBG.
> You can usually save double-digit percentages by stripping debugging information: pass `-ldflags=-w` to `go build`.
Is there a way to make `go build` for a particular project always use `-ldflags=-w`? Or do I have to remember to type it in every time (or hide my `go build` command inside a Makefile or `build.sh`)?
Shrinking the line table will be irrelevant to performance. As I understand it, the tailscale binary has a hard size requirement to run as a VPN under iOS.
This is great article instructing how to shrink the Go Binaries. Author is lead of cryptography and security team of Go Team at Google.