Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Life without line numbers for 6% smaller Go binaries (commaok.xyz)
95 points by caust1c on July 21, 2020 | hide | past | favorite | 92 comments


https://blog.filippo.io/shrink-your-go-binaries-with-this-on...

This is great article instructing how to shrink the Go Binaries. Author is lead of cryptography and security team of Go Team at Google.


Completely agree.

I performed the steps in the article and reduced the output binary size from ~35 MB for raw go build to around 29 MB for go build with the ldflags all the way down to 8.3 MB for compressed with upx.

Really impressive results.


Does UPX still trigger virus scanners, though? As much as I like the idea of tinier binaries, the chance that your user's endpoint security is going to freak out makes it not worth the hassle.


Virus scanners just treat UPX as a compressed file format and scan the binary inside.


Some scanners also treat UPX as generally malicious.


Never mind that, startup times are atrocious with such a huge upx packed binary ...


That's an interesting point.

I hadn't tested the startup times of the upx binary before, so I just did by running `time $binary_name --help`. For my binary, this should print a simple help message and quit.

Big: 0.04s user 0.01s system 101% cpu 0.054 total

Med: 0.04s user 0.01s system 99% cpu 0.048 total

Sm : 0.14s user 0.02s system 99% cpu 0.163 total

So you're right - it makes execution time take about 3x as long. However, in my case, 3x slower is a fine trade off to make.


UPX binaries throw off most endpoint security models and so we have a chicken and egg problem where its more common use is for malware authors than commercial software vendors. But once UPX binary signatures are in significantly wider use new models can be developed to better score legit binaries.


I don't think UPX triggers (my) virus scanner. The McAfee daemon on my work laptop barfs at the EICAR Test File (https://kc.mcafee.com/corporate/index?page=content&id=KB5974...) but it ignores the UPX binary right next to it.


35MB seems relatively small already. What do you get from an 8MB binary that you don't from one 4x that size?


35mb seems small to you? That's an absolutely massive binary.

But to answer your question, you're going to see some performance improvements if your binary can fit into lower levels of the CPU cache.

Kdb+/q for example is less than 700kb, which mean that the entire thing can fit into the L1 instruction cache on high end server CPUs. And that size is even more impressive considering its only dynamically linked to libc and libpthread.

This small size contributes to the out of this world performance the system has. And remember that's the language and the time series db.


Some of the gains here are from compression, does the compressed size matter for fitting into cache? Surely it's stored as the uncompressed size after expansion?


Yeah, if you're simply compressing the binary with upx or something, the uncompressed size is what actually matters. Though I guess it will decompress faster, the smaller it is. But I doubt that upx decompression speed is particularly important.


> 35MB seems relatively small already

35 MB is huge! Are there any assets inside the binary?


> 35 MB is huge!

For context, 35MB is two and a half boxes of high-density floppy disks. There were whole operating systems, complete with applications, smaller than that.


I remember when a OO framework like MacApp would produce a minimal "hello world" program that took up hundreds of KB and that was considered almost too bloated to be usable. I can't remember exactly how much, but the better part of a floppy to do nothing was ridiculous.


Aren't Go binaries completely static? That sort of is a large part of an operating system (the libraries minus hardware-related libraries) -- Just add a kernel (tens of MB) and whatever userland stuff you need to get your hardware to work (which could be "nothing" in some cases) and you're ready to go.


QNX used to be shipped on a single high density floppy.


Were? My initrd + vmlinuz is still smaller than that today on a stock Ubuntu kernel.


Seems like a culture clash between “why would you not use your distro’s shared libraries?” and “containerize all the things!”


For context, the Starbucks app on Android is 92MB. My bank's app is 180MB. They might have assets, but modern app size has become ridiculous.


They are mostly assets or frameworks like React Native, Flutter, Xamarin what have you.

Native Android doesn't need such big sizes.


There are no assets (other than a helpfile) inside the binary. This is (compiled) 35 MB of Golang.


What OS/arch? I have a small program that compies to 16MB, 14 with -w on OS X. Compiling for Linuxx with -w results in a 14MB binary too.


This is a binary with ~1800 LoC including comments and generous whitespace. Probably 16-1700 Executable SLoC. Mostly some API handling, file reading, and user input requests. I'm running the following go version:

go version go1.14.4 darwin/amd64

Looks like the Linux and Darwin versions of the binaries are only a few hundred kb different.


Hello! You must be a fellow Googler.

(The program we at Google use just to build our bloated programs is itself 180MB.)


Cough python cough cough


... nothing except a vague feeling of self superiority, judging from the answers so far.


I mean, you're not entirely wrong.

But also, since this is client side code, a smaller download means smaller utilities on client machines. Which is nice.


I'm not familiar with UPX compression so it might invalidate what I'm about to say. However it can be beneficial for performance if you fit more of your executable into CPU cache (actual results may vary based on access patterns and many other factors)


UPX is just a (nice) compresser for binaries, where running the executable really runs a small decompressor on the binary, gets a binary in ram and runs it. So your binary is not smaller in cache when executed.


27MB. Where that is important is context specific


this is what I use, I do not use -s -w link flags these days, just run strip over the binary and it gets the job done. then upx to make the size further down.


Does the smaller binary execute faster?


No but the smaller the binary, the smaller the download size. Also it results in smaller docker containers, which is a popular way to ship go binaries.


tl;dr for those wondering: strip debugging and then compress the binary with upx:

    go build -ldflags="-s -w" main.go
    upx --brute main
Just tried that on one of my binaries and went from 12 mB down to 3.6 mB.


What about process startup time?


I tried to answer upthread (https://news.ycombinator.com/item?id=23912394), but the long and short of it is that it takes about 3x longer to start the program for my example (35 MB no compression, 27 MB after ldflags="-s -w", 8 MB after UPX) after UPX compression than before.


Since the Go compiler is so primitive you can strongly influence the size of the program by just restructuring it. An example: if you have a function that converts integer return codes to errors, and you have another function that calls it a lot:

  func codeToError(i int) error {
    switch i {
       case ...
       ...
    }
  }

  func stuff() error {
    if rv := callJunk(); rv < 0 {
      return codeToError(rv)
    }

    if rv := moreJunk(); rv < 0 {
      return codeToError(rv)
    }

    ...

    return nil
  }
You're going to get multiple whole copies of the inner function inlined into the outer one. This can add up, if the outer function is long and/or if the inner switch statement is huge.

You can save a lot of code by jumping through an adapter function instead, like:

  func callWithErrorsConverted(f func() int, g func(int) error) error {
    return g(f())
  }
... because now the compiler doesn't get to inline g into f. These things make a big difference in whole-program size. I wrote about it in the context of tailscale and protobufs here:

https://paper.dropbox.com/doc/Bloaty-Puffs-and-the-Go-Compil...


I’m pretty happy I don’t have to worry about such details in C and can rely on the compiler to do the easy optimizations for me.


It's not really a language problem, it's a compiler problem, so at least in the future we might get a compiler that recognizes error-handling idioms. Or we just get llvm? The language problems, like the performance-hostile calling convention, are harder to fix.


> like the performance-hostile calling convention, are harder to fix.

Can you elaborate on what you mean by this? I’m new to go and want to understand the performance implications.


not gp, but it looks like go's calling convention requires pushing all function args and return values to the stack [1]. thats always going to be less efficient than most C calling conventions, where (iirc) you put what you can in registers and only use the stack if registers aren't enough

EDIT: and all registers are caller-saved. so when calling a function, all register contents will be saved on the stack (and restored later) even if the function doesn't touch certain registers

[1](https://dr-knz.net/go-calling-convention-x86-64.html)


Wow, maybe they have some valid reason for that, but from where I'm standing it just seems like an extremely lazy and thoughtless design choice that belongs in a prototype and nowhere else.


it's not lazy, it's simple! ;)

i don't think its intended as a stable target and will probably change at some point. the top google hit was a 3rd party site, and this issue is still open [#16922 doc: document assembly calling convention](https://github.com/golang/go/issues/16922)

btw the source i linked in my original comment links to a proposal to allow passing args in registers


Does it help goroutines and the scheduler?


This is not a language problem, but a compiler problem also. gccgo uses registers like C.


It's more than a compiler problem: Go modules contain assembly routines which assume the Go calling conventions, so changing the convention is now quite breaking.

There's a lesson in there about success, or something...


How can that be true? Every assembly function in Go moves its arguments from memory relative to the stack pointer. If the arguments were in registers instead it would break every assembly function. Does gccgo not support asm?


> In contrast to 6g (the original Go compiler) gccgo tries to mimic the native calling convention.

https://dr-knz.net/go-calling-convention-x86-64.html#differe...


I just tried it and it won't build my existing Go packages that include x86 assembly files. So I conclude that it does indeed break compatibility with asm, which certainly is one way to change the calling convention.


true, thanks for pointing that out!


I think Go's calling conventions also don't allow for tail call optimization?


> It's not really a language problem, it's a compiler problem

This has never seemed like such a clear distinction to me. A language is essentially just an instruction set for a compiler.


Consider that a hobbyist C compiler isn't going to perform the same optimizations as gcc. Likewise, clang and gcc aren't totally equivalent, either, despite being based on the same language standard. Most languages don't have multiple compilers, but any non-commercial language can in theory have multiple competing compilers that don't give equivalent output for the same input.


Even commercial languages can have multiple competing compilers.


Well, there is a disctinction.

It being a compiler problem, you can just accept the performance/size penalty and leave your code as is, but hope that in the future it will eventually change and the eventually really smart compiler will accept your current code as is and produce output without the penalty.

Language problem would means you have to rewrite it differently.


A language is both the abstractions that it implements, and the way it implements them. Any problem in the code you write could be solved by changing the compiler as much as it could be solved by changing the code. The two parts of the system are inseparable. I’d say a language can only be as good as its best compiler.


But they are quite different. There are languages that are tightly coupled, but not true of all (like Go). You could write all your applications in the Go Language, but use a transpiler.

And as other comments call out, it is important to not conflate the two. A language issue is one that can't be easily fixed without breaking backwards compatibility. Compiler optimizations can be made (and is done all the time) without breaking the language.


Wait, isn't that trick basically about avoiding optimization (inlining the function to avoid the context switch) to reduce code size?

The C compiler would do something similar here and also increase code size for performance unless you explicitly use `-Os`


The better C compilers use better heuristics in these cases, knowing that binary size can mean I$ pressure and therefore be a detriment to performance.


They also tend to have ways to override those heuristics (e.g., `__attribute__((noinline))` in GCC).


go has //go:noinline for the same effect. I don't think you need these weird trampoline calls and they are probably somewhat fragile since a future compiler might figure out how to inline through them.


Nobody needs to bounce through trampoline calls like this, but for people willing to chase a modest code size decrease like the 6% in the article, it's worth knowing that how your code is arranged may have a larger effect on program size than you're used to from other compiled languages.


Cache effects mean this is not a pure tradeoff.

Any inlining routine needs a cost model. From the article it seems like Go compiler's cost model is not yet well tuned.


ohhh you think weird stuff doesn’t happen in C compilers? heh


> Since the Go compiler is so primitive...

That doesn't sound like Go's compiler is primitive, a primitive compiler wouldn't attempt the inlining at all. This sounds like a performance optimization which isn't something that a primitive compiler would attempt.


The compiler is not doing any sophisticated analysis at all, it can inline since the language is designed to be easy and fast to compile, but in my opinion at the expense of a little too much.

I went through most of the compiler about two years ago to see ,amongst other, how much of its time was spent during IO since both compile time and correctness of dependency handling varied wildly in virtualized environments.

Which is where I learned a lot, as for example that the correct way to invoke the compiler is essentially impossible to do with the supplied tools, but if you do it, it was actually possible to not recompile system libraries each time you build, that refactoring code where module scope is used for a most things is incredibly tedious. To compile to/from memory, as a service I had to change approximately 10 000 lines of code without being entirely done, as everything was tied into the module namespace and a little bit too concrete vfs like file access mechanism. Thus both the lifetimes of data, and where IO went were impossible to change without changing almost every type declaration site, hence often declaration, and attribute reference cite. Anything module based needed to be threaded through in appropriate places. Yes, and because some package names are hardcoded to be forbidden by parts of the toolchain unless it knows it is compiling itself, you have to patch the compiler (iirc) and the build tool if you want it to compile itself to a different name, which you kind of have to when you are doing something incredibly experimental. Otherwise it'd just overwrite itself because of gopath. Because of how the compiler/build works, it becomes contagious, so you'll need different names for everything. Well, you could probably get away with a little less renaming if you compiled the compiler with make, but at the point I realized that, I was already too far in for it to be any point in backtracking.

Hardest issue was when it deadlocked on a mutex for a while, turns out one the main compiler go routines concurrency were not implemented correctly, so without the implicit serialization of the disk IO, it deadlocked almost every compile.

Apologies for the digression, go was kind of forced on me, and some of the most touted "opinionated" parts lost me countless of hours of grief.


TBH i do not really know much about Go, though i did spend some time reading the compiler's code last year and found it very readable despite not having any real knowledge of the language. It does try to keep things simple, but at the same time i do not think it is a primitive compiler - that would go probably go to something like Tiny C Compiler, though even TCC has some optimization features i think (in general i wouldn't say that any compiler that has an optimizer can really be called "primitive") - as it does a lot of things that strictly speaking aren't really necessary to produce executables. I'm certain that it is possible to create a much simpler and much smaller compiler for Go than the official one.


Pretty insane that the compiler for such a popular language is so basic.


its also one of the fastest these things were a choice


It's even faster (at least on OSX) if you can invoke it correctly, which at least 2 years ago was almost impossible.

If you manage to supply exactly the packages (transitively), and not a wildcard more, it would avoid recompiling the system libraries on each invocation. One would think a compiler where so much effort was put into making it fast would make it easy to have it run fast? The primitives were sort of there, but they didn't really fit together, and I actually had to write quite a bit of script/code to make it work. Got very fast though, thought I broke it the first time I got it to work. <return> and instant prompt. Toke me a while to convince myself it was actually compiling anything at all.

Cut the recompile time in a VM to less than a tenth. Still slowed down by some pointless file IO (don't remember exactly what) but that was less of an issue.

Vanity package URL's though was what broke this camels back, after GOPATH and some other miscellanea had strained it. When I ... probably needed to quickly fork+fix an external package, and suddenly had this additional obstacle, I was so ... unhappy. Already too much work, and then you get more work for exactly no discernible reason at all. It's even called a vanity url officially.


A feature flag doesn't have to impact compiler performance unless enabled.


I don't know if it would be in the spirit of Go.

(I'm not for/against the philosophy of the decision making, just pointing it out).


it does, indirectly in a few ways. code size affects performance, human effort spread over more code paths, more branches.

you could have two totally separate compilers, which go does, there is the gcc variant. i don’t know if it tends to produce faster binaries though.


This didnt seem to change anything for me. I am currently using Go 1.14.4 (Windows), and after trying go1.15beta1, the resultant executables are actually larger.

Another thing Ive always found strange is people always recommend (-ldflags -w), when (-ldflags -s) gives smaller result, and (-ldflags '-s -w') is identical to (-ldflags -s).


Given the article is talking about DWARF debugging formats, I believe the numbers in the article are for Linux binaries. I'm not sure if Windows binaries store line numbers in the same way.


Doesn't Windows usually just use a .pdb file next to the exe/dll, containing the debugging information? So stripping debug information on Windows is just deleting (or not distributing) that file.


Yes, .pdb files are used by most toolchains for debug information on Windows.

No, Go doesn't use .pdb files.

On all platforms Go stores a significant amount of meta-data in the binary, in the same format so that they can access it using the same code on all platforms.

This information is used by the runtime. Precise garbage collector needs to know the layout of structures and stack to know which fields are pointers; To generate readable callstack Go needs info about function and where they come from (source code file name, position). etc.

A sidenote: you can convert DWARF debug info to .pdb using external tool which allows debugging Go programs with Windows native debuggers like WinDBG.


Not entirely, there's also a debug directory embedded inside executables that gets stripped.


> You can usually save double-digit percentages by stripping debugging information: pass `-ldflags=-w` to `go build`.

Is there a way to make `go build` for a particular project always use `-ldflags=-w`? Or do I have to remember to type it in every time (or hide my `go build` command inside a Makefile or `build.sh`)?


How much better does a 6% smaller go binary perform?


Shrinking the line table will be irrelevant to performance. As I understand it, the tailscale binary has a hard size requirement to run as a VPN under iOS.


Your HTTP cert is invalid. Just a heads up.


It's fine for me and was issued May 26. The URL when I visited was https://commaok.xyz


His browser might have oppened the following link: https://www.commaok.xyz/post/no-line-numbers/ somehow.

It's serving a certificate for github.com.


There's an error for https://www.commaok.xyz/, for what that's worth.


If it's showing up as invalid for you, you may have locally revoked trust of the DST Root CA X3 IdenTrust root certificate.


The www subdomain serves the wrong cert. https://www.commaok.xyz/ gives you an SSL error but https://commaok.xyz/ does not.


Hi : unrelated to this thread.

In an older thread you mention an acronym TFA. The thread was a discussion on sparse files and removing bytes from the front of a file.

What is TFA?


I wonder how much more from javascript minifying can be reused in this space.


Probably not much? Javascript minifying is about raw source code, whereas those are binary instructions + debug info.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: