Hacker News new | past | comments | ask | show | jobs | submit login
Simple techniques to optimise Go programs (stephen.sh)
273 points by sjwhitworth 66 days ago | hide | past | web | favorite | 83 comments

Most of these techniques sacrifice readability so before implementing them, you should really profile to make sure that the code in question is a hot spot in terms of allocations or CPU usage.

That said, the example using sync.Pool is not quite right. The New function should always return a pointer type to avoid incurring an additional allocation on Get [0]

The code should look like:

  var bufpool = sync.Pool{
    New: func() interface{} {
        buf := make([]byte, 512)
        return &buf

  b := *bufpool.Get().(*[]byte)

  defer bufpool.Put(&b)
[0] see example in https://golang.org/pkg/sync/#Pool

There's a little at the bottom that encourages caution. In a previous draft, I wrote about the importance of proving hot spots before optimising, but took it out as I wanted this to be more a list of techniques.

On sync.Pool: good spot. I'll update it - thanks.

Unfortunately, that's still wrong (I think). Go will still allocate a temporary slice-header on the heap and then de-allocate it when it's dereferenced.

I believe the only way to do this without unsafe code is to pool these slice-headers as in https://github.com/libp2p/go-buffer-pool/blob/master/pool.go.

The risk of write-profile-optimize is that you might write a big slow program with a totally flat profile. If it doesn’t meet performance requirements at that point, then what?

In your hypothetical scenario, you then optimize whatever is easiest to optimize.

I think there is a difference between architectural changes that would inhibit the overall performance of an application vs language specific optimizations like the article is referring to.

My comment re write-profile-optimize was geared towards the latter, i.e. using strconv is usually uglier to read than fmt.Sprintf, you probably should leave it alone unless you've profiled and found it to be a problem.

Does that ever happen?

I've seen examples of actual production systems where there is just a general proliferation of slow practices that make the system slow without any one function standing out yes. In C# such practices could include using interfaces (for dependency injection) and linq (for readability), and reflection for instance. All of these have fast alternatives with varying levels of downsides. The downsides aren't as big when you are practiced at them.

Also, when profiling production code, it can be hard to FIND the slow function since the optimizer may inline things.

There is perhaps value in understanding your language and ecosystem very well, so that you have decent intuition for what is fast, and making default-fast choices even if their is a small readability cost. That cost may be made up for by not having to crunch on performance later. As well, many performance choices make code simpler, after all the goal is to do less. and less is less.

Yes. This is the plague of modern software. It is all over the place. So much software today spends all its time chasing pointers, using the cache poorly, and branching wildly.

You won't see these aspects of poor design on a sampling profiler. You will see it by running e.g. perf on Linux and seeing pitifully low IPC and cache miss numbers.

This is one of the reasons I really love Rust: it not-so-subtly nudges you towards avoiding heap allocations by adding extra boilerplate and making them obvious in type signatures (unless you hide them behind wrapper types).

This contributes to Rust programs generally having good performance characteristics without spending time on optimizations.

I find the exact opposite for rust as it often encourages boxing random objects to satisfy odd lifetimes.

That being said of course in almost all of these case you can restructure your program so you don't need to box the values but if it's not performance critical why bother? Repeat a couple dozen times across a large codebase and you have the same pointer chasing issues.

The NLL improvements of last year have improved things quite a bit. It's still not perfect, but in my opinion it has reached a point where this is mostly an issue for Rust beginners.

Some patterns of writing code will be really awkward to realize, but there are usually "more rusty" solutions that you start to apply without event noticing. Once you write code with the desired ownership semantics in mind, it's often (relatively) frictionless.

It is still an issue for doing UI related coding, due to the way it is common to design such kind of systems.

> This is one of the reasons I really love Rust: it not-so-subtly nudges you towards avoiding heap allocations by adding extra boilerplate and making them obvious in type signatures (unless you hide them behind wrapper types).

Could you clarify this? It seems like the opposite to me. Borrowing requires lifetimes in type signatures, but boxing yields owned objects, which can be passed around easily like value types.

I’d say it happens more frequently than being able to optimize yourself out of a perf problem in my experience.

Too many people have taken the “make it correct then make it fast” advice too far. The definition of “correct” needs to include performance parameters from the beginning because usually the bottlenecks that cause real issues are architectural.

It is my extremely studied experience that this is rarely (read: almost never) the case, and that the far greater risk is of radically less- or entirely un-maintainable code produced by developers who have incorrect intuitions about what needs to be fast.

I suppose this is why they call it anecdotal evidence, as our experiences don’t agree.

What I will say is that code that is architecturally correct for the performance requirements necessary is largely not less understandable or unmaintainable than code that is incorrectly designed for the performance context. You don’t end up in the kinds of harder to read optimizations you see in this article in those cases any more than you do under the other case.

Another consideration is if your architecture is wrong for your performance space, nothing will help but a rewrite. If your code is optimized in the small too early, you can always rewrite in a way that is more clear.

You're talking about architectural changes, this article is about, essentially, microoptimizations. Good artchitecture is often faster than bad.

Your design should indeed take into account performance requirements. But micro-optimizations like these (almost all of these changes are to avoid linear numbers of reallocations, with the exception of string builder) don't give you order of magnitude speedups unless they're in hot loops anyway.

Profile than optimize means profile, then optimize. Designing good software isn't optimization, its designing good software.

It makes a big difference if all your function signatures in a Go program take a string parameter or a []byte parameter, or if they take some other expensive type by value. Refactoring this later can be close to impossible. I would say rather than micro-optimizations these are the choices that must be made correctly in the early phases of development to avoid having an unfixable performance problem later.

Literally the thread I’m responding too is asking what the value of optimize then profile is. My experience is that you are likely hosed if you’ve gotten to this point in the 80% case. That isn’t to say these techniques have no value, I’ve used many of them myself. But the question was, “how often does it happen that your optimizations are for a flat profile”. For me, the answer is “most of the time”.

And the broader context is when talking about micro-optimizations versus reasonable architectures.

You shouldn't micro-optimize before profiling, because it likely won't matter. Bluntly, if you have a flat profile, none of the optimizations in this article are relevant anyway. You'll be able to pull out single digit percentage speedups, maybe.

The optimize then profile argument isn't meant to be about architecture. Yes, you should build performant architecture. Yes, you should take time to plan a performant architecture before building[+]. But the question of profile than optimize is never (except in the strange way you're bringing it up) about doing macro-optimizations before you've written a line of code. It's almost always in the context of "don't just try to optimize what you think is slow, because you're almost always wrong".

Big-O style speedups from architectural changes aren't micro-optimizations, they generally sit outside of that conversation entirely.

As an aside, flat profiles are in practice exceedingly rare. Most (useful) programs do the same thing many times. Its very unusual to see a program that isn't, in essence, a loop. And the area inside the loop is going to be hot. The pareto principle applies to execution time too.

[+]: Maybe startups who gotta ship it to survive as the exception.

> You shouldn't micro-optimize before profiling, because it likely won't matter. Bluntly, if you have a flat profile, none of the optimizations in this article are relevant anyway. You'll be able to pull out single digit percentage speedups, maybe.

Glad we agree.

> The optimize then profile argument isn’t meant to be about architecture

Glad we agree. If only all the people who tell me “correct before performant” agreed with us. In practice, in my experience, this is not the case. People use it in day to day conversations at the earliest parts of conversations about architecture all the time. If they didn’t I wouldn’t have nearly the problems I do with the statement.

> As an aside, flat profiles are in practice exceedingly rare

This seems to be the most controversial part of our disagreement. In my experience, that is flatly untrue. Especially when talking about systems where the performance does not meet the requirements. I can count on 1 hand the number of times I’ve seen systems go from “unacceptable” performance to “acceptable” via micro optimizations. I’ve never seen one go to “great”. I don’t know how to quantify this though, so I’m willing to leave this in the realm of my experience is different than yours.

All that is to say, my experience says that systems that don’t treat performance as first class requirements don’t tend to meet their performance expectations.

All of which is neither here nor there based on the article but is directly related to the question of ‘what do you do with a flat profile’?

Kanev[1] disagrees. Flat profiles are the common case in actual practice.

Edited to add: Since apparently you also work at Google, you should walk over to Svilen's desk and just ask him if profiles of production software are generally flat, or if they generally have hot spots.

1: https://static.googleusercontent.com/media/research.google.c...

You're citing a paper using, as an example, an already highly hand optimized and performance-guided optimized binary.

That's what you get after you profile and optimize.

For me, all the time. But the two points about "write, profile, optimize" that people usually leave out is that profiling/benchmarking is mostly useful as a comparison tool (are your optimizations actually doing anything?) and that part of profiling is asking the question, "how fast can this go?" which is usually a difficult question to answer, but when you get really granular you can usually reason about it.

Your example doesn't seem right. Doesn't the Get perform an allocation anyway because you access the value of the pointer and store it in a new variable b? It's very likely b will escape to the heap.

I agree with nhooyr's analysis. The interface{} will anyway transparently "contain" a pointer-to-the-[]byte, in other words, the []byte value itself will be heap allocated.

(Note for anyone new to this that the "[]byte-value" - we say "the byte slice" - is a distinct thing from the "values stored-in-the-byte-slice", which is a heap allocated backing array)

So maybe we should sacrifice compiler speed for having a better compiler optimize more readable code?

Most compilers skip optimizations when compiling in debug mode, so that it is still quick to iterate/test, then perform the optimizations which take longer to compile for release builds. Obviously it means one can only profile code when it has been compiled in release mode, which I guess has its drawbacks...

I do wish Go would do this. The fast compiler thing is fantastic, but I think we can have the best of all worlds by supporting a --release flag, and I think Go is large enough now that we can justify the extra maintenance cost of dealing with two compilation modes.

Is a "fast compiler" really all that impressive though when a lot of the speed comes from not checking things that other compilers do?

A lot of speed also comes from some decisions about the language architecture. Go packages can be compiled separately and the compilation result in binary form can be just used by code importing the packages - usually no recompilation of these packages is necessary.

The speed isn't from "not checking things that other compilers do"; it's from not performing more agressive optimizations.

Does it really matter how the compiler does it? Faster compilation speed makes programmers happier. And it's not like you can convince a C++ or Rust or Scala compiler to just trust you with the types this once and compile your program ten times faster.

I would say that yes, it does matter how the compiler is faster. Turbo Pascal, for instance, has a fast compiler (still faster than Go's last time I checked) and it is fast because it was hand-optimized in assembly to be so. That's impressive; especially so back in the days of the 286.

Golang's compiler is fast because they don't check everything they could, the language itself was intentionally dumbed down, and because it doesn't bother dealing with dependencies as tightly (or as well) as other languages. I'd say it is important to know that.

Type checking is a very small part of Rust compile cost; most of it is due to the extensive use of monomorphized code (resulting in code bloat which is not always well-controlled, to put it mildly), as well as overhead in the LLVM-dependent part of compilation. Some features in the pipeline for Rust should enable some improvements, particularly const-generics and specialization.

Go is specifically designed for fast compilation. You can't just skip some checks and get similar results.

Nobody is expecting Rust compile times to ever match those of a language whose premier guiding principle is fast compile times, but the question is whether or not Rust compile times can become fast enough to keep Rust programmers satisfied. I'd say that programmer attention span is generally sufficient to tolerate iteration times of five seconds or less for interactive editing. Who knows if Rust will manage to get there, but the Rust developers I've spoken to seem to earnestly believe that an order of magnitude improvement is possible (over the course of several years of development, however).

I think what people are trying to get at in this thread is that they'd be happy if their Go compiles took 0.5 seconds instead of 0.1 seconds, if it also meant that it lessened the need to manually employ the techniques in the OP. Of course, keeping such tradeoffs within reason is a difficult prospect.

> the Rust developers I've spoken to seem to earnestly believe that an order of magnitude improvement is possible

I'd like to believe this is the case, but has that ever actually happened - as in a programming language seeing that much speedup in its compiler? There're quite a few cases of languages that are generally taken to have long compile times (C++, Haskell, Scala, etc) but I don't know of a case where any has been able to improve by as much as an order of magnitude.

Often speedups of that magnitude involve "cheating" by somehow rethinking the problem space. An example of this applied to compilers is the venerable ccache, which absolutely does provide a 10x improvement or more compared to naive use of the compiler (in this case the "cheating" is caching build artifacts so that most things don't need to be re-built at all; this is often devastatingly effective). I also suspect that the reason we don't see compiler developers regularly advertising order-of-magnitude improvements is that these improvements often span a decade or more of work. Clang 2019 may not be 10x faster than Clang 2018, but it very well might be compared to Clang 2004.

As for Rust specifically, a benchmark I saw a while back estimated that modern rustc compilation is about 3x faster than it was as of the 1.0 release in 2015. Other "cheating" approaches to reducing build times for Rust are incremental recompilation (which is kind of like built-in ccache), parallel codegen units (the original rustc, ironically, was single-threaded), a compilation mode that stops after typechecking (thereby avoiding codegen and LLVM entirely; this is invoked via `cargo check` and suffices for much of one's interactive development), and an alternative debug-build backend (Cranelift, still very WIP) which is optimized for extreme compilation speed (like a browser JS engine) rather than compilation quality (LLVM being largely designed for the inverse).

Indeed, the first step to making a fast compiler is "design your language from the start such that it can be compiled quickly" (which can be seen in the design of Go). I can think of a few things in Rust that might have been designed slightly differently (in the name of compilation speed) if the authors had the compiler-building wisdom they do today. But there are still plenty of opportunities to improve here.

C++ precompiled headers easily can give you your factor of ten.

Also, taking ‘its compiler’ literally: Turbo Pascal was over ten times as fast as earlier pascal compilers for the same hardware, which were multi-pass.

Isn’t that the entire point of higher order programming languages?

What are "higher order programming languages"? Did you mean "higher level programming languages"? If so, no, the point is that you can program more abstractly, almost always at the expense of performance.

EDIT: Downvoters, I'm terribly curious which of these points you disagree with? Do you believe higher level languages (like Python, JS, or even Go) are faster than lower level languages (like C, C++, etc)? Or do you believe that HLLs are actually faster (or meant to be faster) than LLLs? Or maybe you are downvoting because I don't know what a "higher order language" is?

The downvotes are because you have chosen to make an irritating pedantic point about word use when the meaning was entirely clear in context.

Wasn't trying to be pedantic; whatever is "entirely clear in context" is not entirely clear to me at all. Did he use a term that I'm unfamiliar with or did he use the term I'm familiar with in a way that is incongruent with my understanding?

Informally, there is no difference between higher order programming languages and higher level programming languages.

Thanks for the explanation; I hadn't heard that term before. :)

The JSON marshaler actually caches struct serializations so that you don't incur the reflection cost on every run.

> a previous version of this blog post did not specify that the New() function should return a pointer type. This avoids an extra allocation when returning through the interface{} type.

While this is good advice, it's not entirely correct. Even with the current go compiler there are ways to use sync.Pool with non-pointer values without incurring in the extra allocation, e.g. using a second sync.Pool to reuse the interface{}. Although I would not recommend it as it's slower, and much less maintainable.

> The safe way to ensure you always zero memory is to do so explicitly:

  // reset resets all fields of the AuthenticationResponse before pooling it.
  func (a* AuthenticationResponse) reset() {
      a.Token = ""
      a.UserID = ""

I think this is safer, in face of modifications to the AuthenticationResponse structure, and much clearer in its intent:

  // reset resets all fields of the AuthenticationResponse before pooling it.
  func (a* AuthenticationResponse) reset() {
      *a = AuthenticationResponse{}

> During a garbage collection, the runtime scans objects containing pointers, and chases them. If you have a very large map[string]int, the GC has to check every string within the map, every GC, as strings contain pointers.

This would, of course, be much less of an issue with a generational GC, which doesn't have to scan the entire heap on every collection.

> In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue. - Greg Linden, Amazon.com

Just curious, do vendors on Amazon get reimbursed for the drops in revenue during tests like this?

They'll imperceptible to individual vendors, as it's distributed across thousands. "Substantial" is extrapolated from a small (but just large enough to draw strong statistical conclusions from) A/B rollout.

Should Amazon charge vendors more when they identify where to invest more engineering effort, as a result of these A/B tests, which eventually lead to far higher increased revenue?

I'm going with "yes, it's part of the cost of using Amazon as a service to sell your goods"

Replacing strings with ints would be more believable with a real world example.

Dictionary encoding strings to integers is a common compression technique. In GC'd languages it fakes out the garbage collector because your dictionary codes are essentially pointers but not actually pointers from the GC's perspective. You also have the benefit of potentially using a smaller integer than a pointer if you know roughly the number of values you're encoding beforehand.

Does this hold for Go? The tracing is the only relevant cost and tracing is very, very fast, so I would expect this to be negligible? Or am I misunderstanding the hypothetical scenario?

Good point. Here's one off the top of my head: a giant map that caches some count by date. You can represent this as a map[time.Time]int (native Go time type), a map[string]int (time represented in some string format), or a map[int]int (time represented as an integer like 20190718).


I had the same thought but then had a better thought. What type of optimizations would you need in Rust? Are they similar or completely different? I know that Rust doesn't have GC so that removes some of the methods in the original post, but I'm sure there are places in Rust that could be better/faster.

> Allocate capacity in make to avoid re-allocation

Am I the only one who has done this and used append() within a loop, and resulted in a slice that is 2x as long as the original desired length, and the first 1x of items are all empty?

I quickly caught it and fixed the approach (instead of append, I used the `i` as the insertion index into my new empty slice which already had capacity allocated), but the ease with which that subtlety could be overlooked turned me off to this approach unless I profile and really find it's a hot path.

Edit to add: I tried his code, and it resulted in the new slice being as expected.

The make function takes the type, length and capacity. If you don’t supply capacity, it defaults to length. So if you want to use append, do something like: make([]int, 0, length)

That will allocate a slice that can fit length ints but has a length of 0 (so append starts at index 0).

You can call len(slice) and cap(slice) to see the difference, but append will insert an element at index len(slice), growing the capacity if necessary

Ahh thanks for the great explanation! Love HN.

make can take three parameters for slices, which is the type, the initial length, and the initial "capacity": https://play.golang.org/p/8d1mS1mGsW6

By default, the slice is filled in and fully initialized. On the one hand, IMHO and IME that's not really what you want if you're going to append things which is a pretty common case, but on the other hand, worrying about "capacity" for slices, while perfectly sensible once you understand the abstraction, is something you can get in trouble with if you're not paying attention, because it makes you really pay attention to the fact the slice has an underlying array.

(Many languages have this concept somewhere, Go just happens to put it in the syntax: A "slice" is an object with a reference to an underlying array, an index of the first element, a length, and a "capacity" which can be smaller than the underlying array has room for. If you want to grow past the underlying array, the runtime then has to create a new one and copy the elements out into that.)

Similarly, if you have a slice of ten elements and you want to take a subset of it and start appending your own bits and pieces, you have to do that carefully: https://play.golang.org/p/QcZOndSp-9x There's a syntax for reducing the capacity of the new slice deliberately, which forces any new "append" to copy to a new slice, producing the behavior you want.

For as tricky as this can get, I'm surprised it doesn't come up more often, but so far in the last five years, this was the source of my bug I believe twice, and one of them I was doing rather inadvisable things anyhow. It turns out that in most "just code" you're not cutting lists up and distributing them various places very often.

You're running into the difference between

    make([]T, 0, n)

    make([]T, n)

I occassionally hit this bug, but it's pretty easy to spot after I introduced it a couple of times. Allocations in Go are slow enough that I usually do this by default unless I know that performance absolutely will never be critical (e.g., doing CLI boilerplate that just needs to be fast on human timescales).

Missing technique: rewrite the performance-critical parts of your Go programs in a different language, and use Cgo to make them accessible to Go code via the standard C ABI. K.I.S.S.

This isn't simpler at all. Managing CGo is many times more difficult than a little optimizing. If you're able to write the hot path more performantly in C or similar, then you're perfectly capable of optimizing the Go version--the principles are the same and you don't need to incur the costs of crossing the CGo boundary nor do you need to reason about the memory/ownership handoff between languages or any of the other half dozen downsides of CGo.

This is not simple in any way. cgo is convenient for being able to use "lib*" packages, but calling these packages is not free, and certainly writing cgo for performance reasons is not likely to be a gain.

Go is plenty fast for most things, notably things that work well with a GC... but cgo doesn't make the GC go away.

This is a good way to ensure a relatively slow (by comparison) security review for your PRs as every dumb string manipulation and every allocation needs to be checked.

'I am deliberately ignoring techniques that require significant effort, or large changes to program structure.'

Hopefully that's /s, as debugging CGo stuff is a royal pita. ;)

also the overhead of calling out to c can actually be quite high: https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...

While some overhead and increased debugging effort may be inevitable, it's a mistake to blame them on the use of Cgo itself; the root cause is Go's custom ABI and user-level-threading ("goroutines") model. And I really have to dispute OP's claim that the techniques they mention do not "require significant effort, or large changes to program structure" of their own. In many ways, rewriting some portion of the code can actually be simpler.

Well, it may be true that the "blame" for why CGo is slow is up for debate, it's just a fact that it's rather slow to use.

Go isn't unique in this; there are many languages with runtimes that require similarly intensive amounts of copying and context conversion before C code can run on whatever the data is. However, most of those languages, like CPython, are themselves slow enough that the penalty isn't as noticeable against the general background noise. (In general CPython requires a lot more copying too; Go is closer to C struct and array semantics and can more often get by with some form of memcpy, the internals of Python look nothing like that.) Go is fast enough that it's much easier to get into scenarios where in a tight loop you're spending 90% on CGo overhead if you're not careful with data flow. For those languages that have to copy a lot out of their runtime and are also fairly fast, they'll face the exact same issues. It's not really a "Go" issue per se, but the challenges faced by any language that wants a runtime significantly different from C.

(One of the miracles of Rust is building an environment and runtime that isn't stuck on C's limitations but at the same time can still speak to C really, really cheaply. Plenty of languages have one or the other of those, but there aren't very many that have both. I'm not sure there's any other language that has threaded that particular needle so cleanly.)

> Go isn't unique in this; there are many languages with runtimes that require similarly intensive amounts of copying and context conversion before C code can run on whatever the data is.

I agree, and some of this overhead is even inherent in legitimate, foundational choices such as VM-interpreted/"managed" code (as in Java/.NET; but Python does this as well) or the use of tracing GC (which requires some strict discipline on heap contents, so as to enable the GC itself to reliably "trace" and discover the semantics it cares about). So, I'm definitely not saying that the choices made as part of Go's design are consistently wrong here!

Indeed, Haskell is in a very similar place overall; the "fibers" that GHC uses in its compiled code are implemented via async code underneath, much like Go's goroutines, and Haskell's performance is also well within an order of magnitude of pure C-like code. So the issues you describe are quite well-understood in that context, and they are definitely not regarded as a "reason" to avoid the use of C FFI when performance requirements call for it. But describing Cgo itself as something that's high-overhead and should not be used for that reason is misleading to an even stronger extent, and that's what I was objecting to in the grandparent comment!

Well said. Also worth noting that most of the languages like CPython are very slow because many optimizations are now prohibitively difficult given the "easy C-interop" constraint.

Isn't that more because the choice to support C interop by giving C access to interpreter internals, which is the easiest-implemented option but not the only way to get easy-to-use C interop.

That could be. Although I'm not familiar with any languages that are fast and have a GC and easy C interop. At least none of the major VM languages or (non-embedded) scripting languages. Maybe D or some other similarly non-mainstream language.

Lisp? Common lisp's ffi faculties are good, from all I've heard.

Could be. I’m not that familiar.

This is Go not Python.

For me, the best way to optimize a Go program is to write it in Rust.

Please don't do this here.

I find it interesting how a lot of Go devs, myself included, agree that Rust is better in terms of language design, performance etc but still use Go. Personally, it's the productivity that has me reaching for it first

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact