Iterative Optimization on Hot Paths in Go Apps

blaisio · on Oct 5, 2018

I don't really get this article. It isn't necessary to optimize for minimal allocations if it doesn't affect response time. They mention the new technique was making too many allocations; why do they care how many allocations are being made, if response time is the same?

berfarah · on Oct 5, 2018

(author here)

Good question! Because this is on a hot code path, we were allocating new values faster than the garbage collector was able to remove them. Over extended periods of time, the server would trend toward running out of memory.

blaisio · on Oct 5, 2018

Interesting! That sounds like a garbage collector bug! It's probably worth opening a ticket.

dullgiulio · on Oct 5, 2018

In one work: throughput. More CPU time spent doing useful work rather than GC scanning. Makes you need less scaling out to handle increasing load.

blaisio · on Oct 5, 2018

They didn't mention having problems with or even measuring throughput in the article, and the number of allocations a program makes is not the only thing that affects throughput. So again, it is unclear why they would spend time optimizing the number of allocations.

chrisseaton · on Oct 5, 2018

How can you increase throughout without reducing response time?

simcop2387 · on Oct 5, 2018

Throughput is going to be affected by anything that can bottleneck your application. Even if response time is the same, if you reduce the time spent cleaning up between responses and requests then you can handle a higher number of requests with the same number of workers. If you reduce the amount of memory being allocated you'd also be able to run more workers on the same hardware, also increasing throughput. And as you're implying, reducing time spent making responses would also allow you to increase throughput too.

chrisseaton · on Oct 5, 2018

I still don’t understand that maths - if you both get n requests per second and the time per request has not changed then the throughout is the same isn’t it?

Time between request translates directly into response time as someone is waiting during that time aren’t they? If nobody is waiting and it’s not added to anyone’s wait time then who cares?

The answer really is power consumption during GC even if nobody is waiting, but you didn’t mention that.

simcop2387 · on Oct 5, 2018

Throughput involves both time to process a request and how many requests you can process per unit time.

If you only take 10ms to process a request and give the data back to the user, but then take 200ms afterwards cleaning up after yourself (garbage collection, background tasks etc). then you can only serve up less than 5 requests per second per worker. If you allocate per request 200mb and then free it afterwards and only have 1GB of memory on the server, you can only have a max of 5 workers, so in this case you can only have ~25 requests per second throughput. Fixing either of those cases means you can have a higher throughput in the end without having to scale it out across more servers, since you can prevent the future requests from waiting by either being able to have more workers, or have workers do less work between requests. This isn't even necessarily GC work, it could be sending off jobs to send emails or other jobs that were related to the request regardless of what they were. All of this still ties up the worker that could be handling the request.

Also, the number of requests you get has nothing to do with the number of requests you can actually process. You could be able to process 100k requests per second, but only get 200/second, or vice versa.

chrisseaton · on Oct 5, 2018

Sorry I still don’t understand that - if you have a delay caused by GC before you can respond the next request then this adds time to the request which is waiting during that delay which increases response time.

dullgiulio · on Oct 6, 2018

It depends on what you call response time: if you see it server side, you might count it only from when a request is accepted, from the client side, since the request is sent.

Because you can scale out, it makes sense to measure response time from the server perspective (from when the request is accepted.) Silly example...

Say you have two CPU: you can respond to (CPU bound) requests at a time. When the parallel GC kicks in, it fully occupies one CPU. Response time does not change, but you are handling half of the previous requests per second.

You could scale out, and have two machines handle two requests in parallel when both are running GC.

pcwalton · on Oct 5, 2018

More optimization based on making escape analysis happy to reduce allocations. This is yet another data point in favor of a generational garbage collector with bump allocation in the nursery.

chrisseaton · on Oct 5, 2018

But escape analysis is useful no matter what collector you are using.

dullgiulio · on Oct 5, 2018

That's less than a fourth of what the article says.

Copying data is sometimes faster than allocating it (and the TCMalloc-inspired Go allocator is quite quick.) Many small, allocations are inefficient in any language... There are other interesting points.

And anyway, a bump allocator with nursery will spend longer time vetting the nursery and copying long-lived objects. In Go this would not pay back, as the nursery is mostly staying on the stack without special tricks to please the escape analysis.

If you argued that Rust lifetimes system makes it easier to reason like the escape analysis would, I would agree, but you preferred to drop a tweet-size rant...

pcwalton · on Oct 5, 2018

tcmalloc cannot compare to allocation performance in HotSpot, where allocation is about 5 or 6 instructions. This is the same size as a function prologue and epilogue.

You're denying that the generational hypothesis holds for Go heap objects. I see no reason why this would be the case. In fact, Go's situation is very close to that of .NET, where the generational hypothesis certainly holds and therefore .NET has a generational garbage collector. This very article is evidence in favor of the generational hypothesis, since much of the optimizations boil down to reducing short-lived heap allocations.

lossolo · on Oct 5, 2018

And again same thing about HotSpot that you repeat third time in last week-two weeks?

Java has new GC which is non-generational and is focused on low latency like Go GC is. It's called ZGC.

http://openjdk.java.net/jeps/333

pcwalton · on Oct 5, 2018

ZGC has a very different design. Go's GC is non-moving and has no read barrier. ZGC has a read barrier and bump allocation everywhere (using compaction to ensure that memory regions are contiguous). As a result it does not suffer the throughput issues of Go's GC.

lossolo · on Oct 5, 2018

It's not moving/compacting because it doesn't need to - it doesn't have fragmentation issues. Bump allocation is efficient and fast in single threaded programs but almost all Go applications are multi threaded and it would require locks there. Go is using thread local caches for allocation, that's why there is no point in using bump allocation.

This was mentioned many times by the Go team.

Generational hypothesis says that most objects die young. Golang have value types (which Java do not have atm, but Valhalla is coming) and allocate on the stack based on escape analysis (which gets better with time). This means that adding generational GC would not benefit Go as much as you think. Go GC main focus is very low latency, not throughput and handling huge heaps with a lot of generations is not so easy with HotSpot. In many cases you need to tune GC in Java because of that. There are trade offs like with everything.

I am running high performance service written in Golang and I was working on high performance services in Java.

This are the pauses from Go service on XX GB heaps from today:

https://imgur.com/tSXtP4a

This is without tuning GC or writing exceptionally unidiomatic Go code. This is impossible to achieve in Java with Hot Spot without tuning GC/writing your whole code specially to satisfy GC.

pcwalton · on Oct 5, 2018

> It's not moving/compacting because it doesn't need to - it doesn't have fragmentation issues.

It's not about fragmentation. It's about allocation throughput.

> Bump allocation is efficient and fast in single threaded programs but almost all Go applications are multi threaded and it would require locks there. Go is using thread local caches for allocation, that's why there is no point in using bump allocation.

No production bump allocator takes locks!

> This was mentioned many times by the Go team.

Yes, and my view is that they're coming to incorrect conclusions.

> Golang have value types (which Java do not have atm, but Valhalla is coming) and allocate on the stack based on escape analysis (which gets better with time).

1. .NET has value types, and in that runtime the generational hypothesis certainly holds and therefore .NET has a generational garbage collector.

2. Java HotSpot has escape analysis too (and the generational hypothesis still holds). It's primarily for SROA and related optimizations, though, not for allocation performance. That's because, unlike Go, HotSpot allocation is already fast.

> This means that adding generational GC would not benefit Go as much as you think.

I have yet to see any evidence that bump allocation in the nursery would not help Go.

> Go GC main focus is very low latency, not throughput and handling huge heaps with a lot of generations is not so easy with HotSpot.

And for most applications, balancing throughput and latency is more desirable than trading tons of throughput for low latency.

> In many cases you need to tune GC in Java because of that.

The default GC in Java HotSpot has one main knob, just as Go does. Actually, the knob is better in Java, because "max pause time" is easier to understand than SetGCPercent.

lossolo · on Oct 5, 2018

> The default GC in Java HotSpot has one main knob

Do you write that from theory or experience? Because from my experience this was never enough. There are actually whole books written about tuning GC in Java.

> I have yet to see any evidence that bump allocation in the nursery would not help Go.

I can revert what you wrote here and it would be true too.

> Yes, and my view is that they're coming to incorrect conclusions.

I will ask again as someone did before. Did you reach to the Go team about your views and explained that you think they are coming to incorrect conclusions? You can do it easily by writing here[1].

I've seen you are committing a lot of time to write those things about Go in multiple submissions. The best way would be to spend this time talking with Go team and showing your view with proper argumentation. This would be better spent time than writing same thing over and over again on HN in my opinion. If you are right (which I doubt, but I can be wrong too, we are only humans after all) it would benefit whole Go community.

1. https://groups.google.com/forum/#!forum/golang-dev

pcwalton · on Oct 5, 2018

I've talked to the devs on Twitter, yes.

I'm more trying to correct misconceptions around GC than anything else.

lossolo · on Oct 5, 2018

> I've talked to the devs on Twitter, yes.

Can you point me to that discussion? I can't seem to find any discussion in which you participate on Twitter about Go GC with Rick Hudson or Austin Clements, Ian Lance Taylor etc.

jashmatthews · on Oct 5, 2018

https://blog.golang.org/ismmkeynote is worth a read. Apparently, the write barrier overhead makes it not really worth it for the Go ecosystem to adopt a generational or moving GC. I think WebKit also switched to a non-moving GC, but retained bump-allocation into empty pages.

pcwalton · on Oct 5, 2018

I'm familiar with that keynote, and I've written about it before. Long story short, the Go developers never tested a generational GC with bump allocation in the nursery. Without that, they haven't given generational GC a fair shake.

Regarding JavaScriptCore, they are constrained by compatibility with native code that takes pointers into the GC heap, particularly iTunes. So the situation is not really comparable. Nevertheless, Filip has done a better job with reducing allocation overhead in JSC than the Go team has, because the "bump'n'pop" fast path is inlined into allocation sites. It's still a throughput loss over a pure bump allocator, because it has a little more bookkeeping to do [1]—but not by much. Note also that JavaScriptCore has—you guessed it—a generational GC [2], though a non-moving one.

If Go doesn't switch to a generational GC, they should at least adopt bump'n'pop.

[1]: https://trac.webkit.org/browser/trunk/Source/JavaScriptCore/...

[2]: https://webkit.org/blog/7122/introducing-riptide-webkits-ret...

jashmatthews · on Oct 6, 2018

> they should at least adopt bump'n'pop.

I wrote a rudimentary hack of bump'n'pop for CRuby, but in that context, it had almost no effect. The real issue for CRuby is all objects larger than a few bytes are malloced and freed but there's a proposal now for: https://bugs.ruby-lang.org/issues/14858

chrisseaton · on Oct 5, 2018

> Bump allocation is efficient and fast in single threaded programs but almost all Go applications are multi threaded and it would require locks there.

If you’re using bump allocation you do it per thread! You don’t do it globally and use a lock!

valarauca1 · on Oct 5, 2018

    Bump allocation is efficient and fast in single threaded
    programs but almost all Go applications are multi
    threaded

This operation doesn't require locks, just an atomic add, and a branch to ensure you aren't allocating past the end of the current "arena" where memory is being allocated. Trivial optimizations are providing thread-local allocation arenas which remove the need for longer pauses as locality is improved (less cache coherence protocol work for the silicon).

OFC these schemes require some kind of relocation, but they make allocating blindingly fast. The only way you get faster is by pre-faulting the arena, and hinting for the _next_ chunk of the arena to be loaded in L1/L2 cache.

gok · on Oct 5, 2018

But a stack allocation is 0 or 1 instructions.

pcwalton · on Oct 5, 2018

Not if you have to check for stack space, like Go does.

It's no coincidence that a bump allocator and a function prologue have exactly the same structure: check against a limit and bump a thread-local pointer. There are runtimes like Chicken Scheme that actually implement heaps using the stack!

troytchell · on Oct 5, 2018

The creators of Go are Rob Pike and Google. I think they would know if a different GC were better and use it.

I used to sometimes doubt Rob Pike too and think Go had made some mistakes. After using Go and listening to his talks (like [0]) I realized how wrong I was. There's a reason he has helped build two wildly successful programming languages.. How many have you built? Are you smarter than Google?

In the end, embracing Rob Pike's brilliant decisions and taste has made my programming life so much simpler and better. I switched to Acme [1], turned off syntax highlighting on Gitub PR reviews with a userscript[2], and my life is infinitely simpler. Rob Pike has been right on so many things and has made Go successful. I'm sure if a generational GC were better for Go, he would have built and used it, so you're obviously wrong.

[0]: https://www.youtube.com/watch?v=rFejpH_tAHM

[1]: http://acme.cat-v.org/

[2]: https://groups.google.com/d/msg/golang-nuts/hJHCAaiL0so/kG3B...

_ph_ · on Oct 5, 2018

Reducing allocations isn't about the speed of allocations. Yes, allocating with a generational GC is very fast, as is stack allocation.

Reducing allocations is about reducing the time needed to perform garbage collection.

That time is zero for stack based allocation and non-zero for any heap based allocation. Stack based allocation or memory pools etc. reduce the pressure on the GC. Having more GC runs means not only time spent in the GC, with generational GCs it also can mean premature aging of objects - while collecting the nursery is very fast, collecting older generations has a larger cost.

So, with any GC and language, the less heap allocations, the faster your code will run. (Unless of course, your program code gets much worse by trying to avoid heap allocations).

pcwalton · on Oct 5, 2018

To nitpick, the time for stack allocation isn't zero. Stack objects still have to be traced.

But yes, there is some benefit to reducing allocation. The biggest benefit to escape analysis doesn't really have anything to do with allocation at all. Rather it's that promoting objects to the stack enables SROA, which unlocks a lot more optimizations.

This article wasn't about that, though. These were short-lived benchmarks that are dominated by throughput. In those types of benchmarks, allocation speed matters much more than mark and sweep time. So, the best way to optimize for these sorts of workloads is to add generational garbage collection with bump allocation in the nursery.

_ph_ · on Oct 5, 2018

I didn't claim that the cost for stack allocation was zero. I claimed that the cost for stack deallocation is zero. Thats where the stack wins against garbage collectors.

gok · on Oct 5, 2018

You potentially would also have to do work to keep the generational GC happy. It can be easy to accidentally have allocations escape the nursery.