
Iterative Optimization on Hot Paths in Go Apps - jmharvey
https://medium.com/samsara-engineering/iterative-optimization-on-hot-paths-c89827749c20
======
blaisio
I don't really get this article. It isn't necessary to optimize for minimal
allocations if it doesn't affect response time. They mention the new technique
was making too many allocations; why do they care how many allocations are
being made, if response time is the same?

~~~
dullgiulio
In one work: throughput. More CPU time spent doing useful work rather than GC
scanning. Makes you need less scaling out to handle increasing load.

~~~
chrisseaton
How can you increase throughout without reducing response time?

~~~
simcop2387
Throughput is going to be affected by anything that can bottleneck your
application. Even if response time is the same, if you reduce the time spent
cleaning up between responses and requests then you can handle a higher number
of requests with the same number of workers. If you reduce the amount of
memory being allocated you'd also be able to run more workers on the same
hardware, also increasing throughput. And as you're implying, reducing time
spent making responses would also allow you to increase throughput too.

~~~
chrisseaton
I still don’t understand that maths - if you both get n requests per second
and the time per request has not changed then the throughout is the same isn’t
it?

Time between request translates directly into response time as someone is
waiting during that time aren’t they? If nobody is waiting and it’s not added
to anyone’s wait time then who cares?

The answer really is power consumption during GC even if nobody is waiting,
but you didn’t mention that.

~~~
simcop2387
Throughput involves both time to process a request and how many requests you
can process per unit time.

If you only take 10ms to process a request and give the data back to the user,
but then take 200ms afterwards cleaning up after yourself (garbage collection,
background tasks etc). then you can only serve up less than 5 requests per
second per worker. If you allocate per request 200mb and then free it
afterwards and only have 1GB of memory on the server, you can only have a max
of 5 workers, so in this case you can only have ~25 requests per second
throughput. Fixing either of those cases means you can have a higher
throughput in the end without having to scale it out across more servers,
since you can prevent the future requests from waiting by either being able to
have more workers, or have workers do less work between requests. This isn't
even necessarily GC work, it could be sending off jobs to send emails or other
jobs that were related to the request regardless of what they were. All of
this still ties up the worker that could be handling the request.

Also, the number of requests you get has nothing to do with the number of
requests you can actually process. You could be able to process 100k requests
per second, but only get 200/second, or vice versa.

~~~
chrisseaton
Sorry I still don’t understand that - if you have a delay caused by GC before
you can respond the next request then this adds time to the request which is
waiting during that delay which increases response time.

~~~
dullgiulio
It depends on what you call response time: if you see it server side, you
might count it only from when a request is accepted, from the client side,
since the request is sent.

Because you can scale out, it makes sense to measure response time from the
server perspective (from when the request is accepted.) Silly example...

Say you have two CPU: you can respond to (CPU bound) requests at a time. When
the parallel GC kicks in, it fully occupies one CPU. Response time does not
change, but you are handling half of the previous requests per second.

You could scale out, and have two machines handle two requests in parallel
when both are running GC.

------
pcwalton
More optimization based on making escape analysis happy to reduce allocations.
This is yet another data point in favor of a generational garbage collector
with bump allocation in the nursery.

~~~
dullgiulio
That's less than a fourth of what the article says.

Copying data is sometimes faster than allocating it (and the TCMalloc-inspired
Go allocator is quite quick.) Many small, allocations are inefficient in any
language... There are other interesting points.

And anyway, a bump allocator with nursery will spend longer time vetting the
nursery and copying long-lived objects. In Go this would not pay back, as the
nursery is mostly staying on the stack without special tricks to please the
escape analysis.

If you argued that Rust lifetimes system makes it easier to reason like the
escape analysis would, I would agree, but you preferred to drop a tweet-size
rant...

~~~
pcwalton
tcmalloc cannot compare to allocation performance in HotSpot, where allocation
is about 5 or 6 instructions. This is the same size as a function prologue and
epilogue.

You're denying that the generational hypothesis holds for Go heap objects. I
see no reason why this would be the case. In fact, Go's situation is very
close to that of .NET, where the generational hypothesis certainly holds and
therefore .NET has a generational garbage collector. This very article is
evidence in favor of the generational hypothesis, since much of the
optimizations boil down to reducing short-lived heap allocations.

~~~
lossolo
And again same thing about HotSpot that you repeat third time in last week-two
weeks?

Java has new GC which is non-generational and is focused on low latency like
Go GC is. It's called ZGC.

[http://openjdk.java.net/jeps/333](http://openjdk.java.net/jeps/333)

~~~
pcwalton
ZGC has a very different design. Go's GC is non-moving and has no read
barrier. ZGC has a read barrier and bump allocation everywhere (using
compaction to ensure that memory regions are contiguous). As a result it does
not suffer the throughput issues of Go's GC.

~~~
lossolo
It's not moving/compacting because it doesn't need to - it doesn't have
fragmentation issues. Bump allocation is efficient and fast in single threaded
programs but almost all Go applications are multi threaded and it would
require locks there. Go is using thread local caches for allocation, that's
why there is no point in using bump allocation.

This was mentioned many times by the Go team.

Generational hypothesis says that most objects die young. Golang have value
types (which Java do not have atm, but Valhalla is coming) and allocate on the
stack based on escape analysis (which gets better with time). This means that
adding generational GC would not benefit Go as much as you think. Go GC main
focus is very low latency, not throughput and handling huge heaps with a lot
of generations is not so easy with HotSpot. In many cases you need to tune GC
in Java because of that. There are trade offs like with everything.

I am running high performance service written in Golang and I was working on
high performance services in Java.

This are the pauses from Go service on XX GB heaps from today:

[https://imgur.com/tSXtP4a](https://imgur.com/tSXtP4a)

This is without tuning GC or writing exceptionally unidiomatic Go code. This
is impossible to achieve in Java with Hot Spot without tuning GC/writing your
whole code specially to satisfy GC.

~~~
pcwalton
> It's not moving/compacting because it doesn't need to - it doesn't have
> fragmentation issues.

It's not about fragmentation. It's about allocation throughput.

> Bump allocation is efficient and fast in single threaded programs but almost
> all Go applications are multi threaded and it would require locks there. Go
> is using thread local caches for allocation, that's why there is no point in
> using bump allocation.

No production bump allocator takes locks!

> This was mentioned many times by the Go team.

Yes, and my view is that they're coming to incorrect conclusions.

> Golang have value types (which Java do not have atm, but Valhalla is coming)
> and allocate on the stack based on escape analysis (which gets better with
> time).

1\. .NET has value types, and in that runtime the generational hypothesis
certainly holds and therefore .NET has a generational garbage collector.

2\. Java HotSpot has escape analysis too (and the generational hypothesis
still holds). It's primarily for SROA and related optimizations, though, not
for allocation performance. That's because, unlike Go, HotSpot allocation is
already fast.

> This means that adding generational GC would not benefit Go as much as you
> think.

I have yet to see any evidence that bump allocation in the nursery would not
help Go.

> Go GC main focus is very low latency, not throughput and handling huge heaps
> with a lot of generations is not so easy with HotSpot.

And for most applications, balancing throughput and latency is more desirable
than trading tons of throughput for low latency.

> In many cases you need to tune GC in Java because of that.

The default GC in Java HotSpot has one main knob, just as Go does. Actually,
the knob is better in Java, because "max pause time" is easier to understand
than SetGCPercent.

~~~
lossolo
> The default GC in Java HotSpot has one main knob

Do you write that from theory or experience? Because from my experience this
was never enough. There are actually whole books written about tuning GC in
Java.

> I have yet to see any evidence that bump allocation in the nursery would not
> help Go.

I can revert what you wrote here and it would be true too.

> Yes, and my view is that they're coming to incorrect conclusions.

I will ask again as someone did before. Did you reach to the Go team about
your views and explained that you think they are coming to incorrect
conclusions? You can do it easily by writing here[1].

I've seen you are committing a lot of time to write those things about Go in
multiple submissions. The best way would be to spend this time talking with Go
team and showing your view with proper argumentation. This would be better
spent time than writing same thing over and over again on HN in my opinion. If
you are right (which I doubt, but I can be wrong too, we are only humans after
all) it would benefit whole Go community.

1\. [https://groups.google.com/forum/#!forum/golang-
dev](https://groups.google.com/forum/#!forum/golang-dev)

~~~
pcwalton
I've talked to the devs on Twitter, yes.

I'm more trying to correct misconceptions around GC than anything else.

~~~
lossolo
> I've talked to the devs on Twitter, yes.

Can you point me to that discussion? I can't seem to find any discussion in
which you participate on Twitter about Go GC with Rick Hudson or Austin
Clements, Ian Lance Taylor etc.

