
Go code refactoring: the 23x performance hunt - anastalaz
https://medium.com/@val_deleplace/go-code-refactoring-the-23x-performance-hunt-156746b522f7
======
todd8
One of the things I like about go is that the path of least resistance is
straightforward code. Getting a program working and working correctly while
being straightforward is usually the starting point to getting a correct
program that runs fast enough. Go's tools for testing and benchmarking and
it's compilation speed make it relatively easy it iterate on a program while
tuning it.

This example was really nicely done; it's a classic problem, parsing input
data. Some problems, on the other hand, require a different approach. See for
example Knuth's wonderful introduction to boolean satisfiability in _The Art
of Computer Programing, Vol 4B_ , see [1]. Here the algorithms are really
important to performance.

[1]
[https://www.youtube.com/watch?v=g4lhrVPDUG0](https://www.youtube.com/watch?v=g4lhrVPDUG0)

------
pcwalton
"The execution time is now dominated by the allocation and the garbage
collection of small objects (e.g. the Message struct), which make sense
because memory management operations are known to be relatively slow."

A perfect example of why Go should switch to a generational garbage collector
with bump allocation in the nursery, like Java HotSpot, .NET, V8, etc. etc.
use. There are consequences to Go's decision to make extreme throughput
sacrifices in order to optimize for latency (low pause times) above all else.

~~~
cdoxsey
Ian Lance Taylor replied to this question once:
([https://groups.google.com/d/msg/golang-
nuts/KJiyv2mV2pU/wdBU...](https://groups.google.com/d/msg/golang-
nuts/KJiyv2mV2pU/wdBUH1mHCAAJ))

> Now let's consider a generational GC. The point of a generational GC relies
> on the generational hypothesis: that most values allocated in a program are
> quickly unused, so there is an advantage for the GC to spend more time
> looking at recently allocated objects. Here Go differs from many garbage
> collected languages in that many objects are allocated directly on the
> program stack. The Go compiler uses escape analysis to find objects whose
> lifetime is known at compile time, and allocates them on the stack rather
> than in garbage collected memory. So in general, in Go, compared to other
> languages, a larger percentage of the quickly-unused values that a
> generational GC looks for are never allocated in GC memory in the first
> place. So a generational GC would likely bring less advantage to Go than it
> does for other languages.

> More subtly, the implicit point of most generational GC implementations is
> to reduce the amount of time that a program pauses for garbage collection.
> By looking at only the youngest generation during a pause, the pause is kept
> short. However, Go uses a concurrent garbage collector, and in Go the pause
> time is independent of the size of the youngest generation, or of any
> generation. Go is basically assuming that in a multi-threaded program it is
> better overall to spend slightly more total CPU time on GC, by running GC in
> parallel on a different core, rather than to minimize GC time but to pause
> overall program execution for longer.

> All that said, generational GC could perhaps still bring significant value
> to Go, by reducing the amount of work the GC has to do even in parallel.
> It's a hypothesis that needs to be tested. Current GC work in Go is actually
> looking closely at a related but different hypothesis: that Go programs may
> tend to allocate memory on a per-request basis. This is described at
> [https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...](https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ftk-
> tcRRJcDFANj2VwCB0/view) . This is work in progress and it remains to be seen
> whether it will be advantageous in reality.

~~~
pcwalton
I'm aware of Ian's arguments. He's (a) too dismissive of the huge throughput
benefits of generational GC; (b) ignoring the fact that both HotSpot and C#
allocate objects on the stack too, with effectively superior escape analysis
to that of Go because of speculative devirtualization; (c) neglecting the fact
that Go's situation is almost the same as that of .NET, where the generational
hypothesis certainly holds; (d) treating "request-oriented" GC as something
other than generational GC, when it's really just a somewhat limited form of
generational GC.

It's just a pointless thing for the Go team to dig in their heels over. They
should implement generational GC, like most other languages with optimizing
compilers do.

~~~
weberc2
I've seen you rebut Ian's remarks a number of times. I'm not sufficiently
knowledgeable about GCs, but I would really like to hear some of his responses
to your criticisms. I think I (and probably lots of others) would learn a lot
from the debate.

~~~
geodel
There is most likely none.

Go team at Google would have enough real usage feedback from running Go
systems in their massive data centers. They most likely find it reasonable
with all things considered.

This debate about Generational GC is theoretical in nature, claiming Java GC
design is universally applicable irrespective of language and its use cases.

There may indirect approval to Go approach by Java when developing value
types:

[http://openjdk.java.net/jeps/169](http://openjdk.java.net/jeps/169)

key Quote:

"In modern JVMs, object allocation is inexpensive, with a cost comparable to
out-of-line procedure calling. But even this cost is often a painful overhead
when compared to individual operations on primitive values. Thus, Java
programmers face a binary choice between existing primitive types (which avoid
allocation) and other types (which allow data abstraction and other benefits
of classes). When they need to define small composite values such as complex
numbers, pixels, or pairs of return values, neither approach serves. This
dilemma often has no good solution, and the workarounds distort Java programs
and APIs. "

If we go by claims that Java and its GC supporters always made:

1) RAM is Cheap 2) GC is incredibly fast so as not to worry about allocation.

There should be no need to waste tremendous multi year engineering effort to
add value types to Java.

~~~
openasocket
That's a straw man. Just because GC is fast doesn't mean it can't be made
faster.

> Go team at Google would have enough real usage feedback from running Go
> systems in their massive data centers. They most likely find it reasonable
> with all things considered.

That's not just an appeal to authority, that's a big assumption. As far as I
know, no Go maintainer has claimed that they have gathered large amounts of
performance data and concluded that a generational GC wouldn't improve
performance.

Also, calling generational GC "Java GC design" is disingenuous. Generational
GC long precedes the creation of Java, and is used almost universally. Common
Lisp (various implementations), .Net, the Lua VM (and the LuaJIT VM), BEAM,
SELF, and Haskell (GHC at least) all use generational garbage collection. In
fact, the current Go implementation is the only GC implementation I'm aware of
that isn't generational.

~~~
yoklov
Neither Lua nor LuaJIT use a generational design, they're both tri-color
incremental mark and sweep collectors without a separate nursery.

There was a plan for a generational GC for LuaJIT [0] (not to mention a lot of
other cool features and a very clever design) but it was never implemented
(and at this point it probably never will be).

[0] [http://wiki.luajit.org/New-Garbage-Collector](http://wiki.luajit.org/New-
Garbage-Collector)

~~~
weberc2
Also, most of the listed languages don’t have value types or added them
expressly because their GC scheme was insufficient.

------
defertoreptar
Great article. I was recently optimizing a go program. I was focusing on
trying to improve the underlying algorithms, but then I realized after
profiling that most of the time was spent creating temporary slices (many,
many times). When I changed it to reuse existing slices, I saw a bigger
performance gain than all the other effort combined.

~~~
weberc2
Yeah, this is usually what I run into. I once had a recursive algorithm to
collect a flat list of nodes in a tree. I found that building slices was the
bottleneck, so I changed the signature to `func (t Tree) CountNodes() int` and
`func (t Tree) Nodes(outSlice []Node)`, and then I'd call it like:

    
    
        nodes := make([]Node, t.CountNodes())
        t.Nodes(nodes)
    

Even though this required me to traverse the tree twice (once to count and the
other to populate the list of nodes) the savings were considerable.

------
RyanZAG
Kind of defeats the purpose of using golang for a task like this. The whole
point of golang is using the little greenlet threads, but actually using them
in this case is terrible on performance.

The remaining performance left behind is all in memory allocation and garbage
collection - something you could optimize relatively easily if it were written
in C. Such as by using a memory pool, so that you wouldn't need allocations or
garbage collection at all.

Of course if performance isn't a big issue for your task, then none of this is
really important.

~~~
iainmerrick
It's surprising that the per-file Goroutines were so expensive, though. (The
original per-line Goroutine, sure, that's excessive if you care about
performance.) Just using long-lived workers seems non-idiomatic for Go, but it
certainly pays big dividends in this example.

~~~
jerf
Per-file may have had other problems not related to the Go runtime, such as IO
contention. I'm not going to check it, but it would be easy to verify that
just by using a limited number of them at a time. Spawning a new goroutine in
that case is not strictly necessary, but would still be good software
engineering.

One of the problems I see repeatedly when people try to benchmark things with
concurrency is when they don't specify a problem that is CPU-intensive enough,
so it ends up blocked on other elements of the machine. For a task like this,
I'd expect optimized Go to easily keep up with a conventional hard drive, and
with just a bit of work, come within perhaps a factor of 2 or 3 of keeping up
with the memory bandwidth on a consumer machine (including the fact that since
you're going to read a bit, then write some stuff, you're not going to get the
full sequential read performance out of your RAM), not because Go is teh
awesomez but because the problem isn't that hard. To get big concurrency wins,
you need a problem where the CPU is chewing away at something but isn't
constantly hitting RAM or disk or network for it, such that those systems
become the bottleneck.

~~~
val_deleplace
Hi jerf, please note that

\- the benchmark was designed to repeatedly parse an in-memory byte slice (not
the hard drive), thus IO contention is unlikely here ;

\- concurrency is a big win when IO is a bottleneck : keep processing dozens
of things while some of them are waiting for data from network or hdd.

~~~
jerf
"the benchmark was designed to repeatedly parse an in-memory byte slice (not
the hard drive), thus IO contention is unlikely here"

You could still be getting IO contention from the RAM system. RAM is not
uniformly fast; certain access patterns are much faster than others.

"concurrency is a big win when IO is a bottleneck : keep processing dozens of
things while some of them are waiting for data from network or hdd."

Concurrency is a win when IO is a bottleneck on a _single task_. Once you've
got enough tasks running that all your IO is used up, adding more may not only
fail to speed things up, but may slow things down. I'm speaking of situations
where you've _used up_ your IO. The tasks you're benchmarking are so easy per-
byte that I think there's a good chance you used up your IO, which at this
level of optimization, must include a concept of memory as IO.

I think you'd be helped by stepping down another layer from the Go VM and
thinking more about how the hardware itself works regardless of what code you
are running on it. Go can't make the hardware do anything it couldn't
physically do, and I'm getting a sense you deeply understand those limits.

------
crb002
One trick I use is passing data files through the command line "sort", "uniq",
and "wc" utilities. Those are great benchmarks to see how off you are from
tuned code on your machine.

------
nefasti
Does anyone now profiling tools like that for node.js?

