
Go, don't collect my garbage - jgrahamc
https://blog.cloudflare.com/go-dont-collect-my-garbage/
======
alkonaut
If you make something that is performance critical, and in particular if it's
concurrent, the normal rule is you avoid allocating on the heap at all. Use
object preallocated data (just operate on massive lists), or use pools of data
etc.

Any language/runtime will support just allocating a large blob of data and
then playing with it with code that looks like C. Effectively once you want to
write perf sensitive code in a garbage collectied language you have to _stop_
writing idiomatic code for that language.

That can mean using data orientation (SoA for example is a very non-idiomatic
thing to do in most C-like languages and especially OO ones). Not using heap
allocation at all for any length of time is definitely non-idiomatic in
Java/C#/Python/js etc., but that's what you need to do if you want any kind of
performance.

There are two truths here: 1) Any language can be as fast as C and 2) When
they are as fast as C, they also look like C regardless what it was to begin
with.

~~~
Kapura
I came into the comments to say basically what you said here. Trying to avoid
GC? Preallocate!

For context, I'm a game developer working in Unity. We do basically all of our
dev in C#, which is a language I love. But because it's garbage collected, and
you can see performance hits when the GC runs. In VR dev especially (targeting
90 FPS) this is unacceptable.

So we profile, look to see where we're allocating _anything_ onto the heap and
figure out a way to cache it or re-implement it in a way that makes no
garbage. We recently shipped a mobile VR experience and one of the best parts
was being able to profile my gameplay code and seeing the '0B's next to its
allocations every frame.

I understand that this may not work for everyone, and I fully believe that
CloudFlare's got some talented engineers on the team, but there are ways to
think about and structure programs in a C-like fashion so that you're more
cognizant of the memory you are allocating and using from the heap.

~~~
merb
can you even do that? In Java you can't allocate any List like data strucutre
(LinkedList, ArrayList, arrays (Maps of course)) on the heap. Escape analysis
might do it, but there is no way to force it.

per my knowledge this is not even possible in C#, I mean you can drop to c way
easier there or use (stackalloc) (same as Unsafe usages in Java) But I doubt
that is not a good idea?

~~~
Kapura
So, I'm not sure what the situation is with Java, but I suspect there might be
some confusion in what you're reading.

In C#, anything instantiated as part of a class goes onto the heap. Value
types that are _declared and used only within the function_ are allocated onto
the stack. For example:

    
    
      public class Example
      {
          public int a;  // Allocated onto the heap when Example is instanced
          public int b;  // Allocated onto the heap when Example is instanced
          public int c;  // Allocated onto the heap when Example is instanced
       
          public void CycleValues()
          {
              int temp = a;  // Allocated onto the stack when function is called
              a = b;  // a, b, and c are already allocated
              b = c;
              c = temp;
              // temp goes away when the function finishes executing
          }
       
          public int SumPositiveValues()
          {
              List<int> positiveList = new List();  // Not a value type: allocates onto the heap
              if (a > 0)
                  valueList.Add(a);
              if (b > 0)
                  valueList.Add(b);
              if (c > 0)
                  valueList.Add(c);
       
              int sum = 0;  // Allocated onto the stack
              foreach (int i in valueList)
              {
                  sum += i;
              }
              return sum;
              // sum goes away when the function finishes executing
              // positiveList was allocated onto the heap, so it persists (although inaccessible) until garbage collection
          }
      }
    

In the example above, space on the heap is allocated when you create a new
Example(). Calling CycleValues() doesn't allocate anything new onto the heap,
so it isn't creating any garbage. Calling SumPositiveValues() creates a new
List(), which, not being a value type, is allocated onto the heap. When the
function finishes, that memory isn't automatically freed; it creates garbage.
Optimising for heap performance relies a lot on removing New calls to classes
when it is possible and makes sense to do so.

~~~
merb
well thats what I meant. Basically I'm not a game dev, but keeping simple
values on the stack is most often not a problem. However it's insane how many
Lists, Arrays I need and I'm a web developer. So I guess a Game Dev even needs
more of these (Inventory, Stats, Chat, whatever)

~~~
Kapura
Yeah, but the big key is to keep things instantiated only once (where
reasonable). So, if I'm making an RTS, I keep only one list of selected units,
and it persists even if I have no units selected. No reallocation or losing
the reference means nothing is getting garbage collected.

There are bigger convos to be had about data architecture here, but now's
neither time nor place for it.

------
jasode
Whenever I see GC, I immediately think of _tradeoffs_. One can optimize a GC
algorithm (or GC configuration setting) to work well for one type of workload.
Likewise, one can always create a pathological workload for any GC
configuration because GC has unavoidable tradeoff between cpu time and memory
footprint.

Therefore, I read his article with the intent of placing his findings in the
"taxonomy" of GC tradeoffs.

In that vein, I noticed that his benchmarks and graphs do not include
measurements for _memory footprint_. Instead, it has _throughput statistics_
such as "ops/sec" or "sign/second".

I think I see why working memory size isn't emphasized. He's testing a crypto
algorithm _" ECDSA-P256 Sign"_. I'm not familiar with it but I assume we can
think of it as a typical hash function[1] that doesn't require unpredictable
dynamic allocation of memory as the algorithm processes the bytes of input. (A
typical hash function uses the same fixed amount of memory whether hashing 1
kilobyte or 1 terabyte.)

If his synthetic benchmark to play with GOGC was using a different type of
workload... such as a text parser that stressed the GC with thousands of
different sized memory objects, we'd see memory footprint as part of the
analysis.

[1]
[https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signatu...](https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm)

~~~
dullgiulio
It's not just a fixed-length hash, but it does computations on the curve
points, which are represented by big.Int's, that take a variable size and are
likely to end up on the heap.

Also, Go crypto makes a heavy use of interface that also make stuff escape to
the heap. Of such things are a bottleneck, it might be worthwhile to "un-
interface" them.

------
dboreham
This doesn't read like a "GC is slow" problem to me, more like a "GC doesn't
cope well in the face of a highly concurrent allocation workload". That's
pretty bad though. As soon as you go down the rabbit hole of tweaking knobs to
cajole GC into doing what you expect, darkness lies beyond..

~~~
coldtea
> _in the face of a highly concurrent allocation workload_

Isn't this "highly concurrent" the main selling point of Go?

~~~
eternalban
The original PR line taken by Go team was that langauges like Java
"communicate by sharing" whereas in Go you would "share by communicating". It
wasn't so much about Green vs Native threading as it was about the _core
semantics_ of Goroutines and Channels. This original PR also had annoying (to
people such as myself) aspect of poopooing Java in context of concurrency (of
all things).

As it happens [lol] facts asserted themselves and (at least some in) Go
community saw the reason as to _why_ "communicating by sharing" paradigm ever
got traction. _" It's performance, stupid"_, to paraphrase Slick Willy, not
because other language designers were thoughtless. So now it is "idiomatic" to
see Go code using both paradigms (locking & passing channels). The original
matra was quietly dropped along "systems programming PL" bit somewhere along
the line.

This OP type of article, in an interesting way, continues this pattern of
"let's do it this [obvious] way" in face of dealing with concurrency. No less
an authority than Cliff Click has asserted that lock-free type algorithms
pretty much _require_ a GC.

Concurrency and memory management seem to be zero-sum games. Root cause is
clearly Physics, and not the mechanism devised to deal with the physical
realities.

~~~
dullgiulio
Channels pass ownership of allocated objects, they don't perform copies or
allocations themselves (once you've allocated the channel and the objects you
want to pass).

I don't think the article says anything about concurrent allocations; in fact,
the various signers don't need to share anything (mutable) to sign the
requests they get.

~~~
eternalban
> Channels pass ownership of allocated objects, they don't perform copies or
> allocations themselves (once you've allocated the channel and the objects
> you want to pass).

And? Did I say they did?

> I don't think the article says anything about concurrent allocations

No it doesn't. Point was that it is not a general purpose approach. GCs are.

~~~
dullgiulio
I've just searched for the word "channel" in the article: zero results.

The "original mantra" is "Channels orchestrate, locks serialize" and hasn't
been dropped by anyone.

~~~
eternalban
2010: [https://blog.golang.org/share-memory-by-
communicating](https://blog.golang.org/share-memory-by-communicating)

There is nothing wrong with that approach. It does put a ceiling on
performance, however.

------
TazeTSchnitzel
The Go team were very proud of their garbage collector that “virtually
eliminates” (IIRC) GC pauses. I guess it does that at the cost of severe
program slowdown in cases like this? It's all about trade-offs, and I don't
trust people who think there's only one thing to optimise for.

~~~
kibwen
While the Go GC is definitely no slouch when it comes to engineering, the
messaging around it has not been entirely forthright:
[https://blog.plan99.net/modern-garbage-
collection-911ef4f8bd...](https://blog.plan99.net/modern-garbage-
collection-911ef4f8bd8e)

~~~
geodel
This blog is more of a rant. Official blog announcements and marketing
material are not the place to delve deeply into technical tradeoffs of
designing GC. He compares it with Java which is rather ironic considering I
have not seen Java official announcements talking about things not done. It
would like saying 'We have not been able to deliver value types, module story
is still half done, and we couldn't write an official http cient in whole
development cycle in Java 9' in release announcement of Java 9.

Also there is no data provided to show how bad is Go's GC compare to Java

------
jnordwick
As anybody who has done high-performance, low-latency Java and had to deal
with writing GC-less Java code, this is no surprise at all. In the Java world
we at least had ways around it with things like sun.misc.unsafe, avoiding
language features that allocate, pooling, etc. While some of those things can
be done in Go, I don't know if all the techniques are possible. Without them,
I'm not sure Go can really compete in the arena.

~~~
stmw
Exactly right re: Java. You could just use Rust, and get automatic memory
management without GC.

~~~
abiox
One does not simply "just use Rust".

~~~
kibwen
Here, allow me to help: [https://www.rust-lang.org/en-
US/install.html](https://www.rust-lang.org/en-US/install.html)

------
stmw
This kind of tuning ends up being required for all languages that use garbage
collection. The more you use it in production, the more you're doing meta
programming with environment variables, command-line options and your
allocation profiles. The problem is not so much some performance reduction,
it's that the performance reduction is often variable and unexpected, as it is
in this blog post.

------
dm319
Can anyone explain to me why he got better performance with GC turned on (and
optimised) rather than off?

~~~
tom_mellior
Judicious garbage collection reuses memory you have already allocated, and for
which there may already be entries in caches, page tables, garbage collector
metadata, and whatnot.

If you never garbage collect, the runtime has to repeatedly allocate new
memory from the operating system, and all this additional metadata has to be
juggled. If allocating new memory is a system call, that has costs. If you get
lots of page faults because you keep using new pages, that has costs as well.

~~~
emeraldd
It sounds like instead of turning off GC, the algorithm was changed to an
"automatic" allocation/deallocation scheme that's closer to what you might see
with manual memory management in a C like world. I wonder if the Go runtime
has any optimizations to turn things into stack/static allocations in that
scenario.

~~~
tom_mellior
I'm not a Go expert, but my understanding of the blog was that the algorithm
or any other aspect of the code was not changed. Only the garbage collection
frequency.

~~~
emeraldd
My comment was in reply to the parent's:

    
    
        > If you never garbage collect, the runtime has to repeatedly allocate new memory from the operating system ...

~~~
tom_mellior
Yeah, that was me. I still don't understand what you're saying. Who is
changing what algorithm in what component?

If you tell the Go runtime not to garbage collect but keep allocating memory,
that memory has to come from somewhere. That somewhere is the operating
system.

------
twotwotwo
FWIW, I think this mostly shows that the GOGC defaults will surprise you when
your program keeps a small amount of live data but allocates quickly and is
CPU bound. By default, it's trying to keep allocated heap size under 2x live
data. If you have a tiny crypto bench program that keeps very little live data
around, it won't take many allocations to trigger a GC. So the same code
benchmarked here would trigger fewer GCs in the context of a program that had
more data live. For example, it would have gone faster if the author had
allocated a 100MB []byte in a global variable.

If your program is far from exhausting the RAM but is fully utilizing the CPU,
you might want it to save CPU by using more RAM, equivalent to increasing GOGC
here. The rub is that it's hard for the runtime ever be sure what the humans
want without your input: maybe this specific Go program is a little utility
that you'd really like to use no more RAM than must to avoid interference with
more important procs. Or maybe it's supposed to be the only large process on
the machine, and should use a large chunk of all available RAM. Or it's in
between, if there are five or six servers (or instances of a service) sharing
a box. You can imagine heuristic controls that override GOGC in corner cases
(e.g., assume it's always OK to use 1% of system memory), or even a separate
knob for max RAM use you could use in place of GOGC. But the Go folks tend to
want to keep things simple, so right now you sometimes have to play with GOGC
values to get the behavior you want.

------
tuyguntn
If any Cloudflare engineers are here, why is Rust not chosen as main language
for such use cases?

~~~
acdha
I like Rust too but it's a newer language and the Cloudflare team has a
history of using Go since it's targeted pretty squarely at their kind of work.
In the case of doing benchmarks, it's especially unhelpful to make that
suggestion since the point of a benchmark is to learn where the limits are
rather than an invitation to port everything to a different language to avoid
a problem which might remain theoretical forever.

~~~
staticassertion
I don't think they're trying to suggest using rust, just curious why
Cloudflare chose a language with a GC for workloads that apparently do not
play well with GC.

"We already used Go" is a totally valid reason. "It's rare (or as you say
potentially never an issue) that we actually need to avoid GC" is another.

~~~
acdha
“why is Rust not chosen as main language” seemed like an unnecessarily
accusatory way to express that idea, especially given that language choices
are usually driven by a number of factors rather than a single benchmark.

I've seen a lot of advocacy over the years like that which has seemed to
inspire more backlash than adoption. I generally prefer something constructive
in the tradition of showing a better way to do something — i.e. in this case,
perhaps show a Rust crypto library which delivers better performance so people
could judge whether that kind of thing is enough of a draw to be worth
migrating or, in the case of Rust, taking advantage of its embedding
characteristics.

~~~
tuyguntn
First of all, I am not advocating using Rust, I am asking why they didn't
choose language without GC and good performance for given use case.

    
    
        “why is Rust not chosen as main language” seemed like an unnecessarily accusatory way to express that idea, especially given that language choices are usually driven by a number of factors rather than a single benchmark.
    

do not cut the context, I said "why is Rust not chosen as main language --->
for such use cases? <\---". By use case I meant writing performant code
without GC

------
readittwice
I would love to see an in-depth analysis why Go's GC behaves this way for this
specific benchmark in the first place. Just playing around with different
values for `GOGC` seems like guess-work and could just paper over the
underlying real issue. I know this probably requires to read the GC's source
code, but that would've been certainly been very educational.

------
mcguire
I'm curious about this.

Go has a history of doing the simplest thing that covers the majority of
cases, inherited from Plan 9. For one thing, although I don't know its current
GC story, its original collector was very simple.

I wonder what would happen if you transplanted a recent collector from Java,
where the strategy is to optimize the crap out of anything that moves, into
Go?

Obviously, it wouldn't be as fast as not allocating, or at least not
collecting, in hot code, but what would be the actual penalty?

On the other hand, what about, say, Pony and its per-thread collectors?

~~~
stmw
From a recent talk on the latest JDK GC, it seems to require even more tuning
parameters than before. So the best performance achievable may be higher, but
with more manual work. Which is along the same trade-off curve, with
malloc/free on one end, and a very simple GC on the other.

------
tapirl
Using the SetGCPercent option with a large value is dangerous, it may cause a
program never get GCed before virtual memory is exploited. And when this
happens, Go programs will become very very slow.

The current official Go runtime performs very very poorly when virtual memory
is involved. It may make the while OS non-interactive so that you must restart
your computer.

A SetGCMemoryThreshold(maxMemory int) opition is better for many programs.

~~~
twotwotwo
Definitely agree option of a memory limit would be handy in a lot of cases.
Sometimes a box exists to run one service and you know you want like 75% of
the box's RAM (leaving some for cache, other procs, etc.). Other times, like
this, you don't want GOGC=100's frequent collections but also don't want the
risk of an explosion if live data increases.

Just adding some heuristics to override GOGC's behavior in extreme cases could
help, e.g. assume by default OK to use 1% of system RAM and not OK to use >75%
of it regardless of live data. I'm not sure we can realistically hope to get
that--adds complexity, folks dislike arbitrary thresholds, and any change will
break users who have optimized around current behavior. It might be you could
approximate the desired behavior for reasonably well-behaved programs from a
third-party package using runtime methods and a timer, but that's a drag.

------
Arzh
Tuning is part of the job if you need high performance. It can take many forms
from code optimization, build flags, platform configurations, to even hardware
tweeks. Most people hate this kinda stuff but some of us think it's fun.

------
tedunangst
Graph complaint: the orange and blue lines arent labeled which is which.

------
littlestymaar
This is a good illustration that Go isn't as simple as its proponents like to
praise[1]. When you face complex engineering challenges, you need tools with a
certain amount of complexity.

[1] especially when comparing to Java when it comes to GC customization.

~~~
lucozade
Funny what different people get from these things.

I was most impressed that he managed to get very close to perfect scalability
by twiddling the one knob on offer.

Java GC customisation is an art. This analysis was entirely mechanical. If the
author wasn't just lucky i.e. that this is sufficient for a large number of
workload profiles, then that's very good news.

~~~
tveita
The main problem is that it isn't a very good knob for this use case, as
"ratio of freshly allocated data to live data remaining" is only incidental to
the goal of the person writing the program. Therefore you'd expect to have to
retune it due to unrelated changes, like if you allocated a big chunk of
static memory to hold results.

I expect Go's current behaviour is right for memory-intensive programs, where
"limit the wasted memory as a ratio of the actual working set" is a reasonable
high-level goal. For a CPU intensive program you might have preferred a "max
x% CPU overhead" knob, but just a minimum memory size would work, as this is
mostly an artifact of the very small memory usage of benchmark program.

Java's G1 has "max pause delay" as their preferred knob, but Go has stuck that
knob on low since around version 1.6, which is a good choice for what is
essentially a language for network services.

------
kbart
This is why we will still have C for times to come.

~~~
tebruno99
If you can't figure out how to tune 1 documented option in Go to fix this
issue caused by this one specific use case (which he did figure it out and it
was resolved).. Then you're probably not using C either.

~~~
kbart
_" Then you're probably not using C either."_

In fact, I do - daily. I'm not stating that this exact issue is the key
reason, but because C is very simple with not much of an abstraction layer
between software and hardware, where most such (unexpected) performance issues
arises. Yes, it might be a single flag to makes things run normal, but it
might also take 1 week in production to find such "magic" bugs.

