I'm particularly keen on some of the details around goroutines since Project Loom hopes to bring something similar to the JVM and it's nice to see what the design space looks like.
Another is that go sometimes gets dinged for not being modern enough in a bunch of different directions, e.g. no rust-like immutability/borrowing, generics, or generational GC, but there is a lot of thought put into what they are working on and where they want to focus the language.
"It isn't that the generational hypothesis isn't true for Go, it's just that the young objects live and die young on the stack."
"... we are led to the conclusion that memory, or at least chip capacity, will follow Moore's law longer than CPUs"
Kudos to the Go team!
Already in the MS-DOS days, even if string class back then was compiler framework specific, and part of ritual of passage as C++ developer.
The typical rite of passage that every C++ developer needs to write their own string and vector classes as part of learning C before C++, instead of properly C++ as we can do nowadays.
Finally those shops that don't allow to use proper C++ on their code, rather C with Classes.
The generational hypothesis is not about the heap, it's about all allocations.
You can see what the escape analysis has determined for each variable by building like so:
go build -gcflags "-m"
Unfortunately it can be hard to know by just looking at code if something will be stack or heap allocated. You can allocate a struct and take its pointer, and have it still end up on the stack, for instance (which is good). You can also have seemingly obvious flows for a variable that the analyzer can't reason about and it escapes when you didn't think it would. You really just have to run the tool to see if you want to be sure.
And this is why "get programmers to manually optimize their code to please escape analysis" is not a reasonable memory management strategy.
If profiling finds that there are too many allocations on a hot path, you might want to see why they escape to heap. But that's a last resort strategy.
Have a look at  and go down to where it says " To find out why the garbage collector is running so much".
It essentially says that java doesn't (ever?) allocate Objects on the stack, but it sometimes approximates it by just storing the scalar values (that would have been part of the object) on the stack, and never allocating the object (with its headers) at all. The end result is similar.
It also has a mention of an easy way to confuse the escape analysis (though Go's escape analysis has many pitfalls as well).
I'm speculating, but it could be partly that the escape analysis and resulting scalar replacement is a hotspot optimization, and therefore the Java GC has to be well tuned to handle non-jitted items. It may have something to do with what types of failure cases each of the languages have with their analysis.
I tend to be skeptical of the claims that are often made that Go data structures have different GC-relevant characteristics from those of Java, or especially C#.
Java on the other hand is pathologically bad for escape analysis, outside of ints and floats literally every object in java is presumed to be on the heap. Even C# has less heap pressure than java typically because of more value type options, and not type erasing greatly improves the JIT's escape analysis. After sufficient JITing Java may reach go level of garbage generation, it may not: but its not deterministic.
Firstly type erasure and escape analysis are un-related. Whether an object escapes or not is unrelated to things like whether it's a List<Integer> or List<int>.
Secondly .NET doesn't do escape analysis at all:
Thirdly HotSpot will happily escape analyze and then scalar replace regular objects, it isn't limited to ints and floats. It can make a big difference in many situations.
Fourthly and finally, if it were "easy to see where data escapes" in Go there wouldn't need to be tools to help the programmer find out, and Go developers wouldn't spend time manually tweaking to try and ensure the optimisation works.
Although C# does not have escape analysis, its value type vs reference type distinction serves roughly the same purpose, but under manual control of the programmer.
GraalVM and other Java JITs (J9, Azul, PTG, ...) have much better escape analysis algorithms.
My understanding is, the string itself is on the stack, but the data is on the heap https://github.com/golang/go/blob/2f2e8f9c81d899d8eefb1f2f98...
Likewise, the "double pointer" of an interface is on the stack, but points to the heap: https://research.swtch.com/interfaces
So, from a lifetime perspective, it's still pretty stack-like. If you make a string, and you don't refer to it from outside the function, it's gonna get noticed by escape analysis and be like anything else on the stack.
So both are regular heap allocated values? So the majority of data is still stack (edit: heap) allocated (safe ints, floats and some simple structs, I suppose), so I fail to see how ``young objects live and die young on the stack''.
It is true that some GCed languages does not have stack allocation at all, but many of them (java, haskell) have very powerful inlining capabilities to compensate it. I still don't understand why I don't need generations for allocating to-die-soon strings in a more efficient ways.
It doesn't matter that the string's data lives somewhere else. It matters where what owns it lives. The string itself, that is, the pointer and length, both live on the stack, and so, it ends up working like a stack allocated value.
> I still don't understand why I don't need generations for allocating to-die-soon strings in a more efficient ways.
Because the string itself isn't on the heap, and therefore, a generation wouldn't help it.
This is one sort of reason why stack allocation isn't a cure-all - stack allocating a wrapper doesn't help you much if the bulk of the data is still in the heap. What can help, in this situation, is if the compiler can prove (or speculate that) the array is small and then scalar replace the array. However I think doing that sort of thing requires a profiling JIT compiler because statically proving array sizes ahead of time is very difficult.
This kind of thing happens in Ruby a lot; the VM will allocate stuff through malloc that never actually hits the GC; see Aaron Patterson's work.
I think all that happens when a string goes out of scope is that the byte array becomes eligible for GC.
IIUC there's a really huge difference here. Go can quickly allocate the byte array into the GC managed heap. MRI Ruby has to make a slow call to malloc(). Worse still, when collecting we can't just mark that memory available again. Ruby has to detect the type of each object and call free() accordingly.
What makes you say that? Pointers to stack are totally possible in Go; just because the `str` field is a pointer doesn't mean it points to the heap.
The primary advantage of generational GC is the huge throughput advantage that bump allocation in the nursery gives you. If you're not comparing against such a GC, then you're not properly comparing non-generational GC to generational GC. The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation. Of course you will take a throughput loss if your generational GC is non-copying; that's why people don't usually deploy non-copying generational GCs.
Rather it's assuming the infant mortality hypothesis - that the probability of an allocation being garbage is inversely proportional to its age. Thus you split your allocations into three logical buckets. The first bucket is recent allocations that are super quick to trace (e.g. no bigger than CPU cache), the third bucket is long-lived allocations ideally reachable by global roots rather than short-lived stack frames; and the second bucket is all the stuff that survives collections on the first bucket (hopefully only rooted by code in flight on the stack) and you want to keep out of the third bucket because the third bucket is so damn expensive to trace.
You need write barriers for generational GC, but you need them for concurrent GC anyway.
Generational GC is ideal for servers that have a predictable amount of scratch allocation per request. If first and second buckets are sized correctly and collected at the right time, and the server never makes anything reachable from the third bucket (like a cache - you should use reusable serialized arrays for your caches in generational GC languages, people! And not weak references!), then you're in a great place. It approaches arena allocation in terms of efficiency. I've written app servers in .NET back in the mid 2000s that barely hit 2% of CPU time in GC following these principles, and they allocated pretty freely.
I think generational GC outside of a server request loop context, or with wildly variable allocation profiles, is less great. Not bad mind, but the generational infant mortality hypothesis, and more importantly, lack of middle age death, is less reliable.
(I wrote this before I finished with the actual article. The point in the article about avoiding write barriers with hashing is interesting in particular - somewhat surprised it scales to 100+G heaps, if indeed it does.)
(I don't consider pure semispace GCs to be practical because of memory usage. One benefit of generational GC is that it makes Cheney scans practical by confining the semispace collection to TLABs or similar, which are by their nature small.)
Bump allocation significantly increases throughput only if you allocate a lot. If you don't, maybe you gain more by avoiding write barriers. And, as explained in the article, Go tends to allocate less because it is value-oriented (instead of reference-oriented, where most values are "boxed") and thanks to escape analysis.
The most important point, though, is that you don't know whether the hypothesis that Go wouldn't benefit from generational GC is true until you test it. Nobody has tested it. And there is plenty of indirect evidence to suggest that this is not true. For one, Go programmers have a habit of manually microoptimizing short-lived allocations to keep escape analysis "happy", which is something that becomes much less important with bump allocation in the nursery. Another simple observation is that .NET has the same situation regarding value types as Go does, and .NET uses a copying generational garbage collector. A final observation is that certain situations, like indirect calls , cause objects to be spilled to the heap unconditionally.
I've been writing Go for 8+ years and have never done this, and I wouldn't approve a code review of a teammate that did this.
I agree. But you would probably agree it would be a fairly difficult endeavour to test your hypothesis? To make a fair comparison, we'd need to implement an alternative version of the Go compiler and runtime, replacing the current GC with a copying generational GC, with a similar multi-year effort in optimizing and refining it.
Escape analysis and value types are both useful techniques, but they don't eliminate the need for a fast GC.
I would actually be quite interested in a study of whether equivalent Go/Java programs allocate more or less, especially when pitted against Graal. Presumably it's not hard to prepare such a test.
the weird thing is java doesn't seem to offer this an option because it seems like it would be something simple to add. basically CMS + jemalloc. maybe the performance is actually that bad that it would not be usable on java where there is a lot more heap allocation going on.
the design of go seems generally to use simple solutions where other people use complex solutions. like for interface dispatch go has fat pointers which has middling performance. while java uses JIT to remove the dynamic dispatch where it can which has great performance or falls back to looping through the class to find the correct interface vtable to dispatch the method call which has terrible performance.
For implementing the generational GC they simply use sticky mark-bits.
SpiderMonkey used to use an allocator much like that of JSC before it switched to copying generational GC, to significant performance improvements.
This may muddy up the language vs Java a little bit, but it also gives a lot more tools to work your code around the runtime's limitations. And thats a perfectly fine design point by me.
If you're going to have platform limitations (no JIT, simple GC) then I need reasonable tools when I hit them. It's much easier to avoid frequent allocations or implement arenas in Go. This takes the pressure of the runtime to be one size fits all.
I know, I know, Java is getting Value Types some day! We'll see where that lands.
Meaning int, long, char, bool, ..., points, rectangles, pairs, colours, enumerations.
Then generics are reiffed. So each instance gets a copy if value types are used.
Additionally, you can explicitly allocate value types and arrays on the stack and there are APIs for manual memory management.
It sounds like they don't intend to add any more tuning parameters for the time being. Considering they have made it this far without many ways to adjust the garbage collector, I'll bet this remains true.
WRT knobs. Apparently they aren't going to do so, but I suspect that's a mistake.
It's apparent from reading this talk that GC design for Go has often been constrained by this "no knobs" design philosophy in ways that seem problematic. For instance they designed a GC to have the shortest pauses possible, but then kept hitting throughput problems on a classic batch job where latency doesn't matter at all (their compiler).
I wrote about this problem some time ago:
The essay points out that GC tuning exists because of fundamental tradeoffs between low pause times and throughput. Java lets you tune the GC precisely to avoid the nasty situation Go got itself into, where the demands of very very different kinds of workload (super latency sensitive Google servers vs compilers) are difficult to reconcile with each other.
You can see this tension come through several times, like when they admit that they would have done a normal "enterprise" style GC as they rather condescendingly called it, but they felt Go's market position was under threat, they only had a year to make improvements but were constrained by the requirement to not slow down the Go compiler itself:
So we were limited. We had basically a year of compiler performance improvements that we could eat up by making the GC concurrent. But that was it. We couldn't slow down Go programs. That would have been untenable in 2014 ... Go also desperately needed short term success in 2015.
So they ended up with this design that they knew was kind of weird and not that great, but it let them promote low latency above all else to their users, and they sort of buried the tradeoffs they were making along the way. Which, reading between the lines, might have been related to internal staffing issues at Google - why else would a well funded project like Go have "desperately needed short term success in 2015"? It's not like they sell the Go compiler.
More quotes that support this:
As you can see if you have [request oriented collector] on and not a lot of sharing, things actually scale quite nicely. If you don't have ROC on it wasn't nearly as good.
So they had a GC that scaled better. But it got killed by, you guessed it, the needs of batch jobs:
But that wasn't good enough, we also had to make sure that ROC didn't slow down other pieces of the system. At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is so we couldn't slow the compiler down, certainly not this much.
They could have solved this problem by having two GCs for Go programs tuned for throughput and latency, or doing what the Java guys did and trying to make an adaptive GC like G1 where you can give it pause time targets: if the target is very high then it tunes for throughput otherwise it tunes for latency. But instead they seem to have sacrificed it all on the altar of "no knobs", which is a pity because they ended up introducing two knobs anyway: SetGCPercent and SetMaxHeap.
Well hey, guess what, if you use the G1 GC in Java you are only supposed to tweak two knobs too - max heap size (same as Go) and pause time target. If anything G1's knobs seem better because pause time target is a lot easier for developers to reason about than "heap overhead multiplier". In the end it's unclear their sacrifices were really worth it, although I don't know enough about Go to know what dire straits in 2015 were causing them to need such short term success.
> So they had a GC that scaled better. But it got killed by, you guessed it, the needs of batch jobs
Even in the JVM where you have multiple specialized GCs, the developers also look at each GC in workloads it wasn't designed for and that also seems to influence design decisions. For me this just means they focus on the server use case (=latency) but not at the cost of 50% throughput hit for the compiler (=batch job).
> The math is absolutely fascinating, ping me for the design docs
Ping! Publish them please.
Go 1.5 concurrent garbage collector pacing: https://golang.org/s/go15gcpacing
Proposal: Separate soft and hard heap size goal: https://github.com/golang/proposal/blob/master/design/14951-...
0 - http://hg.openjdk.java.net/jdk/jdk11/file/a0de9a3a6766/src/h...
The next one up in complexity is called the serial collector. It's a stop the world mark/sweep design straight out of the 1980s, but it can be quite effective for small programs like command line tools, simple GUI apps etc because there's no heap overhead and no runtime overhead when not collecting.
After that you get into the more advanced algorithms. In order they'd be parallel, CMS, G1, Shenandoah/ZGC.
The different engines can be found one level up, here:
The directory you were looking at is code shared between the collectors.
Naturally this is OpenJDK specific, other JVMs might do it differently.
I also think that throughput of memory allocation, not just latency, matters a lot more than people give it credit for.
Agreed. But you also ask for very fast memory allocation, which implies a bump allocator, which implies a copying GC, which implies pinning objects shared with C through cgo (Go's foreign function interface), which would probably make cgo even slower.
> I also think that throughput of memory allocation, not just latency, matters a lot more than people give it credit for.
At 60 fps, minimizing latency is critical. You have to optimize this first. When your longest GC pause is well below 1/60 s, then you can optimize throughput of memory allocation.
Anyway, most AAA video game engines being written in C/C++, my guess is they don't use a bump allocator, which means they do quite well with allocators like tcmalloc or jemalloc, probably by allocating as much as possible early, and using object pools.
Note that Unity is an existence proof that most games can get by just fine with high-level logic in GC.
Go would be fine for lots of game dev, imo.
i wonder if it's possible to speed up ffi in go...
my understanding is go is slow because each function call has to copy over to a bigger stack (goroutines have tiny stacks to start, and grow on demand, but c code can't do that, natch) and because it has to tell the goroutine scheduler some stuff for hazy reasons.
this github issue has a lot of interesting discussion of go ffi: https://github.com/golang/go/issues/16051
these benchmarks https://github.com/dyu/ffi-overhead seem to show that a c call via cgo will be about 14 (!) times slower than a c call from c# in mono, which itself is about 2 times slower than just a plain c call.
So I think the slowness is caused by something else.
Thanks for sharing the benchmark.
Not all of them use C++ on the server side.
For instance, what restricts one from doing:
int blah(myobj a);
int blah(myobj &a);
The usual argument is efficiency - since "myobj" may be a big and have an expensive copy constructor - and you may happen to see this pattern being commonly used in C++ programs, but restrictions? There are none. C++ even makes it possible to assert coherency with copy/assign constructors, and modern C++ supports the more efficient moving as well.
So, the point is: nothing in the language restricts you from passing things by value. I'd go further an say that blindly passing things by reference is bad practice and detrimental for simple objects - since you now have to consider lifetime/ownership of the reference you passed along, and this will probably complicate an otherwise simple piece of code.
The argument about fields in memory also makes not much sense. How is that different from C++? Keeping references/pointers in structures is possible but not exactly what one needs to do.
Otherwise the article is great I just think it misses the point when comparing to "C and C++", since they are not the same language - even though syntax is similar and there is some level of compatibility.
Interfaces in Go seem pretty inefficient for unions of small types. If you have a slice of interfaces, it's all pointers, even if many of the values fit within a word and could be stored inside the interface.
Its glossary is invaluable. Even though the site is old, GC has been around for decades and the state of the art hasn't changed much since it was written. It's opinionated in a good way, explaining why production GCs work as they do.
Note that latency isn't the only metric that matters! The top comment from that discussion says this:
There were tests lately in which Go GC was almost the fastest latency wise. Go was was couple of times faster than Java in mean latency time but it had 1062 pauses comparing to Java G1 GC which had only 65 pauses. Time spent in GC was 23.6s for Go but only 2.7s in Java. There is no free launch, you need to pay for low latency with throughput.
More detail from the same guy:
"Go: 67 ms max, 1062 pauses, 23.6 s total pause, 22 ms mean pause, 91 s total runtime
Java, G1 GC, no tuning: 86 ms max, 65 pauses, 2.7 s total pause, 41 ms mean pause, 20 s total runtime"
2. In September '18, ZGC will be available in OpenJDK on Linux/x86 (available today in early access), which also targets sub ms pauses (and guarantees under 10ms): https://wiki.openjdk.java.net/display/zgc/Main
Note that at some point, worst-case latency becomes far less meaningful than throughput, because unless running on a realtime OS, the OS introduces bigger pauses, and it makes no sense for the GC to try and beat them.
The fact that Go has value types helps a lot, but the latency inferior to 0.5 ms is mainly the result of GC design, as explained in the discussed article (especially the work on removing as much as possible stop the world pauses).
> Note that at some point, worst-case latency becomes far less meaningful than throughput, because unless running on a realtime OS, the OS introduces bigger pauses, and it makes no sense for the GC to try and beat them.
This is already said in the article.
> Go: 67 ms max, 1062 pauses, 23.6 s total pause, 22 ms mean pause, 91 s total runtime
> Java, G1 GC, no tuning: 86 ms max, 65 pauses, 2.7 s total pause, 41 ms mean pause, 20 s total runtime
Otherwise it seems to me that the Java/C# model is the best design for most tasks. Which is why they're so popular, it's not a mistake.
This is discussed in the article (basically, Google needed low latency servers):
« If you want 10 answers ask for several more and take the first 10 and those are the answers you put on your search page. If the request exceeds 50%ile reissue or forward the request to another server. If GC is about to run, refuse new requests or forward the requests to another server until GC is done. And so forth and so on.
All these are workarounds come from very clever people with very real problems but they didn't tackle the root problem of GC latency. At Google scale we had to tackle the root problem. Why?
Redundancy wasn't going to scale, redundancy costs a lot. It costs new server farms. »
But they did. The new low-latency Java GCs are more sophisticated than Go's, and deliver pauses that are on the order of OS-caused pauses. The reason Go was able to achieve low latency with a relatively simple design is because 1. it suffers a hit to throughput and 2. that throughput hit, while significant, is not catastrophic because Go relies heavily on value types.
As you wrote, they are new, and weren't available when the decision was made for the Go GC.
> The reason Go was able to achieve low latency with a relatively simple design is because 1. it suffers a hit to throughput and 2. that throughput hit, while significant, is not catastrophic because Go relies heavily on value types.
And are you saying these "new low-latency Java GCs" have no tradeoffs either?
I'm sorry, but we are discussing the keynote of the International Symposium on Memory Management, which is a recognized event in the field, and you are claiming things without any substantial material to offer. Maybe you're right, but I need more than vague assertions to be convinced :-)
A recent presentation on ZGC is here: https://www.youtube.com/watch?v=tShc0dyFtgw
Hudson explains they tried to switch to a generational GC, but for this they needed a write barrier. It was difficult to optimize the write barrier, by eliding it whenever possible, because Go is moving to a system where there is a GC safepoint at every instruction (this is because goroutines can be preempted at anytime, which is not a requirement for Java threads). In other words, the GC design is also constrained by the way goroutines work.
Hudson also explains that because Go relies a lot on value types, it makes espace analysis more effective, even without interprocedural analysis, which makes generational collection less effective than in other languages using a GC.
Keeping the allocation rate low is part of Go's GC design. A language with a "much higher allocation rate" would probably lead to a different design.
Thanks for the link to the presentation on ZGC! I'll watch it soon. But I saw the slide showing the performance goals and ZGC doesn't sounds a lot better than the numbers presented by Hudson for Go:
- "10 ms max pause time" for ZGC versus "two <500 microseconds STW pauses per GC" for Go
- "15 % max throughput reduction" for Java versus "25% of the CPU during GC cycle" for Go"
By the way, I also note that ZGC is "single generation".
ZGC, whose explicit goal is to be a low latency GC for Java, has a goal of 10 ms max pause time, which seems way above the pauses caused by the OS (compared to two <0.5 ms pauses per GC cycle for Go). The number came from the video you shared earlier. But maybe I misunderstood something.
2. It's still very early days for ZGC, but there are more established low-latency Java GCs. A non-realtime one is C4 (used in the Zing JVM), which then added generational collection to a design that is similar to ZGC. They report a few 0.5-1ms pauses per hour
This is definitely great engineering.
I also recommend this short presentation of ZGC: https://dinfuehr.github.io/blog/a-first-look-into-zgc/
It is easy to forget there isn't a single answer with Java.
Google Cache (irony...):