Hacker News new | past | comments | ask | show | jobs | submit login
Go does not need a Java-style GC (erik-engheim.medium.com)
153 points by nahuel0x 13 days ago | hide | past | favorite | 219 comments





> In a multithreaded program, a bump allocator requires locks. That kills their performance advantage.

Java uses per-thread pointer bump allocators[1]

> While Java does it as well, it doesn’t utilize this info to put objects on the stack.

Correct, but it does scalar replacement[2] which puts them in registers instead

> Why can Go run its GC concurrently and not Java? Because Go does not fix any pointers or move any objects in memory.

Most java GCs are concurrent[3], if you want super low pauses you can get those too[4][5]. Pointers can get fixed while the application is running with GC barriers

[1]: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/

[2]: https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacemen...

[3]: https://shipilev.net/jvm/anatomy-quarks/3-gc-design-and-paus...

[4]: https://wiki.openjdk.java.net/display/zgc/Main

[5]: https://wiki.openjdk.java.net/display/shenandoah


I agree. The author seems to know quite a bit about Go and GCs, but doesn't seem to have much experience with Java. As a Java performance engineer, it sounds like he is comparing Go to how he thinks Java works based on what he's read about it.

Additionally he doesn't seem to know that much about C#, which also has advanced GC, while allowing for C++ like memory management, if needed.

It's odd how most people that haven't used a VM with GC are amazed by Go (no VM) and WASM (no GC) but still fail to understand that with a GC _AND_ VM you can code something that doesn't crash even if you make a big mistake!

And they haven't even bothered to try it out!

To use anything other than JavaSE/C# on the server you really need very good arguments!


Or being amazed by Go's compile speed, when Turbo Pascal and Object Pascal compilers were already doing that in the 1980's, or finding WASM innovative when polyglot bytecodes with no GC also go back to the early 1980's, like Amsterdam Compiler Kit EM as one example among many.

The innovation in WASM is more about getting all the major players in the browser space to agree and support it as a first-class citizen in the web stack. That polyglot bytecodes existed in the early 1980's does nothing for the web.

What a perfect example, have you even tried Java?

I'm guessing you are coding for the client?

C++ arrogance is the problem here.

About the pipe dream of WASM there are 3 problems:

1) Compile times (both WASM and the browser)

2) If you thought Applets where insecure (btw they wheren't) wait until the .js (also a VM with GC...) bindings that you are forced to go through to reach anything from WASM securely gets attention!

3) If you build for the browser you have Intel, Nvidia, Microsoft and Google (do you work there, might explain things) to deal with on Windows. You DON'T want that... use C to build for linux on ARM/RISC-V has to be one of your options and then all that work you spent on getting WASM to work is wasted. (because you won't have the performace/electricity/money for the cycles you need in the long term)

Edit: Please don't replace Python (that you should probably never have touched) with Go...


1) No one was realistically compiling C++ or Python to Java though. WASM is not new tech as the other poster said — it’s people coming together to support one compile target that also works on the web, which itself is the crowning achievement.

2) Building a secure VM is easy. You only need to give it access to things it should have access to. If a VM has only math instructions, it’s not going to access the file system. My computer can’t poke my nose because there is literally no machine instruction for it.

Java did not build its VM that way. Instead, the JVM had full access to everything and individual functions were blacklisted within Java applications and this was enforced by Java code running in the same machine as the attacker’s code. Naturally every blacklist works like a sieve.


When I think about it, the memory security is probably the weakest point of WASM and probably also the reason nothing of value has come out of that initiative yet.

How does WASM protect memory?

Or maybe there is something valuable made in WASM and I don't know about it, Figma is I think, but I'm not in the target audience.


Not to mention Turbo Pascal was compiling faster than Go, on a single-core, in-order 20Mhz PC...

With more advanced language features too.

Yep, Turbo Pascal 7.0 vs Go 1.0.

I just wrote some Go last weekend and the compile time was very slow. It reminded me of Scala. Any way I switched to Ruby and didn't have to deal with it any more. Turbo Pascal really was fast, but I don't see that in Go.

Scala and SBT are some of my biggest productivity killers. They murder my machine and take years to do compile.

> To use anything other than JavaSE/C# on the server you really need very good arguments!

Haha! Quoting this so you can't delete it.


Most programmers use PHP, Python, Ruby, Node.js which are all GC bytecode VMs.

What language are you imagining as an argument here?


C# makes it possible to do C (not C++) like memory management but it does not make it easy. Unsafe C# code is really, really, really unsafe, much more unsafe than equivalent C code, and does not compose well. It's improving, with Span/Memory, etc, but it remains an absolute last resort.

C and C++ memory managements are alike, C# doesn't need reference counted library types.

If you mean RAII, they can be easily done with IDisposable, using and Rosylin analysers that throw errors when usings are forgotten.

And yes, manually memory management should be last resource, proven by profiler data.


I do mean RAII, which is fundamental to C++. Memory management without it is not "C++ like".

IDisposable is not a substitute for RAII, and C# has nothing that can manage ownership in equivalent ways as it lacks deterministic destruction. Implementing, for example, a data structure based on unmanaged memory in C# in such a way that it can be safely used by code outside the library without undermining the safety of the language (i.e. without introducing the possibility of programming errors outside the data structure implementation causing leaks or memory corruption) is an exercise in discipline and requires a thorough understanding of the runtime - eg. knowing that an object can be garbage collected while a method on it is executing. I know this because I've done it (as a last resort after extensive profiling and production use of various optimised, managed versions of the library).


Maybe you should explore better Marshall Interop services and the span related improvements.

IDisposable is definitely an alternative when coupled with Roslyn analysis or SonarQube.

As if doing it in C++ doesn't require the same knowledge level, worse actually, as it also requires ISO C++ mastery and a good knowledge of how each compiler being used handle IB and UB.

Also unless static analysis is getting used, RAII can also get thr ownership wrong, specially with use after move errors, or smart pointers incorrectly given as arguments.


I used Spans extensively. They help (mostly by providing efficient bounds checking), but not much. The main problem, as I mentioned initially, is the composability of this sort of code.

I wasn't trying to give some kind of defence of C++ here, just pointing out the dissimilarities to C#. This particular piece of code would have been far easier in C++ and RAII would have been a huge help, but the system as a whole would have been an utter nightmare in C++ (which I know because I've worked on multiple similar systems in C++).


Sure, but I am the opinion it does compose, if one embraces it isn't the same as writing C++ in C#, not trying to downplay your experience on the matter.

For example, here are the Roslyn analysers for a more RAII like experience with IDisposable,

https://github.com/DotNetAnalyzers/IDisposableAnalyzers

Turn "DISP004-Don't ignore created IDisposable." into a compiler error and you have your region allocated RAII like experience.

And moving a bit the goal posts, for those scenarios where C# fails completly to provide a usefull solution, we can rewrite just that module into C++ (preferably C++/CLI when only Windows matters), and have the best of both languages.


There's also ref which is almost like pointers and safe. Span/Memory do not remain an absolute last resort, they are becoming the standard way for string formatting/parsing/(de)serialization and IO.

I didn't say Span/Memory are a last resort, I said C style memory management (AllocHGlobal/Free), despite being somewhat improved by Span/Memory, remains an absolute last resort. Span/Memory aren't primarily aimed at handling unmanaged memory, though they're useful for it.

Of course C style memory management is a last resort, 99% of devs don't need it, and the 1% that actually need it, better learn to use V-Tune.

why is it more unsafe than equivalent C code? I always thought it's simply a way to enable C-like real pointers/native memory ...

"Scalar replacement" explodes the object into its class member variables and does not construct a class object at all. That does result in the exact same `sub %esp` (that Go would do for any struct), but it is restricted to only working if every single usage of that class type is fully inlined and the class is never passed anywhere that needs it in its object form.

It's worse than what Go has. Go can stack-allocate any struct and still pass pointers to it to non-inlined functions.


Scalar replacement does not work even in very trivial cases: https://pkolaczk.github.io/overhead-of-optional/

In all those cases, Optionals were inlined, didn't escape, yet they haven't been properly optimized out.


Did you understand why?

I don't know the exact reason here, but from my experience JVMs don't seem to perform optimisations as deep as static compilers. You can see the compiler not only missed scalar replacement here, but also didn't use a conditional move to avoid branching neither it performed any loop unrolling. Maybe it is because JITs are constrained by time and resources a lot more.

It'd be worth doing a deep dive into this - you can look at the compiler's intermediate representation to understand why it didn't make an optimisation. There may be something trivial getting in the way.

Scalar replacements (as currently implemented in Java) does not work in real-world programs.

Well-written code does not need it. Poorly written code can not trigger it, because the JIT is too dumb and isn't getting better.

There is no sane test to determine whether a piece of code will be inlined in Java. In practice anything more complex than byte array is unlikely to be inlined. Even built-in ByteBuffers aren't! Meanwhile Go compiler treats Go slices just as nicely or better than arrays.


The JIT is getting better. Major escape analysis upgrades are a big part of where Graal (a drop-in replacement for the HotSpot JIT) gets its performance boosts. EA definitely does work well there because Truffle depends on escape analysis and scalar replacement very heavily. GraalVM CE is better than regular HotSpot at doing it and GraalVM EE is even better again.

Graal effort is over 10 years old.

Jaotc, — Graal's sole mainstreamed part, — has been recently removed from OpenJDK in 17th release, and Oracle says [1], that they are "considering the possibility of using C2 for ahead-of-time"

1: https://mail.openjdk.java.net/pipermail/discuss/2020-Novembe...


It’s almost as if graal and openjdk are separate projects with different use cases. Also, they are not competing (in the usual meaning) since both are developed by Oracle.

Which JIT? There are plenty of them to choose from across Java implementations.

ZGC[4] in particular has me excited, enough so to want to pick up a JVM language.

ZGC and Shenandoah can be slower than G1, those are not silver bullets. The fact that there is 4-5 GCs explains the situation, there is not a single GC that is better than the others.

It really depends of the workload.


Indeed. It is strange that no official JDK document puts pros/cons of GCs packaged with standard JDKs in some kinda easy-to-read table/matrix.

Well, unless latency is explicitly a problem with the default (G1) GC, it probably should not be changed to begin with. It is a beast of a GC with a very good balance between throughput and latency. Also, if the latter is problematic, the first thing should be to fiddle with the singular G1 knob (one should set, unless they really know what they are doing), target pause time — throughput and latency are fundamentally opposing goals.

G1 by default has low enough pauses as well, but it does increase with the speed of garbage creation. But it handles even ridiculously high throughputs with graceful degradation of pause times.

Here is a really great blog on various aspects of modern GCs (and don’t forget that we are at Java 17, with small but significant GC updates in each version): https://jet-start.sh/blog/2020/06/23/jdk-gc-benchmarks-remat...


Right it is all good. As I said how difficult it would for official document to put out a comparison table. A comparison around latency/throughput/Heapsize/Cost(Mem/CPU) should be reasonable enough for developers to choose right GC for their applications.

ZGC is already available, since JDK 15, September 2020. =)

https://wiki.openjdk.java.net/display/zgc/Main#Main-ChangeLo...


Max pause times of 0.5ms is what got me really interested.

It feels like a huge trade-off of GCs is almost completely gone.

https://malloc.se/blog/zgc-jdk16


> It feels like a huge trade-off of GCs is almost completely gone.

FWIW the tradeoff of low latency GC is usually paid in throughput.

That is definitely the case for Go, which can lag very much behind allocations (so if your allocation pattern is bad enough the heap will keep growing despite the live heap being stable, because the GC is unable to clear the dead heap fast enough for the new allocations).


throughput can be fixed by adding compute. latency cannot. always optimize for latency with gc.

and no the heap will not keep growing in golang. it'll force threads to help with GC if its falling behind. thereby reducing the rate of allocations and speeding up the collection.


Random Go dev suddenly more of an expert than the literal best-in-class GC experts that have been working on G1GC, ZGC, Shenandoah, Parallel, ConcMarkSweep and others.

GCs are a matter of tradeoffs. Always optimizing for latency is Go's solution, but there are reasons for everything. It's the very reason why the JVM has so many knobs. Yes, it requires a PhD to know what to tune, but there are many parameters for the thousands of different situations that can arise.


rolls eyes I've tuned java's GC for highly available and massive throughput systems.

pretty familiar with the trade offs. java's problem isn't GC (in general) its problem is memory layout and the fact it can't avoid generating a metric shit ton of garbage.

G1GC was a good improvement and I stopped paying attention at that point because I no longer had to deal with its problems (left the ecosystem).

I'm not asserting java hasn't improved or that its GC implementations aren't modern marvels. fundamentally they're just a self inflicted wound.

golang wouldn't benefit anywhere near as much as java has from these implementations because guess what... they've attempted GC that operate under similar assumptions, young allocations, compacting patterns, etc. and they failed to make an improvement that just increasing the heap size knob wouldn't fix using the current GC.


> it can't avoid generating a metric shit ton of garbage.

IMHO this is an often-overlooked aspect of Java and I get flak for pointing out that Java programs are often full of defensive copying because the Java style encourages it. The JDK is very trashy and highly biased against reusing objects or doing anything in-place without creating garbage. You can't, for example, parse an integer out of the middle of a string.

15 years ago when I wrote a lot more Java, I could get away with avoiding JDK libraries to write tight and small Java code (yeah, omg, mutable non-private fields). That was how I got my AVR microcontroller emulator to beat its C competitors. Nowadays if you write that kind of code you get a beatdown.

JVMs work really hard to deal with trashy Java code. Generics make everything worse, too; more hidden casts and adapter methods, etc.


Only in some kinds of apps, like web servers where all the heavy lifting is being done by the database anyway.

Consider a compiler. It's not infinitely scalable to multiple cores. It may not even be multi-threaded at all. It also doesn't care about pause times - for that you want Parallel GC.


depends on the compiler and language. golang seems to counter point your position quite handedly. having one of the fastest compile times and being highly concurrent.

if your application is so simple it doesn't use concurrency then 99% of the time you can completely remove the need for GC by preallocating slabs.


The Go compiler is fast because it doesn't do very much, not because Go's GC is good for throughput oriented jobs.

Go has repeatedly tied itself in knots over the years because they have a goal of not making the compiler slower, yet, the GC needs of the compiler are diametrically opposed to the needs of the HTTP servers Go is normally used for.

https://go.dev/blog/ismmkeynote

"As you can see if you have ROC on and not a lot of sharing, things actually scale quite nicely. If you don’t have ROC on it wasn’t nearly as good ... At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is."


For a compiler, the GC strategy that makes most sense is to never deallocate. The compiler is going to die shortly. No need to do the bookkeeping of figuring out how to deallocate before that happens.

There's still a memory tradeoff, some due to GC, some due to Java (lots of runtime reflection...). Guessing 2-4x.

This is a bit of a tangent, but you can get into situations where Java's memory-overhead becomes pretty untenable. I was in a situation of having to keep track of ~1 billion short strings of a median length of maybe 7 characters.

In terms of just data, that should clock in at about 10 Gb; in practice it was closer to 24 Gb. I tried going with just byte[]-instances instead, which didn't help a lot. Using long byte[]-instances and indexing those doesn't help as much as you'd think because they get sliced up into small objects behind the scenes.

I ended up memory mapping blocks of memory and basically implementing my own append-only allocator.


FWIW. This would probably present a challenge in most (all?) languages.

For example in libc++ due to SSO an std::string has a minimum size of 24 bytes.

For a billion strings less than 15 chars (+ the null byte) that gets you to 24GB, and that’s optimistically assuming each string is allocated in place.

I doubt heap allocated char* would do much better either. Just having a billion 8 byte pointers eats a lot of memory. You’d really need some sort of string packing scheme similar to what you did in Java.


It's a lot easier to build custom allocators in C++ though.

For one, Java has a maximum mmap-size of 2 Gb, and as a cherry on top of that turd, you have no control over their lifecycle. The language is very clearly not designed for this type of work, and if you try to make it do it anyway, it fights you every step of the way.


It has nothing to do with the language but with the API / VM capabilities.

Both those restrictions are fixed, granted that this is not yet in the offical API, it's an incubation module (the equivalent of from __future__ of Python).

https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incub...


The foreign memory API which is currently incubating should help with most of these limitations: https://openjdk.java.net/jeps/419

Right - specifically they invented a way to make closing an mmapped segment safe and fast. The reason you can't officially (without private APIs) unmap something in current Java is because if you did then other threads would segfault, and "no segfaults" is kind a defining characteristic of Java. The new API fixes this using more VM magic, so closing a mapping from one thread will cause other threads to throw exceptions, but this is done without sacrificing performance (it doesn't require checking a status every time you read from memory).

The Lilliput project aims to address this: https://wiki.openjdk.java.net/display/lilliput

I would never write something like this in java, but to be fair, a program shouldn't be written like this in the first place. If you "need" a billion strings in memory and you didn't design for that with something that would scale better, you messed up a long time ago.

Huh, I didn't really have a problem solving the problem as it came up. Like many scaling problems, it wasn't a problem until it was. Then I fixed it. Now I have a solution that can deal with ten times as many strings as before. If I grow out of that one, I'll come up with a better design.

I could have gotten 10 times as much hardware instead, but that would be an incredible waste of money compared to just spending a few days writing more hardware-efficient code.


I have addressed similar problems using a typed array of contiguous memory and another array of lengths.

I did try exactly that, but the GC overhead was still prohibitively high for my use case, made because big arrays in practice are composed from shorter non-contiguous arrays to make garbage collection even remotely possible.

Hmmm, if the typed array only contains primitives, the GC isn't looking inside the array. If the thing pointed to, can never contain a pointer, it won't need to be scanned.

A factor of 2.4 is extremely unlikely to be what makes the difference between a program being viable and not. You have to be in an extremely fine-tuned situation for that to make the difference, and unless your usage level is unusually static it's probably only going to make the difference for a few weeks or months.

So for some context, I'm running a search engine on consumer hardware. Coaxing power out of limited hardware is sort of my bread and butter. How many of these strings I can keep in memory limits how many documents I can index. Doubling that number is quite significant.

This number is orthogonal to how many searches I can serve, that is already a solved problem (through similar craftiness). Multiple searches per second sustained load is fine. The search engine has held up to the hacker news front page, which is something many supposedly non-interactive sites struggle with.

This is a non-profit operation, so unless someone just decides to pay me a months salary or two, I'm not able to meaningfully increase the hardware. I'm stuck with what I've got and will make the most of it.


The time of someone who can do that kind of crafty tuning is almost certainly worth more than the hardware. If nothing else, I'd think having you do a week or two of consultancy and buy more hardware with the proceeds would be more efficient in terms of getting things done. (I appreciate that nonprofits have a lot of particular constraints that mean that that's probably not a practical way forward).

Playing a devil's advocate here: a use case might already be hitting the maximum per server RAM.

A factor of 2.4 might mean 2.4 times more hardware. You might need 240 servers instead of 100.


Sure. And? Everything that you need to do to manage 240 servers, you need to do to manage 100 servers. Like I said, if you're in the scaling stage then you'll be getting to 240 in a few months either way, so you need to be ready for it. 2.4x higher operating expenses matter eventually, when you're a mature organisation, but if the software is under active development you almost certainly have higher priorities.

Also, throughput. But latency and throughput are almost universally opposite ends of the same axis — that’s why it’s great that Java allows for choosing a GC implementation.

you can fix throughput problems by adding compute resources, you can't fix latency issues. I'll always pick a GC that optimizes latency over throughput. its easier to maintain the software.

The GC memory overhead affects all languages with a GC more advanced than refcounting. It certainly does affect Go as well.

Skimmed this and found that a lot of claims are inaccurate and don't have data... Then realized, ugh, it's another Medium post. One particular wrong claim:

> In C#, for example, reference objects and value objects live entirely separate lives. Value objects are always passed by value. You cannot take the the address of a value object and pass that around. No, you need to copy the whole object every time.

In C#, there are normal, C-style pointers. (Although they aren't used very often, mostly when interoperating with C APIs.) Furthermore, you can pass a struct by reference if copying the struct is a problem.

I once wrote some performance critical code where I used structs and past them around by pointers. I ran the code through a profiler, and then refactored to use normal idiomatic non-pointer code. After running through a profiler there was no performance difference.

---

The author of this post really needs to base their claims on actual data. Some well-tuned programs in the languages compared and then compare the performance.


In general when talking about quantitative subjects, we need to use quantitative measures. I think the author is nearly there, but in general, unless you have data to refute your claims, it should be ignored. I don't say this snarkely, but in a domain where we actually have hard quantitative measures, they should be used and required for argument.

Quantitative measures would be nice, but I would imagine that it's going to be really difficult to quantitatively compare go and java garbage collectors, without a billion other factors about the language/runtime getting in the way.

Java’s GC is the state of the art, without doubt. Other managed languages can be faster for some workloads (by not generating much garbage in the first place, eg go), but where it do come up:

https://news.ycombinator.com/item?id=29323468


To yield to non-quantitative reasoning when the difficulty in the measurement goes up, can lead to decisions like the one to launch Challenger space shuttle in 1986. Difficulty doesn't change the science or the questions we ask. Politics doesn't make something true.

I am not even talking about GCs, I am talking about the intellectual rigor we should use when discussing quantitative subjects. To do anything else is to practice magic.

Research papers on GCs will outline the comparisons and metrics they use to study GC algorithms.


I doubt it'll be that hard. Just write the same memory intense routine in both languages, and time it running in a loop for a couple million executions.

Make sure the code is idiomatic and optimized.


> Make sure the code is idiomatic and optimized.

Those are basically opposite ends of the same axis.

Heavily optimized code is never idiomatic in any language. If the natural, idiomatic code were fast enough it wouldn’t need to be optimized!


Yep. His claims about C# are so egregiously wrong that the entire piece can safely be dismissed.

I'm a bit skeptical.

Yes, Go has value types and pointers. But whether you need a modern GC will undoubtedly depend a lot on the type of algorithms you need to execute. Also, it's great that you can implement (some form of) allocators, and that will definitely help for many algorithms, but that's definitely a case of tradeoff between convenience and readability. Similarly, unless I'm mistaken, TCMalloc "solves" fragmentation in two cases: either allocations are small (very common) or memory allocation maps neatly to threads (much less common). That's two good cases to have, but I wouldn't count on it solving memory allocation on its own for, say, a browser engine or a videogame.

Oh, and hasn't Java's GC been fully concurrent for a while now?

That being said, revisiting the assumptions made by Java (and other languages) is a very good idea.


> Java is a language that basically outsourced memory management entirely to its garbage collector. This turned out to be a big mistake.

Given that Java is not far behind C and C++ for many kinds of workload and in quite some scenarios can outperform C code, I'm not sure that I buy this line of reasoning.

> Doing these updates requires freezing all threads.

Eh, what? There are multi-threaded GCs, what is this article even talking about?

> However, this does not put C# and Java on equal footing with languages like Go and C/C++ in terms of memory management flexibility

In which world are Go and C in the same worlds when it comes to memory management flexibility? That's like comparing a language that uses a Garbage Collector to one that requires manual memory management. Because - it is.

> Modern Languages Don’t Need Compacting GCs

> If need be, the Pacer slows down allocation while speeding up marking.

So, you just traded in one drawback for another?

Heck, I get it. The JVM is not a thin graceful fawn. It's a complicated beast that requires years of experience to tame it - and even then it will come back from time to time an bite you. There are many good points to critique the JVM - but I don't get the feeling that the author of this article has spent much time with modern JVMs because he's not pointing out any of them.


The author doesn't really understand how Java escape analysis works, and just focuses on one key aspect: "It does not replace a heap allocation with a stack allocation for objects that do not globally escape."

The author then implies that escape analysis is only used to reduce lock acquisition. Java escape analysis will replace a heap allocation with a stack allocation if the code is fully inlined. This is known as scalar replacement.


I don't know much about this, but upthread various people say that scalar replacement happens very rarely if at all, currently. E.g.: https://news.ycombinator.com/item?id=29324132.

Could you perhaps comment on that, since you seem to have experience?


The issue is not that it doesn't happen - it does, all the time. The problem is it's unpredictable, hard to control and hard to measure. So it's sort of magic that gets blurred into all the other optimizations the VM is doing, and refactoring your code can make the difference between it happening or not.

First of all that blog post is wrong, it uses Optional which isn't a value type to start with, and contains internal fields.

It is considered a value like class, that will eventually be turned into a value type when Valhala arrives, until then there are no guarantees in scalar replacement.

Secondly, GraalVM or Azul are much better at it, which the author didn't bother to try their blog post on.


As some of the other comments in the thread allude, this is quite a rudimentary (or rather outdated) understanding of how Java GC operates and ends up (unfortunately) turning an otherwise good comparison into a straw-man argument.

As someone who's worked with Java from the days where "If you want superhigh performance from Java without GC pauses, then just turn off GC and restart your process every X hours" was considered a "valid" way to run high-performance Java systems, I think the changes Java has made to GC are among the biggest improvements to the framework/JVM and have contributed vastly to JVM stability and growth over the last decade.


Fundamentals of Java about data/class layouts in memory have remain same for decades. So author is right at big picture.

> I think the changes Java has made to GC are among the biggest improvements to the framework/JVM and have contributed vastly to JVM stability and growth over the last decade.

This is of course true. However the point is for Java it is absolute necessity for Go it may be nice to have.


Cool article, I'm not sure I agree with the headline.

I used to write low-scale Java apps, and now I write memory intensive Go apps. I've often wondered what would happen if Go did have a JVM style GC.

It's relatively common in Go to resort to idioms that let you avoid hitting the GC. Some things that come to mind:

* all the tricks you can do with a slice that have two slice headers pointing to the same block of memory [1]

* object pooling, something so common in Go it's part of the standard library [2]

Both are technically possible in Java, but I've never seen them used commonly (though in fairness I've never written performance critical Java.) If Go had a more sophisticated GC, would these techniques be necessary?

Also Java is supposed to be getting value types soon (tm) [3]

[1] https://ueokande.github.io/go-slice-tricks/

[2] https://pkg.go.dev/sync#Pool

[3] https://openjdk.java.net/jeps/169


Object pooling in Java used to be fairly common. I don't see it much anymore in new code, but used to run into it all the time when writing code for Java 1.4/5. Even Sun used pooling when they wrote EJBs. Individual EJBs can be recycled instead of released to the GC.

Nowadays the GC implementations are good enough that's it's not worth the effort and complexity.

Though now that I think about it Netty provides an object pooling mechanism.


Pooling objects (for the purposes of minimizing GC) is consider a bad practice in modern Java. The article suggests that compacting, generational collectors are a bad thing, but they can dramatically speed up the amount of time it takes to deallocate memory if most of your objects in a given region of memory are now dead. All you have to do is remove objects that are still alive, and you're done: that region is now available for use again. The result is that long-lived objects have a greater overhead.

Does object pooling still make sense for direct ByteBuffers nowadays?

Yes. Those aren't GC controlled so any arguments about GC is irrelevant with direct byte buffers.

Also, object pooling isn't really a GC related hack, it's more useful as a cache booster. Programmers like immutability and garbage collection but your CPU doesn't like these things at all. If you're constantly allocating new objects it doesn't matter if your GC kicks ass, because those objects will always be in parts of memory that are cold in the cache. If you allocate some up front and mutate them, they're more likely to be warm.

Obviously this isn't a language or even VM thing. It's a "mutable variables are good for performance" thing.


> Both are technically possible in Java, but I've never seen them used commonly (though in fairness I've never written performance critical Java.)

I don't know about the Java world, but in C#—especially in games written in Unity—object pooling is very common.


Writing High Performance .NET Code (https://www.writinghighperf.net/) has a chapter on this. In C#, time spent collecting depends on the number of still-living objects. That means you want objects you allocate to be short-lived (dead by the time GC happens) or to live forever (they go to the gen 2 heap and stay there). The book suggests object pooling when the lifetime of objects is between those two extremes, or when objects are big enough for the Large Object Heap.

But at the end of the section, the book says:

  I do not usually run to pooling as a default solution. As a general-purpose mechanism, it is clunky and error-prone. However, you may find that your application will benefit from pooling of just a few types.
What kind of things do you pool in Unity?

Unity aside, I believe it's useful in a lot of areas of gamedev.

Anecdotally, in a lower-level game engine I wrote at one point in C#, object pooling significantly reduced memory overhead (and IIRC increased framerates on complex scenes) when I scaled well past 1000 dynamic, moderately-lived entities. Particles, objects, projectiles, bad guys, etc. I believe can all benefit from pooling, assuming they aren't long-lived.

I do agree it can be error prone, but I'm convinced it's worth it for several places in gaming.


Ideally you would pool anything you might need to dynamically allocate during a level. You want to avoid allocations during game play entirely, if possible.

Unity itself will pool most media assets. Any given texture asset is shared between all object instances that use that texture. The programmer will end up pooling instances of their objects or just use structs and such. It can be tedious but I wouldn't call it more clunky than explicit memory management.

Large collections are actually not a problem at all in games as long as you only run the collection during a load screen.


I've seen it on a large site written in c#. Object pools of stream objects for serializing and deserializing data. This was 10 years ago.

Java has a pretty decent standard library with different list, map and set implementations and quite a few third party libraries with yet more data structures. Honestly, Go felt a bit primitive and verbose to me on that front on the few times I used it. Simplicity has a price and some limitations.

There are also other tricks you can do like for example using off heap memory (e.g. Lucene does this), using array buffers, or using native libraries. There obviously is a lot of very memory intensive, widely used software written for the JVM and no shortage of dealing with all sorts of memory related challenges. I'd even go as far as to argue that quite a few of those software packages might be a little out of the comfort zone for Go. Maybe if it were used more for such things, there would be increased demand for better GCs as well?

Object pooling is pretty common for things like connection pools. For example apache commons pool is used for doing connection pooling (database, http, redis, etc.) in Spring Boot and probably a lot more products. Also there are thread pools, worker pools and probably quite a few more that are pretty widely used and quite a few of those come with the Java standard library. Caching libraries are also pretty common and well supported popular web frameworks like Spring.

A typical Java based search or database software product (Elasticsearch, Kafka, Casandra, etc.) is likely to use all of the above. Likewise for things like Hadoop, Spark, Neo4j, etc.

Of course there's a difference between Java the language and the JVM, which is also targeted by quite a few other languages. For example, I've been using Kotlin for the last few years. There are functional languages like Scala and Clojure. And people even run scripting languages on jython, jruby, groovy, or javascript on it.

There even have been some attempts to make Go run on the JVM. Apparently performance, concurrency and memory management were big motivators for attempting that (you know, stuff the JVM does at scale): https://githubmemory.com/repo/golang-jvm/golang-jvm

Their pitch: "You can use go-jvm simply as a faster version of Golang, you can use it to run Golang on the JVM and access powerful JVM libraries such as highly tuned concurrency primitives, you can use it to embed Golang as a scripting language in your Java program, or many other possibilities."


Not sure if you're in on the joke, but for those who didn't go to the repo itself:

https://github.com/golang-jvm/golang-jvm

It's just a copy-paste of JRuby on April 1st and the readme now includes a rickroll.

Maybe it's irresponsible of them to leave it up in a way that Google still finds as a legitimate-looking search result.


LOL, I was not aware and stepped right into that.

There appear to be other attempts: for example https://github.com/zxh0/jvm.go (might be the same?)

Let's just say people have tried/joked about it but it never took off.


Its other way round. Implementing JVM in Go and not running Go over JVM.

> There even have been some attempts to make Go run on the JVM. Apparently performance, concurrency and memory management were big motivators for attempting that (you know, stuff the JVM does at scale):

This seems legit. Just links to their website/Wiki are not working right now.


Object pooling used to be more common in Java. Now it is mainly used for objects that are expensive (incurs latency) to create, not for GC reasons.

How have you found Go in contrast to Java. Is the simplicity worth it?

yes. golang is actually less restrictive than java. and avoids a ton of the bullshit abstractions you see in every java code base.

>ton of the bullshit abstractions

Not sure how that's the Java's problem. Most of these abstraction come older frameworks. You can have these same abstraction/design pattern in Go also.


> In a multithreaded program, a bump allocator requires locks. That kills their performance advantage.

Wait, what?

What's wrong with:

    char theHeap[0x1000000];
    atomic_ulong bumpPtr;

    void* bump_malloc(int size){
        uint32_t returnCandidate = bumpPtr.fetch_add(size, std::memory_order_relaxed); 
        if(returnCandidate + size >= HEAP_SIZE){
            // Garbage collect. Super complicated, lets ignore it lol.
            // Once garbage collect is done, try to malloc again. If fail then panic().
        }
        return &theHeap[returnCandidate];
    }
------

You don't even need acq_release consistency here, as far as I can tell. Even purely relaxed memory ordering seems to work, which means you definitely don't need locks or even memory barriers.

The exception is maybe the garbage-collect routine. Locking for garbage collect is probably reasonable (other programmers accept that garbage-collection is heavy and may incur a large running + synchronization costs), but keeping locks outside of the "hot path" is the goal.

------

This is what I do with my GPU test programs, where locking is a very, very bad idea (but possible to do). Even atomics are kinda-bad in the GPU world but relaxed-atomics are quite fast.

-------

> In Java, this requires 15 000 separate allocations, each producing a separate reference that must be managed.

Wait, what?

    points [] array = new points[15000];
This is a singular alloc in Java. 15000 _constructors_ will be called IIRC (My Java is rusty though, someone double-check me on that), but that's not the same as 15000 allocs being called, not by a long shot.

------

Frankly, it seems like this poster is reasonably good at understanding Go performance, but doesn't seem to know much about Java performance or implementations.


    Point[] array=new Point[15000];
In java would create an array with null references, to fill it up you need to create each object so that point is valid.

I stand corrected on that point.

Even with relaxed memory ordering, you’re still bouncing the cache line around between every CPU that tries to access bumpPtr. I would expect that to be significantly slower than using per-thread heaps (especially if you have a large number of CPUs).

Java bump pointer allocations (ie: normal allocations, except for eg: large array allocations) occur in a thread local allocation buffer. Only when a thread exhausts its current allocation buffer does it need to worry about potentially contending with other threads to acquire a new one.

There are very few falsifiable statements in this article but the extant ones are all demonstrably false, starting with this whopper:

"This typically causes Java programs to have complete freezes of several hundred milliseconds where objects get moved around"

Yeah, i mean, definitely not. The only language I regularly work with that has this property is, surprise, Go.

The proof is in the tasting, as they say. Go look at the gRPC daily benchmarks if you want hard data. Java beats Go latency at the median and at the tail while having substantially better specific throughput. In other words the Java GC has no tradeoff versus the Go GC; it is better in every way.


The Java gRPC SDK is heavily optimized by Google, much more than the Go one. Until recently Java pauses were a real issue,

I worked with a lot of Java based solutions ( elastic search, hadoop etc ... ) and was never impressed by the GC.


While I agree that the java gRPC lib is better maintained, I don’t agree with your second point. Java’s GCs are the state of the art, that can manage heap sizes up to 16 terabytes. Other GC implementations would simply die there.

That’s neat, but a lot of applications will never need 16TB.

They are a beast in throughput as well (as well as alternatively, really really good at low-latency with the new GCs made specifically for that): https://news.ycombinator.com/item?id=29323468

Well, you are talking about java applications...

Maybe, but it's still not possible that the Java test has a median latency of 150µs, a tail latency of 350µs, and a regular need to stop the world for hundreds of milliseconds. That statement is simply not compatible with reality.

As I said the Java SDK is heavily optimized to do the least minimum of allocations. I mean if you don't allocate much the GC is not really an issue.

OK but again, that is a statement totally contrary to the article. The article is claiming two things: that it is easier in Go to avoid the heap while Java is utterly dependent on allocating everything on the heap - which is not consistent with the experience of actual Java and Go programmers - and that Java has a "preference for high throughput and high latency" which it doesn't.

FWIW, I used to write some high-performance JavaScript. One of the tricks of the trade was to allocate early and make sure you avoid allocating once the loading phase has started, hence avoiding triggering the garbage-collector (in most runtimes/languages, gc phases are typically triggered by the allocator). This involved writing very non-idiomatic code, to a large extent reimplementing a form of high-level custom allocator on top of the existing allocator/gc, but it worked.

I'd be very surprised if the same wasn't possible in Java.

I don't know if this is how gRPC is written but it could explain an apparent contradiction.


To be fair, early Java definitely had lots of issues with GC and "hanging" programs due to GC activity. It's a failure of the article to bring it up if modern Java implementation are better on that point. As an example of bad reputation, I think the latest Eclipse IDE on the latest Java SDK is still sucky and because of my ignorance I blame it on Java.

I was disappointed with the first Go gRPC implementation, it allocated like crazy and was really slow because of it. It's been rewritten since then but I don't know how much better it is now. To get good performance out of Go it is important to think about memory allocations (unfortunately), just as it is in C and C++. Although allocating like crazy in C/C++ will probably mostly result in a general loss of performance, not affect tail latencies in network protocols much.


gRPC Java wins in part due to not allocating memory when reading and writing buffers. It uses Java-based implementation of jemalloc (via Netty) to manage its own memory. No/low allocations means the GC doesn't need to wake up so much.

i've literally experienced worse pauses from java (sometimes surpassing 30s) GC due to allocation. granted it was pre the latest G1 GC. but they absolutely exist and many companies run on older versions of java.

and I doubt you see pauses in golang longer than 30ms. proof please.


Comparing apples to oranges. Java has been around for multiple decades so there may be some bias on your part based on very old examples, but also, your “average” java program is more likely much more complex and larger code base than your average go one.

Also, shitty code does get written by developers quite often, regardless of language — at that point the runtime can only do that much.


this is entirely BS. if by complex you mean overly abstracted and useless boilerplate then yes. java is larger and more complex.

and your assertion about the runtime only being able to do so much is just wrong here. golang will slow down your application in order to meet GC deadlines.

however its incredibly rare to see GC STW pauses for longer than 10s of ms. not even sure its possible given how the GC is written. this is true for heaps in the 100s of GBs.

this is why you don't hear about issues with golangs GC since like ~1.7. it just doesn't show up in 99% of use cases.

your basic assertion that java's GC is more advanced and therefor better is completely without merit. GC doesn't operate in isolation. java has more complicated GC systems because the language mandates them. golang entirely sidesteps the issues java has requiring said systems with better memory layouts and utilizing the stack.

its incredibly rare to hear stories about golangs GC because its simply not a problem.

java still has a bunch of issues and requires different GC tradeoffs as a result. meaning java will likely be in a constant state of 'which GC model do I use'


Have you followed this very thread?

> golang will slow down your application in order to meet GC deadlines

Hardly a good thing for throughput-oriented applications. Java let’s you tune this (the default G1 collector has a target pause time parameter, and then there is also the two new low-latency GC, Shenandoah and ZGC which has <1ms maximum pause time.)

> golang entirely sidesteps the issues java has requiring said systems with better memory layouts and utilizing the stack.

As mentioned many times, having value types greatly decreases the amount of created garbage, thus a less-modern GC can be enough for most Go applications. But there absolutely are use-cases where new object creation is unavoidable and java will beat Go by a huge margin, due to its more advanced GC, see:

https://news.ycombinator.com/item?id=29323468


> Have you followed this very thread?

indeed i am, people seem to having issues dealing with the reality of the two languages. One (golang) the problems you encounter with its GC are easily remedied using memory pools or adding compute. The other? you don't really have a choice without fighting against the language itself. or relying on improvements to the GC itself. which is precisely why java has had to spend so much effort on their GC to reduce latency while golang has been able to achieve extremely similar results with a fairly straight forward concurrent mark and sweep.

> Hardly a good thing for throughput-oriented applications.

add more compute. problem solved. its extremely rare to have situations where you can't add more compute to the problem. that's the point. you can easily solve throughput issues with compute. you can't solve latency issues as easily. if your STW GC pause ends up being 30s+ you're fucked. period.

if you're GC is the source of the problem because of garbage generation, generate less garbage. in go that's generally very easy. in java its very much not.

>beat Go by a huge margin, due to its more advanced GC, see:

you mean the advantage that entirely disappears with a memory pool added in like 4 lines of code? doesn't sound like a huge margin win to me.

I've never once asserted java's GC isn't amazing for the problem its solving. I'm asserting its completely unnecessary in golang because golang allows avoiding the garbage to begin with.

you: 'JAVA GC IS AMAZING COMPARED TO GOLANGS!' me: 'who gives a shit in practice golang memory allocation problems are trivial to resolve'

its the in practice of dealing with reality of the two languages I care about not the self inflicted problems java forces on developers.


I'm cautious about believing this sort of claim because I remember reading about why Go doesn't need generics, yet here we are waiting for Go Generics to be ready. However, the article convincingly explains who Java has a greater need for a compacting GC - it creates more garbage. This doesn't necessarily mean Go won't benefit from having a generational, compacting GC at some point, for some applications.

Even if it doesn't need specifically a compacting QC having a swappable more tunable one might still be a big deal for people who need something other than the current GC. Like a simple example is the Discord article about how their use case could not work with Go because it always ran the GC every two minutes. If they could optionally disable that feature they theoretically could have kept using Go instead of rewriting to Rust.


Also in java, with the Epsilon “GC”. But it is not really meant for programs that have to run for weeks.

That wasn't my read.

The Discord team that wrote the article I believe you're referencing ( https://discord.com/blog/why-discord-is-switching-from-go-to... ) wanted GC, the problem was that they had a huge cache which took a long time for the GC to scan. They could shrink the cache and it would remain performant in the face of GC, but they took latency hits due to increased cache misses. After a bunch of testing and tweaking they found a goldilocks zone where performance was okay.

Rust was introduced because it was already something other teams were interested in, and when they tried a quick prototype, they avoided the issue entirely, and saw better performance even in the prototype than their finely tuned Go code, so they switched to it.


To me it sounded like if they could have turned off the auto scan on a time limit it would have been fine because once it was fully up and running they didn't really need GC because the memory footprint was stable. But because they had no way of turning that off without compiling their own version of the runtime it wasn't tenable.

The go gc was greatly improved in the versions subsequent to the ones used by Discord. The timing there was unfortunate.

Besides sometimes engineers just want to use Rust.

Ah yes! A new shinny thing. I've had plenty of conversations where some 'new' framework / language solves a 'huge and unsolvable' problem (but that has not actually been deeply investigated), or is just 'cool because XYZ company are using it' ... ¯\_(ツ)_/¯

Go creates less garbage and also stack allocates things using escape analysis which as the article explains, is effectively a form of generational garbage collection.

However it's a weak one, the stack allocation acts as a form of extremely limited nursery, but lots of "escaping" objects could well fit into a nursery, to say nothing of "heap" objects (like strings and slices) which always trigger heap allocations.

Furthermore AFAIK most generational GCs have 3 generations, not 2 (let alone 1.5).

It does make the tradeoff more complicated, a generational GC is not simple (especially in a language with ubiquitous mutability), but the casual dismissals are... troubling.

And that's before mentioning the regularly problematic lack of tuning knobs of the Go GC, also often dismissed as "java concerns" (which Go users have to work around using ugly hacks when hit, because they don't have tuning knobs).


> Go users have to work around using ugly hacks when hit, because they don't have tuning knobs

Yeah, Java users just have to hire Java performance tuning experts from sprawling Java perf consulting cottage industry. Can't get much simpler than that.


Even if you need an expert, it’s safer and cheaper than rewriting the critical path several times hoping for better behavior.

I don't get this attitude. Optimizing your code is a pretty normal thing to do. Running performance analysis, finding hotspots and tuning them.

I would prefer tuning my code over tuning a garbage collector as tuning code is more transparent. It depends on your background. If you are used to native code development like me, then you are used to thinking about how code get compiled and how memory is used.

I suppose in the Java world, these things are treated as black boxes. The downside of that is that you get no stability under your feet. Everything is up to whatever garbage collector you use and how that is tuned. I am sure many people like that, but I would prefer to be in control over my own code and understand why it performs and doesn't perform. I want optimization to be explicit rather than based on lots of magical tweaking by some GC expert.


Well, I guess YMMV. I have seen more success by rewriting slow stuff.

That's because Java is used to run huge systems (dare I say, "enterprise scale"), whereas golang, especially today with microservices" is not seen in such areas.

However, we still see issues in golang like

* https://blog.discord.com/why-discord-is-switching-from-go-to...

* https://news.ycombinator.com/item?id=21670110


I hope it is joke. Endless Java GC problems with any large system (cassandra, hadoop, websphere and so on and on) despite decades of work from top major vendors like Oracle, SAP, IBM etc tells a different story. I have seen enough "Enterprise Scale" java systems which would die of Out of memory error before even "hello world" is served. Though "Java runs huge systems" type thing work for those who wouldn't know most of these huge enterprise systems do not run as single JVM process.

As a user of system that shits everyday except weekends on a 128GB of heap, I am not gonna buy that "oh these microservice kids do not know enterprise scale systems." I have seen both and not an uncritical fan of either.


Because that's the nature of those systems. If you wrote them in golang they'd almost surely run into similar issues.

As far as I know (I’m not too familiar with Go), Go mostly stack allocates based on the developer’s intent, eg. by using structs.

Java doesn’t (yet) have an option for value types that can be reliably stack allocated, so it resorts to very complex escape analysis. Calling the former escape analysis is a bit misleading imo, even if technically true.


Go is a little more elaborate than that: if a value is initialized as a pointer-to-struct (e.g. foo := &Foo{...}) and it doesn't escape the function, Go will allocate it as if were a value type.

You can stack-allocate in C, with alloca() or by taking the address of a local, and use it like a pointer. So long as you're extremely sure nothing is going to hang onto the pointer beyond the lifetime of that stack frame, it's fine.

Same thing with Go, except that the compiler makes the decision.


Go designers never said they would never add generics. I've followed Go from the start and they have always been clear that they would consider adding generics if they found a good design for it. However they never had it as a high priority.

Many Go fans have said does not need generics, and many including myself would still argue that it doesn't need generics. Yet I cannot say I mind that Go adds generics. I am just worried about whether Go manages to retain its simplicity. Time will tell.

I have been a professional Swift programmer who has had ALL the features people have been shouting about for years that Go should have, yet it cannot be said to be a better language. A lot of these features also make Swift more clunky to use than Go in many instances.

Having said that, I like both Swift and Go. I just cannot see that more is always more.


The JVM world right now has quite a few GCs and your choice essentially dictates your program's performance tradeoffs.

But I'm not sure if there will ever be a point where this is true for Go considering how little gets into the stack compared to Java.


> I remember reading about why Go doesn't need generics, yet here we are waiting for Go Generics to be ready

Hey, us generics-naysayers are still out here (grumbling)! :)


This is mostly stupid. Being able to have many GCs is a good thing. The big reason for "value types" is controlling spacial locality in memory, not GCs being bad per-se.

Also, they undersell the java/c# situation. C# has "ref, out, or in", but even without those, you can always make a reference wrapper that has the value type as a field. So "reference types suck because coppying" is nonsense garbage.


> The big reason for "value types" is controlling spacial locality in memory, not GCs being bad per-se.

Spacial locality is one reason. Another very important reason is not generating unnecessary garbage. If you have a lot of microallocations, then your program is going to be slow regardless of you using manual memory management, RAII or a tracing GC. There is just a lot more stuff to do. With manual memory management, you're just going to call free() a lot (using big forward linked lists and freeing them with a function iterating over them, calling free() on each node? yikes). With RAII you're going to spend a lot of time in destructors due to an avalanche of objects freeing the objects they own (structs holding unique_ptr/shared_ptr to structs holding unique_ptr/shared_ptr to...? std::vector<std::string>/std::vector<std::unique_ptr>? these are all antipatterns when it comes to performance). With GC, you're going to have long GC pauses, because the GC needs to traverse a bigger graph of pointers (deep/broad trees of heap allocated objects? arrays of boxed objects? yikes).

Another reason is eliminating unnecessary pointer indirection, which helps make sure that instead of loading data with two MOVs from memory polluting the cache, you need only one.

For my current project, I use arenas to opt-out of RAII and have just a couple points in my project where memory is freed, instantly in one go. Needless to say, I don't use arrays of boxed elements. It's working great for me. Not only is it faster, but also simpler to understand and I don't need to fight with the borrow-checker as much as in normal (RAII) Rust, because my objects are grouped into longer lifetimes, so I don't need to think about separate lifetimes for every distinct little object.

That said, I'm spending some time gathering ideas for a toy language I'd design for myself to write an OS from the kernel up to userspace in, and I'm leaning towards a GC-based design that relies a lot on value types. One way or another, allocating a lot of little objects is stupid, because in every scenario it makes your program slower and, code quality-wise, at best it doesn't make a difference, but a lot of times it makes your code harder to reason about, because now a lot of things are remote to the place in code you're thinking about at any given time.


> If you have a lot of microallocations, then your program is going to be slow regardless of you using manual memory management, RAII or a tracing GC.

No, that's no true. Tracing GC is mainly O(f(live stuff)), not O(f(dead stuff)), so there is basically no problem making lot's of garbage if you don't have any exotic latency requirements.

> Another reason is eliminating unnecessary pointer indirection, which helps make sure that instead of loading data with two MOVs from memory polluting the cache, you need only one.

Uh, that is locality, what I started with, no?

> I'm leaning towards a GC-based design that relies a lot on value types.

That sounds good.

Compacting GC makes locality less of a problem than one that would think with the old Java/Haskell style, but yes we in Haskell also want unboxed types (the name I prefer) to have more control.

In cases where if one traverses the parent they almost certainly traverse the child (e.g. from map nodes to their keys (not child maps nodes) they area good fit, as with long live objects.

----

The overall lesson here is that Haskell/Java programs can be surprising fast and not memory thrashing, but for reasons that are quite surprising. Two Haskell and Rust programs can feel similar in what they mean and how they are written, but then be fast for very, very different reasons.


> No, that's no true. Tracing GC is mainly O(f(live stuff)), not O(f(dead stuff)), so there is basically no problem making lot's of garbage if you don't have any exotic latency requirements.

That's true for copying generational garbage collectors. You only need to copy the live stuff and forget everything else. But in e.g. a mark-sweep collector, sweeping dead stuff takes time. In particular, e.g. Chromium's Oilpan can take a long time in the sweeping phase, when there are lot of dead objects.

But when there are a lot of microallocations, you also have a lot of live objects, which could otherwise be just a single object in the case of e.g. big arrays.

>> Another reason is eliminating unnecessary pointer indirection, which helps make sure that instead of loading data with two MOVs from memory polluting the cache, you need only one.

> Uh, that is locality, what I started with, no?

Hmm, seems I highlighted the wrong aspect. It's not just locality, but also size of the data you're dealing with. Those pointers take space. Another aspect of it is that arrays of unboxed elements have more predictable access patterns. Iterating over an array of unboxed elements is friendly to cache prefetching. CPUs recognize this (linear) pattern. With boxed elements and a compacting garbage collector locality may be fine, i.e. elements of the array may be close in memory, but the order in which you fetch elements is pretty random. When the cumulative size of the elements of the array is significantly larger than 64 bytes (the usual size of a cache line), then you're going to be prefetching the wrong regions of the array. There are situations where the order is going to be right, because e.g. the elements were created in the same order in which they are in the array, but that will be destroyed after some sorting, filtering or whatever.


> That's true for copying generational garbage collectors. You only need to copy the live stuff and forget everything else. But in e.g. a mark-sweep collector, sweeping dead stuff takes time. In particular, e.g. Chromium's Oilpan can take a long time in the sweeping phase, when there are lot of dead objects.

Sure. I got confused why people do these designs which seem to me to be an awkward compromise, but yes they exist.

> But when there are a lot of microallocations, you also have a lot of live objects, which could otherwise be just a single object in the case of e.g. big arrays.

Not necessary in general. Sure with the array, and I agree that is a bit silly. But there is no general law that more microallocations means more live data.

> Hmm, seems I highlighted the wrong aspect....

Sure those things sound sensible. I don't mean to disagree with any of that.


> Being able to have many GCs is a good thing.

The only advantage I can think of to having a single GC is with using 3rd party packages. If the trade off is always low latency and throughput then libraries know to target that when optimizing, otherwise it’s less clear what to benchmark with.


I mean, I’ve dabbled in all of these languages, and I much prefer Go’s value types to ref, out, in, and a half dozen GCs. It’s nice that these VM languages have a distinct thing for every eventuality, but I much prefer a single thing that works 99% of the time.

I usually am bashing Go :D, but I am not this time.

I am saying this blog author is confused on what the runtimes are capable of. I don't mean to say ref out and in are good language ergonomics or whatever, but simply that they show the compilation target is capable of expressing thing these things.

The only thing I know of that the go runtime can do that these others can't is "interior pointers", i.e. pointers to a field of a larger object. You can always just copy the field into a new box, but that breaks mutation semantics. Java gets away with this precisely because most fields are themselves boxed...bu that's exactly the no-control-over-locality problem we're trying to solve.


Why do you much prefer

  func changeName(p *Person) {
    p.firstName = "Bob"
  }
to

  void changeName(ref Person p) {
    p.firstName = "Bob";
  }

?

Low Java knowledge and more Fan babel rather than true critique.

> In C#, for example, reference objects and value objects live entirely separate lives. Value objects are always passed by value. *You cannot take the the address of a value object and pass that around.*

The last sentence is wrong. You can pass structs by ref in C# (ref, in, out parameters). Java will also allow taking references to value types and passing them.


The binary-trees benchmark on The Debian Language Shootout[1] involves allocating millions of short-lived trees and traversing them. It is informative about GC performance even with the caveat that there are 'lies, damned lies, and benchmarks', because many real-world graph analysis and brute force tree search algorithms similarly allocate zillions of short-lived nodes. For non-GC languages like C/C++/Rust it gives a decent idea of the performance difference between malloc'ing and freeing individual objects vs doing bulk arena allocations:

  language         secs       GC'd language?
  ========         ====       ==============
  C++ (g++)        0.94
  Rust             1.09
  C (gcc)          1.54
  Free Pascal      1.99
  Intel Fortran    2.38
  Java             2.48       yes  <====
  Lisp (SBCL)      2.84       yes
  Ada (GNAT)       3.12
  OCaml            4.68       yes
  Racket           4.81       yes
  C# .NET          4.81       yes
  Haskell (GHC)    5.02       yes
  Erlang           5.19       yes
  F# .NET          6.06       yes
  Node.js          7.20       yes
  Julia            7.43       yes
  Chapel           7.96       yes
  Dart             9.90       yes
  Go              12.23       yes  <====
  Swift           16.15       * "Automatic Reference Counting"
  Smalltalk (VW)  16.33       yes
  PHP             18.64       yes
  Ruby            23.80       yes
  Python 3        48.03       yes
  Lua             48.15       yes
  Perl            53.02       yes
So Java has the fastest GC for this test, 2.48 secs vs 12.23 secs for Golang. The Java code is also notably perfectly idiomatic for multicore, it doesn't do heroic "avoid GC by writing C-like code manipulating a fixed global memory array" tricks. The Java code is also more concise.

The 'plain C' code that uses Apache Portable Runtime memory pools instead of standard malloc/free and uses OpenMP #pragma's strikes me as more 'heroic' than 'idiomatic', whereas C++ and Rust use standard libraries/crates and idiomatic patterns. (Note that OpenMP is 'standard' for high-performance C and well supported across GCC/LLVM/Microsoft/Intel compilers. But still....)

OCaml and Haskell made impressive showings for functional languages which are in practice the easiest for dealing with complicated tree algorithms, which is perhaps why the formally verified C compiler, CompCert, is implemented in OCaml, as is Frama-C for formally verifying ISO C programs, as is the Coq theorem prover, etc.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/... [Edited link]


Look at the Go benchmark program:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Needs to be faster? Use a pool like the C benchmark programs, but super simple: just a slice, local to the goroutine.

Trivially make the Go program 10x faster, 5 minutes of work, a few edits:

  type Tree struct {
          Left  int
          Right int
  }

  func itemCheck(id int, pool []Tree) uint32 {
          tree := &pool[id]
          if tree.Left != -1 && tree.Right != -1 {
                  return uint32(1) + itemCheck(tree.Right, pool) + itemCheck(tree.Left, pool)
          }
          return 1
  }

  func bottomUpTree(depth uint32, pool *[]Tree) int {
          var tree Tree
          if depth > uint32(0) {
                  tree.Left = bottomUpTree(depth-1, pool)
                  tree.Right = bottomUpTree(depth-1, pool)
          } else {
                  tree.Left = -1
                  tree.Right = -1
          }
          id := len(*pool)
          *pool = append(*pool, tree)
          return id
  }

  func inner(depth, iterations uint32) string {
          var (
                  pool []Tree
                  chk  = uint32(0)
          )
          for i := uint32(0); i < iterations; i++ {
                  a := bottomUpTree(depth, &pool)
                  chk += itemCheck(a, pool)
                  pool = pool[:0]
          }
          return fmt.Sprintf("%d\t trees of depth %d\t check: %d",
                  iterations, depth, chk)
  }
If it matters, your Go program can be made as fast as a C program.

Easily.


Several (about 10) years ago, I submitted essentially the same program. It was one of the fastest of all the implementations across all the languages.

However, Isaac rejected it, saying it violated the conditions.


Of course. The problem description [1] specifically calls for the following:

When possible, use default GC; otherwise use per node allocation or use a library memory pool. As a practical matter, the myriad ways to tune GC will not be accepted. As a practical matter, the myriad ways to custom allocate memory will not be accepted. Please don't implement your own custom "arena" or "memory pool" or "free list" - they will not be accepted.

In the other words, this problem measures the memory allocation performance with the "natural" strategy, which is either a default GC, a memory pool or a per-node allocation in the order of availability. If the target implementation doesn't support GC by default the code would be necessarily more complex, which makes up the performance benefit. So you can't really compare submissions with different strategies, because they are just doing different things and there is no reasonable way to make them do the same thing [2].

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[2] You may still argue that every submission should use a per-node allocation for the fair comparison, but a per-node allocation with GC and without GC is pretty different.


My submission did violate the conditions. I was just recollecting what happened; it is not a complaint.

I am at variance with those conditions, though. I refer to the benchmarks suite itself, broadly. When programs dip into GMP, PCRE, etc., we are no longer comparing the speeds of the different language implementations per se. Now, they are mixed with FFI speeds of those implementations, the choices of the foreign language libraries/algorithms, etc..

In the same vein, as you remark in your second footnote, for a fair comparison, the same algorithm should be used in all programs. That will shed light on the strengths and weaknesses of the different language implementations better. I don't intend to say that the current system is useless; I am saying that using the same set of algorithms is likely to be more illustrative.


The point of this benchmark (that you're deliberately ignoring) is actually to have at least one program that realistically shows how bad a language's GC is. Go performs terribly because its GC makes lousy tradeoffs, which is the only reason you're annoyed at its requirements. It's actually probably the best benchmark in the whole set on that site.

Can this benchmark be written using sync.Pool mentioned in the article? If so, then that should be accepted since sync.Pool comes in stdlib.

Debian Maintainers doing Debian Things.

Now do the same in Java and it will be as fast, if not faster. The point is the benchmark is looking at GC performance. In large real world programs, it is not trivial to make such changes.

Do it. Show me that Java can avoid allocating/collecting the Tree objects. Go has no indirection for the Tree objects, in this example. The Trees are contiguous in memory.

Measure the performance.

I have no garbage collection performance problems in the various large Go projects I have worked on. But I am also a competent programmer, so "your mileage may vary".


For the time being, until Valhalla is finalized, Java can emulate this approach simply by using an int array.

After making the modifications, on my local machine, Java ran as low as ~820ms, compared to golang's ~932ms.

golang's GC is no silver bullet as we see here:

* https://blog.discord.com/why-discord-is-switching-from-go-to...

* https://news.ycombinator.com/item?id=21670110


Thanks for doing the test!

No problem. I'll try to download a Valhalla JVM build and see how it fairs.

How does Java do without using an array of ints, or Valhalla? In other words, what's the cost of Java's forced indirection?

What happens if you use

  pool := make([]Tree, 0, 256)
instead of

  var pool []Tree
in the Go so it doesn't waste time growing tiny slices?

After making the change in the golang version, I saw it go as low as 867.61ms.

The JVM is very good at inlining. It's not perfect obviously, but in this benchmark just using an `int[]` wrapped in a class (similar to an ArrayList) produced quicker results than golang. I'm not surprised, since the golang compiler doesn't generate the greatest code.

I tried with Valhalla, with the following array wrapper, and I saw it run as quick as 667ms.

    stretch tree of depth 22  check: 8388607
    2097152  trees of depth 4  check: 65011712
    524288  trees of depth 6  check: 66584576
    131072  trees of depth 8  check: 66977792
    32768  trees of depth 10  check: 67076096
    8192  trees of depth 12  check: 67100672
    2048  trees of depth 14  check: 67106816
    512  trees of depth 16  check: 67108352
    128  trees of depth 18  check: 67108736
    32  trees of depth 20  check: 67108832
    long lived tree of depth 21  check: 4194303
    0.667455718

This is the Java Tree and array wrapper code I used

    inline class Tree {
     public int left; 
     public int right;

     public Tree(int left, int right) {
      this.left = left;
      this.right = right;
     }
    }
    
    ///////////////////////////////////////////////////////////////////////////////
    // Copyright (c) 2001, Eric D. Friedman All Rights Reserved.
    //
    // This library is free software; you can redistribute it and/or
    // modify it under the terms of the GNU Lesser General Public
    // License as published by the Free Software Foundation; either
    // version 2.1 of the License, or (at your option) any later version.
    //
    // This library is distributed in the hope that it will be useful,
    // but WITHOUT ANY WARRANTY; without even the implied warranty of
    // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    // GNU General Public License for more details.
    //
    // You should have received a copy of the GNU Lesser General Public
    // License along with this program; if not, write to the Free Software
    // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
    ///////////////////////////////////////////////////////////////////////////////
    
    final class TIntArrayList {
     protected Tree[] _data;
     protected int _pos;
     public TIntArrayList() {
      _data = new Tree[0];
      _pos = 0;
     }
    
     public void ensureCapacity(int capacity) {
      if (capacity > _data.length) {
       int newCap = Math.max(_data.length << 1, capacity);
       Tree[] tmp = new Tree[newCap];
       System.arraycopy(_data, 0, tmp, 0, _data.length);
       _data = tmp;
      }
     }
     public int size() {
      return _pos;
     }
    
     public void add(Tree val) {
      ensureCapacity(_pos + 1);
      _data[_pos++] = val;
     }
    
     public Tree get(int offset) {
      return _data[offset];
     }
    
     public void resetQuick() {
      _pos = 0;
     }
    }

Would you mind posting a complete example that compiles and runs? Your above changes to Go#8 break it. E.g. the function 'run' needs to be updated to use your new 'bottomUpTree' and 'inner'.

I'm not going to paste 170 lines of code into the thread. Add this line:

  var pool []Tree
to each of the stretch and long-lived closures in run(), pass &pool as the second argument to bottomUpTree(), pass pool as the second argument to itemCheck().

The call to inner() doesn't change.

It's trivial.


+1 Not bad, your simple pool improved performance from about 10.9 secs to 1.4 secs on my laptop here. However, it does break the stated rules for the benchmark.[1] The Java version doesn't use a pool and doesn't change Tree pointers into ints as your version does. The C/C++/Rust versions are required to use a "library memory pool." The rules state,

  Please don't implement your own custom "arena" or "memory pool"
  or "free list" - they will not be accepted."
Donald Knuth uses integers instead of proper pointers in the source code for TeX for maximum portability and performance so your technique is legit used by master programmers but in general replacing pointers with integer indices into arrays only works for monolithic programs of small to medium size. For example, could the Chromium maintainers rewrite the 20 million lines of C++ into Golang replacing most pointers with integer offsets into arrays for performance? No, impossible with current tools and practices, the technique won't scale to that, so this benchmark rule isn't totally stupid and unfair. This binary trees benchmark was originally devised decades ago by Hans Boehm, a notable GC researcher, as a way of testing GC performance that he as a GC implementer found informative.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


You don't rewrite your whole program to use pooled allocations.

Just the parts of your program where allocation performance needs to be improved.

If part of your program is doing many allocations, and it's too slow, use a simple pool. Go makes pooled allocation easy and safe.

I have written huge Go programs (e.g., https://NN-512.com), and the garbage collector is not a problem. Quite the opposite, it's a major, major benefit.

Pooled allocations are rarely needed.


We can't say that this is a purely GC benchmark. Generating and checking these trees takes time independently of garbage collection. Without a detailed report from the programming language run-time about how much time was spent in GC, it's just a guess.

> So Java has the fastest GC for this test, 2.48 secs vs 12.23 secs for Golang.

Further not mentioning memory used Java/Go programs makes it very fair comparison. Because GC perf does not depend on memory allocated.


Right now in the datacenter CPU usage is considerably more expensive than RAM usage. Ram consumes comparatively little power, whereas burning hot CPUs+GPUs are the reason datacenters are favored near cooling water and power stations. 2.48 vs 12.23 seconds for Java and Go is a big deal for how many solar panels or tons of coal are needed to run an app on Xeon or Epyc instances, whereas 1.7GB vs 0.4GB for Java and Go, a 4x difference in low-power memory usage, is not so big of deal.

At any rate I did link to the full table so everyone can see the mem usage, source listings, etc.


Just for fun set the Java heap to .4 Gigs or use GOGC to set the Go heap to 1.7 Gigs. If Go is faster then try some other sizes and draw a graph to see what the lines look like.

> Modern languages such as Go,

golang is modern? it's stuck in the 70's


I like how memory management is done in Nim [1]. You can chose among several garbage collectors, manual memory management or no memory management at all, depending on what better fits your use case.

[1] https://nim-lang.org/docs/gc.html


The default memory manager also won't do any stop the world pauses due to the way that cross thread communication works.

That's not to say the latest Java GCs aren't incredibly impressive - they are.


Go may have less need of an advanced GC, but surely it could only get faster if it had a Java-like GC.

Did Go ever fix that problem where short term high memory usage can cause large empty heaps because of the lack of compaction?

I remember a rash of articles about this phenomena a while back. This article seems to think its not an issue so I'm wondering when it was fixed.


No, it was never fixed and probably never will be, because it would make Go perform worse in the latency-sensitive microbenchmarks it wants to do well in.

that was never an issue? maybe your thinking of the slowness golang had in returning memory to the OS? that was resolved.

jfyi: here is the latency GC numbers (JDK17) - https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html . The experience is vastly different compared to an 8 year old jdk version (jdk 8)

It’s an insightful article, but I can’t get around the fact that almost every paragraph contains subtle typos. Makes me wonder, whether the author bothered to do a final pass on the text, or just dumped everything out of his mind, and hit Publish.

Or maybe isn't a native English speaker and did a wonderful job in a second language?

Lots of flaws in the article. Rust doesn't need a Java like GC because Rust doesn't have a need for GC. C# has pointers. "Old" languages like C and C++ also don't need a Java like GC, so it's not something related to "modern" languages.

Is there a single statement about C# in this piece that isn't a complete falsehood?

What exactly does Go solve that can’t be done in C#?

Headline follows logically from "Does not need Go"

I guess it is the same reasoning like Go not needing generics.

No, it is same reasoning as Java not needing dense memory layouts

Work on Valhalla is not funded as a joke though. They are working on that one.

Ofcourse. The point was about above snarky comment. Go devs took time to implement generics balancing with competing priorities. Java devs taking time to implement value types balancing their competing priorities.

From 1996 point of view it seemed it was a correct decision from the language landscape with Java coming from Smalltalk and Objective-C heritage, where 9 years later all languages had generics, except one created by the same authors of Go, and their newly born language.

Also there have been extensions, and the work to fix it goes back to 2014, after escape analysis proving not being good enough, with IBM and Azul having their own extensions for value types.

Even Dart down the corridor got generics first, during the years they weren't needed for Go.


All said and done it is basically: "There is a nuance when people I like do not do desired thing and excuse when people I don't like do not do same desired thing"

No, Java folks never denied they were eventually needed, in fact IBM has had the object layout extension on their implementation for almost a decade now.

And Go folks denied generics need? LOL. Go hate is blinding your reasoning.

What it needs is Rust style static lifetime management.

As a Rust developer, contributor and generally a big fan of Rust, I'd tend to disagree. Having a garbage-collector is a lifesaver for many algorithms. Static lifetime management is great but isn't something you want to force developers to use when they're more interested in coming up with solutions quickly.

Also, I really don't see how it would work in Go. You require a pretty strong type system to make static lifetime management work. Unless something has changed radically in the recent past, that's not something that the Go community had much interest in (yes, I know that generics are one step in that direction, but I'm not aware of further steps being discussed).


If you're at the point that serious GC performance is being examined, then a rewrite to rust is something that should be considered on the strategic roadmap if it is code you have control over (versus third party / OSS software).

I think Go has a maximum expansion footprint. People presumably use Go because it is faster and better GC than Java. Rust will probably eat a lot of that territory. That will leave people that like it strictly for language/idiomatic reasons, and that won't be enough.

The entire meta-point of the post is to try to argue that Go is "better" than Java GC-wise. Well, it is and it isn't in reality, as benchmarks and people in the know have said. If it is "better" it is a very unconvincing win.

As someone wise said 10 years ago, you almost need 10x the performance to have a convincing improvement to get laypeople to notice, and to get dev people to consider switching from legacy/entrenched ways of doing things.

Here, there's basically no real world improvement to point to.


> You require a pretty strong type system to make static lifetime management work

Go is strongly and statically typed.


Go's type system is much, much weaker than what is needed for safe static lifetime management be even remotely close to workable.

Static typing is not an on-off, there are statically typed languages with extremely weak type systems (e.g. C), languages with somewhat weak type systems (e.g. Java, Go), languages with strong type systems (e.g. OCaml, Haskell) and languages with extremely strong type systems (e.g. ATS, Idris).

And that's a simplification because it's not linear either.


Rust's lifetimes need full subtyping, with covariant and contravariant type constructors (though I don't think it supports annotations for them, and always infers them instead), which I think would make the generics implementation quite a bit more complicated than it currently is...

Yeah, the fact that co/contravariant type constructors aren't (or aren't always?) annotated has always felt a bit awkward to me. Ah, well, it works :)

For some value of "strongly". It has nil and lacks sum types, so it is comparatively much weaker than even Rust/Swift.

It's a spectrum. Rust is further than Go in the direction of "strong type system", and somewhat lateral wrt Haskell, F# or OCaml, for instance (each is more strongly typed in different directions). Idris or Coq are further than Rust in most directions, etc.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: