I'm not much of a Java user, nor do I use GC for any performance-critical code, but I write code that interacts over the network with quite a few Java systems. These systems have all kinds of performance problems, a lot of which involve GC pauses.
Some of those systems come from vendors who have very competent teams and invest heavily in their systems, and some have tried fancy, expensive JVMs with supposedly better real-time performance and have failed to get them to work.
So I'd offer a statement in the other direction: it's important not to overstate the ease of avoiding GC performance gotchas. Maybe it can be done, but that doesn't mean that you're going to be able to do it, especially if you don't design all parts of your application from the beginning to operate within the constraints of real-time GC implementations.
This is very important. There's far too much No True Scotsman in typical discussions around this. The fact remains that really good teams try and fail to meet needed performance criteria, even though their app/service meets all other requirements, is well architected, well coded, and the team has experience in GC tuning, pitfalls, etc.
> Maybe it can be done, but that doesn't mean that you're going to be able to do it, especially if you don't design all parts of your application from the beginning to operate within the constraints of real-time GC implementations.
Because... unless your solution is one where you can simply avoid doing a few stupid things that make the GC unhappy, you have to wade deep into managing memory, except you can't do it directly. You have to understand/infer how seemingly innocent code impacts GC, and then suggest/cajole/plead with the GC to do what you want. It required the kind of intimate knowledge that GC is supposed to save you from, but doesn't have the direct manipulation tools.
There are large classes of problems where GC is perfectly fine. Great even. But there are other problem domains where sluggishness or pauses are not acceptable, and not just realtime or soft realtime.
Azul sells a pauseless collector. AIUI the downside is that you need enough memory headroom and spare CPU cores for the collector to keep up with the mutator threads, which presumably is a significant overhead on large heaps.
I think that one of these vendors tried Azul and couldn't get it to work. (I don't know the details. If I understood correctly, the problem wasn't the GC so much as that Azul couldn't run their code correctly or would have required extensive modifications or something along those lines. In any event, they decided not to use Azul's product.)
It even says that the byte-code format was specifically designed with JIT compilation in mind:
> The byte code can be
translated on the fly (at runtime) into machine code for the particular CPU the
application is running on. For those used to the normal design of a compiler
and dynamic loader, this is somewhat like putting the final machine code
generator in the dynamic loader.
> The byte code format was designed with this in mind, so the actual process of
generating machine code is generally simple. Reasonably good code is
produced: it does automatic register allocation and the compiler does some
optimization when it produces the bytecode.
The idea that JIT compilation was retroactively bolted onto the Java ecosystem is absurd. Perhaps more than any other technology, Java brought JIT compilation into the mainstream, starting more than 20 years ago.
Parent: what's wrong with the JVM as a multithreaded-JIT?
The original design of the JVM was for embedded systems, not for running server applications with heaps in the 10s of GB of memory.
The threading model, which allows for near arbitrary sharing of memory, is still a significant source of complexity for both the JIT and the GC and has changed multiple times since the JVM's inception.
Modern JVM implementations have little to do with the original design, but still have to deal with design decisions made 20 years ago for completely different architectures and with only limited foresight.
The way Java works is there's a language spec and a VM spec and then organizations are free to implement them as they see fit. So JVM design is referring to what's in the spec.
What's wrong with Sun's implementation? Well, they did the best they could, given that there's no design / spec to work from (for the JIT). The memory model was completely replaced in the spec in ~2005 (to work better on multi-core CPUs and stuff), and the threading model was changed in the spec sometime in the 90s (from green threads to native threads).
For some applications those still are a problem, yes, but others would be happy if those were the only ones they'd experience.
By default D uses a GC for classes (allocated on the heap, Java-style). You can disable/avoid the GC with some additional effort and the language tries to help. For example, there is a @nogc annotation, which emits an error for anything which could trigger the GC, although it does not prevent a stop-the-world event from another thread.
Rust is kind the inverse approach. Currently, there is no GC and real time is (relatively) easy. Now they try to introduce an optional GC for convenience.
Mixing GC with (real time) non-GC code is quite desirable, but I have not seen an elegant approach, yet.
https://github.com/manishearth/rust-gc actually lets you cleanly mix GC and non-GC heap objects (even put pointers to GC heap objects on the malloc heap and vice versa). So does Servo's current bindings to spidermonkey.
> Now they try to introduce an optional GC for convenience.
long-lived garbage generally doesn't need to exist
Wonderful. It's not the GC it's you who's incompetent. Or maybe you're competent, but your app simply doesn't really need to exist. A more complete shit-HN-says comment I couldn't find if I went searching for it.
There's a whole additional class of problems involving the JIT needing to warm up, too, but that's off topic for an article specifically about GC. (Go, for example, would not have the latter problem at all, but the JIT and GC situations for Java are a bit similar. JIT warmup issues can be avoided using AOT compilation, but that has its own set of issues. What seems to be completely missing, or at least was missing last time I looked, was the ability to generate a JIT profile, save it, restart, and restore the old profile.)
Where in a dynamic language people are tempted to use write things like `people.pick('name').capitalize().sort()`, where containers of objects are created at every stage of the pipeline just to be discarded moments later. In C you wouldn't write code like that. Or if you do, you simply allocate a single block of memory beforehand that's large enough for the entire operation and you reuse that memory block at every stage of the pipeline. In C more memory efficient code is less work to write and easier to reason about.
So it really shouldn't be a surprise that in practice GC'd languages like Java are dogs in terms of performance, nor should we expect this to change no matter how clever the garbage collectors get. The "sufficiently intelligent garbage collector" is a creature just as mystical as the "sufficiently intelligent compiler".
The D community encourages `people.pick('name').capitalize().sort()`. They call it Component-Programming and they make it fast. See  for a more realistic example. The trick is to turn the methods into (lazy) iterators (streams in Java land) with static dispatch. Now the compiler can inline everything and optimize it like a C-for-loop.
The problem is that the D garbage collector needs more engineering effort to be on Java level. For example, it is not precise or concurrent yet.
With substructural types, the programmer can tell the compiler that a specific value is meant to be used only once. For instance, the `self`-arguments to `.capitalize()` and `.sort()` can be modified in-place, if only the compiler actually knew that it won't be used elsewhere.
> The "sufficiently intelligent garbage collector" is a creature just as mystical as the "sufficiently intelligent compiler".
The real myth is that a “sufficiently intelligent compiler” can figure out what you really mean without you telling it. Compilers are often better than programmers at identifying opportunities for optimization, as well as correctly carrying them out, but the validity of these optimizations often depends on preconditions that compilers can't figure on their own - the programmer must tell them.
Fortunately, the science of systematic, reliable and truthful communication between programmer and compiler already exists: https://en.wikipedia.org/wiki/Type_theory
Right, so compare like with like. If you write Java in a style where you're thinking carefully about performance, you can get very good performance - probably faster than you would spending the same amount of time writing C. If you write Java in a style where you think a little bit about performance, and then spend a bit of time profiling your application after, you tend to end up with something faster than your first effort in C and taking a lot less time to write as well. But for some reason people seem to want to compare the dumbest thing you could write in Java that would work at all with the dumbest thing you could write in C that would work at all, regardless of the fact that it would take much more programmer time to produce the latter.
Again looking at C# (or Swift's Lazy collections, or Java streams) your example isn't quite correct. You can absolutely get single-pass lazy evaluation of the elements, or at the very least avoid any temporary intermediates.
The .NET runtime has various GCs, some of which are concurrent GCs that can do thread-local pauses without pausing the entire process, as well as doing extremely short pauses (a few ms) but otherwise running concurrently with other threads doing allocations. The GC even does compaction, avoiding the memory fragmentation problem that has driven many a C programmer crazy. (There are other approaches like V8's JS GC.)
The idea that GC has to be a dog is nonsense. It has the benefit that allocation is just adding sizeof(object) to a pointer, making allocation trivial. It has the runtime overhead of occasionally needing to pause things for memory safety reasons, depending on the specific implementation. It also has a higher high-water mark for memory usage. It has the human overhead of reasoning about living roots and avoiding unnecessary object churn (though that is true of any system).
Reference counting has runtime overhead of incrementing/decrementing reference counts, and human overhead of manually breaking retain cycles and remembering to properly increment/decrement the counts (unless using things like ARC which insert the calls automatically).
malloc/free has runtime overhead to manage slabs and free lists, plus fragmentation problems under some workloads. It has a massive human overhead of making the programmer remember to track and free the allocated memory perfectly or risk leaking or corrupting memory, potentially leading to remote code execution in the worst case.
These are all tradeoffs with costs and benefits.
I haven't used Unity. It's probably better. (XNA was essentially abandonware a couple years after Microsoft hatched it -- a typical myopic lifecycle and a good reason why you can't trust MS with your platforms unless they have a really strong money story behind them).
On my recently shipped C# game, I spent a ton of time writing custom memory management and tracking down memory leaks. Then I hired a coder with a background in web development, and it played out exactly like the grandparent comment described. Guy had Linq queries everywhere generating 4 MB of garbage per frame, because he just didn't know.
The point is, we have to think about memory no matter what. So why bother with GC? It just makes it harder to control.
That said, I honestly don't understand why everyone only talks about GC and malloc+free abstractions, because they simply do not materialize in physical world at all (even if malloc needs a new page, it will call kernel for it), cpu/kernel do not have such mechanisms, so this is only abstract limitation in developers heads. So why not just talk (if we actually end-up in situation when we need to talk about memory, 'cause usually it's just work) and measure about what's actually real?
Still pooling is something that everyone should be taught and seems like very few are.
Actually, no, it does not need to explicitly record the size of the allocation. Some implementations of malloc do this, using what are called "headers". But more modern allocators tend to use techniques which avoid this. They basically follow the pattern:
1. Maintain a list of slabs, where each slab contains objects of a particular size (say, 4, 8, 16, 32, etc, but not necessarily always doubling by 2).
2. You ensure that the slabs are in contiguous regions of virtual memory. The start is always on a page boundary.
3. For each page, you record its meta-information in either a giant array (for small address spaces), or a radix tree (for large address spaces).
4. Upon a free, you can just figure out what page an object lies in, then look up where its meta-information is. Then you have all of the information you need to free an object of a particular size. This function from an old project of mine shows all three techniques (headers, radix tree, giant array, also called a BIg Bag Of Pages): https://github.com/scotts/streamflow/blob/master/streamflow....
Something else the author did not mention is that when it comes to performance tuning a GCed application, you have another variable: heap size. If the heap size is too small, you will GC too often. If the heap size is too large, your pauses when you do GC may be too large. It may require tuning to find that sweet spot.
Also, this came up in a previous thread (https://news.ycombinator.com/item?id=11058911), but the routines the author outlined for object allocation are quite similar to the ones used in modern implementations of malloc.
Please note that I am talking about techniques for explicit memory management, not garbage collection.
On the other side, this also doesn't bring up that any sort of concurrent data structure is much easier to write when using a GC, and the GC removes the need for expensive memory tracking. It doesn't necessarily make things faster, and libraries like Crossbeam and epoch management in general make things easier, but Crossbeam isn't fully featured, is Rust only, and still suffers from strange bugs.
A worthwhile side note is that when manually managing memory, non-malloc allocators can often be used if really needed that have far higher performance than both malloc/free and a GC.
Which means when you have a value of the type, you can't assume it's in a valid state. You have to put checking code somewhere, either in each method of the type, or in functions that receive the type as a parameter, in clients that create objects of the type, etc.
In C++, you have to know the difference between stack allocations and heap allocations.
If you aren't dynamically allocating anything; malloc and free performance has no bearing. As far as I know, Java has no reliable way to create an object purely on the stack (and it probably shouldn't.
And even if you are dynamically allocating a new object in C++, if it has non-pointer children, it allocates that memory too in the same allocation call.
It is true that a lot of allocations will slow down C++ a ton; but a lot of allocations can be avoided in C++ whereas in Java you don't get that fine of control.
Even then; there is nothing about the malloc+free contract that prohibits you from having an implementation that does some of the things the author talks about. You can make a version of free that marks memory locations to be freed at the nth call of free.
I'm curious on the source for this statement:
> Java's collector can free an unlimited number of objects in time independent of the number of objects freed.
That seems like it can't possibly be true in the generic case.
But once it's in the old generation, you run right into the same issue with fragmentation and having to keep track of where deleted objects are so you can reuse that space. If the idea is that you just keep doing memcopies for all the live objects to reuse the space and avoid looking at dead objects, I'd think that that is it's own performance issue?
While I don't buy the GC performance argument, it's ok to use it, because most of the time it doesn't matter. Sure it can be "almost as good as" manually managing memory, but that's about as hard to achieve as manually managing your memory. The point of GC is that it relieves the programmer of caring about managing memory, and as such it's only really useful where the system has sufficient resources that the fact that a GC is running doesn't matter. For the vast majority of workloads on modern hardware this is true.
At the other end of the scale, in real high-performance embedded systems, even malloc+free is too slow, and memory fragmentation is too much of a problem for things with expected uptimes measured in years. GC is a complete non-starter here. Instead these systems use pools of fixed-size buffers than can be O(1) allocated and freed, and never fragment. You can use MRU allocation of buffers from these pools to keep the dcache warm, and you can make your memory-using operations queue until they can reserve the memory they're going to use up front so that your system can run flat out using all the memory it can without ever crashing from OOM. Both malloc+free and GC (especially GC) can only dream of those characteristics.
This is entirely wrong. Sure this is true if you're writing a general purpose programming language, but if you're writing an app or a game or a tool, you know the usage patterns because you are building them. You don't need to free things at "arbitrary" times, you free chunks at times when it is appropriate for your problem. You know more about your memory allocation patterns than any general purpose GC could ever know, and so you can write code to fit those better.
In a GC language, often you don't care about where and when the memory comes from, but sometimes you do, and when you do, you're stuck.
It is Java where you are stuck. Not GC languages in general.
PJSIP has an interesting approach to this. SIP messages have lots of strings associated with them, are usually relatively short-lived, and typically you'll want to modify these strings or attach new ones. PJSIP does this by hanging a string arena off of each SIP message object.
As you add strings to the SIP message, these get appended to the arena, which is grown automatically as and when it needs to be. If you delete a string, it just sticks around in the arena unused. Arena space can also be shared, e.g. if you copy a substring from one SIP header to another, it'll just be referred to twice using a pointer and length.
Finally, when you free the SIP message, it just frees the entire arena in one go.
Languages runtimes enable lots of other cool things, like JIT compilation and runtime specialization, user-space schedulers and I'm sure some other things as well, but if, as a language designer, all you want is GC, consider if that's worth all the costs of having a runtime.
I don't know of anybody who complains that the cost of actually freeing memory in GC'd languages is somehow significantly worse than in non-GC'd languages. Everybody is ultimately complaining about how expensive it is to figure out what is actually okay to free/release in the first place.
There are plenty of GC-backed languages/runtimes where the GC penalty is isolated or mitigated because of other known constraints of the system. Erlang's no-shared-mutable-state constraints allow its GC system to avoid stopping the world and also allows processes to run on other schedulers while GC is active on a different process running on a different scheduler.
This is far from the truth. First, C has nested structs, whereas in Java all object references are pointers. This leads to many more allocations in Java. Second, C programmers are aware of the cost of malloc, and will try to minimize its use. Reusing memory whenever possible is common, as are some simple memory management strategy (such as arena allocation: allocate big chunks of memory, do all your work therein, then free it in one go, for instance at the one of a processing step).
Some C programmers, maybe. I have had discussions about GC vs. manual memory management with a few C programmers that appeared to think that malloc+free is basically free, and that GC is expensive because of, you know, GC. They still tried to avoid malloc()ing memory, but that was mainly because of problems with memory leaks and related issues.
(And just to be clear, it took me very, very long to figure this out myself, mostly because none of the C programs I have worked on were affected by the performance of malloc+free. I have to give the credit to a friend who works in real time systems and one day looked at me as if I was an idiot [I might be, of course!] and told me about how in real-time code you cannot call malloc and friends, and I was like "Why not?", and then he explained it to me.)
But if you're willing to put the effort in to do something like that, there's no reason you can't do the same thing in Java.
As far as I know, you can only control allocation at the level of "retainment": you can keep something from being GC'ed by keeping a reference to it. Potentially you can also reuse memory. So basically, object pools. This is much less efficient than arenas, where there is one malloc and one free (or one per memory page).
Also reusing memory is a bit ugly: you can't specify that things are final anymore, which you can in C -- sure this is no hard guarantee (it isn't in Java either) but it's a useful indication that during the lifetime of a "logical object" (not its memory) it will not change.
I've tried using object pools a couple time in Java, whenever allocations dominated the running time, it never helped any.
The tracing can happen concurrently with the appropriate config. And I wouldn't be surprised if you could disable GC, at least for all practical purposes (e.g. only GC at 100% space usage or every 10000 years, or some such).
> As far as I know, you can only control allocation at the level of "retainment": you can keep something from being GC'ed by keeping a reference to it. Potentially you can also reuse memory. So basically, object pools. This is much less efficient than arenas, where there is one malloc and one free (or one per memory page).
AIUI you can do that by defining objects backed by an unsafe buffer (technically a non-public API, but there's an understanding that it will be supported and the wheels are turning towards standardizing either it or something equivalent that supports the important use cases).
> Also reusing memory is a bit ugly: you can't specify that things are final anymore, which you can in C -- sure this is no hard guarantee (it isn't in Java either) but it's a useful indication that during the lifetime of a "logical object" (not its memory) it will not change.
I didn't think C supported final at all? I agree that it's less convenient than it might be and a lot less convenient than standard Java, but I'm not convinced it's really any worse than what you'd be doing in C.
Especially since GC doesn't map to the real world in important ways, for instance OS resources such as sockets/threads/pipes/files. You have to build your own reference counting on top of these things anyways.
This is why I have a lot of hope in Rust -- not necessarily because I see it as the ideal, rather because they are asking the latter question and they've got some pretty cool results.
It's time to let go of 'GC' in the classic sense and figure out how we can elide object lifetimes in a clean, rational and deterministic way.
Here's how Java allocates an object of size N: round N up to a multiple of 8 bytes, and bump the allocation pointer by that much. That's it. By using a scheme called a thread-local heap (TLH), there is no contention between threads. There's some additional handling for large objects, and for the case when the TLH is exhausted and another one must be allocated, but generally speaking, the GC is self-tuning so that none of these activities appear on the hot path of an application. The overhead is generally negligible, and all you're left with is the pointer bump.
What is to stop someone from implementing something like this in a C library? It sounds like it could be relatively easily done... or is it the TLH that causes issues?
It actually goes one step further than the TLH and does this round-robin stuff sharing the load between your processes.
To give a recent example, HTML5's canvas API only allows accessing image data with getImageData, which will allocate a large array for you, and then return it. You cannot do this at 60fps, because of GC pauses (and also because Chrome doesn't actually track the memory usage of the ImageData correctly, leading it to run out of memory rather than correctly free the object).
There is no way to pass in a pre-allocated buffer, because the presumed cost of the GC is free. When developers are asked to track memory usage themselves, they're allowed to be a lot more careful with it, and design APIs that take that cost into account.
In apps that need to scale, designing data to be efficiently allocated and freed becomes crucial and the reference by default strategy really falls down.
The problem is Java specs do not dictate native access besides JNI. One standard off-heap API, which provides all the goodies of native world and escapes GC will be a right step. It is upto the dev to choose which world the object lives.
First, one reason GCd languages are slow is because they GC more than necessary; not necessarily because of the GC itself. It's not always possible to statically determine what needs to be GCd. Go is pretty good at this, though, especially since pointers are explicit unlike Java. You also have cache issues here. Even if the GC tries to assuage these cache issues by smartly placing objects, this is a far cry from the speed you get from a cached stack.
Secondly, the "GC can free when it wants" is not an indictment of malloc-based heap usage. You can easily plug in an allocator that does the same thing with respect to freeing memory at an opportune moment. IIRC jemalloc does this. You can argue that jemalloc isn't malloc, but since your programming style or language doesn't need to change if you want to use jemalloc over malloc, it doesn't matter -- comparing GC to malloc isn't the real question at hand here, comparing GC to a malloc-like memory management API (or an abstraction over it) is.
Almost all the improvements that a GC has over malloc-like memory management mentioned in the article can be implemented as a policy in a jemalloc-like pluggable allocator. Smartly managing malloc() calls and delaying free() calls is orthogonal to garbage collection, it's just that garbage collectors invariably do this since it's necessary to offset the performance hit of putting everything on the heap.
The only legitimate performance feature mentioned there is that a moving GC can handle the cache smartly. It is true that malloc can't do this. However, malloc doesn't have cache problems in the first place -- managing the cache in a GC is making the best out of a sticky situation; malloc doesn't cause that situation in the first place.
Comparing Go and Java, I have the general impression that garbage collection in Go has reached the point where it just works, while in Java we're still tuning it after all these years.
But part of the reason might be that creating garbage is more easily avoided in Go. Or it might that people are attempting different things.
Go does have a couple of structural advantages over Java, such as shipping with "value types" on day one, and it's good enough for many uses. But that's probably not accurately described as "just working" in that sense.
Historically Java competed with C++, and its evolution has been defined by trying to get it to match C++'s performance in the general case. Go prefers simple, sound and obvious to maximum performance.
If I know how to arrange my data I can beat Java by 50x.
The thing is so few people care(or need) this these days that stuff like the original article gets written.
Of course, there's a teensy-weensy mental cost that you have to pay in Go where you actually think about these pointers, since you have to explicitly add `&` and `*` for it to work well. This is far less than the cost required to think about malloc and free in C, but it's a cost, and part of the cost-benefit tradeoff between Go and Java. You have to think about pointers a little bit, and in return you get decently efficient memory management.
Go's GC is basically "I will GC-manage what you ask me to and even then try to avoid it when possible"
Java, on the other hand, doesn't distinguish between this and has to try and figure everything out itself. The programmer can't hint to the compiler that something doesn't need GC; the compiler has to figure this out on its own. This can't be changed backwards-compatibly, since it involves changing defaults.
Java's GC is "I will GC-manage everything and sometimes try to avoid it"
Instead of having specialized objects for manipulating memory (like deque<>, vector<> ...), this language would provide statically a profile (or signature) of the stack at compiletime. It would allow the user to apply strategies that are used in JITs (like inline caching).
For example, detect that an array is push or shift only:
- if it's push-only, preallocate memory by chunk (like vector.reserve()).
- if it's shift-only, trap the  operator and change the head of the array by an offset (no resize)
The issues of GC and malloc are around the context.
Knowing almost nothing about the JVM, what's the best approach to go about this? I considered implementing my own sets / maps on naked int / structs of 1-3 naked int.
Go would be an ok comparison. I suspect Go would come out on top, since it actually tries to avoid GCing things, whereas the Rust compiler wouldn't even be aware that there is a GC going on and wouldn't do any analysis.