Hacker News new | past | comments | ask | show | jobs | submit login
Malloc envy (vena.io)
72 points by greenhathacker on April 4, 2016 | hide | past | favorite | 105 comments

> It's also important not to overstate the duration of GC pauses. Applications don't need to stop for the entire duration of a garbage collection cycle: they can continue to run and mutate objects (and even allocate objects), so long as they coordinate with the GC.

I'm not much of a Java user, nor do I use GC for any performance-critical code, but I write code that interacts over the network with quite a few Java systems. These systems have all kinds of performance problems, a lot of which involve GC pauses.

Some of those systems come from vendors who have very competent teams and invest heavily in their systems, and some have tried fancy, expensive JVMs with supposedly better real-time performance and have failed to get them to work.

So I'd offer a statement in the other direction: it's important not to overstate the ease of avoiding GC performance gotchas. Maybe it can be done, but that doesn't mean that you're going to be able to do it, especially if you don't design all parts of your application from the beginning to operate within the constraints of real-time GC implementations.

> Some of those systems come from vendors who have very competent teams and invest heavily in their systems, and some have tried fancy, expensive JVMs with supposedly better real-time performance and have failed to get them to work.

This is very important. There's far too much No True Scotsman in typical discussions around this. The fact remains that really good teams try and fail to meet needed performance criteria, even though their app/service meets all other requirements, is well architected, well coded, and the team has experience in GC tuning, pitfalls, etc.

> Maybe it can be done, but that doesn't mean that you're going to be able to do it, especially if you don't design all parts of your application from the beginning to operate within the constraints of real-time GC implementations.

Because... unless your solution is one where you can simply avoid doing a few stupid things that make the GC unhappy, you have to wade deep into managing memory, except you can't do it directly. You have to understand/infer how seemingly innocent code impacts GC, and then suggest/cajole/plead with the GC to do what you want. It required the kind of intimate knowledge that GC is supposed to save you from, but doesn't have the direct manipulation tools.

There are large classes of problems where GC is perfectly fine. Great even. But there are other problem domains where sluggishness or pauses are not acceptable, and not just realtime or soft realtime.

> But there are other problem domains where sluggishness or pauses are not acceptable, and not just realtime or soft realtime.

Azul sells a pauseless collector. AIUI the downside is that you need enough memory headroom and spare CPU cores for the collector to keep up with the mutator threads, which presumably is a significant overhead on large heaps.

> Azul sells a pauseless collector. AIUI the downside is that you need enough memory headroom and spare CPU cores for the collector to keep up with the mutator threads, which presumably is a significant overhead on large heaps.

I think that one of these vendors tried Azul and couldn't get it to work. (I don't know the details. If I understood correctly, the problem wasn't the GC so much as that Azul couldn't run their code correctly or would have required extensive modifications or something along those lines. In any event, they decided not to use Azul's product.)

I agree. In fact I don't think one can succeed in mitigating GC latency unless you understand how very important it can be. The excellent presentation "How NOT to Measure Latency" explains how with the wrong measure latency doesn't seem that important.

The JVM has a lot of other reasons to pause [1]. A big problem with using the JVM as an example is that it's an enormously complex architecture that wasn't really designed for the kind of workloads it is subjected to these days. For example, the complexities involved in having a JIT compiler in a multi-threaded environment that wasn't designed from the ground up for that purpose can be pretty scary (e.g. the method deoptimization issue in the video). The effort that has gone into the various JVM implementations both impresses and scares me.

[1] https://www.youtube.com/watch?v=Y39kllzX1P8

In what sense was the JVM not designed to be a JIT compiler in a multi-threaded environment?

Like (somewhat ironically) Javascript, Java's origins can be very easy to forget, and also like Javascript, both are surprisingly humble. Javascript was banged together in a week, and Java was intended to be a language for developing constrained set-top box UIs in the 1990s. While that does not render them intrinsically unsuitable after decades of development for the sort of tasks that they are put to today, the early days of a language still inevitably color it forever when backwards compatibility is a big concern.

I don't buy this at all. The original Java whitepaper from 1995 mentions both multithreading and JIT compilation: http://www.cs.dartmouth.edu/~mckeeman/cs118/references/Origi...

It even says that the byte-code format was specifically designed with JIT compilation in mind:

> The byte code can be translated on the fly (at runtime) into machine code for the particular CPU the application is running on. For those used to the normal design of a compiler and dynamic loader, this is somewhat like putting the final machine code generator in the dynamic loader.

> The byte code format was designed with this in mind, so the actual process of generating machine code is generally simple. Reasonably good code is produced: it does automatic register allocation and the compiler does some optimization when it produces the bytecode.

The idea that JIT compilation was retroactively bolted onto the Java ecosystem is absurd. Perhaps more than any other technology, Java brought JIT compilation into the mainstream, starting more than 20 years ago.

To distill your contribution to its essence:

Parent: what's wrong with the JVM as a multithreaded-JIT?

You: Java is similar to Javascript. Javascript was terribly designed in the 90's. Therefore Java sucks . QED.

You should try your hand at politics! Don't answer the question (Java), instead, segue into the more comfortable topic/what you want to talk about (Javascript)

Um, jerf gave a reason why Java's history affects it here and noted how it's similar to Javascript in that regard on the side; not the other way around. The JS comparison was not part of their argument.

But that has nothing to do with the question he is responding to; regarding Java's suitability as a multithreaded JIT

Tu quoque.

The JIT and threading / memory model were bolted on afterwards. The JIT isn't even part of the JVM spec. (Unless this has changed.)

The Sun JVM has had a JIT since 1998: https://en.wikipedia.org/wiki/Java_version_history#J2SE_1.2

Which was after JDK 1.0. The JVM was originally implemented as a bytecode interpreter; Sun licensed the Symantec JIT compiler later for 1.1, then eventually added HotSpot several years after 1.0, derived from Smalltalk technology.

The original design of the JVM was for embedded systems, not for running server applications with heaps in the 10s of GB of memory.

The threading model, which allows for near arbitrary sharing of memory, is still a significant source of complexity for both the JIT and the GC and has changed multiple times since the JVM's inception.

Modern JVM implementations have little to do with the original design, but still have to deal with design decisions made 20 years ago for completely different architectures and with only limited foresight.

You were asking about design, not implementation.

The way Java works is there's a language spec and a VM spec and then organizations are free to implement them as they see fit. So JVM design is referring to what's in the spec.

What's wrong with Sun's implementation? Well, they did the best they could, given that there's no design / spec to work from (for the JIT). The memory model was completely replaced in the spec in ~2005 (to work better on multi-core CPUs and stuff), and the threading model was changed in the spec sometime in the 90s (from green threads to native threads).

Most other safepoints pauses are much shorter and far less common than GC pauses on multi-GB heaps.

For some applications those still are a problem, yes, but others would be happy if those were the only ones they'd experience.

Are you familiar with the D programming language? They think a lot about these issues.

By default D uses a GC for classes (allocated on the heap, Java-style). You can disable/avoid the GC with some additional effort and the language tries to help. For example, there is a @nogc annotation, which emits an error for anything which could trigger the GC, although it does not prevent a stop-the-world event from another thread.

Rust is kind the inverse approach. Currently, there is no GC and real time is (relatively) easy. Now they try to introduce an optional GC for convenience.

Mixing GC with (real time) non-GC code is quite desirable, but I have not seen an elegant approach, yet.

It's not exactly for convenience. The Rust GC support isn't intended to be used in general situations. It's for integrating with languages/applications that have a GC, like spidermonkey or V8.

https://github.com/manishearth/rust-gc actually lets you cleanly mix GC and non-GC heap objects (even put pointers to GC heap objects on the malloc heap and vice versa). So does Servo's current bindings to spidermonkey.

  > Now they try to introduce an optional GC for convenience.
Most of the GC work in Rust, at least at this time, is more about integrating with external GCs. Think "Servo uses javascript objects that Spidermonkey controls" or "A Ruby gem in Rust lets the Ruby GC deal with Ruby objects."

It might be easy, after all, to reduce or eliminate those pauses. Try running the Java programs using the G1 collector [1].


This is probably not a very productive conversation to have, but that simply doesn't match my experience - and frankly a system with "all kinds of performance problems" sounds like incompetence. Short-lived garbage does not cause pauses (GC in a modern JVM can be concurrent, it's only fragmentation that causes pauses), and long-lived garbage generally doesn't need to exist (that's workload-dependent - but the kind of workload where it's important is the kind of workload that's impossible to write without GC). Holding onto objects for longer than they need to live certainly happens, but that's the same problem you have in a non-GC language, and if anything it's easier to diagnose on the JVM which has some excellent tooling around monitoring these things.

and frankly a system with "all kinds of performance problems" sounds like incompetence

long-lived garbage generally doesn't need to exist

Wonderful. It's not the GC it's you who's incompetent. Or maybe you're competent, but your app simply doesn't really need to exist. A more complete shit-HN-says comment I couldn't find if I went searching for it.

So without knowing any details of the Java programs you interact with you're going to make a blind claim that the performance issues are the result of GC pauses?

No. We've spoken at length with the teams that develop these applications.

There's a whole additional class of problems involving the JIT needing to warm up, too, but that's off topic for an article specifically about GC. (Go, for example, would not have the latter problem at all, but the JIT and GC situations for Java are a bit similar. JIT warmup issues can be avoided using AOT compilation, but that has its own set of issues. What seems to be completely missing, or at least was missing last time I looked, was the ability to generate a JIT profile, save it, restart, and restore the old profile.)

I've heard these arguments many times before, about how in theory GC'd languages can have less memory allocation/deallocation overhead. This theoretical benefit never materializes, though. This is because in a language with garbage collection the programmer barely thinks about memory at all which results in millions of tiny allocations happening all over the place and that wrecks performance. In contrast to languages with manual memory management where programmers use strategies that minimize the headache (memory leaks and doubly-freed data) of manual memory management which results in there being vastly fewer memory operations.

Where in a dynamic language people are tempted to use write things like `people.pick('name').capitalize().sort()`, where containers of objects are created at every stage of the pipeline just to be discarded moments later. In C you wouldn't write code like that. Or if you do, you simply allocate a single block of memory beforehand that's large enough for the entire operation and you reuse that memory block at every stage of the pipeline. In C more memory efficient code is less work to write and easier to reason about.

So it really shouldn't be a surprise that in practice GC'd languages like Java are dogs in terms of performance, nor should we expect this to change no matter how clever the garbage collectors get. The "sufficiently intelligent garbage collector" is a creature just as mystical as the "sufficiently intelligent compiler".

I still hope for D here. (Maybe C# as well, but I don't know C# well enough)

The D community encourages `people.pick('name').capitalize().sort()`. They call it Component-Programming and they make it fast. See [0] for a more realistic example. The trick is to turn the methods into (lazy) iterators (streams in Java land) with static dispatch. Now the compiler can inline everything and optimize it like a C-for-loop.

The problem is that the D garbage collector needs more engineering effort to be on Java level. For example, it is not precise or concurrent yet.

[0] http://wiki.dlang.org/Component_programming_with_ranges#Form...

Rust does something very similar, albeit with lambdas for extra syntax goodness.

(to be clear, lambdas aren't necessary for this pattern to work in Rust, just that the stdlib has generic iterator adapters that are created using lambdas letting you do a myriad of things with a single function. Rust lambdas (closures) are on the stack, so this usually gets optimized down to a for loop or something)

> things like `people.pick('name').capitalize().sort()`

With substructural types, the programmer can tell the compiler that a specific value is meant to be used only once. For instance, the `self`-arguments to `.capitalize()` and `.sort()` can be modified in-place, if only the compiler actually knew that it won't be used elsewhere.

> The "sufficiently intelligent garbage collector" is a creature just as mystical as the "sufficiently intelligent compiler".

The real myth is that a “sufficiently intelligent compiler” can figure out what you really mean without you telling it. Compilers are often better than programmers at identifying opportunities for optimization, as well as correctly carrying them out, but the validity of these optimizations often depends on preconditions that compilers can't figure on their own - the programmer must tell them.

Fortunately, the science of systematic, reliable and truthful communication between programmer and compiler already exists: https://en.wikipedia.org/wiki/Type_theory

> Where in a dynamic language people are tempted to use write things like `people.pick('name').capitalize().sort()`, where containers of objects are created at every stage of the pipeline just to be discarded moments later. In C you wouldn't write code like that. Or if you do, you simply allocate a single block of memory beforehand that's large enough for the entire operation and you reuse that memory block at every stage of the pipeline

Right, so compare like with like. If you write Java in a style where you're thinking carefully about performance, you can get very good performance - probably faster than you would spending the same amount of time writing C. If you write Java in a style where you think a little bit about performance, and then spend a bit of time profiling your application after, you tend to end up with something faster than your first effort in C and taking a lot less time to write as well. But for some reason people seem to want to compare the dumbest thing you could write in Java that would work at all with the dumbest thing you could write in C that would work at all, regardless of the fact that it would take much more programmer time to produce the latter.

This is nonsense. Games using Unity think about it all the time and that's using C# and a GC runtime. (Here's a dirty little secret: C++ games use object pools too and for the same reason! You can't afford to take the malloc/free overhead in the middle of rendering a frame, any more than you can afford a GC pause!)

Again looking at C# (or Swift's Lazy collections, or Java streams) your example isn't quite correct. You can absolutely get single-pass lazy evaluation of the elements, or at the very least avoid any temporary intermediates.

The .NET runtime has various GCs, some of which are concurrent GCs that can do thread-local pauses without pausing the entire process, as well as doing extremely short pauses (a few ms) but otherwise running concurrently with other threads doing allocations. The GC even does compaction, avoiding the memory fragmentation problem that has driven many a C programmer crazy. (There are other approaches like V8's JS GC.)

The idea that GC has to be a dog is nonsense. It has the benefit that allocation is just adding sizeof(object) to a pointer, making allocation trivial. It has the runtime overhead of occasionally needing to pause things for memory safety reasons, depending on the specific implementation. It also has a higher high-water mark for memory usage. It has the human overhead of reasoning about living roots and avoiding unnecessary object churn (though that is true of any system).

Reference counting has runtime overhead of incrementing/decrementing reference counts, and human overhead of manually breaking retain cycles and remembering to properly increment/decrement the counts (unless using things like ARC which insert the calls automatically).

malloc/free has runtime overhead to manage slabs and free lists, plus fragmentation problems under some workloads. It has a massive human overhead of making the programmer remember to track and free the allocated memory perfectly or risk leaking or corrupting memory, potentially leading to remote code execution in the worst case.

These are all tradeoffs with costs and benefits.

My experience with XNA (a C# based gaming environment) was that allocating memory and incurring GC was death. So you bent heaven and earth to avoid making garbage.

I haven't used Unity. It's probably better. (XNA was essentially abandonware a couple years after Microsoft hatched it -- a typical myopic lifecycle and a good reason why you can't trust MS with your platforms unless they have a really strong money story behind them).

Grandparent is 100% correct. Not sure why you brought up games to defend GC... we are the worst possible scenario for GC, barring the space shuttle.

On my recently shipped C# game, I spent a ton of time writing custom memory management and tracking down memory leaks. Then I hired a coder with a background in web development, and it played out exactly like the grandparent comment described. Guy had Linq queries everywhere generating 4 MB of garbage per frame, because he just didn't know.

The point is, we have to think about memory no matter what. So why bother with GC? It just makes it harder to control.

Currently I work as engineer in video games, and I would like to say that next step is just to write software completely without malloc+free and GC : you just allocate virtual pages directly from OS, create your own pool allocators in them, and that it. When you need to "free" stuff you just wipe whole virtual page, O(1) free, faster then anything you could imagine. We have only three special cases : string operations (usually implemented with special string pool), heap (malloc + free) for libraries that cannot live without it, and everything else on stack.

That said, I honestly don't understand why everyone only talks about GC and malloc+free abstractions, because they simply do not materialize in physical world at all (even if malloc needs a new page, it will call kernel for it), cpu/kernel do not have such mechanisms, so this is only abstract limitation in developers heads. So why not just talk (if we actually end-up in situation when we need to talk about memory, 'cause usually it's just work) and measure about what's actually real?

Presumably most of the defences written for GCs are from people who haven't spent much time in C. Pool allocators are common elsewhere outside gaming, for example Samba's talloc, Apache httpd's APR and a bunch more places like e.g. boost::pool

Video games end up being a special case where you tend to have somewhat discrete checkpoints and can dump large parts of RAM via pages.

Still pooling is something that everyone should be taught and seems like very few are.

I'm not sure it's that special. At least the arena allocator approach. It's a great solution servers too. If the connections are short-lived, you can just give a few pages to each one and unmap them at disconnect.

That actually has a name, and is explored in the literature: region-based memory management: https://en.wikipedia.org/wiki/Region-based_memory_management

> An array in C++ has none of these things. However, consider that if the array is ever to be freed by free, then its size must be recorded anyway. That leaves 8 bytes of overhead, which is likely comparable to the overhead caused by malloc's fragmentation.

Actually, no, it does not need to explicitly record the size of the allocation. Some implementations of malloc do this, using what are called "headers". But more modern allocators tend to use techniques which avoid this. They basically follow the pattern:

1. Maintain a list of slabs, where each slab contains objects of a particular size (say, 4, 8, 16, 32, etc, but not necessarily always doubling by 2).

2. You ensure that the slabs are in contiguous regions of virtual memory. The start is always on a page boundary.

3. For each page, you record its meta-information in either a giant array (for small address spaces), or a radix tree (for large address spaces).

4. Upon a free, you can just figure out what page an object lies in, then look up where its meta-information is. Then you have all of the information you need to free an object of a particular size. This function from an old project of mine shows all three techniques (headers, radix tree, giant array, also called a BIg Bag Of Pages): https://github.com/scotts/streamflow/blob/master/streamflow....

Something else the author did not mention is that when it comes to performance tuning a GCed application, you have another variable: heap size. If the heap size is too small, you will GC too often. If the heap size is too large, your pauses when you do GC may be too large. It may require tuning to find that sweet spot.

Also, this came up in a previous thread (https://news.ycombinator.com/item?id=11058911), but the routines the author outlined for object allocation are quite similar to the ones used in modern implementations of malloc.

(this only works if destructors/finalizers don't exist). You still need runtime type information if you want to be able to properly clean up a thing. E.g. what would happen if I put a `std::vector` on the GC heap?

It works just fine with destructors and finalizers, as this is the underlying implementation of explicit memory allocation. One can build a true type-aware object system on top of it. This is how C++ does it.

Please note that I am talking about techniques for explicit memory management, not garbage collection.

Oh, you were talking about a non-GC situation. Right.

Something this doesn't mention is that for a given program written in both Java and say C++, the C++ version will have to perform much fewer allocations than the Java one. Escape analysis can help with this but afaik it's not even a full replacement for stack allocation, not to mention the other scenarios where C++ doesn't allocate.

On the other side, this also doesn't bring up that any sort of concurrent data structure is much easier to write when using a GC, and the GC removes the need for expensive memory tracking. It doesn't necessarily make things faster, and libraries like Crossbeam and epoch management in general make things easier, but Crossbeam isn't fully featured, is Rust only, and still suffers from strange bugs.

A worthwhile side note is that when manually managing memory, non-malloc allocators can often be used if really needed that have far higher performance than both malloc/free and a GC.

What kind of strange bugs are you seeing with crossbeam? I haven't heard anyone say this yet, I'm curious :)

They've mostly been in the epoch manager itself - some were recently resolved, one has been pseudo-resolved by not freeing the respective memory, and I'm working on another right now. They're all edge cases, but they manifest as segfaults. Hopefully these can all be resolved soon though!

Ah, interesting. thanks for elaborating, I will have to look into that. Crossbeam is one of my favorite libraries, but I haven't dug into the code as much as I should.

Instead of Java, a comparison should use a language with value types (structs): C#, D, etc

I like the C#/F# have value types, but they have the downside of not being able to do away with the default constructor for the type, meaning it's harder and less efficient to control type invariants (would have to guard validity in methods repeatedly instead of being able to gaurantee valid construction). For types for which a default/"0" value is a legal value they are good of course.

I don't understand. What is the problem with default constructors?

The problem with the default constructors is just that for value types, you can't not have one. For plenty of kinds of things, there's no sensible default value that would make any sense, but your type must admit such a value anyway, there is no choice.

Which means when you have a value of the type, you can't assume it's in a valid state. You have to put checking code somewhere, either in each method of the type, or in functions that receive the type as a parameter, in clients that create objects of the type, etc.

This article discussed Java, so I used Java. Also, C#'s value types are somewhat limited aren't they?

I know too little C#. D is very flexible. It also allows to explicitly allocate (emplace) classes on the stack or in an array, for example. You can also wrap structs into a reference counting construct and use them as reference types.

This really is an apples to oranges comparison.

In C++, you have to know the difference between stack allocations and heap allocations.

If you aren't dynamically allocating anything; malloc and free performance has no bearing. As far as I know, Java has no reliable way to create an object purely on the stack (and it probably shouldn't.

And even if you are dynamically allocating a new object in C++, if it has non-pointer children, it allocates that memory too in the same allocation call.

It is true that a lot of allocations will slow down C++ a ton; but a lot of allocations can be avoided in C++ whereas in Java you don't get that fine of control.

Even then; there is nothing about the malloc+free contract that prohibits you from having an implementation that does some of the things the author talks about. You can make a version of free that marks memory locations to be freed at the nth call of free.

I'm curious on the source for this statement:

> Java's collector can free an unlimited number of objects in time independent of the number of objects freed.

That seems like it can't possibly be true in the generic case.

It's true, because the time to free objects is not related to the number of objects freed- it IS dependent on the number of objects that AREN'T freed. The GC time (in a very general sense) is the amount of time it takes to traverse the live object graph.

Isn't that true only for objects in the 'new' pool? Or does it actually compact the old pool every gc?

I'm not sure I see the connection you're trying to make. The specifics of old pool vs new pool (generation is the preferred term) are dependent on the GC config, but in no case can you find memory that can be freed by traversing a "dead objects" graph. Liveness is what drives the relationship between size and computation time. Big O notation stuff.

So my understanding of the new generation is that it's essentially a giant chunk of continuous memory and as you create new objects, you are just adding to the end of it. For whatever reason a GC event happens, and now you can traverse the live graph, and anything alive in the new generation gets copied to the old generation partition of memory. And now you just reset the counter for the new object pool, and voila, you've deleted a whole bunch of young objects just by not copying them.

But once it's in the old generation, you run right into the same issue with fragmentation and having to keep track of where deleted objects are so you can reuse that space. If the idea is that you just keep doing memcopies for all the live objects to reuse the space and avoid looking at dead objects, I'd think that that is it's own performance issue?

Why is this the case?

If anything, Rust shows that you can have reliable "manual" memory management without the need for a GC. The best of both worlds.

While I don't buy the GC performance argument, it's ok to use it, because most of the time it doesn't matter. Sure it can be "almost as good as" manually managing memory, but that's about as hard to achieve as manually managing your memory. The point of GC is that it relieves the programmer of caring about managing memory, and as such it's only really useful where the system has sufficient resources that the fact that a GC is running doesn't matter. For the vast majority of workloads on modern hardware this is true.

At the other end of the scale, in real high-performance embedded systems, even malloc+free is too slow, and memory fragmentation is too much of a problem for things with expected uptimes measured in years. GC is a complete non-starter here. Instead these systems use pools of fixed-size buffers than can be O(1) allocated and freed, and never fragment. You can use MRU allocation of buffers from these pools to keep the dcache warm, and you can make your memory-using operations queue until they can reserve the memory they're going to use up front so that your system can run flat out using all the memory it can without ever crashing from OOM. Both malloc+free and GC (especially GC) can only dream of those characteristics.

I feel like this article goes in the typical trap - you either do fully manual malloc/free, or stop-the-world mark&sweep. But that's like extremes on the spectrum. What about arena allocators (in dynamic content for example per-scene and per-frame). What about selective refcounting. What about mixing systems in various heaps. What about lightweight processes with own, well adjusted local heap size. There are so many options available in real world. GC is not Java, and Java also works against a good GC with the lack of value types.

I'm the author of the article, and I agree 100% with you: there are definitely application-specific memory management schemes that beat both malloc and automatic GC. In fact, I'm working on one right now: it's a way of storing a huge balanced tree data structure in arrays to save space.

This whole post is nonsense, but there's one point in particular I think that GC advocates don't understand about managing memory. He writes, "Consider: you need to hand out chunks of memory of any size, one at a time, with no advance warning of usage patterns; and then you need to be able to free arbitrary chunks at arbitrary times."

This is entirely wrong. Sure this is true if you're writing a general purpose programming language, but if you're writing an app or a game or a tool, you know the usage patterns because you are building them. You don't need to free things at "arbitrary" times, you free chunks at times when it is appropriate for your problem. You know more about your memory allocation patterns than any general purpose GC could ever know, and so you can write code to fit those better.

In a GC language, often you don't care about where and when the memory comes from, but sometimes you do, and when you do, you're stuck.

You can put a manual allocator on top of a runtime GC, just like you put it on top of malloc.

It is Java where you are stuck. Not GC languages in general.

Manual deallocators are more interesting.

PJSIP[1] has an interesting approach to this. SIP messages have lots of strings associated with them, are usually relatively short-lived, and typically you'll want to modify these strings or attach new ones. PJSIP does this by hanging a string arena off of each SIP message object.

As you add strings to the SIP message, these get appended to the arena, which is grown automatically as and when it needs to be. If you delete a string, it just sticks around in the arena unused. Arena space can also be shared, e.g. if you copy a substring from one SIP header to another, it'll just be referred to twice using a pointer and length.

Finally, when you free the SIP message, it just frees the entire arena in one go.

[1] http://www.pjsip.org/

Yes, that kind of "transaction-based" memory management is beautiful in applications where it's appropriate: just leak memory until you're done with the current task, and free it in one chunk. That's hard to beat for performance and simplicity.

Check the title of the article. I'm comparing GC verus malloc+free specifically.

One point that seems to be ignored in this discussion is that having a GC creates problems for interfacing with foreign libraries. It's not too hard to create hooks in the runtime to allow calling C functions, but it's very hard to go the other way, and usually you have to write bridge code in C that links to both the language runtime and the foreign library. And no matter what there will be function call overhead that's unacceptable in some circumstances.

Languages runtimes enable lots of other cool things, like JIT compilation and runtime specialization, user-space schedulers and I'm sure some other things as well, but if, as a language designer, all you want is GC, consider if that's worth all the costs of having a runtime.

This article seems like a bit of a strawman.

I don't know of anybody who complains that the cost of actually freeing memory in GC'd languages is somehow significantly worse than in non-GC'd languages. Everybody is ultimately complaining about how expensive it is to figure out what is actually okay to free/release in the first place.

There are plenty of GC-backed languages/runtimes where the GC penalty is isolated or mitigated because of other known constraints of the system. Erlang's no-shared-mutable-state constraints allow its GC system to avoid stopping the world and also allows processes to run on other schedulers while GC is active on a different process running on a different scheduler.

One flaw in this article is that it assumes "naive malloc/free use". It assumes that a program written in C and in Java (for instance) will need to perform the same amount of allocations.

This is far from the truth. First, C has nested structs, whereas in Java all object references are pointers. This leads to many more allocations in Java. Second, C programmers are aware of the cost of malloc, and will try to minimize its use. Reusing memory whenever possible is common, as are some simple memory management strategy (such as arena allocation: allocate big chunks of memory, do all your work therein, then free it in one go, for instance at the one of a processing step).

> C programmers are aware of the cost of malloc, and will try to minimize its use

Some C programmers, maybe. I have had discussions about GC vs. manual memory management with a few C programmers that appeared to think that malloc+free is basically free, and that GC is expensive because of, you know, GC. They still tried to avoid malloc()ing memory, but that was mainly because of problems with memory leaks and related issues.

(And just to be clear, it took me very, very long to figure this out myself, mostly because none of the C programs I have worked on were affected by the performance of malloc+free. I have to give the credit to a friend who works in real time systems and one day looked at me as if I was an idiot [I might be, of course!] and told me about how in real-time code you cannot call malloc and friends, and I was like "Why not?", and then he explained it to me.)

> Second, C programmers are aware of the cost of malloc, and will try to minimize its use. Reusing memory whenever possible is common, as are some simple memory management strategy (such as arena allocation: allocate big chunks of memory, do all your work therein, then free it in one go, for instance at the one of a processing step).

But if you're willing to put the effort in to do something like that, there's no reason you can't do the same thing in Java.

Ah but you can't. The GC will still pause and trace no matter what.

As far as I know, you can only control allocation at the level of "retainment": you can keep something from being GC'ed by keeping a reference to it. Potentially you can also reuse memory. So basically, object pools. This is much less efficient than arenas, where there is one malloc and one free (or one per memory page).

Also reusing memory is a bit ugly: you can't specify that things are final anymore, which you can in C -- sure this is no hard guarantee (it isn't in Java either) but it's a useful indication that during the lifetime of a "logical object" (not its memory) it will not change.

I've tried using object pools a couple time in Java, whenever allocations dominated the running time, it never helped any.

> Ah but you can't. The GC will still pause and trace no matter what.

The tracing can happen concurrently with the appropriate config. And I wouldn't be surprised if you could disable GC, at least for all practical purposes (e.g. only GC at 100% space usage or every 10000 years, or some such).

> As far as I know, you can only control allocation at the level of "retainment": you can keep something from being GC'ed by keeping a reference to it. Potentially you can also reuse memory. So basically, object pools. This is much less efficient than arenas, where there is one malloc and one free (or one per memory page).

AIUI you can do that by defining objects backed by an unsafe buffer (technically a non-public API, but there's an understanding that it will be supported and the wheels are turning towards standardizing either it or something equivalent that supports the important use cases).

> Also reusing memory is a bit ugly: you can't specify that things are final anymore, which you can in C -- sure this is no hard guarantee (it isn't in Java either) but it's a useful indication that during the lifetime of a "logical object" (not its memory) it will not change.

I didn't think C supported final at all? I agree that it's less convenient than it might be and a lot less convenient than standard Java, but I'm not convinced it's really any worse than what you'd be doing in C.

I think all this time spent over the last 50 years trying to improve GC were a kind of sunk-cost fallacy. Instead of answering the question "how do we make GC better/faster/less laggy/take less memory/not freeze the world/be more deterministic" people should have been investing in "how do we elide object lifetimes so programmers don't have to think about them."

Especially since GC doesn't map to the real world in important ways, for instance OS resources such as sockets/threads/pipes/files. You have to build your own reference counting on top of these things anyways.

This is why I have a lot of hope in Rust -- not necessarily because I see it as the ideal, rather because they are asking the latter question and they've got some pretty cool results.

It's time to let go of 'GC' in the classic sense and figure out how we can elide object lifetimes in a clean, rational and deterministic way.

Well, you can't always know the lifetime of an object until runtime, especially in multithreaded situations. There are plenty of reasons to want to work on being able to determine object lifetimes at compile time, but I don't think that work would actually replace what a GC would give you. I think the future will likely be some combination of both.

I wish I could upvote this 10 times. In C the programmer has to manage memory which is a chore, but the programmer can do it because they _know_ something about the lifetime of their data. In GC languages that knowledge is sent to /dev/null, then heroic efforts are expended to attempt to machine-re-lean it again at runtime. This is crazy. As you observe, the better plan is to relieve the programmer of the chore of manual management, but do while retaining their knowledge and understanding of the behavior of their code.

I'm curious about the following:

Here's how Java allocates an object of size N: round N up to a multiple of 8 bytes, and bump the allocation pointer by that much. That's it. By using a scheme called a thread-local heap (TLH), there is no contention between threads. There's some additional handling for large objects, and for the case when the TLH is exhausted and another one must be allocated, but generally speaking, the GC is self-tuning so that none of these activities appear on the hot path of an application. The overhead is generally negligible, and all you're left with is the pointer bump.

What is to stop someone from implementing something like this in a C library? It sounds like it could be relatively easily done... or is it the TLH that causes issues?

You are correct. As I mentioned in https://news.ycombinator.com/item?id=11426339, jemalloc does this (http://linux.die.net/man/3/jemalloc, see the implementation notes), and you can use it pervasively so that all your existing malloc-using code links to jemalloc instead of the regular malloc. Ultimately the allocation strategy is an orthogonal concern to garbage collection, it's just that many GCs spend a lot more time worrying about the allocation strategy since they need to.

It actually goes one step further than the TLH and does this round-robin stuff sharing the load between your processes.

The difference is that free() needs to be able to free objects one-at-a-time without knowing their size, and without moving any live data around. This inevitably leads to fragmentation, one way or another.

My issue is that with GC-able languages, people don't seem to care as much with allocating lots of data and expecting the GC to wipe it all away.

To give a recent example, HTML5's canvas API only allows accessing image data with getImageData, which will allocate a large array for you, and then return it. You cannot do this at 60fps, because of GC pauses (and also because Chrome doesn't actually track the memory usage of the ImageData correctly, leading it to run out of memory rather than correctly free the object).

There is no way to pass in a pre-allocated buffer, because the presumed cost of the GC is free. When developers are asked to track memory usage themselves, they're allowed to be a lot more careful with it, and design APIs that take that cost into account.

If you need to do some low latency and high throughput stuff, sometimes you just can't afford a garbage collector. Like in many videogames.

I will allow that GC can be fast, but you have to work hard and have appropriate language features to keep it "chunky". If you have millions of small traces to follow, the n of your O(n) is going to be big. In contrast a manual collection algorithm can rely on user supplied metadata to free the whole dataset in one instant without having a tracing step.

In apps that need to scale, designing data to be efficiently allocated and freed becomes crucial and the reference by default strategy really falls down.

It's been what, 20+ years we've been hearing the theoretical arguments about how automatic GC can be soooo much better than manual or ref-counted heaps and yet... https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc...

I like to joke that you could simply comment out the body of free() in any C/C++ program and achieve 99% of the benefits of GC.

What's the point you're trying to make? That GC can't beat manual memory management on arbitrary workloads with no tuning hints? Ok then, no argument here.

One common theme I have seen is all the Big Data Java solutions using home-baked solutions for off-heap storage. This essentially is free-malloc for Java world using sun.misc.Unsafe API.

The problem is Java specs do not dictate native access besides JNI. One standard off-heap API, which provides all the goodies of native world and escapes GC will be a right step. It is upto the dev to choose which world the object lives.

This article seems pretty muddled.

First, one reason GCd languages are slow is because they GC more than necessary; not necessarily because of the GC itself. It's not always possible to statically determine what needs to be GCd. Go is pretty good at this, though, especially since pointers are explicit unlike Java. You also have cache issues here. Even if the GC tries to assuage these cache issues by smartly placing objects, this is a far cry from the speed you get from a cached stack.

Secondly, the "GC can free when it wants" is not an indictment of malloc-based heap usage. You can easily plug in an allocator that does the same thing with respect to freeing memory at an opportune moment. IIRC jemalloc does this. You can argue that jemalloc isn't malloc, but since your programming style or language doesn't need to change if you want to use jemalloc over malloc, it doesn't matter -- comparing GC to malloc isn't the real question at hand here, comparing GC to a malloc-like memory management API (or an abstraction over it) is.

Almost all the improvements that a GC has over malloc-like memory management mentioned in the article can be implemented as a policy in a jemalloc-like pluggable allocator. Smartly managing malloc() calls and delaying free() calls is orthogonal to garbage collection, it's just that garbage collectors invariably do this since it's necessary to offset the performance hit of putting everything on the heap.

The only legitimate performance feature mentioned there is that a moving GC can handle the cache smartly. It is true that malloc can't do this. However, malloc doesn't have cache problems in the first place -- managing the cache in a GC is making the best out of a sticky situation; malloc doesn't cause that situation in the first place.

This all seems language and implementation specific.

Comparing Go and Java, I have the general impression that garbage collection in Go has reached the point where it just works, while in Java we're still tuning it after all these years.

But part of the reason might be that creating garbage is more easily avoided in Go. Or it might that people are attempting different things.

Go's garbage collection probably looks like it "just works" because the type of person who, in 2016, runs into a GC problem on Go will probably just rewrite it in Rust, rather than spend tens of thousands of dollars on a nonexistent best-of-breed Go compiler, the way you have that option for Java. So I don't think there's a lot of chatter about GC problems. There is some, but I think there's been more "I had a GC'ed language and switched to Rust and WOW! It was more work to write but the performance is blowing me away!"

Go does have a couple of structural advantages over Java, such as shipping with "value types" on day one, and it's good enough for many uses. But that's probably not accurately described as "just working" in that sense.

I've been hearing (from other HN comments) that Go 1.6's GC is still considered less performant that hotspot's in the average case, and has significantly worse performance in some select edge cases.

Historically Java competed with C++, and its evolution has been defined by trying to get it to match C++'s performance in the general case. Go prefers simple, sound and obvious to maximum performance.

Java is never going to compete with C++ until they have proper memory placement semantics.

If I know how to arrange my data I can beat Java by 50x.

The thing is so few people care(or need) this these days that stuff like the original article gets written.

Go distinguishes at the type level between pointers and stack-based data, so it's much easier to have pervasive value types and only GC things which are being pointerified. Add escape analysis to that and it's amazing.

Of course, there's a teensy-weensy mental cost that you have to pay in Go where you actually think about these pointers, since you have to explicitly add `&` and `*` for it to work well. This is far less than the cost required to think about malloc and free in C, but it's a cost, and part of the cost-benefit tradeoff between Go and Java. You have to think about pointers a little bit, and in return you get decently efficient memory management.

Go's GC is basically "I will GC-manage what you ask me to and even then try to avoid it when possible"

Java, on the other hand, doesn't distinguish between this and has to try and figure everything out itself. The programmer can't hint to the compiler that something doesn't need GC; the compiler has to figure this out on its own. This can't be changed backwards-compatibly, since it involves changing defaults.

Java's GC is "I will GC-manage everything and sometimes try to avoid it"

What I'd like is a language that permits me to trap a dynamic allocations with a knowledge of the callsite.

Instead of having specialized objects for manipulating memory (like deque<>, vector<> ...), this language would provide statically a profile (or signature) of the stack at compiletime. It would allow the user to apply strategies that are used in JITs (like inline caching).

For example, detect that an array is push or shift only:

- if it's push-only, preallocate memory by chunk (like vector.reserve()).

- if it's shift-only, trap the [] operator and change the head of the array by an offset (no resize)

The issues of GC and malloc are around the context.

I've had to use Java for sets and maps of millions of small objects (containing 1-3 integers). The garbage collector choked pretty early after a few (< 20) millions of them (I think there is no way to hash integers in standard Java without significant object overhead).

Knowing almost nothing about the JVM, what's the best approach to go about this? I considered implementing my own sets / maps on naked int[] / structs of 1-3 naked int[].

Objects are expensive. If you're dealing with millions of them you definitely want to 1.) use primitives where possible (ie: int, not Integer) and 2.) put those primitives in native arrays instead of collections (ie: int[] and not List<Integer>). It sounds like in your use case that means going from the collection of objects pattern to the object of collections pattern.

It might be useful to try out the Colt collections: https://en.wikipedia.org/wiki/Colt_(libraries)

Sounds like you might be looking for something like Trove. http://trove.starlight-systems.com/

Someday, when Rust gets its GC interface, we can do some proper benchmarks.

It's still apples to oranges, because Rust will only ever have an fine-grained optional GC (you choose what objects s to pop into the GC), not a pervasive one.

Well the second step is a make an allocator crate that puts GC in as the system allocator for std.

As far as Java is concerned, _still_ apples-to-oranges since Rust doesn't pervasively mallocate either.

Go would be an ok comparison. I suspect Go would come out on top, since it actually tries to avoid GCing things, whereas the Rust compiler wouldn't even be aware that there is a GC going on and wouldn't do any analysis.

I mean compare the same rust programs, with jemalloc vs with GC.

Ah, okay. That can work.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact