Hacker News new | more | comments | ask | show | jobs | submit login
Java’s new garbage collector promises low pause times on multi-terabyte heaps (opsian.com)
372 points by ScottWRobinson 5 months ago | hide | past | web | favorite | 234 comments

> GC’s SPECjbb 2015 throughput numbers are roughly comparable with the Parallel GC (which optimises for throughput) but with an average pause time of 1ms and a max of 4ms. This is in contrast to G1 and Parallel who had average pause times in excess of 200ms.

Not bad! Looking forward to seeing how this performs with a diverse range of workloads as it matures.

I wonder if that would make java more suitable to game development since pauses need to be <8ms, ideally.

I think a lack of value types also hurts Java significantly here. Though I believe they're also on the roadmap.

Yeah, implementation is under way for value types; the project is called Valhalla - https://en.wikipedia.org/wiki/Project_Valhalla_(Java_languag...

With generational GC, does this even matter? Generation 0 is pretty much a stack as far as performance and operation go.

Every object requires a header and a 8 byte pointer. This means even an empty object will cost you 16 or even more bytes. If you have perfect data locality and zero cache misses then you will still suffer from exhausting your bandwidth on useless data.

Then there is the fact that you still cannot actually control the allocation. The allocation pattern ABBAABABBA has worse performance on read accesses of all As or all Bs than AAAAABBBBB but with two arena allocators you can write your code to allocate memory according to the irregular pattern but still have continguous data as the end result.

Problem is memory locality, or lack thereof. Having to chase pointers around and not having things laid out in sequence doesn't play well with the CPU cache.

I'm not arguing against value types as I definitely want those too but it's not that simple nowadays - a moving garbage collector (like all modern server jvm's have) can do things for locality that non-moving systems (gc'd or not) can't - https://shipilev.net/jvm-anatomy-park/11-moving-gc-locality/

That only applies in a few specific cases. Basically, a linked list or an array of objects containing no further references. Otherwise, those internal references get immediately followed and significantly dilute the locality.

There have been garbage collectors designed to improve locality of predictable accesses, but they are not the norm.

When you embed a composite object into another composite object, in Java there always needs to be a pointer indirection between them. Even if GC and allocation is free, this is a massive performance loss. (For example, if you have something that would fit into 4 bytes, now you have to fit a 8-byte pointer, and the minimum created object is probably 16 bytes.)

Alternatively you can just flatten everything into a single object, but this loses out on abstraction.

Can't the compiler flatten objects into one another?

For example, if one class includes a reference to another, merge the fields and methods of both classes into one big mega-class, then see if there is any redundant code which can be removed for this new megaclass.

Not in general, because of object mutation, subtyping, and the fact that object fields are semantically references. First, what happens if you update the value to point to a subclass with more fields? They won't fit in the data structure (there are some other problems too, but that's the biggest one). If the classes have to be final and can't be modified through reflection, you'd probably be closer to a solution, but now any time you take a reference to a value in the inner class, you will be pointing into an object of the outer class, which means that GC has to be able to figure out that the internal pointer is keeping the object containing it alive. That's a pretty tricky thing to do, at least in a precise GC. Ultimately, this is not the sort of optimization that a compiler can easily prove safe in general; you could prove it was safe in specific instances (e.g. a local variable that never escapes), but in such cases Java can often allocate the whole thing on the stack anyway (or even in registers). If you want the optimization to work in any other cases, creating something semantically different (that doesn't allow interior references and/or mutability) is a better option.

The GC still has to move objects that survive, and this isn't deterministic. It's fine for a lot of code, but for inner loops when latency matters, you want better guarantees than that.

It doesn't matter nearly as much as people believe. Value objects at this point are a kind of red herring. The JVM does aggressive escape analysis and is also very good at capturing short-lived objects. In the big picture, pointer chasing and memory latency is kind of the last frontier of performance grumps. It really becomes an issue is with very large object arrays that must be iterated over quickly. In that case it's straightforwards to implement your own value-types using code-gen or use excellent libraries like Chronicle-Values [1].

[1] https://github.com/OpenHFT/Chronicle-Values

I'm not sure why you argue against giving the compiler the information it needs to do its job, and instead recommend these crazy codegen hacks, that even describe themselves as "poor-mans" solutions.

> I'm not sure why you argue against giving the compiler the information it needs to do its job

There's several different factors to consider. Value objects will be a valuable addition to the core JVM if only to standardize the many different approaches in place today. But even the standard value types will have strict limitations -- they will likely be immutable and it's not clear that you'll be able to easily obtain their memory address. That means many projects will still need things like code-gen. Code gen, and buffer-backed objects in general, isn't a hack at all it's a proven, battle-tested solution for many domains (finance, scientific sims, big data) that are manipulating several several gigabytes of objects in a performant manner.

Can you please elaborate?

It's just an object that lives on the stack, not the heap. In lots of languages when you create an object, it's pretty explicit where you're putting it. When Java was created, they wanted to get rid of that complexity, so primitives always go on the stack and objects always* go on the heap.

* I think Java has escape analysis, which means that if it can determine that an object you created doesn't leave its context, it will go on the stack and won't add work for the garbage collector.

It's more about the memory layout than where the objects live. For example, consider a Point class with (x, y) members.

If you have an array of these, each element in the array will be a reference/pointer to a separately allocated Point instance which can be anywhere in the heap. Accessing the array is expensive due to lack of locality (arbitrary memory access). Additionally, each instance has substantial object overhead (typically 8 or 16 bytes).

There are various ways to deal with this. One way is to have separate arrays for x and y. Another way is to use a byte array and use Unsafe to read/write the integers. This is terrible from a developer standpoint and is only acceptable if it is a small part of your application.

Value types solve this by giving you an efficient memory layout while still allowing you to code against them as normal. The slogan is "codes like a class, works like an int".

Value types also give you more control over cache locality. Not having to travel that extra pointer redirection can really add up depending on what you're doing.

There is at least half a magnitude in speed benefit from a cache-line-size/alignment tuned B-tree compared to e.g. a red/black tree or similar, if used to 'only' store pointer-size value types, and even more if the value type takes up even less space. A case would be e.g. a float/int pair ordered by floats and used as a priority queue or such. That's an 8-byte-large value type. Though, it might, for that case, need to be doubled up if there is a uniqueness constraint on the integers associated with the floats. LLC cache pressure can be very real. I highly recommend perf stat -dd and perf top -g -e uops_executed.stall_cycles as well as the -e cycle_activity.stalls_ldm_pending. The former counts how long stalls happen, including those where instructions like DIV just take long to execute, the second counts stall events, i.e. one stall per load/store, not weighted for how long this stall event actually stalled the core. I'm pretty sure the former ignores cases where just one hyperthread stalls, but I'm not as sure for the latter. On Haswell and newer, unless using branch taken/not-taken profiling, I recommend setting perf config --system call-graph.record-mode=lbr before using perf top -g and perf record -g, as -fomit-frame-pointer binaries cause issues otherwise.

Moreover, they can be used in composite objects without pointer chasing. Suppose class A is a 32-byte value type. Then

    class B {
        A a1;
        A a2;
is one continuous 64-byte block of memory. Without value types, each A will be allocated separately on the heap, and B will be a pair of pointers.

Lack of value types makes it hard to get cache efficient memory layouts in Java.

Of course if/when a1 gets also referenced from outside while you may not need B any more that is then the things become more interesting.

Depends how they implement value types. They might simply say 'you can't have a reference to a value type'. You either embed it in your class, or you copy it's value.

You mention it in your footnote, but objects don't necessarily live on the heap - they can also live on the stack, or in registers, or in vector pipelines, or not at all.

Also an interesting aside is that objects are never 'allocated on the stack'. As in you won't find the same layout of memory on the stack as you do on the heap. Instead objects are turned into data edges in the compiler's graph. It's much more abstract than literally doing an alloca.

For further details on how objects are not exactly allocated on the stack: https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/

> It's just an object that lives on the stack, not the heap.

That's not quite correct. A reference type might contain a value type field, in which case the value type will still live on the heap.

Conversely, a local variable's type might be a value type, and _still_ involve a heap allocation. For example, std::vector in C++ is a value type but the array storage is heap-allocated.

So it's really about semantics, and not memory layout. Accessing an lvalue with a value type produces a copied rvalue; with reference types you're only copying the reference itself.

> For example, std::vector in C++ is a value type but the array storage is heap-allocated.

This isn't correct. Firstly, C++ doesn't have value or reference types. Where any object is stored depends purely on how it is allocated. Either in the free store (dynamic) or automatic storage (the stack).

std::vector has a constant object size - typically the size of 3 pointers (but that depends on the implementation - the standard doesn't specify): the start of the memory (dynamically allocated for storage), end/capacity (typically either another pointer or a size_t of the current capacity), and back (typically either a pointer just past the last element used or a size_t + 1 of current size). It's storage is dynamically allocated, though, and for a std::vector<T> will have dynamically allocated sizeof(T) * capacity bytes (maybe more due to alignment issues).

std::array is constexpr safe and does not (directly) dynamically allocate memory - size has to be known at compile time (hence it's a template argument).

For instance, on my compiler (GCC 4.8.4), for this example program:

  #include <iostream>
  #include <vector>
  #include <array>

  int main()
        std::vector<int> v_ten(0, 10);
        std::vector<int> v_twenty(0, 20);
        std::array<int, 10> a_ten{0};
        std::array<int, 20> a_twenty{0};

        std::cout << "sizeof(v_ten): " << sizeof(v_ten) << std::endl;
        std::cout << "sizeof(v_twenty): " << sizeof(v_twenty) << std::endl;
        std::cout << "sizeof(a_ten): " << sizeof(a_ten) << std::endl;
        std::cout << "sizeof(a_twenty): " << sizeof(a_twenty) << std::endl;

        return 0;
produces this output:

  sizeof(v_ten): 24
  sizeof(v_twenty): 24
  sizeof(a_ten): 40
  sizeof(a_twenty): 80
edit: added comment about alignment.

The parent is correct. std::vector is a value type in the sense that everyone else is talking about... All types in C++ are, as their members are not inherently stored on the heap. C++ (and C, and Rust) has types that behave like what other languages call value types, even if there is not any other sort of type.

The mention of array in the parent was referring to the dynamic storage of a vector, not std::array.

When Java was created, they wanted to get rid of that complexity, so primitives always go on the stack and objects always go on the heap.*

This was inherited from Smalltalk. Everything was an "object reference," with tag bits and all, but SmallInteger was optimized by using the low 16 bits of the object reference pointer to store the SmallInteger. So you can think of Smalltalk as having SmallInteger as a "primitive" type with pass by value.

I worked for a company where we had a 100-150 microseconds order latency. We had to tune jvm to extremes. But it is very doable to optimize Java gc for low gc pause times.

With those latency requirements why did you use Java?

Plenty of people are writing low-latency trading applications in Java where the latency budget is under 100 microseconds. This has been the case for several years now.

It's by no means new or even especially difficult these days. The resulting code is far more maintainable and robust than C++ solutions.

I've had the opposite experience. For green-field applications, it's far easier to write the application in C++ than Java & meet the latency requirements. Java requires far too much tuning, where as C++, from the get go, you can generally just glance at the code and have a good idea of the latency.

We're currently fighting a Java app that in general has decent latency (10s of usecs), but has outliers of greater than a second when GC kicks in. We don't have that issue with the C++ components of our trading system.

It's always going to be subjective. If you have people experienced in gc tuning, it's pretty easy to weed those gc outliers out. I am currently doing that for one of the applications for the company I work for. For me, it would be far easier to write a low latency, high concurrency Java app than a c++ app. The last c++ app that I was asked to write works horribly and bleeds memory. But it was not an important app.

Oof. There's probably some compounding factors that make it more difficult, but usually getting GC pauses <500ms isn't that hard with current-gen collectors. Either being very careful with CMS, or just feeding G1 a lot of memory.

I’ve heard a pretty effective, lazy strategy for this scenario is to oversize the heap to avoid GCs through the trading day and to simply restart the application before the next day. Unsure if it’s viable for all trading platforms, but it makes sense for some I’m sure.

Depends on your order volume. Sure an oversized heap may work for low order volume, but if you're dealing with high volume, even hundreds of gigabytes of RAM isn't even for a trading day.

And yeah, we restart everyday (we only trade US equities, so plenty of downtime for a restart).

You would be surprised to know majority of trading companies/exchanges use Java for trading platform. So does majority of companies in finance sector. I have been working in this secor for over 10 years and have interviewed for many of them.

I would be surprised. I was under the impression that kdb+ holds the majority role in fintech.

If you used to tune JVM only, that might be interesting. Most often the Java code is what being tuned - to the point where there is no point to use Java - with (almost) no gc, off-the-heap memory allocations and all that jazz.

It's not either/or. Everything needs to be tuned to achieve the final latency requirements. This includes tuning java code to eliminate creating "garbage", immutable string objects mostly, unnecessary object creation, creating short lived objects compared to long lived etc. The other part right sizing jvm generational partitions, using right gc algo etc. Next part would be to find out the code path with lowest latency requirement and making it hot/jitting. Even next part would be to tune OS for things like thread pinning etc etc. You can go further and tune the network layer. Every little thing counts.

You can go even further and employ what's called mechanical sympathy (https://mechanical-sympathy.blogspot.com/) and write your code in a way that makes it easier for the processor to process faster.

I've never worked with application that use off-heap allocations or no-gc requirements.

This doesn't make much sense. Using object pools and and off-heap data structures doesn't warp the code that much, if at all. You still end up with pretty much all clean, simple Java code.

I've seen (10+ years old) low-latency Java apps where people avoid using any OO (ie everything is static methods)... and even then the code was still often easier to maintain and understand than equivalent C++ applications. Nobody does this any more anyways.

8ms? Where do you get that number from?

If your player has a 144Hz monitor then your pause time, rounded to the nearest millisecond, has to be 0ms.

Where do you get your number from? 1/(144 Hz) is 7ms. 1/(60 Hz) is 17ms.

I get my number from decades of professional game programming.

In a 7ms frame, you are spending most of that frame doing the work of rendering the actual frame (unless your game is so trivial that the GC is going to be easy / fast anyway). An additional millisecond is going to cause you to miss your deadline and drop a frame. Dropped frames feel really bad.

This sort of stalling is only caused by extremely short lived objects. Pooling and provisioning your data ahead of time goes a long way to fixing these specific issues to the degree that they become unnoticeable. There's a lot of tooling built into LWJGL to support this along with their own vector classes which are easily reclaimed.

Also has it really been a decade since Braid? Wow.

This is true, but it mainly just mitigates the problem. It can never solve the problem, because the whole concept of this kind of scheme is that the memory manager has some volition of its own ... thus it can choose to do things when you don't want or expect (actually this is unavoidable).

Also note that this category of answer is basically saying, "look, if you mostly manage your own memory, then GC takes less time!" That's true, but a large part of the value proposition of GC in the first place was to remove the burden of memory management. Once you are saying actually, GC won't do that for this class of application, then really what you are getting out of GC is memory safety (provided the rest of the language is memory-safe). On the one hand, hey, memory-safety is a benefit. On the other hand, I don't think very many people in game development would trade that much performance just for memory safety.

(And in fact in game development we very often have to do unsafe memory things. So really what ends up being said is "much of the system has memory safety" which, really, does not sound very alluring.)

> That's true, but a large part of the value proposition of GC in the first place was to remove the burden of memory management.

This is a good point. I've spent a fair amount of time trying to build performant stuff on the web, and my conclusion has been that if you really care about memory management, garbage collection is not your friend.

You spend more time than is necessary trying to convince the engine not to allocate anything you don't want, trying to lay out objects ahead of time, etc... I have spent more time learning about how Chrome's garbage collection works than I would have ever spent manually tracking references myself.

And the end result is that I still have to manually track references and be careful about where things are allocated - the garbage collector for me is objectively a net negative, not just for performance but for dev time and program complexity. And I'm not even really doing anything all that complicated in any of the software I build -- it is not hard to run into these kinds of issues.

This is something that I didn't understand until I saw it in my own software, but I strongly suspect that if I could get rid of browser garbage collection I would spend less time thinking about memory than I do now.

Since you're talking about the Chrome GC I'm going to assume that you're writing code in JS. If that is the case and you have to need such performance that you're having to worry about GC and tacking references then you are probably using the wrong language. Admittedly Web Assembly is pretty new but it sounds like that is your solution, or even asm.js would be worth considering.

> I have spent more time learning about how Chrome's garbage collection works than I would have ever spent manually tracking references myself.

Do you have any good references you can share regarding Chrome's garbage collection? This is a topic I would like to learn more about.

Unfortunately not - there are probably good resources out there, but most of what I know has come out of manual profiling in Chrome, different experiments, and the occasional StackOverflow post that explains a weird behavior or two.

At some point in the future I will probably write a blog post about good techniques for taking back control of JS memory management from the browser, but it probably won't be for a while.

The short answer though is that browsers try to defer garbage collection until it looks like the page isn't doing much; and the problem is that games don't ever really stop doing stuff. This encourages the browser to put off garbage collection and handle it in chunks, which in turn leads to dropped frames because you're doing a lot of deallocation at once.

This is why if you profile a Chrome app that's doing a lot of allocations very quickly, your memory usage will kind of sawtooth all over the place. And you get around that by getting rid of as many allocations as possible.

IMO the lack of garbage collection in WASM is the most exciting thing about it, but different languages have different tradeoffs and WASM isn't appropriate for every application.

you've talked about that before, "we very often have to do unsafe memory things", and that a lot of things done in game dev end up violating a lot of principles that are taught in college, but to be honest I'm not sure. granted, I haven't worked directly in game dev before, but i've worked on some projects where runtime perf was the _only_ requirement... and I'm just not convinced that these issues couldn't be worked around using constructs like smart pointers, weak pointers, and just general design principles...

that said, the witness is one of my favorite games of all time, so I think its safe to say you know what you're doing

The graphics APIs we use are often just about managing raw buffers of memory -- often different areas of memory that have different performance characteristics (Onion Bus vs Garlic Bus, yo). You simply can't do that under constraints of memory safety.

But beyond that ... "smart pointers" and the like make your program slow, because they postulate that your data is a lot of small things allocated far away from each other on the heap. I spoke about this in more depth at my Barcelona talk earlier this year.

Although I would argue that in games, memory safety is the door to game cheats, bots and piracy.

Wait, why have a GC if you're going to work around it and end up with manual memory management anyways?

So that when you make a mistake, instead of random memory corruption, you get a leak or a slower GC cycle.

I nearly spit my drink out from that dudes response, because I just happened to peek at your username as I was reading.

It's a crazy place here on HN.

60hz displays, giving a frametime of ~16.667ms. 8 is around half that, giving you opportunity to drop a frame, but not much more that.

Hrm, I would think that if you were developing a game in a GC environment, where GC timing turned out to be a problem, you would get rid of dynamic memory allocation, and try to do everything much more like embedded system programming with fixed memory blocks...?

I suppose at the complexity of modern games that's simply not possible anymore...

You can do that, and people do. However a GC language doesn't normally give you the flexibility of getting a block of memory and doing whatever you want inside it, instead you use things like object pools.

You can allocate a block of memory with `ByteBuffer.allocate()` or, if you want it outside the JVM heap, `ByteBuffer.allocateDirect()`.

You lose all the object/memory management features, though there are some projects around to provide struct-like APIs: https://github.com/alaisi/nalloc

System.Runtime.InteropServices.Marshal in C#'s case.

Also `fixed` in C# [1]

Not used Span<T> and Memory<T> [2] in anger yet, but they also seem to be attacking the issue from the other end

[1] https://docs.microsoft.com/en-us/dotnet/csharp/language-refe...

[2] https://blogs.msdn.microsoft.com/mazhou/2018/03/25/c-7-serie...

I consider object pools to be well within that family of problem solving that I'm talking about.

For a subset of developers, definitely. Overall I don't think it would lead to significantly more adoption, as the main draw (write once deploy everywhere) is already accomplished by engines like Unity.

On smaller heaps it was already possible to get 5-10ms pauses with the (now deprecated) CMS collector. Unlike Z it was a non-compacting collector though and had some ugly worst-case behaviors.

Yeah, CMS is great until it isn't, and then it ruins your whole day. A bit of a black art to get tuned well, too, and when your workload changes slightly your tuning is no longer effective and the world stops for 30+ seconds. Hopefully ZGC will be more like G1 in terms of "feed it some extra memory and set a latency target" being all the tuning you need most of the time.

Java is already suitable for game development, specially with LibGDX.

Not everyone needs to try to write the next Crisis in it.

It certainly has higher performace than all those HTML 5 and Flash games.

The biggest problem is that jMonkey is the only RAD tooling available and no match against any of engines that make use of C#.

> Java is already suitable for game development, specially with LibGDX.

I think it is safe to assume that under "game development" most people understand rather high performance demands, not a solitaire clone (esp since the comment you responded to pointed out pause-times as crucial).

You can also make a 3d game of large scale, and just let it have hiccups occasionally and still make a billion dollars.

Of course those of us hosting servers for that particular game would have saved a ton of money and pain if they had not used Java, or used Java more cleverly.

The point is that they should have used Java more cleverly.

One can make a game in C++ and make it just as bad.

The lack of value types makes the code quite unwieldy to control memory layout vs C++, or C#, or Go, or Rust or anything

Again, not every game needs to be the next Crysis.

Allocating every 3d coordinate on the heap probably wasn't a good idea.

Wasn't Minecraft originally written in Java? Might still be.

Minecraft also has generally awful performance (although this is generally attributed to its being coded poorly, not java).

Yes, but Microsoft ported it to C++ as means to have a common code base across mobile devices, game consoles and desktop computers.

Yes. But I believe that Microsoft made an improved version written in C++ after the acquisition.

"improved" is a matter of opinion. The Java version has mod support and the editions are kept at feature parity.

I remember when they promised a modding api... Not even the C++ has resulted in one... The fact that Notch went with Java might have been a blessing in disguise because it made the reverse engineering process semi automated.

That is known AAA game development.

The memory mapping trick they use on x86 to avoid masking only works up to the maximum 48 bits of addressable virtual memory, so there is much less than 22 free bits in that case.

It's also not quite free since it takes up TLB space.

Agreed - I'd go farther to say it only works for addresses in the userspace half, so on Linux you lose an extra bit and only get 47 (I don't know how the mappings are setup for Windows/OSX, sorry). It also might be worth pointing out that you now need 16x the page mappings (Since every mapping needs 15 duplicates for the possible flag states) - I don't know how Java does it's memory management but if it does lots of small mappings (Which I'm guessing it does not) then that could be a concern. And like you mentioned with the TLB, it takes up space and it also slows things down a bit since if the flags change the entry in the TLB will no longer be used (since the address is now different) and the MMU will have to walk the page-table again.

In practice, I would wager the performance concerns aren't huge, but that's mostly just a guess. I'd personally be interested in seeing a comparison to masking to see just how much slower masking would be, but obviously it's not like they can just flip a switch to use masking.

> It also might be worth pointing out that you now need 16x the page mappings

It doesn't. The article mentions it's only 3 mappings since those bits are colors, not arbitrary combinations of flags.

> I don't know how Java does it's memory management but if it does lots of small mappings (Which I'm guessing it does not) then that could be a concern.

openjdk generally uses large contiguous mappings but it may punch holes in the middle of the heap if it's configured to yield back memory to the OS. But applications that dynamically shrink and expand their heaps are not necessarily those that are concerned about the last quantum of page table overhead.

> And like you mentioned with the TLB

On the other hand it does support huge pages to mitigate costs of TLB entries.

> It also might be worth pointing out that you now need 16x the page mappings (Since every mapping needs 15 duplicates for the possible flag states)

Nope. They went into this in the article:

> Since by design only one of remap, mark0 and mark1 can be 1 at any point in time, it’s possible to do this with three mappings. There’s a nice diagram[1] in the ZGC source for this.

[1]: http://hg.openjdk.java.net/zgc/zgc/file/59c07aef65ac/src/hot...

That might not be totally current, as it doesn't cover the finalizable flag, but if it works the same, that would only be four mappings. If it works differently, then it would be a maximum of 6 mappings. Not 16.

And it looks like you only need 3 mappings for the entire space. It's not like you need one per object in memory. Unless I misunderstood the gist of the article.

Many Java applications will gladly take 10%, or even higher, performance hit if that means more predictable GC times. If you own a latency sensitive application, then it's common for tail latencies to be dominated by GC times. Bringing tail latencies closer to median would be a huge win for those app, even if the median itself moved.

I agree that it definitely can't be completely without cost. But I'll speculate that they may have arranged for the don't-care bits in the pointers to all be 0 when GC is not running and doesn't need the bits. If so, that could mitigate the TLB waste during periods when GC isn't running.

In other words, just because the alternate mappings exist doesn't mean the pointers always have values that actually use them.

x86 caches uses virtual address for indexing (physical bits for tag), so this'll increase cache pressure too.

It takes more TLB entries, but not physical cache space. Virtual indexing just makes use of the page offset being the same between virtual/physical mappings in order to select the cache set before mmu translation is available.

There is also an interesting trick added for Shenandoah to make thread synchronization time bounded.

Basically, you don't want to check for gc in each (allocation free) loop iteration. On top of the overhead it makes optimization harder. But if you have 32+ threads then some are probably stuck in loops for a good while which means spinlocks for everyone waiting for the stop-the-world gc.

The trick to avoid this is to add an inner loop that does a fixed number of iterations and then do a gc check between each inner loop https://bugs.openjdk.java.net/browse/JDK-8186027

> Limited to 4tb heaps

I can't believe the world we're living in, this is incredible. My first computer 512k of ram.

Java is at least somewhat responsible for the need to get to 4TB in the first place.

Sounds interesting. Could you explain how so?

Probably just a joke because of the tendency of Java applications (and other garbage collected languages to be honest) to consume an ungodly amount of RAM. Although these days I'd probably blame the web for that...

Not garbage collected languages per se; it's languages where data structures can't be value types[1], consequently consuming large amounts of extra memory because everything sits behind 2 or 3 or 5 layers of unnecessary 8-byte pointers.

[1] https://en.wikipedia.org/wiki/Value_type_and_reference_type

He's just trying to bash Java. HN readers hate Java for some reason.

Everyone I know in tech hates Java. They've all had to suffer through terrible poorly-performing Java desktop apps (or worse, applets) which has tainted the image of the language for a generation.

Not supporting bashing language in this thread, but this was a reason I never wanted to develop in Java. This terrible desktop experience created a notable bias towards Java, and only later I learned that it's not that slow for networking services (but it was too late).

This terrible desktop experience was created by developers that didn't bother to learn Swing before they jumped into the keyboard.

Writing everything on the main UI thread, not enabling system L&F, reading books like "Filthy Rich Clients", adopting third party Swing components with modern L&F....

I suggest you check out JavaFX. It is a mature, modern, flexible UI library that is now considered the standard.

But of course, bad programmers make bad apps, no matter how good the ecosystem is.

I've found that terrible performing Java apps were created by terrible performing engineers.

Java, from my experience, has always been quick and flexible.

It's not 100% wrong. "Everything is an object" comes with some extra memory overhead, plus some GC accounting. Most garbage collectors need a decent amount of "slack" space to avoid OOM or long STW collections as well. So, I could see Java needing 1.5-4x the memory of C++ pretty easily anyways.

512 bytes here.

What kind of PC had 512 bytes of RAM?

Kim-1, MK-14, other similar systems I built myself. I remember (a couple years later) at age 17 getting a summer job at a POS terminal company where we just took delivery of the first 64kbit DRAMs and being super excited when my boss allowed me to take some of my wages in ram chips at their high volume price..

Not sure about the parent but my family's first computer had a whopping 4k of ram.

BASIC on the Atari 2600 did a remarkable amount with 128 bytes of RAM.

An interesting fact is that the hotspot team went back on their 20-year conviction that load-barriers reduce performance too much. I guess improvements in CPU and VM designs has changed their minds.

This is a big deal, and a major departure.

I'm hoping this will improve things, this new architecture more closely resembles Azul's C4. Java and large heaps with non-conforming object allocation patterns (medium aged objects are the most troublesome) sucks really badly.

They do reduce performance a lot. I think what changed is there are now classes of customers who are willing to take a huge performance hit for reliably low pause times or reliably huge heaps. Whereas in the past heaps were smaller and perhaps latency tails were something people cared about less.

G1 does a lot better here in my (somewhat limited) experience. There's some overhead, and you have to give it some extra space, but the mixed collections are decent at making their way through medium-aged objects that survived to the old generation. Still nowhere close to 5-10ms though, and with a fair bit of memory bandwidth burned copying it through the survivor spaces.

Who are these people that are designing systems that have terabytes of heap under a single GC, but still want/expect/need low pause times.

I would think if there were processes that needed consistent latency you would isolate them.

I think I read somewhere, that the heaps of ecommerce website applications can be very huge, because of the loaded strings. I think that was in the time of the java update (9?) where reduced string size was added

Probably really old teams that had their scope change over time, with decades of technical debt, whose large controlling entities are too slow-moving to afford any kind of large refactor or rewrite in a more suitable language.

All I could think during reading - is this going to affect my Minecraft server parameters?

We use Shenandoah everywhere. It’s amazingly great.

Is that checked into OpenJDK for the 11 release? Does it have any advantages to ZGC?

No, it's still a separate fork for Java 8/10/11 as of right now, although some distributions like Red Hat/Fedora do ship with it enabled as a "Technology Preview" on all their enabled JDKs, so you can just get it out of the box. I think Gentoo has it as well.

(If it also means anything, I can confirm that enabling Shenandoah is as simple as taking the corresponding OpenJDK code and replacing the `hotspot` subcomponent with the (compatible) Shenandoah fork'd hotspot component, then just building everything normally.)

> Pointer colouring

(Is this the same as pointer tagging?)

I would say that pointer colouring is a type of pointer tagging. AFAIK the colouring in ZGC represents different states. So the memory page mappings rely on the pointer having one colour (1 of the 4 bits set) whereas pointer tagging is the term for the general concept of encoding information in pointers.

I'm not entirely sure what the point of the multi-mapping is. It says that it means you don't have to do any pointer bitmasking to dereference, but you already have load barriers, which have to at least check to see what phase you are in and possibly do work every time you dereference a pointer, so how much extra latency would doing the bitmask really add? And the problem with the multi-mapping is you will be resolving more virtual addresses, increasing the working set for the TLB. Seems like there could be increased TLB thrashing.

> today AWS will happily rent you an x1e.32xlarge with 128 vCPUs and an incredible 3,904GB of ram.

> ...

> ZGC restricts itself to 4Tb heaps

Isn't a 4TB heap limit short-sighted for a GC intended to last over a decade?

4TB uses 42 bits of address space. It leaves you with 22 bits, out of which ZGC uses 4.

If anything, borrow a few more bits, and you should easily be able to address petabytes.

> 4TB uses 42 bits of address space.

Can't you compress that into less than 42 bits?

No, compression only works when you don't need to represent every possibility, or can use less bytes for some inputs and more bytes for others.


Why do you need to represent every possibility? You obviously aren't going to squeeze 2^42 objects into 2^42 bytes are you? They each take more than a byte. You don't need to address bytes individually.

There's more holes than pigeons here.

Suppose we want to store 2^(48-n) 2^n byte objects in your hypothetical <42 bits of address space. Say, 2^47 two byte objects. (Alternatively, more than 2^(48-n) 2^n byte objects in 42 bits of address space.)

How do I compute the hash of every object, since by definition I can't address every byte?

You can still read any byte you want by decompressing the pointer and using it as normal.

And why would you want to read raw bytes from an object to hash it? What requires this in Java?

4Tb ought to be enough for anybody.

I can't imagine many use cases where you wouldn't want to use multiple processes before you hit 4TB. But yeah, I'm sure there are some...

Why would you want to introduce the extra complexity of IPC just because your GC sucks and can't handle the heap size that you need?

Besides, processes are a ham fisted way to get around the PDP-11's limited address space and shouldn't be a thing in 2018 anyway.

Reminds me of the quote "640K ought to be enough for anybody."

(I'm guessing you were think of the same [Which apparently is just apocryphal])

Interested to know what workloads are using multi-TB heap sizes? Largest I've used has only been in the 48-64GB size and even then it seemed like a waste.

1.2 TB Xmx (Java heap) here, reporting in from LinkedIn not fintech.

Basically, graphs are hard to shard ... what if we didn't bother? You can query an in memory graph pretty quickly out into the 2nd or 3rd degree.

Even G1 (untuned) only pauses for 250ms once every few minuets. It's not perfect but good enough to ignore.

250ms is atrocious for finance. Need a small handful of microseconds to not be noticed.

Of course. But this is with zero tuning in place. This is just what the default G1 gets you on that heap size. I'm sure with 2 or 3 flags we could bring that down easily. Given how infrequently we're seeing them (one every few minutes) it's fine for what we're doing.

Well G1's default pause time goal is 100msec so it won't even try to do better than that without 'tuning' (telling it to pause for less time).

I have only heard about it in the context of Fintech, where they keep all transaction data around.

At least they seem to be Azul´s biggest customer base.

> Some platforms like SPARC have built-in hardware support for pointer masking so it’s not an issue but for x86, the ZGC team use a neat multi-mapping trick.

Does anyone know what this instruction on SPARC is?

"For sale: baby shoes, never worn" has nothing on "low pause times on multi-terabyte heaps"

Nice. But unfortunately, it's owned by Oracle. Use it, and you'll be constantly worrying what their lawyers are up to next.

ZGC is part of OpenJDK and licensed under GPL.

Doesn't matter. Our legal system is such that Oracle can still burn you down, even if you are right.

> Our legal system is such that Oracle can still burn you down, even if you are right.

Do you happen to know of any real-world case that's similar to the scenario you've described?

Google v Oracle comes to mind.

Google should not have tried to screw Sun, they still had the option to buy Sun and choose not to.

Google basically won that, didn't they? Despite cloning another companies technology from scratch without paying any kind of licensing fees at all.

Didn't Google end up importing OpenJDK code?

Partially, you cannot pick a random jar from Maven Central and be confident that it will work on Android.

They only support part of the standard library, the OpenJDK parts have been cherry picked and not all are made available on the same Android version and then there are the bytecodes that were introduced since Java 7.

And for Java language support, you need to have Android Oreo version of ART as not all language features get desugared into older versions.

There is some AOSP activity related to Java 9, but so far Google has been silent if there will be any further improvements beyond Java 8.

How so in the case of using GPL code?

I'm not completely sure because I'm not familiar with the topic, but maybe it has to do with the fact that even when they sue you with an impossible-to-win lawsuit, they can destroy you with the cash you need to expend to even start defending yourself.

This can also destroy investment rounds for many startups.

Okay but they can do that even if you don't use Java.

And even if you don't do any bussiness with them. Oracle can sue anyone, they sue a lot. Like most companies they only do so if they think they can make a profit.

They could sue you, with a higher chance of winning, for using a python, a js or a ruby than for using OpenJDK derived code.

Oracle holds JavaScript trademark. Nobody's safe :)

I'm fine with calling it EczemaScript

Openjdk8 or get out! Seriously though I am concerned about the future of Java. I am keeping everything openjdk8 personally.


Basically oracle is attempting a hostile takeover of java. I'm keeping everything I do openjdk8 to avoid potential legal action.

You know that the majority of OpenJDK developers are Oracle employees, right?

Exactly. Some people think large software as programing language, compilers, GC etc can write itself. Or a few thousand dollar donation will be enough to pay to top engineers to keep working on it.

Oracle/Google/MS or Apple will continue to exert control on direction of languages they developed and open sourced because they continue to pay for majority of development expense.

I didn't know that. However, doesn't change my stance. Why would I upgrade to java 9 or 10 when I am at the mercy of oracle's lawyers?

Why aren't you at the same mercy with Java 8?

That’s not much of an answer. If anything, it seems to me like Oracle is seeding more control to the community. I could be wrong though, I don’t follow this stuff as close as some.

Coming sometime this winter they will require a license to use java 9 and up.

Below is a link for the downvoters and skeptics:


Oh dear - the FUD is strong with this one. You’re only required to buy a commercial license if you use Oracle’s JDK beyond their free public updates timeframe and if you want the subsequent security patches etc from Oracle. There are also a whole bunch of $free (and $commercial) OpenJDK alternatives by Azul, RedHat, IBM, AdoptOpenJDK, distrust etc.

What does FUD mean in this context?

edit: I looked it up.

Your comment is obviously correct and valid. However, my stance is unchanged. I will stick to openjdk8 and avoid oracle regulatory influence.

Here is the rebuttal which shows that this is patently false for all actual use cases: http://blog.joda.org/2018/08/java-is-still-available-at-zero...

The first sentence from your link:

> Oracle has announced that, effective January 2019, Java SE 8 public updates will no longer be available for "Business, Commercial or Production use" without a commercial license.

Anything you can get now is still free and GPL open sourced. All they're doing is no longer publishing NEW CODE (in the form of backports and bug fixes) to the OpenJDK 8 branch. Instead they're publishing it to OracleJDK.

They're not trying to kill GPL anymore than Redhat is trying to Linux.

I would tell you, but I am honestly worried about possible action form Oracle if I did.

OT: I initially mis-read this as "Jay Z's new garbage collector".

Talk about an intriguing headline.


How do you blacklist the site? Because I'm thinking that would be an interesting extension or browser feature of an anti-bookmark.

Isn’t Java a pay-to-play runtime now? I thought Oracle changed the licensing model after Java 8?

What they did was change the update model for paid versions of the standard Oracle JDK -- you must have a commercial license with Oracle in order to receive security and bugfix updates for copies of the Oracle JDK.

OpenJDK, on the other hand, was unaffected by this change; things like security mitigations/fixes land in OpenJDK, and make their way to other versions (like older JDK versions, or Oracle enterprise copies) from there. So, you can go download Zulu or OpenJDK binaries from your distributor (such as your Linux distribution) and you'll be fine (assuming they update promptly).

There are also feature distinctions between the Oracle JDK and OpenJDK, but they have been getting smaller over time -- AppCDS and Flight Recorder, which were Oracle-customer-only features, for example, are now in OpenJDK proper (FR will come in JDK 11), and many features such as ZGC (and originally G1, too, I think) were developed in the open and went into OpenJDK directly, right from the start.

If you're just running a bog standard Linux/BSD system with some Java software on it, you're almost certainly OpenJDK already anyway, and your distribution maintainers handle security updates for you.

ZGC is going to be in OpenJDK, the fully open source, reference implementation, of Java: http://openjdk.java.net/projects/zgc/

All of the proprietary bits in Oracle's Java implementation are also being fully open sourced: https://www.infoq.com/news/2017/10/javaone-opening

Get your JVM from Oracle, Azul, Adopt Open JDK, Red Hat.

You can either:

1. Update it every six months, for free, if you want to keep current with security updates.

2. Pay for updates on versions older than six months.

3. Not pay, not upgrade, and live with unsupported versions.

Adopt Open JDK has indicated their intention to back port patches to older versions for several years past their expiration for free.

No, Stephen Colebourne has recently written a blog post [1] on that subject.

[1] http://blog.joda.org/2018/08/java-is-still-available-at-zero...

No. The change was that you need to pay to get updates for older versions of the runtime.

No it’s still free, they (Oracle) are just not supplying the free JVMs. There are plenty of alternatives for OpenJDK builds that are FOSS.

Oracle still supplies free JVMs. The change was that to get updates for older versions of the JVM, you need to pay.

And even here, it is something that Sun used to do as well.

That is for Java SE8. Not affected for later releases. Beside, ZGC is part of OpenJDK, which is under GPL.

http://www.oracle.com/technetwork/java/javase/eol-135779.htm..., The Oracle JDK will cost for production deployments but free for development beginning with Java SE 11. OpenJDK will be interchangeable with the Oracle JDK and Oracle will provide OpenJDK builds for free.

I'm glad they didn't buy into the refcount meme that's so inexplicably popular these days.

For people who want to fight about tracing vs ref-counting, please read this first, on how the two are dual and most collectors use both:


i did google search for "refcount meme", and found out your comment as 3rd result, so it's not that popular...

If reference counting were going to help, they'd probably have switched to it already by now. I wonder if the poster just is angling for an opportunity to embark on a rant about one of Swift/Objective-C/ARC, Python, Rust, C++ - or perhaps all of them?

I figured he was talking about Rust, given that it is basically a refcount system but with the added constraint that the refcount can not exceed 1.

It's the only language that is anywhere close to meme status that I know of.

FWIW, people make this analogy, but I think it’s quite misleading; a ref count system can make things live longer, but borrowing can’t.

Arc/Rcs in Rust refcount, but the language doesn't at runtime, or even really at compile time.

I did a search for "freecodyx user", and found no results, so I guess you don't exist.

In order for a meme to exist, it needs to be popular. A person doesn't need to be popular to exist.

It's like you're claiming you are famous because your mom knows who you are.

My point was that your search criteria (including the keyword “meme”) is only applicable for what we might call “internet culture memes”, which is clearly not what was meant by the post you’re replying to. Your results were irrelevant.

Could you elaborate? Why would refcount be a bad idea ? (genuine question)

Many think that refcount isn't a GC algorithm and that it is faster than GC.

Well, refcounting is the basic way of implementing GC and is listed as GC algorithm in any serious CS book about GC algorithms like "The Garbage Collection Handbook".

What many refer as GC is actually the GC algorithms that fall under the umbrella of tracing GC.

Then refcounting is only faster than GC in very constrained scenarios:

- no sharing across threads, otherwise locking is required for refcount updates

- no deep datastructures, otherwise the call stack of cascading deletions will produce a stop similar to tracing GC

- implementations need to be clever about nested deletions when refcount reaches 0, otherwise a stack overflow might happen

- cyclic datastructures need programmer help to break cycles via weak dependencies or are just not allowed

This is only relevant for naive refcount implementations, there are many optimisations, which endup turning a refcounting implementation into a tracing GC in disguise.

Also just because a language uses a tracing GC algorithm, it does not prevent the existence of value types, or manual memory allocation for hot code paths, thus allowing for more productive coding, while offering the tools for memory optimisation when needed.

This is not yet the case with Java, but even here it is part of the language's roadmap to fix this.

> - no sharing across threads, otherwise locking is required for refcount updates

That is not true. Most atomic refcount implementations are lockfree (you do need synchronization though). This optimization has nothing to do with tracing. Given that you also need synchronization for tracing garbage collection (including frequent stop the world pauses in most cases, albeit brief ones), I don't think this is even really an advantage of tracing at all.

> - no deep datastructures, otherwise the call stack of cascading deletions will produce a stop similar to tracing GC

If your program has data structures that live through several collections, refcounting can be faster than tracing GC because it only traces objects once (when they die) instead of several times. Additionally, you can defer the refcount updates in ways that avoid the need to do lots of work on deallocation, and optimize for short-lived objects, to get many of the effects of generational GC. This also has little to do with tracing (except inasmuch as a reference counter with this optimization has to "trace" new objects to make them initially live, but in some sense this work to update the reference counter for the first time is just moved from object initialization time to a later point in the program). The reason it's not usually done is that it requires precise liveness information by an ambient collector, which complicates the implementation, but "exploiting liveness information" is not the same as "being a tracing GC."

> - implementations need to be clever about nested deletions when refcount reaches 0, otherwise a stack overflow might happen

This is extremely trivial to avoid (if you need to) and has nothing to do with tracing. It's really more a product of user-defined destructors than anything else, which you don't need to provide in order to implement reference counting. In fact, a lot of the supposedly inevitable slowness of reference counting compared to tracing goes away when you ban nontrivial destructors (and conversely, nontrivial destructors make tracing perform much worse).

> This is only relevant for naive refcount implementations, there are many optimisations, which endup turning a refcounting implementation into a tracing GC in disguise.

The most optimized versions of refcounting I'm aware of get their wins from things like precise knowledge about live references and update coalescing--not tracing, except for some young object optimizations as I alluded to above which are more about satisfying the generational hypothesis (they generally have a backup tracer in order to break cycles, but if you're willing to live without that they usually don't need tracing to reclaim all memory). Many optimizations people commonly associate with tracing (e.g. cache friendliness due to compaction) are in practice only relevant for young generations most of the time; for older ones both tracing and reference counted implementations tend to benefit more from a really smart allocator with intelligent memory layout and partial reclamation.

I agree that optimized versions of refcounting are difficult and the fact that you still need a backup tracer for most code discourages people from using it, as well as that tracing doesn't negate a lot of systems optimizations. But a lot of the stuff you're saying is pretty misleading: the things that make tracing efficient can mostly be applied to make reference counting efficient without "turning it into tracing," with the cost that optimized tracing and optimized reference counting both need much more invasive knowledge about the user program (and correspondingly restrict the users) compared to the less optimized versions.

>> - no sharing across threads, otherwise locking is required for refcount updates >That is not true. Most atomic refcount implementations are lockfree (you do need synchronization though). This optimization has nothing to do with tracing. Given that you also need synchronization for tracing garbage collection (including frequent stop the world pauses in most cases, albeit brief ones), I don't think this is even really an advantage of tracing at all.

Not sure I agree with that. Copying references between local variables and reads/writes from heap all require expensive atomic operations for RC when they can't be optimized away. That's a major performance problem for languages like Swift. I do not say that one is better than the other, but this is exactly where tracing GC's shine compared to RC.

In the case of ZGC these doesn't require atomic operations, you need a read barrier for reading references from the heap though. But do not conflate tracing GC's read & write barriers with atomic barriers.

ZGC is interesting because it has a read barrier instead of a write barrier, yeah. But that's usually a tradeoff you make to reduce pause times, not improve throughput (IIRC it usually reduces throughput and fixing the barrier often requires an atomic write anyway, right?). For deferred / coalesced update RC the atomic overhead is amortized using (essentially) a write barrier (it only triggers at most once per object between reference counting collections, and does a similar amount of work to a traditional write barrier), and loads don't immediately require incrementing a reference count, so you end up in pretty much the same contention ballpark as a typical generational collector.

Again, the optimizations I'm describing are mostly distinct from turning RC into tracing, just applying the same sorts of optimizations we expect from production garbage collectors. The only exception is probably how in RCImmix, heavily referenced objects (4 bits are enough to precisely track something like 99.8% of objects, so "heavily referenced" refers to the other 0.2%) have their reference counts frozen so they don't pay anything until the backup trace starts. But it seems like most of the win from freezing reference counts comes from using fewer bits for the count, not avoiding the updates per se.

With refcount you need to sync every time you add/remove a reference to an object. For many (most?) workloads that's going to be quite substantially more frequent than tracing collections. Whether one or the other wins out probably depends a lot on your use case.

More generally, I think the parent comment was not aiming to say that reference counting is never appropriate, but that blindly going with reference counting because it's now "ew, GC" is misguided.

> With refcount you need to sync every time you add/remove a reference to an object. For many (most?) workloads that's going to be quite substantially more frequent than tracing collections. Whether one or the other wins out probably depends a lot on your use case.

That's not actually true of deferred reference counting with update coalescing. Instead, it defers the refcount increments and decrements to periodic intervals, much like optimized tracing implementations defer tracing to periodic intervals. This avoids reference count storms on the same objects and (especially for new objects) means much less work overall.

How many languages use this approach though? This probably also removes one major advantage of RC'ed systems though: deterministic destruction (e.g. close resource as soon as last reference disappears). And suddenly you need some kind of runtime too.

> How many languages use this approach though?

As I noted, there's at least one high performance GC implementation for Java that uses it, with effectively identical performance to a state of the art tracing GC on all benchmarks considered--it's not a toy idea. I don't think the fact that most production languages don't use it means very much, given how much more effort has been spent tuning tracing compared to reference counting.

> This probably also removes one major advantage of RC'ed systems though: deterministic destruction (e.g. close resource as soon as last reference disappears). And suddenly you need some kind of runtime too.

Completely deterministic destruction in the sense that you describe is pretty much incompatible with achieving the highest collection throughput. In particular it means you always have to trace dead objects independently (and, if all cores are busy, block someone from doing useful work while you deallocate), so most young object optimizations are a nonstarter. If you can stack allocate (or equivalent, e.g. push and pop from a vector with a trivial destructor) almost all your young objects, great; otherwise, for my money, keeping latency-sensitive things in well-scoped regions is far more deterministic and effective than either RC or tracing. The number of non-memory resources in most programs is usually rather low; you can always use traditional reference counting (or linear types or whatever) just for those.

> And suddenly you need some kind of runtime too.

Yes, which I mentioned a number of times. However, I don't think it's unreasonable to afford reference counting the same conveniences that tracing has when you want to talk about performance; otherwise you're not really comparing reference counting to tracing, but naive reference counting to highly optimized tracing. The fact that tracing essentially requires a runtime is certainly a point against it in some contexts (though conservative GCs like Boehm perform surprisingly well, they are usually no match for anything with precise liveness information), but I think people are a bit too conditioned to believe that things like stack maps have to be used for tracing--especially when features like exceptions and destructors that people consider acceptable in languages with small runtimes require similar metadata.

> The most optimized versions of refcounting I'm aware of get their wins from things like precise knowledge about live references and update coalescing-

I know Python, objective-c, and swift all use refcounting, but just for my own reading, what language has the most optimized ref counting right now?

The most optimized version I'm aware of is the RCImmix collector for Java: http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rcix-oops.... It achieved performance parity with the regular Immix collector, which was the fastest (in the sense of throughput) available for the Jikes research JVM at the time. A backup collector is required for two reasons: cycle collection, and resetting stuck counts (they gain significant performance wins from keeping only a few bits for the actual refcount, for a variety of reasons: usually only a small percentage of objects need more, and they tend to be the most heavily accessed objects, so they save a lot of time by avoiding updates to them and also gain from being able to reuse objects' existing metadata fields rather than use extra space for the refcount). It's worth noting that most recent work on tracing GCs has been on reducing pause times, not improving throughput, so it wouldn't shock me if Immix is still the state of the art there.

There may have been further developments in improving RC throughput since 2013, but I'm not familiar with them; outside of Apple, there's very little modern research into optimizing reference counting. I know Swift does many cool optimizations of its own, but I'm not sure how restricted they are by having to remain compatible with Objective C.

This is a straw man.

Yes, of course, given the same code and allocation pattern, there are many cases where tracing GC will give you higher throughput than reference counting, particularly once you add in cycle detection. But in GC'd languages and non-GC'd languages you write code with completely different allocation patterns. In non-GC'd environments you only need to refcount the small set of allocations that are actually shared. GC'd languages usually have semantics that require tracing for every allocation.

Plenty of GC languages have value types and support for manual memory management.

Mesa/Cedar, Active Oberon, Modula-3, D, System C#, Go, C#

It's worth noting that Go's value types are not actually memory safe, however, since it allows nontrivial concurrent mutations to shared multiword objects (as described in https://blog.stalkr.net/2015/04/golang-data-races-to-break-m...). Unless that got fixed recently somehow? I believe in most of the other languages you mention, value types are restricted in various ways to avoid that problem (for instance, mutations are only allowed to value types on the stack that don't escape, or multiword value types can only be updated in ways that preserve data structure validity even if updates "tear", or multiword value types are immutable).

With the exception of Go and C#, there are no limitation on value types.

All those languages are GC enabled systems programming languages.

Regarding Go the memory safety in multithreaded code is orthogonal to having support for value types or not.

It's not orthogonal at all. Nontrivial mutations to multi-word sized values are inherently racy operations that can lead to UB if you don't have restrictions on how value types can be used. D isn't memory safe; Active Oberon employs mandatory object locking on writes (which is a rather heavy runtime solution); Cedar values are either immutable, or always valid with respect to word sized mutations (as far as I can tell, unions, for example, can only exist behind a pointer indirection). Modula 3 objects are not, as far as I can tell, value types (i.e. they must be heap-allocated and mutation creates a new record), and those are the only types it has where a "tearing" mutation could invalidate the runtime representation.

So, I disagree: all of the languages you mentioned either are memory unsafe when you use multiword value types concurrently, have one of the restrictions I mentioned, or (in the case of Active Oberon) conservatively lock on all writes. These value types are not "unrestricted" compared to primitives, all of which can be mutated in-place without allocating or locking, or to heap objects, which can be updated atomically even for interesting values at the cost of an allocation. Moreover, the issue is intimately tied to GC, or more accurately to the programmer not having to explicitly keep track of whether objects are uniquely owned or not (which is generally the function of a GC). As a result, if garbage collected objects can hold references to nontrivial value types, there is almost no way to avoid this problem.

I don't get why you are jumping into memory safety in multi-threaded environments.

We are discussing GC enabled systems programming languages with support for unsafe operations, not Rust's idea of thread safety.

D is memory safe in the context of @safe code, of course @system code is unsafe.

Active Oberon and Modula-3 also have record types, strings and arrays that can be stack allocated, declared on the global data memory segment or just manually allocated in unsafe packages.

In both cases I can also declare untraced pointers for memory blocks in unsafe packages, that the GC will gladly ignore.

Lack of thread safety leads to undefined behavior when you have nontrivial mutations to value types with tearing. That has nothing to do with some abstract notion of thread safety, nor does it have anything to do with Rust. It's a very straightforward consequence of the failure of (for example) things like bounds checking, or interpreting an integer as a pointer.

Record types with fields that don't depend on one another, strings and (fixed-size) arrays only support "uninteresting" mutations that remain valid even in the presence of atomic tearing. They are restricted compared to the general value types that you can have in a language like C++ (for example, sum types, unions, "fat objects" like Go's interfaces, or vectors with dynamic lengths), because for the latter having a write tear can lead to undefined behavior (for instance, overwriting a value vector can lead to the pointed-to vector part temporarily having a different length than the length field would indicate, which can easily lead to a buffer overflow).

The memory safe fragment of D has similar restrictions on its value types to the above. For example, its value arrays must have a size known at compile time.

Of course you can support such features in the unsafe fragment of a language. However, this is a pretty significant deterrent to actually using the feature and idiomatic code will avoid it most of the time.

If the manually allocated memory may contain references to managed objects then it still needs to be traced by the gc.

That or you start pinning managed objects or pull everything related into manual world. Both kinda suck.

Manually allocated memory in some of those languages is only allowed in the context of unsafe packages.

In any case, the point is that you should only do that for the code paths that actually matter, after profiling the application and not everywhere.

Be productive, make use of the GC, naturally taking into the consideration the algorithms + data structures being used.

If that is still not enough to meet the application's SLAs, then manually allocation comes into play.

Sure, even Java lets you manually allocate memory via JNI, but negligible amounts of code written in those languages actually uses manual management. Reference counting is uncompetitive in those languages because idiomatic Java/C#/Go make lots of allocations and pass around lots of pointers with no lifetime semantics.

I can also code in C with malloc()/free() everywhere.

The features are available, it is up to the programmers to use them

But (a) that's not idiomatic (b) malloc/free aren't refcounted.

I don't see a straw man in the GP comment. He was clearly comparing tracing GC vs refcounting, while you're conflating memory management with value semantics. Specifically, you're assigning the benefits of value semantics to the absence of a GC and vice versa. While GC'd languages more frequently omit value semantics, they're orthogonal and there are languages (e.g., Go) that profit from both a GC and value semantics.

> GC'd languages and non-GC'd languages you write code with completely different allocation patterns.

[citation needed]

> - no sharing across threads, otherwise locking is required for refcount updates

If you are sharing across threads you will need a synchronization mechanism, whether you have reference counting or not.

So it's not exactly fair to assign this cost to reference counting.

General reference counting does make adding or removing a reference a read/write operation, though of course non-naive implementations don't do this at every turn. Not to mention tracing GC has a reference cost as well.

Tracing GCs don't need to synchronize at each access point.

One can optimize refcounting with deferred reference counting algorithms, but then that is already the skeleton implementation of a tracing GC.

> Then refcounting is only faster than GC in very constrained scenarios:

Throughput is not the only concern when designing a memory management strategy though.

> Many think that refcount isn't a GC algorithm and that it is faster than GC.

As usual, people are confused by improper use of terminology. It's like async in Python, three quarters of the complexity is in abuse of terminology.

From the hacker folklore (AI koans, available at http://catb.org/jargon/html/koans.html):

Moon instructs a student

One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.”

Moon patiently told the student the following story:

“One day a student came to Moon and said: ‘I understand how to make a better garbage collector...

In addition to what others have said, the choice of memory management schemes has big knock-on effects. It influences how much runtime metadata you need to retain (e.g. stack maps), it has lots of implications for the FFI, and may constrain various tricks like unions, tagged pointers, NaN boxing, etc. So narrowly comparing schemes in terms of just allocation throughput won't paint the whole picture.

Load-increment-store is slow. Load-decrement-branch-if-zero is slow. And you still need a tracing collector for cycles. So it's slow and still nondeterministic.

Okay, what's better, from a generalist standpoint? I have a decent but very high-level sketched-out understanding of refcounting, but I know nothing else about what's out there, and would be curious to find out.

Particularly what would work well on x86 and ARM.

An odd and probably completely wrongheaded idea just came to mind: add collectible objects to a lock-free queue in one thread, collect them in another. I wonder if you could do the mark-and-sweep stage across threads too...?

Better, depends on tradeoff's. RefCounting misses a key feature that ZGC, C4, Shenandoah and other Java GCs do (but not e.g. GO) and that is compacting. Secondly when accessing ref counted objects concurrently ref counting needs to be thread safe. i.e. using atomics, this does have significant performance overheads.

Pragmatically in high-throughput situations modern GC's perform better than simple GC. The new things in Java land is that the newest GCs have vastly reduced stop the world times (trending towards OS jitter times), while preserving compaction and without to bad throughput hits. (Even with the misfeatures of finalizers)

My personal experience with Shenandoah shows that this changes the way we think about heap settings for java. i.e. just set -Xmx to 60% of machine ram and let idle collections make sure we never use more than 4% in normal days.

For me the biggest letdown of RC is dirtying an otherwise constant pristine page of memory with unneeded updates. Every little usage of that variable changes the variable because it has to update the RC. You are also forced to use two-word values instead of just one (many having even more), with cache trashing and complicated atomics. That's why usually a GC esp. a compacting one is so much faster than refcounting.

Start by reaching out to "The Garbage Collection Handbook" that I mention in another thread.

"It is time to unmask the computing community as a Secret Society for the Creation and Preservation of Artificial Complexity." ~Dijkstra

It's very impressive work, no doubt. And as long as no one asks me to debug anything running on top, I don't have any issues with it.

At least with straight reference counting I know what's going on under the hood, that's worth a lot. Sharing pointers across threads is tricky business, I can't really see what that has to do with anything. And it's quite likely that it will run faster, simply because it's much less complex.

I've always thought the Monty Python Sketch about The Royal Society for Putting Things on top of Other Things was a perfect metaphor for modern "full stack" software development.


And the Ministry of Silly Walks is the perfect metaphor for user interface design.


A full stack is one push from overflow, I feel that sums it up pretty well :)

Silly walks is a masterpiece, and people think they're joking...

A lot of the early Java press about their use of GC involved decrying the state of other interpreted languages that were still stuck on reference counting, and the problem of circular references.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact