
Java’s new garbage collector promises low pause times on multi-terabyte heaps - ScottWRobinson
https://www.opsian.com/blog/javas-new-zgc-is-very-exciting/
======
nicktelford
> GC’s SPECjbb 2015 throughput numbers are roughly comparable with the
> Parallel GC (which optimises for throughput) but with an average pause time
> of 1ms and a max of 4ms. This is in contrast to G1 and Parallel who had
> average pause times in excess of 200ms.

Not bad! Looking forward to seeing how this performs with a diverse range of
workloads as it matures.

~~~
earenndil
I wonder if that would make java more suitable to game development since
pauses need to be <8ms, ideally.

~~~
nicktelford
I think a lack of value types also hurts Java significantly here. Though I
believe they're also on the roadmap.

~~~
pkulak
With generational GC, does this even matter? Generation 0 is pretty much a
stack as far as performance and operation go.

~~~
pdpi
Problem is memory locality, or lack thereof. Having to chase pointers around
and not having things laid out in sequence doesn't play well with the CPU
cache.

~~~
wrmsr
I'm not arguing against value types as I definitely want those too but it's
not that simple nowadays - a moving garbage collector (like all modern server
jvm's have) can do things for locality that non-moving systems (gc'd or not)
can't - [https://shipilev.net/jvm-anatomy-park/11-moving-gc-
locality/](https://shipilev.net/jvm-anatomy-park/11-moving-gc-locality/)

~~~
ghusbands
That only applies in a few specific cases. Basically, a linked list or an
array of objects containing no further references. Otherwise, those internal
references get immediately followed and significantly dilute the locality.

There have been garbage collectors designed to improve locality of predictable
accesses, but they are not the norm.

------
obl
The memory mapping trick they use on x86 to avoid masking only works up to the
maximum 48 bits of addressable virtual memory, so there is much less than 22
free bits in that case.

It's also not quite free since it takes up TLB space.

~~~
DSMan195276
Agreed - I'd go farther to say it only works for addresses in the userspace
half, so on Linux you lose an extra bit and only get 47 (I don't know how the
mappings are setup for Windows/OSX, sorry). It also might be worth pointing
out that you now need 16x the page mappings (Since every mapping needs 15
duplicates for the possible flag states) - I don't know how Java does it's
memory management but if it does lots of small mappings (Which I'm guessing it
does not) then that could be a concern. And like you mentioned with the TLB,
it takes up space and it also slows things down a bit since if the flags
change the entry in the TLB will no longer be used (since the address is now
different) and the MMU will have to walk the page-table again.

In practice, I would wager the performance concerns aren't huge, but that's
mostly just a guess. I'd personally be interested in seeing a comparison to
masking to see just how much slower masking would be, but obviously it's not
like they can just flip a switch to use masking.

~~~
pitaj
> It also might be worth pointing out that you now need 16x the page mappings
> (Since every mapping needs 15 duplicates for the possible flag states)

Nope. They went into this in the article:

> Since by design only one of remap, mark0 and mark1 can be 1 at any point in
> time, it’s possible to do this with three mappings. There’s a nice
> diagram[1] in the ZGC source for this.

[1]:
[http://hg.openjdk.java.net/zgc/zgc/file/59c07aef65ac/src/hot...](http://hg.openjdk.java.net/zgc/zgc/file/59c07aef65ac/src/hotspot/os_cpu/linux_x86/zGlobals_linux_x86.hpp#l39)

That might not be totally current, as it doesn't cover the _finalizable_ flag,
but if it works the same, that would only be four mappings. If it works
differently, then it would be a maximum of 6 mappings. Not 16.

~~~
shaftway
And it looks like you only need 3 mappings for the _entire_ space. It's not
like you need one per object in memory. Unless I misunderstood the gist of the
article.

------
Tarean
There is also an interesting trick added for Shenandoah to make thread
synchronization time bounded.

Basically, you don't want to check for gc in each (allocation free) loop
iteration. On top of the overhead it makes optimization harder. But if you
have 32+ threads then some are probably stuck in loops for a good while which
means spinlocks for everyone waiting for the stop-the-world gc.

The trick to avoid this is to add an inner loop that does a fixed number of
iterations and then do a gc check between each inner loop
[https://bugs.openjdk.java.net/browse/JDK-8186027](https://bugs.openjdk.java.net/browse/JDK-8186027)

------
exabrial
> Limited to 4tb heaps

I can't believe the world we're living in, this is incredible. My first
computer 512k of ram.

~~~
dboreham
512 bytes here.

~~~
Cyph0n
What kind of PC had 512 bytes of RAM?

~~~
dboreham
Kim-1, MK-14, other similar systems I built myself. I remember (a couple years
later) at age 17 getting a summer job at a POS terminal company where we just
took delivery of the first 64kbit DRAMs and being super excited when my boss
allowed me to take some of my wages in ram chips at their high volume price..

------
AboutTheWhisles
Who are these people that are designing systems that have terabytes of heap
under a single GC, but still want/expect/need low pause times.

I would think if there were processes that needed consistent latency you would
isolate them.

~~~
Aweorih
I think I read somewhere, that the heaps of ecommerce website applications can
be very huge, because of the loaded strings. I think that was in the time of
the java update (9?) where reduced string size was added

------
ryanobjc
An interesting fact is that the hotspot team went back on their 20-year
conviction that load-barriers reduce performance too much. I guess
improvements in CPU and VM designs has changed their minds.

This is a big deal, and a major departure.

I'm hoping this will improve things, this new architecture more closely
resembles Azul's C4. Java and large heaps with non-conforming object
allocation patterns (medium aged objects are the most troublesome) sucks
really badly.

~~~
repolfx
They do reduce performance a lot. I think what changed is there are now
classes of customers who are willing to take a huge performance hit for
reliably low pause times or reliably huge heaps. Whereas in the past heaps
were smaller and perhaps latency tails were something people cared about less.

------
cridenour
All I could think during reading - is this going to affect my Minecraft server
parameters?

------
foobarbazetc
We use Shenandoah everywhere. It’s amazingly great.

~~~
kjeetgill
Is that checked into OpenJDK for the 11 release? Does it have any advantages
to ZGC?

~~~
aseipp
No, it's still a separate fork for Java 8/10/11 as of right now, although some
distributions like Red Hat/Fedora do ship with it enabled as a "Technology
Preview" on all their enabled JDKs, so you can just get it out of the box. I
think Gentoo has it as well.

(If it also means anything, I can confirm that enabling Shenandoah is as
simple as taking the corresponding OpenJDK code and replacing the `hotspot`
subcomponent with the (compatible) Shenandoah fork'd hotspot component, then
just building everything normally.)

------
lamby
> Pointer colouring

(Is this the same as pointer tagging?)

~~~
rlmw
I would say that pointer colouring is a type of pointer tagging. AFAIK the
colouring in ZGC represents different states. So the memory page mappings rely
on the pointer having one colour (1 of the 4 bits set) whereas pointer tagging
is the term for the general concept of encoding information in pointers.

------
openasocket
I'm not entirely sure what the point of the multi-mapping is. It says that it
means you don't have to do any pointer bitmasking to dereference, but you
already have load barriers, which have to at least check to see what phase you
are in and possibly do work every time you dereference a pointer, so how much
extra latency would doing the bitmask really add? And the problem with the
multi-mapping is you will be resolving more virtual addresses, increasing the
working set for the TLB. Seems like there could be increased TLB thrashing.

------
schmichael
> today AWS will happily rent you an x1e.32xlarge with 128 vCPUs and an
> incredible 3,904GB of ram.

> ...

> ZGC restricts itself to 4Tb heaps

Isn't a 4TB heap limit short-sighted for a GC intended to last over a decade?

~~~
phoe-krk
4TB uses 42 bits of address space. It leaves you with 22 bits, out of which
ZGC uses 4.

If anything, borrow a few more bits, and you should easily be able to address
petabytes.

~~~
chrisseaton
> 4TB uses 42 bits of address space.

Can't you compress that into less than 42 bits?

~~~
negativegate
No, compression only works when you don't need to represent every possibility,
or can use less bytes for some inputs and more bytes for others.

[https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...](https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_applications)

~~~
chrisseaton
Why do you need to represent every possibility? You obviously aren't going to
squeeze 2^42 objects into 2^42 bytes are you? They each take more than a byte.
You don't need to address bytes individually.

There's more holes than pigeons here.

~~~
AaronFriel
Suppose we want to store 2^(48-n) 2^n byte objects in your hypothetical <42
bits of address space. Say, 2^47 two byte objects. (Alternatively, more than
2^(48-n) 2^n byte objects in 42 bits of address space.)

How do I compute the hash of every object, since by definition I can't address
every byte?

~~~
chrisseaton
You can still read any byte you want by decompressing the pointer and using it
as normal.

And why would you want to read raw bytes from an object to hash it? What
requires this in Java?

------
igravious
4Tb ought to be enough for anybody.

~~~
jopsen
I can't imagine many use cases where you wouldn't want to use multiple
processes before you hit 4TB. But yeah, I'm sure there are some...

~~~
M_Bakhtiari
Why would you want to introduce the extra complexity of IPC just because your
GC sucks and can't handle the heap size that you need?

Besides, processes are a ham fisted way to get around the PDP-11's limited
address space and shouldn't be a thing in 2018 anyway.

------
stuff4ben
Interested to know what workloads are using multi-TB heap sizes? Largest I've
used has only been in the 48-64GB size and even then it seemed like a waste.

~~~
kjeetgill
1.2 TB Xmx (Java heap) here, reporting in from LinkedIn not fintech.

Basically, graphs are hard to shard ... what if we didn't bother? You can
query an in memory graph pretty quickly out into the 2nd or 3rd degree.

Even G1 (untuned) only pauses for 250ms once every few minuets. It's not
perfect but good enough to ignore.

~~~
hermitdev
250ms is atrocious for finance. Need a small handful of microseconds to not be
noticed.

~~~
kjeetgill
Of course. But this is with zero tuning in place. This is just what the
default G1 gets you on that heap size. I'm sure with 2 or 3 flags we could
bring that down easily. Given how infrequently we're seeing them (one every
few minutes) it's fine for what we're doing.

~~~
repolfx
Well G1's default pause time goal is 100msec so it won't even try to do better
than that without 'tuning' (telling it to pause for less time).

------
silverlake
> Some platforms like SPARC have built-in hardware support for pointer masking
> so it’s not an issue but for x86, the ZGC team use a neat multi-mapping
> trick.

Does anyone know what this instruction on SPARC is?

------
quotemstr
"For sale: baby shoes, never worn" has nothing on "low pause times on multi-
terabyte heaps"

------
amelius
Nice. But unfortunately, it's owned by Oracle. Use it, and you'll be
constantly worrying what their lawyers are up to next.

~~~
kevinherron
ZGC is part of OpenJDK and licensed under GPL.

~~~
amelius
Doesn't matter. Our legal system is such that Oracle can still burn you down,
even if you are right.

~~~
geezerjay
> Our legal system is such that Oracle can still burn you down, even if you
> are right.

Do you happen to know of any real-world case that's similar to the scenario
you've described?

~~~
Sir_Cmpwn
Google v Oracle comes to mind.

~~~
lima
Didn't Google end up importing OpenJDK code?

~~~
pjmlp
Partially, you cannot pick a random jar from Maven Central and be confident
that it will work on Android.

They only support part of the standard library, the OpenJDK parts have been
cherry picked and not all are made available on the same Android version and
then there are the bytecodes that were introduced since Java 7.

And for Java language support, you need to have Android Oreo version of ART as
not all language features get desugared into older versions.

There is some AOSP activity related to Java 9, but so far Google has been
silent if there will be any further improvements beyond Java 8.

------
redsymbol
OT: I initially mis-read this as "Jay Z's new garbage collector".

Talk about an intriguing headline.

------
waterbear
Isn’t Java a pay-to-play runtime now? I thought Oracle changed the licensing
model after Java 8?

~~~
koolba
No it’s still free, they (Oracle) are just not supplying the free JVMs. There
are plenty of alternatives for OpenJDK builds that are FOSS.

~~~
antonvs
Oracle still supplies free JVMs. The change was that to get updates for older
versions of the JVM, you need to pay.

~~~
pjmlp
And even here, it is something that Sun used to do as well.

------
M_Bakhtiari
I'm glad they didn't buy into the refcount meme that's so inexplicably popular
these days.

~~~
ahamez
Could you elaborate? Why would refcount be a bad idea ? (genuine question)

~~~
pjmlp
Many think that refcount isn't a GC algorithm and that it is faster than GC.

Well, refcounting is the basic way of implementing GC and is listed as GC
algorithm in any serious CS book about GC algorithms like "The Garbage
Collection Handbook".

What many refer as GC is actually the GC algorithms that fall under the
umbrella of tracing GC.

Then refcounting is only faster than GC in very constrained scenarios:

\- no sharing across threads, otherwise locking is required for refcount
updates

\- no deep datastructures, otherwise the call stack of cascading deletions
will produce a stop similar to tracing GC

\- implementations need to be clever about nested deletions when refcount
reaches 0, otherwise a stack overflow might happen

\- cyclic datastructures need programmer help to break cycles via weak
dependencies or are just not allowed

This is only relevant for naive refcount implementations, there are many
optimisations, which endup turning a refcounting implementation into a tracing
GC in disguise.

Also just because a language uses a tracing GC algorithm, it does not prevent
the existence of value types, or manual memory allocation for hot code paths,
thus allowing for more productive coding, while offering the tools for memory
optimisation when needed.

This is not yet the case with Java, but even here it is part of the language's
roadmap to fix this.

~~~
Jweb_Guru
> \- no sharing across threads, otherwise locking is required for refcount
> updates

That is not true. Most atomic refcount implementations are lockfree (you do
need synchronization though). This optimization has nothing to do with
tracing. Given that you also need synchronization for tracing garbage
collection (including frequent stop the world pauses in most cases, albeit
brief ones), I don't think this is even really an advantage of tracing at all.

> \- no deep datastructures, otherwise the call stack of cascading deletions
> will produce a stop similar to tracing GC

If your program has data structures that live through several collections,
refcounting can be faster than tracing GC because it only traces objects once
(when they die) instead of several times. Additionally, you can defer the
refcount updates in ways that avoid the need to do lots of work on
deallocation, and optimize for short-lived objects, to get many of the effects
of generational GC. This also has little to do with tracing (except inasmuch
as a reference counter with this optimization has to "trace" new objects to
make them initially live, but in some sense this work to update the reference
counter for the first time is just moved from object initialization time to a
later point in the program). The reason it's not usually done is that it
requires precise liveness information by an ambient collector, which
complicates the implementation, but "exploiting liveness information" is not
the same as "being a tracing GC."

> \- implementations need to be clever about nested deletions when refcount
> reaches 0, otherwise a stack overflow might happen

This is extremely trivial to avoid (if you need to) and has nothing to do with
tracing. It's really more a product of user-defined destructors than anything
else, which you don't need to provide in order to implement reference
counting. In fact, a lot of the supposedly inevitable slowness of reference
counting compared to tracing goes away when you ban nontrivial destructors
(and conversely, nontrivial destructors make tracing perform much worse).

> This is only relevant for naive refcount implementations, there are many
> optimisations, which endup turning a refcounting implementation into a
> tracing GC in disguise.

The most optimized versions of refcounting I'm aware of get their wins from
things like precise knowledge about live references and update coalescing--not
tracing, except for some young object optimizations as I alluded to above
which are more about satisfying the generational hypothesis (they generally
have a backup tracer in order to break cycles, but if you're willing to live
without that they usually don't need tracing to reclaim all memory). Many
optimizations people commonly associate with tracing (e.g. cache friendliness
due to compaction) are in practice only relevant for young generations most of
the time; for older ones both tracing and reference counted implementations
tend to benefit more from a really smart allocator with intelligent memory
layout and partial reclamation.

I agree that optimized versions of refcounting are difficult and the fact that
you still need a backup tracer for most code discourages people from using it,
as well as that tracing doesn't negate a lot of systems optimizations. But a
lot of the stuff you're saying is pretty misleading: the things that make
tracing efficient can mostly be applied to make reference counting efficient
without "turning it into tracing," with the cost that optimized tracing and
optimized reference counting both need much more invasive knowledge about the
user program (and correspondingly restrict the users) compared to the less
optimized versions.

~~~
readittwice
>> \- no sharing across threads, otherwise locking is required for refcount
updates >That is not true. Most atomic refcount implementations are lockfree
(you do need synchronization though). This optimization has nothing to do with
tracing. Given that you also need synchronization for tracing garbage
collection (including frequent stop the world pauses in most cases, albeit
brief ones), I don't think this is even really an advantage of tracing at all.

Not sure I agree with that. Copying references between local variables and
reads/writes from heap all require expensive atomic operations for RC when
they can't be optimized away. That's a major performance problem for languages
like Swift. I do not say that one is better than the other, but this is
exactly where tracing GC's shine compared to RC.

In the case of ZGC these doesn't require atomic operations, you need a read
barrier for reading references from the heap though. But do not conflate
tracing GC's read & write barriers with atomic barriers.

~~~
Jweb_Guru
ZGC is interesting because it has a read barrier instead of a write barrier,
yeah. But that's usually a tradeoff you make to reduce pause times, not
improve throughput (IIRC it usually _reduces_ throughput and fixing the
barrier often requires an atomic write anyway, right?). For deferred /
coalesced update RC the atomic overhead is amortized using (essentially) a
write barrier (it only triggers at most once per object between reference
counting collections, and does a similar amount of work to a traditional write
barrier), and loads don't immediately require incrementing a reference count,
so you end up in pretty much the same contention ballpark as a typical
generational collector.

Again, the optimizations I'm describing are mostly distinct from turning RC
into tracing, just applying the same sorts of optimizations we expect from
production garbage collectors. The only exception is probably how in RCImmix,
heavily referenced objects (4 bits are enough to precisely track something
like 99.8% of objects, so "heavily referenced" refers to the other 0.2%) have
their reference counts frozen so they don't pay anything until the backup
trace starts. But it seems like most of the win from freezing reference counts
comes from using fewer bits for the count, not avoiding the updates per se.

