Hacker News new | past | comments | ask | show | jobs | submit login
Go GC: Prioritizing low latency and simplicity (golang.org)
214 points by dbaupp on Sept 1, 2015 | hide | past | web | favorite | 137 comments

This is very similar to Java's CMS collector[1]. Unfortunately, 10ms is still far too long for some applications. I have hopes for something like Azul's pauseless GC[2] eventually becoming a common GC strategy, but I'm not holding my breath. In the meantime, there's always sun.misc.Unsafe[3] in the JVM world :-(

[1] - https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc... [2] - http://www.azulsystems.com/zing/pgc [3] - http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot...

Go already gives better tools than Java for managing native memory through cgo, which is a far less painful interface than JNI (believe me, I've written a lot of JNI for Hadoop). Go also has value types which is a huge win for managing memory. (And even if Java gets value types later, the whole Java standard library will still take years to change to use them, if it ever does.)

Plus, sun.misc.Unsafe is probably going away, according to Oracle. http://www.infoq.com/news/2015/07/oracle-plan-remove-unsafe

From what I've heard, Azul has a great GC, but the throughput is extremely low. It's really only a practical solution for high frequency finance and places like that where latency is everything, and throughput is nothing (can buy another 100 servers or high end hardware.) Note: I'm talking about their software product which runs on vanilla hardware, not their hardware product, which I understand is far superior.

A lot of people on HN also seem to be taking the statement that maximum GC latency will be 10ms as a statement that there will often be 10ms pauses. Hopefully, the average latency will be far less, in the 1 or 2ms, and 10ms will be something that only happens on huge heaps in certain conditions. This should be similar to what has happened on Android, where GC pauses are pretty rare and typically only 1 or 2ms.

> the whole Java standard library will still take years to change to use them, if it ever does.

All Java collections (which are what really matters) are being retrofitted alongside with the introduction of value types. As soon as value types are introduced (and the HotSpot team is working hard on that right now), all collections will be fully value-ready.

> Go already gives better tools than Java for managing native memory through cgo, which is a far less painful interface than JNI

JNI is being replaced by Project Panama: http://openjdk.java.net/projects/panama/ (you can already use a similar FFI already with JNR, which serves as a blueprint for Panama's FFI: https://github.com/jnr I've used JNR to write a FUSE filesystem in Java without a line of C, and unlike JNA, it's fast!)

Besides, HotSpot runs C now, too, and quite well:



> Plus, sun.misc.Unsafe is probably going away, according to Oracle.

... only to be replaced by something much better: https://www.youtube.com/watch?v=ycKn18LtNtk (Unsafe isn't going away until replacements are available).

> From what I've heard, Azul has a great GC, but the throughput is extremely low.

Not at all. Just a little lower than with HotSpot's throughput collector, and possibly higher than with G1 (although G1 changes a lot, so that might not be true).

It feels weird to say this, but Oracle's stewardship of the jvm is making me really hopeful as an occasional ml developer.

Consider the wish list: I want a garbage collected language where, for a handful of large/important data structures, I can sidestep gc and carefully control memory layouts for cache friendliness. I'd also like direct interop with blas and my aforementioned data structures.

It looks like I may get all of this!

And yes, I've done a bunch of work with misc.unsafe but it's nowhere near as nice as it could be. What the jvm really buys you is not having to build once for each platform; I distributed code that relied on c++11 features on 3 platforms while there was mixed compiler support and it was a bloody nightmare.

> I can sidestep gc and carefully control memory layouts for cache friendliness

Memory layout and GC are two completely orthogonal issues. You will be able to control memory layout quite well with Valhalla (value types) and even on a finer-grained level with Panama if you need C interoperability. VarHandles (hopefully in Java 9) will give you safe access to off-heap memory. Currently you can do that with Unsafe, which is more work but still less than C++.

> What the jvm really buys you is not having to build once for each platform

Oh, I'd say it buys you a lot more: seamless polyglotism, exceptional performance even for dynamic stuff (dynamic languages, esp. w/ Graal, but even cool bytecode manipulation in Java or even simple code loading/swapping), and you get all that performance with unprecedented observability into the running platform.

Value types will provide ability to allocate storage embedded in heap object or stack, but it doesn't provide layout control (i.e. order of fields in the layout). It's a good change, but let's not exaggerate.

As the requirement was "layout control for cache friendliness" value types are all you need (or 99.99% of what you can possibly need). For interop, there's Panama. Let's not nitpick.

99.99% is perhaps your estimate, but not necessarily others. This is also not likely what people would consider "layout control" if they're coming from a language that allows field-level layout control. Being able to place frequently used together fields manually is quite useful in quite a few circumstances.

> Being able to place frequently used together fields manually is quite useful in quite a few circumstances.

I have almost never seen this make a difference outside of, say, GPU programming. The fact that Java's optimizer is much better than that of Go will make a much larger difference in execution speed.

This is mostly an issue for large objects (i.e. span multiple cache lines), but have different access patterns for various fields (i.e. clusters of fields accessed together).

The other aspect of layout control is cacheline padding, which is also not present in the JVM. There's @Contended, but it's a blunt tool and not currently a public API (it's in sun.misc).

>The fact that Java's optimizer is much better than that of Go will make a much larger difference in execution speed

Yes, but that's orthogonal.

Both grouping and padding are possible with Java's value types. The only thing you don't have full guaranteed (i.e. on multiple JVMs) control over is ordering and alignment.

Padding is possible with today's java (at least Hotspot), although it's ugly and relies on Hotspot's existing layout implementation.

Yes, you can group but as I mentioned in the other reply, you get "blob" granularity/control. This is unlikely to be an issue for a value type instance itself, but say you want to have multiple value types (each with several fields) embedded into a heap allocated object. Assume this total embedded blog spans multiple cachelines, but you have clustered access patterns. So you've grouped some data together into value types, but you have no control over how that will be laid out in the "host" object.

Again, I'm happy that Oracle will be adding value types -- they will definitely help. I contend, however, that calling it "layout control" is a slight stretch in the general sense; it's perhaps fair to call it that when comparing against today's java, but not against languages that allow explicit field-by-field layout (along with padding, alignment, etc).

> it's perhaps fair to call it that when comparing against today's java, but not against languages that allow explicit field-by-field layout (along with padding, alignment, etc).

It's fair to call it that when comparing against any language or runtime targetting general-purpose development, but not compared to languages designed for different purposes. I agree with that.

However, we're not discussing those languages here so I don't see the relevance. I'm also sure that many statements are wrong when compared to quantum computing, but I don't think there's any need to make that qualification. It's pretty clear from the context.

>It's fair to call it that when comparing against any language or runtime targetting general-purpose development

Unless you consider, e.g., C, C++, and Rust as not general-purpose. Even C# has StructLayout where you can control the struct layout. Heck, Go and C# allow you to embed a fixed size array into a struct (this facility is not in the works for Java at the moment).

> Unless you consider, e.g., C, C++, and Rust as not general-purpose.

Of course they're not! Even Stroustrup now calls C++ "a language for those who need or want to work as close to the hardware as possible". None of these languages -- unlike Java or Go -- has simplicity as a major design goal.

> Even C# has StructLayout where you can control the struct layout.

Yeah, that's precisely my point of giving you something you don't need. That C# does it doesn't mean it's actually significant for any significant number of programs.

> this facility is not in the works for Java at the moment

Paul Sandoz is working on something similar with one of the VarHandles variants (he said there's a prototype already), namely, indexed access to object fields, which would make this a language, rather than a JVM concern.

But value types do let you group fields (with sub-component values).

Also, I've noticed that whenever I say Java does X, you say, "Oh, no! It does X - ε!" Now, to me, that's nitpicking, especially considering that a perfect general-purpose language/runtime designed to be simple (for some definition of simple) should give you 90+% performance in 99% of general-purpose use cases (or 95% in 95% etc.). If it does any better then one of two possibilities is true: 1/ it's magic, or 2/ it's not a perfect simple language/runtime because it could have been made simpler (by whatever definition of simple it's chosen).

Anyone who can't settle for anything less than 100% performance or does something that's outside 95% of the use cases knows not to use such a general-purpose language/runtime, and, instead, uses a more domain-specific language/runtime or one that's not designed to be simple.

>But value types do let you group fields (with sub-component values).

They let you treat the fields of a value type as a "blob", you have no control over how they're laid out within that blob.

>Also, I've noticed that whenever I say Java does X, you say, "Oh, no! It does X - ε!" Now, to me, that's nitpicking, especially considering that a perfect general-purpose language/runtime designed to be simple (for some definition of simple) should give you 90+% performance in 99% of general-purpose use cases (or 95% in 95% etc.). If it does any better then one of two possibilities is true: 1/ it's magic, or 2/ it's not a perfect simple language/runtime because it could have been made simpler (by whatever definition of simple it's chosen).

Nothing personal, but I find your JVM related posts as borderline fanboyism (and I say this as someone that greatly respects the engineering in Hotspot, despite certain things bugging me). It's not about 100% or 95% performance; it's about not making exaggerated claims since, as you say, there's no magic.

> I find your JVM related posts as borderline fanboyism

I think they're much less fanboyish than any other language/runtime discussion on HN.

> it's about not making exaggerated claims

I am not making exaggerated claims because perfect for me (and, by definition 95% of people) is precisely how I've defined the requirements from a runtime like Java (or Go, or Python, for that matter). So, if I say "it gives you all the layout control you need", I don't mean all the layout control you need to write a DSP, or all the layout control you need to get 100% performance -- only 99% performance. I think most people make the same assumption (because they're not writing DSPs), and I find your posts nitpicky and possibly misleading for the target audience (who, by and large, don't write DSPs or medical devices or particle accelerator beam controllers -- at least certainly on threads not discussing those interesting but extreme use cases). I don't think it's reasonable to qualify every statement for some negligible (possibly nonexistent) portion of the participants, especially considering that that minority already knows the statements don't completely apply to them. Not doing so doesn't qualify as exaggeration IMO, but reasonable expressiveness, or else every discussion will be bogged down in irrelevant detail that will only distract from the main point.

So, yes, I stand behind my claim that value types give you all the control you need to get 99% performance for 99% of the use cases people would normally use Java for.

X - ε, for very large values of ε.

X - ε, for very large values of ε.

I think I disagree about the observability -- vtune is a lot easier to use when just tuning straight C++ rather than java

Are you familiar with JMH's perfasm? http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html

And for profiling apps on production, I've yet to encounter a more thorough, low-overhead profiler than Java Flight Recorder.

no but i'll check it out this afternoon. Thanks!

VTune is not too dissimilar to JProfiler.

And as pron mentioned plenty of tools exist for lower level access.

Vtune gives access to PMU counters as well as attributing them to assembly. JProfiler is a purely java level profiler (it won't even tell you Hotspots in the JVM itself, nevermind assembly). They're not really comparable.

jprofiler, at least for my use cases, isn't really similar to vtune at all. I know what my hot spots are: it's the inner bits of algorithms that run a few billion to a few trillion times. What I need to do is understand, as granularly as possible, the exact instructions and how the various caches and memory are operating. Convex and tree optimizers are generally memory speed limited and my goal is to have this code run at eg 0.9+ of memory b/w speed.

Which of the JNR projects is the one you used? There are three that sound like they would do the exact same thing:

https://github.com/jnr/jnr-invoke https://github.com/jnr/jnr-ffi https://github.com/jnr/jffi

jnr-ffi is the high-level API (and the one I used). jffi is a low-level ffi used by jnr-ffi. Unfortunately, jnr-ffi is still missing one piece (which can be done with the low-level FFI), namely returning creating a function pointer to a Java method dynamically and passing it to C code.

I don't know if Azul Zing throughput is "extremely low" but it does sacrifice it for latency. I had a discussion with Gil Tene on this recently where he dove into some details: https://groups.google.com/forum/m/#!topic/mechanical-sympath...

In particular, the LVB is akin to an array range check on each reference read.

> "...From what I've heard, Azul has a great GC, but the > throughput is extremely low. It's really only a practical > solution for high frequency finance and places like that > where latency is everything, and throughput is nothing..."

Well, you heard wrong.

Zing is used in plenty of throughput-intesive and throughout-centric applications, and sustainable throughput on Zing tends to be higher (not lower) than with other JVMs. E.g. Cassandra clusters tend to carry higher production loads per machine when powered by Zing (compared to OpenJDK or HotSpot on the same hardware). All while dramatically improving their latency behavior and consistency.

Specifically, on similarly sized heaps and workloads, the C4 collector's throughout is better than CMS's and close to ParallelGC's. And since it's throughput scales linearly with the amount of empty heap configured and since (unlike OpenJDK/Hotspot) Zing places no practical pause-related caps on how much memory can be applied, it tends to beat both on efficiency in actual configurations.

The notion that good latency behavior has to come at the expense of throughput is just a silly myth. There are plenty of examples that disprove it. Zing/C4 is just one of many.

I thought Zing was basically HotSpot licensed and modified to use C4 (plus a few other minor things). How comes it's faster than HotSpot overall? Are there a lot of Azul-specific compiler optimisations there too?

- Zing is based on HotSpot, and it's biggest visible change is C4, but it changes a lot more than just the collector. E.g. it addresses pretty much all the causes of JVM glitches. (You can see a discussion of the many other reasons JVMs pause here: https://www.youtube.com/watch?v=Y39kllzX1P8).

- The reason Zing tends to to carry higher throughput in production is that in most Java-based systems, production throughput levels are limited not by system capacity, but by how far you can drive the JVMs before the glitches start being unbearable, and what looks like occasional small hiccups at lower throughputs starts looking more like epileptic seizures at load. Capacity planning and sizing usually aim to keep peak production loads below the levels that lead to these "I don't want to go there" behaviors. By taking out the various glitching/pausing/stalling behaviors typically associated with JVMs under load, Zing extends the smooth operating range such that it comes much closer to the traditional "how much can this hardware handle?" capacity and sizing behavior people are used to in non-Java and non-GC'ed environments.

- When you compare raw throughput or speed (with no SLAs, e.g. "how long does this multi-hour batch job take to complete?"), with similar configurations Zing is usually comparable to OpenJDK/HotSpot [Where comparable typically means within +/-15-20% range. Sometimes faster, sometimes slower.] But once people apply simple knob twists in the applications (like turning up heap sizes, caches, and other using-memory-no-longer-hurts related settings) they often get more raw throughout per instance or machine through simple efficiency benefits (like the elimination of raw work that comes from higher in-process, on-heap cache hit rates).

Thanks, fascinating answer.

In this old thread [1] one of the Azul developers outlines a potential path to having a C4 type GC for Go. Interestingly he mentions moving to a generational collector first, before moving to concurrent. It seems the Go developers have decided to go the other way, i.e. make the collector concurrent, but not generational. One of the reasons given for going generational first is implementing a write-barrier, but from the blog post it sounds like that is implemented for this concurrent collector anyway, so maybe it doesn't really matter.

[1] - https://groups.google.com/d/msg/golang-dev/GvA0DaCI2BU/SmEel...

+1 on Azul. They've pretty much solved it by improving on past methods combining hardware and software. Go could do the same thing. I keep wondering about putting a dedicated FPGA on the memory bus that does nothing but concurrent GC. Have a mechanism to keep the processor (s) and it from stepping on each others' toes. Might work wonders.

Two factors require any GC support to be behind the CPU's MMU:

1) GC itself operates on virtual addresses.

2) If you want concurrent collection you're probably going to need a read barrier, and that will require some GC / MMU interaction.

The Azul Vega had a lot of interesting features to support GC (and other Java constructs), but the most important by far is the HW read barrier.

1. I expect, if we use standard OS, that a modification might be needed where the FPGA will know the real address. The FPGA might be given enough info to figure it out with DMA or it might be something like a custom, syscall that gives FPGA details.

2. Barriers are one method. They're among the most expensive. An IBM prototype used careful lock management (supported by lock registers) and scheduling to avoid barriers. There's actually quite a few ways in the literature to avoid barriers in concurrency. I figure a team would have to leverage them plus asynchronous I/O from CPU to FPGA for optimal performance. I see it as a series of hand-offs to FPGA which, once it has necessary information, acts on those hand-offs with CPU assuming it completed after certain time, seeing a part of memory saying so, or receiving an interrupt.

That's a rough sketch.

The virtual addressing part might be fine now that the GPUs are doing it, and hypervisors support programmable passthrough using the x86 IOMMU (VT-d etc) features.

Though I'm convinced custom hardware is doomed here for the usual custom hardware reasons. Maybe GPUs have gotten good enough at pointer chasing to be usable here?

Custom hardware is actually flourishing in datacenters. My preferred architecture is Cavium Octeon III style of many-core RISC, accelerators for everything, plenty I/O, and hypervisor support. Selling like hotcakes. Adapteva's stuff outperforms CPU's & GPU's at performance-per-watt-per-area with sales to HPC people. There's similarly at least a few custom hardware companies in each segment doing something that's hard or not cost-effective with existing hardware or software.

I agree that the risk is high, though, to the point that one shouldn't depend on it. So, I'd advice selling system w/ services that's profitable which just happens to use such custom hardware. A high-performance, easy-to-manage, easy-to-integrate... already worth buying... platform that also has hardware-supported GC and/or memory safety. The sales of the system & licensing of the software subsidize hardware costs, which are structured to be cheap anyway. Start with FPGA's, then S-ASIC's, then advanced S-ASIC's or finally ASIC's. The NRE stays as low as volume can support.

Relevant example of this model (and evidence for my GC idea) is Azul Systems Vega machines. Those are custom hardware for Java supporting native bytecodes, a bunch of RAM, a pauseless GC, and easy enterprise integration. So, while we're all speculating, they're selling custom hardware w/ pauseless GC's. I'm just trying to work out a different, cheaper design hopefully integrating with Intel/AMD.


Note that they support a whole range of hardware, software, and services to diversify income. Any one thing shouldn't sink them, esp unfavorable hardware. That's the model to copy.

> Maybe GPUs have gotten good enough at pointer chasing to be usable here?

They haven't.

I enjoy comments such as this because I enjoy putting a thought, an idea in my head and rolling it around, seeing where it goes and what problems it has. This reply may be a bit rambling as a result.

If the FPGA on the bus is doing concurrent GC, we'd need a way to mark pointers and integers reliably. It means breaking all C code that does pointer manipulation, and breaking custom tagged pointer implementations. Niche languages and applications wouldn't have the buy in, and justifiably so, to set global policy on memory. On a developer machine you could experiment, but on a consumer device you would need to use whatever the OS and the bulk of applications have decided on.

Maybe that's not a bad thing though? The OS would need to provide implementations of malloc and free, and other primitives, that are tied to the hardware. But I suppose that's not unreasonable.

Except when it comes to virtualization. The FPGA isn't fixed function, but on the timescales that modern VMMs provide slices of CPU time to virtual machines, reprogramming an FPGA is an eternity. So we would gain in potential performance but lose in flexibility, substantially.

The biggest problem would be legacy C code. Code that treats "void*" and "long" as interchangeable values. And that includes a lot of hand-written JITs, where coercing values between pointers and word-sized integers is prevalent.

Rambling aside on potential upsides and downsides: The world would probably be a better place if pointers and integers were separated and strongly typed down to the hardware level. It would be a pain, but if the null pointer is a billion dollar mistake, placing all bets on the Princeton architecture may yet be a trillion dollar mistake this century.

This isn't working with just any C or OS code. I'd assume that. It would take a modified or clean-slate runtime whose language can work with it. It might not tolerate virtualization, either. As I was heading for bed, I decided to at least Google to see if anyone did anything with FPGA's and GC's to not leave your head with nothing to think on tonight. ;) Here's what I found:

FPGA-aware garbage collection in Java (2005) https://buytaert.net/files/fpl05-paper.pdf

(Modified Jikes VM to use FPGA coprocessor for collection. Result was good performance with around 2.32% overhead on memory-intensive benchmarks.)

Fine-Grained Parallel Compacting Garbage Collection through Hardware-Supported Synchronization (2010) http://www.ikr.uni-stuttgart.de/Content/Publications/Archive...

Stall-free, real-time collector for FPGA's (2012) http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

(This is on one chip with tiny resources for GC around 1% and 4-17% overhead.)

Maybe I can get something better for you after some rest. Note that my comment to the other person has more details of where I was going with this.


Azul have a similar concept with their Java co-processors. I think they originally went the co-processor route because up until recently (i.e. last few years) the CPU instructions they take advantage of didn't exist on commodity chips.


I'm aware of it. My main recommendation, partly to eliminate x86 risks, was to switch to better, multi-core RISC processors while adding onboard garbage collection to them. One of best was a Scheme machine that had it in the hardware memory subsystem transparent to the CPU or application. I wanted to see about building something like that with Java compatibility and accidentally found the awesome Vega systems in process.

The advantage of implementing it in raw hardware are many: easy to build into a state machines with high clock-rate (even HLS should work); won't waste CPU time or cache; can be setup to analyse segments of memory in parallel because it's hardware; latency with right algorithm can be lower. That combined with multicore RISC and some concurrency extensions I know of would give plenty performance while knocking out tons of errors. Can't exactly do that with Intel/AMD chips unless we buy their semi-custom designs for untold millions, now can we?

So, next idea (for Intel/AMD) customers is to put it on the memory bus with some synchronization mechanism between CPU task and FPGA. This will use far less CPU time than a CPU that's doing GC stuff non-stop. That's on top of above benefits. Plus, if a workload doesn't need low-latency GC, it has a whole FPGA to use for acceleration. :)

Why would an FPGA be a better solution than software running on another core?

.. while keeping in mind that for some workloads GC is a large chunk of the app's work. For example, 10 Xeon cores worth of GC throughput (out of say 24) would be a pretty tall order for a FPGA, and as a fixed resource it easily becomes an Amdahl's law bottleneck.

It would be a cool thing to try still, and maybe doable with COTS hw: https://www-ssl.intel.com/content/www/us/en/embedded/technol... + http://www.hotchips.org/wp-content/uploads/hc_archives/hc21/...

It's a tall order because you set it up to be. Real system design would call for a balancing act, as usual. Remember that you can put a bunch of GC's on one FPGA that all run concurrently with access to shitloads of I/O and/or fast memory bus. Amdahl's law shouldn't kick in any more than with concurrent GC's in general. The parallelism, simplicity, and tech like in your link should make it faster than an on-board collector. The concept isn't speculation as it's already been done in two different ways:

Fine-Grained Parallel Compacting Garbage Collection through Hardware-Supported Synchronization (2010) http://www.ikr.uni-stuttgart.de/Content/Publications/Archive...

Stall-free, real-time collector for FPGA's (2012) http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

The question is, "Can modern CPU's and off-chip FPGA's keep in sync without performance getting dragged down?" The FPGA's have gotten faster. The CPU's I/O have gotten faster. So, I'm sure it can be done but it might be difficult enough to be someone's Master thesis. ;)

Besides, I call for replacing current chips with open ones easy to modify for acceleration and security. Gaisler LEON4 SPARC, Rocket RISC-V, Cambrige's BERI/CHERI MIPS64... these all come to mind. Plan was to put them onto a high-end FPGA w/ concurrent GC's to test the scheme. Once it worked, ASIC conversion time baby. S-ASIC's are $200-500k on average with resulting production & packaging being way cheaper after that. Just hoping there's a few companies that would split the cost to eliminate most memory and control flow issues forever. ;)

I'm guessing since it's sitting on the memory bus it could intercept pointer modifications and synchronously update it's graph?

Memory bus is just for speed. You don't want it doing stuff like that lol. See these two LISP machines for where my inspiration of putting it on memory bus came from:


(See section 7 for a radical... err realy old... way to do concurrent GC. Full paper available at ACM/IEEE or if you Google LISP processors Guy Steele enough.)

Scheme machine by Burger 1995


(See the main graphic and specifics in storage section later. Once again, GC-like stuff is handled in memory management part of the processor. This processor knows that, though, to assist GC a bit. Also different in that it was specified and then synthesized to heterogenous hardware with DDD "correct-by-construction" toolkit.)

So, have fun with those. Plus, Google hardware-assisted or hardware garbage collection to get lots of interesting results already done.

See my reply to anarcticpuffin.

There are plans for adding one to hotspot[1]. As I understand it it can't use the same tricks as C4 because that would require kernel support, but unlike CMS it will be compacting, thus avoiding the terrible failure modes of CMS.

There also is some work being done to make the CMS failures a little less terrible by parallelizing them.[2]

[1] http://openjdk.java.net/jeps/189 [2] https://bugs.openjdk.java.net/browse/JDK-8086706

As HotSpot's GC team is mostly working on G1 and grooming it as a CMS replacement (it has replaced the parallel GC as the default GC in current JDK9 builds), I don't think there's much further work on CMS.

The CMS improvements I've mentioned to are being contributed by a 3rd party. Based on the mailing list posts[1] I think it was google.

[1] http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2015-J...

That's really cool! I hope it gets merged...

I'm really surprised this is so new. I'm only beginning to enter the field and falling asleep the other night one of my thoughts was 'why doesn't the GC happen piecemeal?' AKA 'who said we have to collect everything, just unload 2 bins worth'. Mainly WRT use in Android, where it causes perceivable hitching

The Android Framework team has been working on improving the runtime and its garbage collector for quite some time. There have been many improvements to the garbage collection strategy. With ART on Lollipop, according to its creator, there should never be a visible GC pause when the app is in the foreground. More information here : http://www.anandtech.com/show/8231/a-closer-look-at-android-... and in various I/O talks. Extremely bad code will still cause problems (but that's true for whatever memory management strategy you rely on) but in my experience, on a nexus 5, GC pauses are no longer a real issue on Android.

Still, Samsung is here to samsung things up. Comparisons of scrolling performances between a flagship samsung device and a nexus are just as expected : https://www.reddit.com/r/GalaxyS6/comments/3ck9no/scrolling_...

That's another topic entirely, but I think that another key factor is to move almost as much as possible of the UI work to a separate thread, not just to make heavy work in the background. It is starting to happen with ripple animations on a RenderThread, but I think that this strategy could be generalized.

> Still, Samsung is here to samsung things up

3rd time I've laughed with more than 60% intensity at something I read on the internet in the last 2 years. thank you!!! hahahaaha

for reference, my last 100% laugh that returned me to the days of cartoons and childhood was this comment about how to install and activate your non-rental-Cable-modem: https://www.reddit.com/r/technology/comments/27szuw/comcast_...

On my tablet (not nexus) with Android 5.0 I can surely see when everything comes to a stall.

Tablets are indeed complex.

At best they have the same SOC as a good phone but they typically have way more pixels to carry around.

NoName Chinese tablets are often a disaster and even high end tablets are hard to get right.

Even putting graphics aside, things like a slow internal memory can also wreak performances.

might be 5.0 mem leak issues and general bugs

My experience is a lot of people/groups get obsessed with one metric in this case average speed. A pauseless GC is always going to be a bit slower than one that pauses once in a blue moon (meaning all the time from the users standpoint). Requires a really strong will to go against that sort of grain.

Just remembered though: Friend who is a Azul developer said that 0x86 processors did not have the correct instructions to support pausesless GC up until very recently.

So probably a chicken and the egg problem.

can you open that up more? I'm skeptical. Which instructions?

Over my pay grade, which is why I'm slopping away at ARM Cortex firmware instead of at Azul. I'd ask my friend but he's on vacation. I do know Azul originally ran their JVM on custom processors. (Target market high speed trading systems, etc) I can't see why they would have done that unless older 0x86 etc processors just couldn't do what they needed.

A little internet searching, leads to this and some other stuff. https://www.artima.com/lejava/articles/azul_pauseless_gc.htm...

I think the deal is, if the GC moves something, it can leave a bunch of dangling pointers. Those need to be fixed up to point to the new place before they get dereferenced. Except Azul uses a read barrier to trap when a dangling pointer is dereferenced, so the fix up can be lazy.

Interesting side note - Azul's GC isn't "pauseless" it's pause less.

"To create a garbage collector for the next decade, we turned to an algorithm from decades ago. Go's new garbage collector is a concurrent, tri-color, mark-sweep collector, an idea first proposed by Dijkstra in 1978."

So, the best GC for the next decade was proposed in 1978. Whatever computer scientists have proposed since then apparently hasn't been as good.... so they can all just pack it up and go home?

I'd be really really curious to hear from some academic computer scientists if the best that can be done was first proposed back then, and they've all just failed to find anything better since. Cause this is just downright depressing otherwise (If you actually want to believe that we're making progress).

> Whatever computer scientists have proposed since then apparently hasn't been as good.... so they can all just pack it up and go home?

1. Most modern high-performance garbage collectors are both concurrent/incremental and generational. Generational GC is really important for throughput and I think Go will need to get it eventually.

2. The tri-color marking algorithm plus the idea of a nursery and tenured generation are all very old as far as ideas go, but the devil is in the details, and that's where the research lies. Hash tables, for example, date back to the '50s, but there's been constant research on refinements of the technique up to the present day.

Sometimes old ideas become worthwhile again because the hardware landscape changes.

An example from compilers: Register allocation for expressions (no cyclic dependencies) is an old and solved problem. For an expression like a+b+c+d+e+f, you wanted to order it like (a+(b+(c+(d+(e+f))))), because that minimizes register use. Then newer CPUs got out-of-order processing. Now you want ((a+b)+c)+(d+(e+f)), because in contrast to the previous order the CPU can now sum in parallel. Another case, where digging up an old algorithm helps.

What probably also happens a lot is that people reinvent stuff without knowing previous work.

I'm no academic (just a CS student), but from the very little I understand about GC algorithms, there's no free lunch. In other words, choosing GC isn't like shopping for a new condo... different algorithms have different advantages and implementation difficulties. There may also be caveats with some algorithms that others aren't plagued by. "Better" means different things to different people and in different situations.

For what it's worth, many of today's finest, most important algorithms are from the 50s-70s, or maybe special adaptations of them. Theory doesn't obsolete like technologies do.

No need to get depressed by this. Hard problems are still hard (until proven otherwise), and taking out the trash is the same today as it has ever been.

> For what it's worth, many of today's finest, most important algorithms are from the 50s-70s, or maybe special adaptations of them.

This is only the case/sensible for algorithms that have not been practically improved.

Another reason is that writing an efficient gc is a very complicated software engineering task. A concurrent continuously compacting pause-less gc is the cold fusion of the software world (i'm half-kidding because a ccc gc does work in practice). :) So Dijkstra could propose it, but he couldn't (afaik) code a high-performance proof of concept. If there was a toughness rating scale for engineering problems, then writing a CRUD app in Django would be a one and a good gc a 10.

The problem is that you have really complicated system (the gc) meeting low-level language. You need to write it in C or C++ for performance and for direct memory access. But it's a language that isn't suitable for complex algorithms so if you could choose, you'd rather pick Lisp or something. It wouldn't surprise me if it takes Google a very long time to get their awesome gc that they are planning for Go up and running even if they throw heaps of engineers at the problem.

> Today 16 gigabytes of RAM costs $100 and CPUs come with many cores, each with multiple hardware threads. In a decade this hardware will seem quaint.

This is a potentially dangerous worldview. While it is true for the server/desktop/laptop segment, it is less of a surety for phones, and downright scary for smaller, low-power embedded/IoT devices.

Also it assumes that only one program is running on a machine and it's ok to use all the machine's resources. In practice that isn't true - if your web application uses twice as much server memory then you're going to be paying nearly twice as much for your cluster of servers. That makes a big difference.

"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry." (Henry Petroski)

Right! Had we not squandered all those wonderful hardware performance gains on useless stuff like web browsers, searching the entire internet in milliseconds or streaming HD video over the net, we could now play Pac-Man thousands of times faster than in 1983!!!

It's not a limitation, just a tuning knob that can be used to trade off GC cycles against memory usage. If you're in a low-memory environment, tune the knob to use more CPU.

My comment was targeting that particular statement and was not meant to be interpreted as a deficiency in this particular GC design, though if I were to comment about that, I have to say I remain a bit skeptical about how far you can effectively scale it down via this single knob technique; scaling up, I'm more certain about.

Fair enough. It might not scale down that well, especially since on low-memory systems you often worry about CPU usage and power usage as well. But Go uses a lot less memory to begin with, so the default GOGC value of 100% (half of application memory used, half free) will still be less than a similar Java program, even ignoring that Java also needs a heap larger than the allocated size to get good performance.

What if you're in a low-everything environment?

Then don't use Go. Go was pitched as a systems language, but has found its niche as a server language. It's to be expected that the maintainers will make decisions that tune it for servers.

Don't use a language which has GC. Allocate all objects statically.. no memory management needed at runtime.

At some point you can no longer afford the luxury of a garbage collector. As the tech gets better, that point gets lower and lower, but there are still limitations.

Phones have two gigs of RAM and four cores these days. Low what? If you're talking software on an internet connected fridge, then, maybe, but that's a pretty niche area even in today's mostly theoretical and overpriced IOT world.

>smaller, low-power embedded/IoT device

Do you want a GC there anyways?

There's a market for embedded Java. Atego makes a variety of runtimes for real-time use.


Latency under 1ms per their material.

That's interesting, from a business perspective. How did they arrive at that market via CAD?

What do you mean arrive "via CAD?" What is CAD in this context?

I mean how did PTC go from selling solid modelling software for IRIX to hawking an embedded JVM?

Hell, I didn't even know they were PTC now. On an acquisition or name-changing spree it seems lol. Aonix developed and marketed the Java technology. Got acquired by Atego and now apparently PTC. You will probably find the Wayback Machine's copy more informative with the non-JS links on left leading to PDF's showing its capabilities:


They had standard Java support, a better native API, GC pauses down to 100 microseconds, AOT compilation, and so on. I especially liked how they supported separation kernels like INTEGRITY and non-x86 architectures. The ability to use strong, user-mode isolation on the VM plus obfuscation of codebase and processor ISA makes for high resistance to whatever Internet throws at you. ;) Plus, embedded boards can be really cheap w/ Freescale's doing I/O, crypto, and TCP acceleration.

So, I thought it was cool shit that could've benefited mainstream market if they thought of things like this and applied it properly. Could always build, in parallel, a FOSS-based setup in case they pulled out or some other crap that happens with proprietary companies. Can just painlessly move off their stack. Meanwhile, their stack rocked in its domain, still does given numbers I'm seeing in Go etc, and it's worth copying in some ways.

Yes, very much so. GC on powerful servers is a well trodden path. Low power devices much less so.

Isn't it a less troden path because it's sacrificing too much and gaining too little, especially when we have languages coming up like rust with the memory safety of a GC but no gc runtime cost?

I ask because I don't have much low-level/embedded experience, and I'm unaware of why you'd want a gc on such low power systems. If there's a use case though, I'd love to learn about it!

I think the question is easier to answer if we view "garbage collection" as automatic memory management in general, rather than a more specific case of automatic memory management via a runtime task colloquially known as the garbage collector. I believe automatic memory management is a universally desirable property whether you are running on a server or a small device. However, since you brought up Rust, I also think it is valid to debate the value of a runtime GC as opposed to another automatic memory management technique like automatic reference counting. I don't think that question is embedded-specific though: it applies just as much to the server environment.

How is tracing GC vs. pervasive reference counting relevant to Rust?

I merely used automatic reference counting as one alternative automatic memory management technique. Rust, as I understand it, uses a similar alternative, which is, at least on paper, similar to enforcing a max refcount of one on pointers, statically. I'm far from a Rust expert so feel free to correct me if I'm wrong.

I'd say that "reference counting" implies a dynamic/runtime component, which is not how Rust manages memory (unless you're specifically using the `Rc` pointer in the stdlib).

I think it's important to be clear here because Swift and Objective-C rely on a technology known as "Automatic Reference Counting" (ARC), which does imply the usual dynamic cost of reference counting.

I think the go guys are clearly focusing on server applications.

> potentially dangerous worldview

those tools are the same freaks that make Chrome take 150MB-200MB per Tab and ruined my perfectly usable 8GB-RAM system in a matter of 2 years.

How reliable is this one? Browsers these days are infuriatingly memory hog and Chrome is the champion specially in OSX.

I haven't had any problems with it. As somebody who has ~100 tabs open at any given time, this extension is a godsend.

you know, considering I manage with 300GB / month I don't see why I should need 200MB for each website. Just keep it serialized and compressed until I need it.

it's too late. I have 32GB now

That pretty much validates their motivation.

haha. sigh...

>> "Go's new garbage collector is a concurrent, tri-color, mark-sweep collector"

So basically, it's the same GC that Mike Pall had planned to implement for LuaJIT 3.0. Too bad Mike recently announced he was transitioning away from doing daily LuaJIT work.


Actually he wanted to use a quad-color implementation, which AFAIK is a Mike Pall tweak of a tri-color collector.

Actually, it's the same GC already used by Lua and Luajit. Read the note at the end of the section on tri-color collectors?

OCaml's collector has also been the three color variety (4, including a color for a free list). But it hasn't been concurrent, which is the non-trivial part.

> Maintaining this invariant is the job of the write barrier, which is a small function run by the mutator whenever a pointer in the heap is modified.

Synchronization whenever a heap pointer is changed? That does not sound like "low latency".

In case of GC terminology "barrier" simply means that some kind of hook is run on read or write of pointer, it does not necessarily imply barrier in the caching or synchronization sense.

Typical implementation of write barrier involves adding item to some kind of thread-local queue.

I was wondering, why not use ARC, as in C++ and Obj-C?

Simple reference counting is quite possibly the worst possible choice for automatic garbage collection, especially since it is so intuitively appealing. The common objection is that cycles are hard to handle; leading to either a broken behaviour (leaking garbage) or adding complicated cycle detection (negating the premise of a simple solution).

Apart from cycles, reference counting has several other issues. Some of my pet objections are:

* The need to add a mutable field to every object increases the active memory foot-print significantly (especially bad for caches).

* The frequent updates to that field must be done using atomic operations if the objects may be accessed by different threads (even though the objects themselves may be immutable).

* The updates to the reference field trashes the caches in a multi-core processor.

* The amount of work done managing these updates can easily become a significant part of the run-time of the program, in some cases dwarfing the actual work.

* Uncontrolled arbitrarily long pauses will happen when cascading deletions happen.

There are really cool and smart algorithms that make reference counting much better, but then the appeal of a simple natural algorithm for collecting garbage has come and gone.

>There are really cool and smart algorithms that make reference counting much better, but then the appeal of a simple natural algorithm for collecting garbage has come and gone.

That's probably true, but if the win is pauseless automatic memory management that doesn't require twice as much memory as using a GC, then it may be worth it.

I think, realistically, we'll have to accept that both tracing GC and reference counting (however smart) will always suffer from some drawbacks that manual memory management can solve. The reverse is also true.

Sure, a really smart implementation (coalescing reference count updates, concurrent de-allocations to get pauslessness, etc) could have some nice properties.

However, while the amount of virtual memory needed for a tracing GC might seem large, it is not active memory. Reference counting on the other hand will significantly increase the amount of active used memory, especially if many small objects are common.

As for GC vs. manual memory management, both have their uses. For most systems I would argue that GC is the way to go since it is quite efficient and is much more productive. On the other hand, I actually like writing code that has manual memory management. My current interest is learning more Rust for that kind of development.

Reference counting generally suffers from the cyclic reference problem, e.g. if you lose a reference to a doubly-linked-list, the entire data structure leaks and won't get freed. In Obj-C, this problem is tackled via manually annotating some pointers as "weak", explicitly (and you have to be extra careful with them).

There are less fundamental issues as well, like high contention on the reference count in highly parallel environments, and more memory fragmentation compared to a compacting garbage collector. Also, memory allocation tends to be faster on a compacting garbage collected heap (simply increment the heap pointer instead of searching for a large enough available block). Additionally, freeing memory when things run out of scope cause challenges when optimizing code, like making it more difficult to optimize function invocations that otherwise would have been eliminated as tail calls with a garbage collector. In practice, in some cases, the cost of a tracing garbage collector might be well worth when you amortize it over the lifetime of the program, compared to keeping track of reference counts at runtime.

The classic problem: cycles. Either you make the programmer manually deal with cycles via weak references, which can be error prone and allow memory bugs to creep in, or you have a secondary GC to collect cycles. The first one increases programmer complexity and the second one increases implementation complexity, which are, respectively, the first and second things Go attempts to avoid.

There's also the issue that ARC vs. GC is a tradeoff between CPU overhead and memory overhead, and that in the scenarios Go was initially designed for (Google's servers) memory is much cheaper.

The Go FAQ addresses this: http://golang.org/doc/faq#garbage_collection

Upon re-reading, it's distressing to hear that the GC latency has been raised from "not significant" to "low."

In addition to the cyclic references problem, rc has a high constant overhead (memory traffic from constant manipulation of rc fields) and a problem with pauses (cascading deallocations). Also it precludes efficient shared memory concurrent access of managed objects.

As always with Go, they assume nothing that has happened in the last 3 decades is worth looking at, and retrofit their justifications for their choices to fit that. It's inherently a language that is 3 decades behind other modern languages. It's embarrassing.

I'd prioritize predictability over anything else.

Will this collector evolve into generational in 1.6? or it already does it.

Quite a lot of hubris in thinking they are just going to solve this as if no one has been working on it. Who knows, maybe they have, but color me skeptical.

They're not claiming to have a new solution. They cite a Djikstra algorithm from 1978, and explain why they believe it is a good solution for certain modern applications.

Sure, but they also didn't get into what would make their approach special.

Are there any well used GC that aren't tricolor?

Good GC is an open problem. Even though there has been much work done on the subject, there is still plenty of room for improvement!

Server GC, for example, typically tends to be less concurrent to improve throughput. Client GC is more responsive, but also sacrifices throughput. Since Go is positioned for server work, it is very interesting that they are choose what is typically seen as client-side priorities.

Unpredictable latency is a huge no-no in these days of responsive server applications. Long GC pauses are just not acceptable to many server developers, so I don't think it's correct to consider it only a client-side problem. If you look at the mechanical sympathy mailing list, you'll see a large group of Official Knob Turners for the Java garbage collectors discussing their trade.

The point is not that latency doesn't matter. It's that throughput and latency are more or less a fundamental tradeoff in GC, and latency isn't the only thing that matters.

Yes, that's true, but this GC lets you get more throughput by using more memory, so you can still decide how you want to make that trade-off. But no amount of knobs can save you from a long stop-the-world pause, so they made absolutely the right decision, in my opinion. With multi-core server CPUs, throughput is relatively cheap to obtain. Latency is hard, you can't just throw money at the problem.

Although Go was conceived as a language for the server, that's becoming less apt as time passes. Its being used in many places that the language designers hadn't initially thought of, such as iOS and Android. So the trade-offs they need to make today have to consider all possible use cases, not just the server one.

Also, I think its really important that pause times be as short as possible on the server as well. If they aren't, you'd probably see an unhealthy spike in 90+ percentile responses.

HotSpot has also switched to a low latency collector (G1) as the default GC: http://www.infoq.com/news/2015/07/Oracle-Confirms-G1-Default...

They say[1]: Limiting GC pause times is, in general, more important than maximizing throughput. Switching to a low-pause collector such as G1 should provide a better overall experience, for most users, than a throughput-oriented collector

[1]: http://openjdk.java.net/jeps/248

Throughput is a problem that solves itself. Wait a year and buy a new computer. :)

Not kidding, though, the latency hit from stop the world only gets worse scaling from 32 cores to 64. May as well call it "batch" GC. On the other hand, concurrent GC will keep the same latency but hopefully double throughput on such a machine.

Unlikely. GC, especially non-generational GC, is a memory bound application with low arithmetic intensity and locality. The scaling will likely be sublinear, especially with that many CPUs.

Yes, perhaps a poor assumption. But at what point will the scaling go negative? Even 25% more throughput (same latency) seems preferable to 100% more latency (same throughput). Is it fair to assume latency will be linear with the "server" GC?

I'm sorry, what exactly is your criticism?

Presenting a new GC implementation as solving the low latency problem that people have been working on for decades without any real proof.

Garbage collection was solved long ago: just manage your own memory ;-)

This comment shouldn't have been hit so hard. It's a valid response to people complaining about the throughput/latency tradeoff, especially in special areas like small embedded devices. Go's GC can't just remove all the latency. If you can't stand the tradeoff, then you have to manage your own memory in a way that suits your special case.

My guess? It's not being downvoted because people don't think managing your own memory is verbotten. It's being downvoted because it's a smart-ass comment seemingly meant to be humorous more than insightful. Your reply is about 10x more useful than the parent.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact