Hacker News new | past | comments | ask | show | jobs | submit login
Modern garbage collection (medium.com/octskyward)
434 points by junke on Dec 20, 2016 | hide | past | favorite | 243 comments

An interesting shift in the wind for modern garbage collection comes from microservices.

The work on JVM garbage collectors over the last ten years has increasingly been about supporting bigger and bigger heaps efficiently, as physical machines have got bigger, and programs' (or at least application servers') heaps have grown to fill them. Even Shenandoah, a current prototype collector, is designed around large heaps.

But with microservices, we use our cavernous machines to house many small heaps instead. That leaves us with rather different garbage collection challenges.

So, it's possible that the Go developers will get to ignore decades of progress, as they love to do, and still be competitive, because that progress was towards a different future to the one we arrived at.

All the Hacker News kids are crazy about microservices right now.

In three years, people will have realized it wasn't such a good idea, and moved on to the next "this will solve everything" fad.

Once you have seen enough of these fads go by, it's pretty obvious.

I'd be careful about extrapolating future directions of computer science from something that seems cool this year.

> All the Hacker News kids are crazy about microservices right now.

> In three years, people will have realized it wasn't such a good idea, and moved on to the next "this will solve everything" fad.

Microservices are simply the youngsters rediscovering that encapsulation is a damn good idea.

As I explain to people, microservices do NOT make your life easier. If you can exist in a monolith, that's way easier to implement.

However, once things start getting complicated, microservices make certain things possible.

Microservices go beyond encapsulation in important ways, and regress in other ways.

Build/deploy/release cycles become independent with microservices, which can be great. Also you can write microservices in different languages.

But the APIs can become less rich. Try passing callbacks between microservices written in different languages. And it can be more of a pain to change APIs. And if there's any state to worry about, you need to manage that and maybe use a stateful protocol (which are harder to get right).

> Build/deploy/release cycles become independent with microservices, which can be great.

I accept that coarse-grained services can have independent lifecycles, but not microservices. I'm working on a microservice-based project, and two things we've learned are that there are almost no features we can implement in just one microservice, and that if we don't test microservices together, we get bugs. So, in practice, their lifecycles are tightly coupled.

Some people ding this for being a 'distributed monolith' rather than some kind of 'true microservices', but i don't see how we could factor this application into micro-sized bits that would not be tightly coupled.

> Also you can write microservices in different languages.

I worked somewhere that tried that for a while (although small services, rather than microservices). It turns out that it makes it really hard to share people, libraries, and learning between teams, and that the task-specific advantages of particular languages mostly aren't great enough to make that worthwhile. We managed okay with some bits in Java and some in Scala, but adding Clojure and Groovy was too much. I think there are some bits in R now, but they're owned by a completely separate part of the organisation, so Conway's law allows it.

> It turns out that it makes it really hard to share people, libraries, and learning between teams

The idea is that you don't do that. It's a matter of duplicating functionality across teams so that each team isn't beholden to another team. pushing that library doesn't suddenly break things, for example.

Whether you agree with it or not, when you choose microservices, you choose to avoid sharing libraries like that. In the small it's probably ok (across microservices in 1 team, depending on circumstances), but in the large the gain isn't worth the added maintenance hassle.

Libraries, sure. But if you've chosen an approach where you can't move people and learning between teams, you've chosen failure.

I don't even know what that means.

The independence of microservice is not always achieved. You often end up with a mess of interrelated services depending on fairly narrow version ranges to retain compatibility. The more homogeneous your production environment is, the less that's a problem. The worst case is when you have a dozen (or hundreds...) of production clusters that are all independent and not necessarily in-synch as far as deployed version goes.


Developers often gloss over how truly difficult it is to make good API choices. And when the wrong choices are made, it goes wrong quickly.

Microservices probably amplify this effect, but it's hard to say.

And if its a business application, and the company survives long enough, it will invariably get complicated.

Do they still use FastCGI ?

I doubt it

You want to make your way in the CS field? Simple. Calculate rough time of amnesia (hell, 10 years is plenty, probably 10 months is plenty), go to the dusty archives, dig out something fun, and go for it. It’s worked for many people, and it can work for you.

        — Ron Minnich 

If you don't know Ron - http://www.socallinuxexpo.org/scale8x/blog/interview-ron-min...

People today could do well reading

Computer Architecture and Parallel Processing Paperback – International Edition, January 1, 1986

by Kai Hwang (Author)


Services are a good idea. SOA (serviced oriented architecture) has been out for more than 20 years and it's not going away.

Micro-services is just a new hype buzzword. Classic case of "smaller is cuter" http://www.smbc-comics.com/?id=2932

Who said microservices don't need large heaps? The prefix 'micro' means the share of responsibility and boundary of functionality, not any hints at memory size.

Of cause JVM needs to focus on large heaps. Now it still stucks somewhere at 32GB for performance reason. And for heavy-lifting application like HBase/ElasticSearch, that is currently OK, but still makes it quite inconvenient to utilize larger memory(you might need deploy multiple instances on the same box, etc).

You can't invent a new buzzword and tell other people they are solving the wrong problem.

> Now it still stucks somewhere at 32GB for performance reason

Short version: No it's not.

Long version: Heap is under 30 GB or over 40 GB. Never between. There is a JVM optimization that allows it to use 4 byte pointers up to 30GB of heap.

What is the reasoning for this? I think JVM still has problem handling real large heaps, the GC stop will be unbearable..Does thing change recently?

There is no problem to handle large heaps.

There is simply a gap: A 30 GB heap (with 4 bytes pointers) may be similar in capacity to a 40 GB heap (with 8 bytes pointers).

Guides and documentation (elastisearch) are simplifying this as "java can't do more than 30 GB heap", which is factually incorrect. You can do a 60GB if you want.

> There is no problem to handle large heaps.

That's not true, dealing with large heaps is problematic in any garbage collected language because due to fragmentation you end up in stop-the-world full GC phases. Which for a big heap (i.e. bigger than 4 GB!), you can end up with latencies measured in seconds or even minutes.

The article mentions the CMS GC for Java, which has this problem. The G1 GC, introduced in Java 7 I think, is much better, however from what I've read it too isn't immune to full GCs. Of course for Java you also have the Azul "pauseless GC", which is supposed to never do this, but it is a commercial solution. And the article also mentioned another GC currently being contributed by Red Hat, which is good to hear.

Define large? I never had problem by increasing a heap from 10 to 20 GB.

Sure, if you go from 300MB to 30GB, the GC might behave differently and maybe some applications can't take it.

I don't agree with the snarky tone of this comment. I'm not a Go programmer, but I've seen many concepts touted as "progress" over the past 20 years which have been dead ends.

I think it is reasonable for designers to curate advances to address the problems they are trying to solve.

The comment doesn't come off as snarky, but instead a milder version of the content of the article.

Namely: I don't agree with Go marketing their GC as if it magically provides no trade offs. Since reading the original design doc for it, I've assumed that Google was simply lazy/didn't care about Android, because the Android GC didn't provide this magic.

> I don't agree with Go marketing their GC as if it magically provides no trade offs.

I don't believe this to be true. They always point out painstakingly, that the advances made on pause times come with less throughput, for example.

What the go people are doing (and with which I tend to agree and I believe it's hard to argue against the success they're having), is to say that they believe it's possible to build a GC which is good enough for most modern use cases and doesn't come with a lot of tuneables.

(However, even that comes with tradeoffs; it limits the use cases that go can be used for. For large enterprises, being able to minutely tune a GC might be an important thing)

Apologies for the brusque contrarianism, but this simply isn't true. I spent hours pouring over all the material they published when starting on this journey, because I was astounded that a team with a reputation for purposely avoiding groundbreaking work in favor of productivity today, had done something truly groundbreaking.

It wasn't remotely clear that the advancement here was "We don't provide settings, because we're guesstimating your workload matches the standard Go use case!", and all the histrionics further masked this truth.

At this point I'm just repeating the article, though. :)

So, it's possible that the Go developers will get to ignore decades of progress, as they love to do, and still be competitive, because that progress was towards a different future to the one we arrived at.

Are other fields as bad as programming? Do engineering people get to do the equivalent of having a paper published or getting to present at a conference because they implemented something in language A which was already done three years ago in language B? Do other fields have people grousing about how no one uses X, even though it is clearly "better," where "better" is basically an unsubstantiated opinion?

Some embarrassingly large number of "best practices" in our field are pretty much based on a feeling it "seems right" combined with a "well, it worked a couple times before" level of empiricism. Vital decisions about tooling, such as choice of programming language, are made by some combination of: following the herd, following one's own prejudices, or being enamored of a nifty idea.

One thing a company at the scale of Google could probably afford to do, would be to collect real data about the effects of programming practices.

Go isn't a research project. It's more a manifesto on programming philosophy born of "old man rage" in language form. This works out mostly because the old men in question happen to include Rob Pike and Ken Thompson. However, that results in a very specific feel and set of tradeoffs in Go.

Go isn't a research project.

Never said it was.

It's more a manifesto on programming philosophy born of "old man rage" in language form.

I suspect this has something to do with why I like it so much!

You might not agree with the design decisions, but to say one of the most successful contemporary systems programming languages was born out "old man rage" is insulting

Small nit but many of us wouldn't consider Go a systems programming language. I'd say "service programming langauge" is more appropriate. It's missing many things that would make it appropriate for interacting with hardware and limited resources.

to say one of the most successful contemporary systems programming languages was born out "old man rage" is insulting

Should I be insulted by your finding it insulting? My thought on reading that phrase was, "Right on!"

With all the discussions of ageism and sexism in tech, classifing a significant contribution to the industry as the result of "old man rage" is inappropriate IMHO

I can see how it would imply that negative connotation.

But being classifiable as an "old man" myself, I rather find it more annoying that it should be taken as a negative. Maybe "old man rage" should be taken as an indicator that the lessons of experience have been neglected? Instead, in our culture, it is taken to mean something foolish. I think this says something about our culture.

To be fair, I meant it in both contexts at the same time. A more polite term might be "opinionatedly retro". Go's conservative bend is both a positive and negative.

Somewhat like the reviews my lovers give me.

What about in your un-honest opinion?

I would rather people called it an application programming language. It relies heavily on it's runtime, and I've yet to see a piece of go code running on bare metal that isn't a hack or a proof of concept. However, I've seen multiple applications that rely on the OS to have nice abstractions over low-level inter-computer IPC (ethernet, tcp/ip).

I keep wondering how much success it would have had if the design team was still at AT&T.

Are other fields as bad as programming?

K-12 education for sure. Completely fad (procurement) driven, utterly disconnected from reality (best available science) with egos and bent morality added for giggles.

Mind the feedback loops and incentives.

And don't trust the opinion of anyone who hasn't maintained code they (personally) wrote and deployed +5 years ago.

Partially true, and Joe Armstrong (creator of erlang) makes similar arguments. But that doesn't work well for everything.

Erlang is more clear about their assumptions and goals. They don't claim to have an amazing new GC, they just say they have a simple generational GC, and that's all they need because they assuming many small heaps.

it's also not an STW GC so the pauses affect only small number of processes.

Where is the data kept?

The very large heaps are partially driven by distributed in-memory storage systems.

Hmm, well if the processing is microservices, presumably the data is in a NoSQL database.

More seriously, though, isn't most large in-memory data storage done off the garbage collected heap? I have to confess i'm actually not sure what anyone is doing with >20 GB of normal objects.

Search (elastic/solr), Cassandra, presto, and most other java backend storage and data processing engines need 20+Gb (at least) to be mildly useful. It's not uncommong to go with 64gb+ heaps there too.

Microservices/API won't help with lowering these.

yes, exactly this. and by having scalable GC algos so you don't need to do "off heap" tricks in java, things are much easier.

I'm sure people do use huge heaps, but most web applications I know of (microservice or not) store the bulk of their data in a (separate) database. So, unless your application is a database, which might want a huge heap so it can cache a lot in-memory, I don't think huge heaps are really necessary. IMHO, at least.

Since it's not really possible to do reliable giant user space caches in the POSIX model of things you'd usually delegate to the page cache.

I'd love to know about other GC schemes. Between no GC and whole VM GC. By type GC, localized sub tree GC.

edit: thank you all for your answers, much appreciatedly

Check out Singularity. It was a research effort from Microsoft Research to build a whole OS in a modified .NET

Singularity was a single address space OS, thus there were no "processes" in a traditional sense. They still had a notion of isolated spaces though partly to allow different heaps to be GCd independently, and to use different GC algorithms (with different barriers) exactly due to the lack of a one-size-fits-all approach. Messaging between these process-like-things was done by moving objects onto a "transfer heap" and back again, which boiled down to just a simple pointer write. There were no syscalls.

A very nice architecture, from a theoretical perspective. However it's not clear to me that for standard desktop-sized machines you really need segmented GC spaces these days. With an incremental collector you can probably get pause times in the milliseconds range for a 32GB heap and I guess 32GB is as much RAM as you need on a Pro-level machine. For servers of course you go bigger.

One interesting GC algorithm I came across did thread-local collections. Objects were tracked with write barriers to discover when they became 'globally reachable' and moved to a global heap. Threads could stop independently whilst others continued running to collect their thread local heaps. That opens up the possibility in desktop/mobile apps of having a main thread that's doing an animation of some kind in a low-garbage way, whilst the background thread does work as quickly as possible, with the background thread collecting independently whilst the animation thread still runs.

Erlang has a per-process GC - each Erlang process has it's own heap and it's own garbage collector. There's no VM-wide GC. This usually means the GC pauses apply only to a small part of your system, while the other parts are running completely normally. There's a good article that describes it in depth: https://www.erlang-solutions.com/blog/erlang-19-0-garbage-co...

Also if you have a lot of data you use an ets table which is more or less an in-process but off-gc-heap key-value database.

Thanks, I've been wondering how Erlang does GC.

Of course, the cost is extra heap fragmentation.

In Erlang case, that is not so much of a problem as most allocated objects have essentially same size. On the other hand this causes another problem that GC's and allocators in Erlang VMs tend to have enormous overhead because most most GC'd objects are very small with sizes comparable to per-object GC overhead.

I joked once that in lisp, a vast majority of memory allocation are basically malloc(2). I wondered if it made fragmentation irrelevant.

You might be interested in:

"A Unified Theory of Garbage Collection"


Basic idea is that tracing and reference counting are duals of each other. Reference counting finds dead objects, tracing finds live objects. Optimized variants of each of these tend to approach the other side.

I used to idly browse papers from programming language conferences, and there was often interesting GC stuff. PLDI is probably your bet bet - here's 2016:


When there were two GC papers:

Idle time garbage collection scheduling - "We implemented idle time garbage collection scheduling in V8, an open-source, production JavaScript virtual machine running within Chrome. We present performance results on various benchmarks running popular webpages and show that idle time garbage collection scheduling can significantly improve latency and memory consumption."

Assessing the limits of program-specific garbage collection performance - "We demonstrate that there is considerable promise to reduce garbage collection costs by developing program-specific collection policies."

Or going back to 2012:

And then there were none: a stall-free real-time garbage collector for reconfigurable hardware - "We present the first implementation of a complete garbage collector in hardware (as opposed to previous "hardware-assist" techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100%."

That's by David Bacon, one of the runtime implementation greats, who has been after low-pause GC for years. A golden oldie from him is:

A Pure Reference Counting Garbage Collector - "When processor or memory resources are limited, the Recycler usually runs at about 90% of the speed of the mark-and-sweep collector. However, with an extra processor to run collection and with a moderate amount of memory headroom, the Recycler is able to operate without ever blocking the mutators and achieves a maximum measured mutator delay of only 2.6 milliseconds for our benchmarks. End-to-end execution time is usually within 5%."

As for by-type GC, i remember reading a paper years (decades?) ago, perhaps from Microsoft, where they'd tried putting every class in its own region, the idea being that when you wanted to collect some particular region, you could exploit your knowledge of types and their fields to only scan regions which could possibly have pointers to it (eg to collect BlogPost, you have to scan User). Clever, but didn't work well in practice, because so much garbage is lists or strings that can be pointed to from a lot of places.

ISMM, http://dl.acm.org/event.cfm?id=RE149, is also likely to have papers on garbage collection. The table of contents from 2016: http://dl.acm.org/citation.cfm?id=2926697

Nice! There's a paper in there called 'Block-free concurrent GC: stack scanning and copying' which opens by claiming that "on-the-fly Garbage Collectors (GCs) are the state-of-the-art concurrent GC algorithms today". I'd never heard of on-the-fly GC, so i've learned something!

However, on googling, i find that Edsger Dijkstra wrote about it in 1975:


Hey, maybe that's old enough that Go can use it?

I believe MS does in fact use a typed heap in IE, but for security reasons (some sort of UAF mitigation iirc).

Look at what Erlang's BEAM VM does. It has per-"process" heaps iirc.

Take a look at The Garbage Collection Handbook by Jones et. al. ISBN-13 is 978-1420082791.

> So, it's possible that the Go developers will get to ignore decades of progress, as they love to do, and still be competitive, because that progress was towards a different future to the one we arrived at.

Progress in the wrong direction is not progress.

Ignoring "progress" that was in the wrong direction is progress.

A lot of 'decades of progress' in software is similar to progress in College tuitions, Health Insurance premiums and Cable TV bills. Yes advanced solutions have been developed and now price has to extracted by any means irrespective of user needs or agreement.

> It has been known since 1984 that most allocations “die young” i.e. become garbage very soon after being allocated.

1983: A Real-Time Garbage Collector Based on the Lifetimes of Objects, Lieberman/Hewitt


  Since objects with short lifetimes account for a large
  portion of storage use, it is worth optimizing a garbage
  collector to reclaim storage for these objects
  more quickly.

I'd be interesting to know to what degree Go's aggressive stack-based allocation/escaping makes this hypothesis less strong than other languages.

For example, Java a few generations ago was very poor at escaping heap allocation to the stack and so much the opposite skewing towards heap allocation over stack allocation, so I'm sure the JVMs of this era were an extremely good fit for the generational hypothesis.

One thing I wish people in the PL/GC community would take more seriously is the idea of explicitly separate heaps or heap partitions. If it's a problem that pause times can be proportional to heap size with some approaches, then the problem can be alleviated by having more small heaps instead of one large. I've used this approach in my own code, including kernel code, to very good effect. Yet somehow the language features to support this properly have never existed in any language that I know of. At best we get heuristics to approximate it; more often, nothing.

Research is done. Here is a tech talk on a form of GC integrated into the JVM where the user can annotate regions:



The region markings are used as hints. Allocations are put into the region when possible and then the region is discarded at the end in one go, which is of course a well explored idea but this one is interesting in that you only scope out the regions and don't need to manually assign each allocation. Also allocations that naturally exist beyond the end of the region work properly without additional effort.

One issue is that it's often quite hard to beat a properly tuned generational collector even with techniques that intuitively 'feel' right. As one example, Azul experimented with aggressive hardware-assisted stack allocation on their custom CPU and abandoned it because they found it didn't really beat well tuned generational collection.

Awesome, thanks!

The problem is you either need the equivalent of the Rust borrow-checker to validate that the partition boundaries are respected or the partitions are just hints which makes the scope of optimizations much smaller.

It isn't GC but NeXTSTEP had allocation regions under this theory; that's where alloc:withZone: came from. In reality programmers mixed up the zones and there were way too many crashes as a result. Without some way to enforce the separation you just don't get much benefit.

For traditionally GC'd languages you still need some way to track dependencies between separate heaps, which means that you end up with what is essentially overcomplicated generational GC. Handling inter-heap dependencies manually on the other hand gets you into space of hierarchical allocators (see samba's talloc or apache's apr_pool) which is something that still needs some kind of manual memory management.

In Nim, each thread has its own heap :-)

If you need multiple heaps, you can already run multiple OS processes that communicate through message passing. Message passing is a little inefficient compared with shared memory concurrency, but you have to pick your poison.

.Net has AppDomains. You can't access heaps in one AppDomain from the other. I don't know if it's done so that GC can pause only one AppDomain; I always viewed it as more of a security and sandboxing feature.

erlang has this I think?


In databases, it's common to do arena based allocation. You allocate objects of similar lifetime (life of a query, life of a transaction, life of a procedure execution, etc.), and free them all at once when the lifetime is up.

When hacking postgresql, for instance, using pfree() is fairly uncommon. You just allocate, (which goes to the current arena by default), and then the arena is wiped out all at once later.

Of course, sometimes you might use a lot of memory temporarily in a loop, and you need to pfree() it each iteration. But that is the exception.

I think of each arena as a tiny heap (though of course they can grow large).

This was also an optimization you could do in Objective-C on NeXTSTEP/OpenStep/macOS. Allocating methods took an optional "zone" parameter (`allocWithZone:`, `copyWithZone:`), which was basically an arena.

In this case, the bigger benefit wasn't the allocation/deallocation – but the fact that you would be working with memory on the same virtual memory page, IIUC.

I've seen this strategy appear before in real-time situations where new allocations remain in a near-constant band: As long as you have enough memory to make it through one loop, you could use it as the defining strategy of your game engine or signal processor, at least w/r to temporary allocations.

I've also heard of GC algorithms using exactly this behavior for the nursery stage.

Arena allocation in C/C++ is doing something similar to generational GC. I was about to type "simulating it" but that's not quite right ... they exploit similar observations to similar effects in different ways.

In C/C++ you have to manually decide how short lived allocations are ahead of time and either stack or arena allocate them. Freeing all the allocations on the stack or the arena is then very fast.

In a generational GC the programmer doesn't annotate allocations like that ahead of time. Instead it is assumed that all allocations might be short lived. How long they actually live is then tracked and eventually they're moved out of the "arena" (young gen space) by copying them. But again, freeing the contents of the young gen is very fast because you don't actually touch dead data at all, you only copy the live objects.

I've heard of GCs that can actually do "pre-tenuring": when the VM knows an allocation will definitely live for a long time based on profiling data it can be allocated directly into the old gen space. That reduces your minor GC pause times further.

All this adds overhead of course, but makes automatic something C/C++ devs would do themselves. Given the potentially catastrophic security consequences of screwing up in the C/C++ model I've become a lot less sympathetic to that approach over the years. How much extra performance is a security advisory and possible resulting botnet worth to you?

Arenas have fast allocation (because you often have no freelist and just increment the end). They also allow you to quickly reclaim old data, which is important for a database because a query might run long enough for the allocated memory to be considered old (for one example). And they don't require compaction or copying, because objects of the same lifetime are physically near each other in memory.

It's fair to draw an analogy to generational GC, but they aren't the same.

Rust is an example of a language that can do arena allocation without the dangers of C++.

EDIT: Based on my experience with postgresql, and what I've learned about rust so far, I am starting to believe that object lifetime is a fundamental concept in CS. It's possible that sophisticated GCs have, to some degree, hidden the importance.

Arenas have fast allocation (because you often have no freelist and just increment the end). They also allow you to quickly reclaim old data, which is important for a database because a query might run long enough for the allocated memory to be considered old (for one example). And they don't require compaction or copying, because objects of the same lifetime are physically near each other in memory.

This is exactly how the Gen0 nursery works. The surviving objects get copied to Gen1 before the Gen0 bump pointer is reset. Of course, in C++ arenas there cannot be surviving objects or that is implemented manually.

The main difference here is that what is in the nursery is decided by the GC, whereas a manual arena requires you to pre-decide what is in the arena or not.

Arenas also have the benefit of incredibly quick allocation as well. Malloc/new can easily get you into performance trouble just as well as free.

WRT security Rust is my go-to these days(they also support both typed and un-typed arenas). Much better memory safety guarantees and C ABI makes embedding in a larger application straightforward.

Apache did the same thing in 1.3.xx (I'm not familiar with the 2.x codebase, it was not popular in 2000). You had a per-request pool to allocate from and once the request was done, the whole pool was free()'d. It was my first encounter with that strategy and it's always struck me as the sanest way to go for a long-lived process with C's memory model.

Mike Hearn criticizes the Go implementers for making misleading claims about building a "garbage collector for the next decade", but I don't see them making claims that they are solving garbage collection for any language other than Go?

There's an underlying assumption (both in the article and in comments here) that the results of garbage collection research are generally applicable across languages. It's not obvious that this is true. While you can certainly get a lot of ideas from previous research, memory allocation and access patterns can differ significantly between languages (due to different language features and common idioms), and what seemed like a win for one language might not actually apply for a different one.

I didn't get that at all. I read two main criticisms, both of which aren't explicit in the Go implementors' claims but are strongly implied by the language and style:

1) they've solved for Go 2) this is comparatively better (almost mockingly so) than other VMs because of the advanced nature of their implemtation

The post suggest that 1) is overreaching and misleading: GC is always about trade-offs. It suggests that 2) is also false, because their implementation is not particularly advanced and just follows from the trade-offs they made in 1).

It does rather seem like over-marketing what they've done, regardless of whether it useful in many Go use-cases.

Article has lot of guess work about Go's GC. However in comments section some shortcomings of his claims are pointed out including one by Go GC implementer, he added more guesses about his assumptions. This may all be very valid but there are no numbers for that in this piece.

There's not "a lot" of guess work. It's quite clear to read that it is mostly about GC trade-offs and technologies in the general sense, not especially critical of Go's implementation. There are very few absolute claims of good/bad, if any. A poster on a Go forum admits that the latency improvement cost 20% CPU usage, and the article doesn't call this necessarily bad.

You're making a very uncharitable interpretation, and unjustifyably calling them guesses only makes it more so.

It is lot of guess because there are no memory usage numbers or throughput numbers to get general sense of how Go GC is worse for applications.

Even in the thread about 20% CPU usage increase mentions that Go implementer considers this much higher than expected and is likely a bug. But author make it sound in article as Go guys are calling it reasonable tradeoff for low latency.

Yeah. I find it a haughty article because he's not acknowledging the elephant in the room, which is that GC is not a solved problem despite all the whiz-bang techniques ... and that the Go people aren't being dishonest or ignorant of these techniques, but rather, judging those techniques not to be so useful for the class of programs they care about.

The following statement from the article addresses this:

As Go is a relatively ordinary imperative language with value types, its memory access patterns are probably comparable to C# where the generational hypothesis certainly holds and thus .NET uses a generational collector.

I think that's reasonable. I would say it is on the Go developers to point out why they think Go would have seriously different performance characteristics from similar C# network applications using Task-based asynchrony (similar to goroutine + channel based asynchrony in Go).

I don't think this is an actual claim made by go developers. Their arguments for choosing the algorithm they did was (at least as far as I remember, feel free to point me to a source) a) go for implementation simplicity, readability and clarity first then b) once you figure out where it needs improvement, do that.

They didn't choose mark-and-sweep because they thought other algorithms are worse, go has special needs or anything else. At least this would be the first time I've heard about it. The same goes, from what I've seen, for most of the arrogance ascribed to them (in this article and elsewhere); it usually hinges on making up or misrepresenting things they've allegedly said.

I think the only point being made by the article is that their focus seems to be on pause times above all else.

a) I wasn't replying to the article, but to a particular comment about it.

b) I disagree with that summary of the article. I believe it's first paragraph should be an adequate indication of what it's point is, that's how you usually write articles, and it's distinctively not making that point.

And lastly, c) I also don't believe that this statement, if it were the point, is at all true. At best it's hyperbole, but even on the surface, of the list of 15 design criteria for a GC mentioned in the article, at least 7 are explicitly taken into account and considered an optimization goal for the go GC (not all on equal footing and some in a different way than the author might prefer, but claiming they were ignored is just false). I can definitely say that for pause times, pause distribution, compaction, concurrency, scaling, tuning and portability.

I keep seeing papers with the lede paragraph and sometimes the conclusions written in a different voice that the rest of the paper. I suspect "men of good faith but limited understanding" are contributing to the editing (;-))

Here is from Techempower benchmark C# vs Go[1]. To me C# is not looking very impressive. For memory I do not have a better reference than[2]. Here again C# seems to use 2-10 times more memory than Go.

1. https://www.techempower.com/benchmarks/#section=data-r13&hw=...

2. http://benchmarksgame.alioth.debian.org/u64q/compare.php?lan...

These tests don't seem to specifically test GC, so it doesn't seem to be a good comparison.

Comparing Go and C# as languages don't seem to be the goal of this exercise, either. Instead, the article was making the point that the garbage characteristics have generational components. This is true of C#, so I would expect it to be true of Go as well. If that is the case, whatever Go's current performance numbers are, they could probably be made better by a generational GC.

Isn't GC in some part defines application performance characteristics? Why would pure GC benchmarks be useful if an application with better GC performs worse than application with worse GC?

You don't seem to understand how performance benchmarking works.

Unless you seek to optimize a single application, a benchmark is meant to be broadly representative of a wide category of applications whose performance is dominated by the component being benchmarked.

For instance, an application which allocates no memory is a poor benchmark of the memory allocator, and it is also a poor general representative of program performance if most programs' performance is dominated by memory management.

Most of the article is about GC being a set trade-offs. I don't think that's an elephant here. The official promotional wording around Go is being criticised for not acknowledging the drawbacks of the approach.

Strange argument in the article. As I do not see Official Java materials are telling us about Java's massive memory bloat in typical data structures as this:


I suppose it depends what you consider official Java materials, but the presentations about Project Valhalla and the future of the JVM (by the official Java team) do discuss that sort of thing extensively.

However, this discussion is about GC, not generics specialisation. Java has a guidebook explaining the tradeoffs and how to configure your collector:


In this case official would be like Oracle release new version of Java with statements in release news like

"We continue to have significant memory bloat in standard Java Collection data structures. We added a higher overhead and slower G1 GC which will be default going forward. In Java 9 we also hope to break many user programs which depends on Sun internal APIs"

Now this sounds ridiculous because announcement of major product/feature are suppose to highlight advantages just like Go's new GC announcement blog.

Tradeoffs are discussed in mailing lists and bug trackers.

Oh, they have acknowledged drawbacks from the very beginning, they just never deemed them nearly as important as the author of this blog post.

Go is a pretty ordinary language in most respects: it's not all that different from other imperative languages. It's not like Haskell where the entire evaluation model, compilation approach and set of basic algorithms changes.

Also, I should note that I've never heard of a language specific GC algorithm. The closest I can think of would be conservative collection which is a variant of non-generational non-compacting mark/sweep for the case where you don't have pointer maps. Usually that means C++. In a conservative GC any memory contents that looks like it might be a pointer is treated as if it is. It turns out that integers that look like pointers but aren't are pretty rare, so whilst this type of GC can actually leak memory (!), in practice it is still very useful.

The algorithms don't need to be different for the results to be different. A lot of GC language research is based on Java, for example, where it's harder to avoid creating garbage, and therefore generational GC is a large win. It's not clear if that result applies to languages where typical code creates less garbage.

Early versions of Go used the Boehm garbage collector which worked okay for a 64 bit architecture but not as well for 32 bits since false pointers are more likely - another example of how a smallish language change can have big effects.

I'm not a GC researcher but I've had enough trouble getting reliable benchmark results to be skeptical of drawing any general conclusions from cross-language benchmarks. Performance tends to depend on implementation details.

> It has been known since 1984 that most allocations “die young” i.e. become garbage very soon after being allocated. This observation is called the generational hypothesis and is one of the strongest empirical observations in the entire PL engineering space. It has been consistently true across very different kinds of programming languages and across decades of change in the software industry: it is true for functional languages, imperative languages, languages that don’t have value types and languages that do.

Is it really? The linked paper is about Smalltalk. A lot of the research is about Java. Functional probably talks about Haskell and ML. All those languages rely on GC completely.

Is this hypothesis still true, when a language allows the programmer to mix different memory management techniques? For example, if you have smart pointer and stack allocation, do objects still "die young"? It might be that these young objects never reach the GC, because they are now allocated elsewhere, which means we have mostly long living objects under GC management.

I guess a selective use of the Boehm GC would be the only reasonable way to falsify the hypothesis. Did anybody try that?

We had languages like Mesa/Cedar, Modula-2+, Modula-3, Oberon, Oberon-2, Active Oberon, Component Pascal, Eiffel that allow for GC algorithms, stack allocation and if really required unsafe memory management.

Even .NET allows for it, but most .NET related research lives within MSR walls, and many don't follow it because Microsoft.

So we got Java as the main platform for GC research and only now, thanks to market pressure are we seeing such features planned for Java 10.

> So we got Java as the main platform for GC research

Perhaps we did, but it's not because it was the only option. Python has the same need, has always been open source, and has been very popular for as long as Java. Ruby was even more popular than Python for a while, JavaScript might be now. Why didn't much GC research happen in those environments? Is it because people who do GC research looked down their nose at mere scripting languages, even though out in the field all of these have been used for more serious kinds of development? Or is it something else?

Most python users aren't interested if you break C extensions, and the C extension interface strongly constrains how many parts of a Python VM can look -- have to maintain reference counts, can't change memory layout significantly, can't change how objects are represented.

Probably because those communities aren't welcoming much that type of research and rather use C than an improved VM stack.

There are better options to run Python and Ruby than CPython and MRI in terms of performance, yet they lack mainstream adoption.

My guess? The Python and Ruby developers have never shown any great interest in performance. Javascript performance only become recently popular.

EDIT: I should clarify that "Python and Ruby developers" is meant to mean the language implementers, not ordinary users. And this isn't intended to be a dig at those languages--only that they clearly have different goals for their languages than implementors of Java, Go, C++, etc.

Java has a think called Jikes, which is a JVM implemented in Java, garbage collectors included. Being implemented in a higher level language makes it much easier to hack on, and in a time-pressured academic environment that makes it a more attractive platform for research than VMs written in C++ (or worse, C). Hundreds of papers have been published on Jikes-based research over the years.

Even Cliff Click, one of the HotSpot Server Compiler authors, is nowadays pushing for doing exactly that, instead of C++ or C.

Also something that many aren't aware regarding Oracle's JVM development, is the decrease of C++ code, being replaced by intrinsics and better optimizations across releases.

I compared loc of C++ JDK9 vs JDK8.

JDK9 651408(C++) lines

JDK8 497588(C++) lines

I do not know the reason but newer JDK seems to have quite a bit more C++ code.

That was my understanding from following up the mailing lists discussions.

I think the short answer is that it's just a massive amount of difficult work to get one of these high-performance GC algorithms implemented, and it has to be highly customized to the language implementation.

Also, the original implementations of Python and Ruby are so slow in general that it's not clear that improving the GC is the best use of one's time.

In the open-source Common Lisp world, CMUCL and SBCL have for years used Douglas Crosher's multigenerational collector. Guido and Matz would have done well to use it too -- it's much better, probably, than either of theirs. It's still far behind the JVM, but the point is, Python and Ruby didn't necessarily need the very latest research; they just needed 1995 technology instead of 1975 technology.

It is more likely Oracle/IBM etc put more dollars where their GC is.

Python is using reference counting.

If you look at a language like Java, where most operations produce some amount of garbage, the observation holds true. If you have a function performing some work, then all memory allocated which is not referenced by the return result, is garbage. And the introduction of generational GC with Hotspot back then, did a lot to improve the Java performance. This is no golden bullet though, as when the young generation overruns during a function call, not all memory allocations local to that function call can be collected, and even worse, they might be promoted to older generations, where they are rather costy to get rid of.

Another view on this is, that ideally, you don't perform short lived heap allocations in the first place. Go for example tends to require less heap allocations than Java, and the compiler tries to put them on the stack, if they don't escape the function in which they are allocated. So Go manages to avoid those allocations, which generational GCs are especially good at cleaning up.

The idea of generational GC is still considered by the Go team - they plan to have a special heap for each goroutine. This heap would be collected when the groutine exits. This approach would combine the sound idea of generational GC with a pretty good heuristic for the life time of the young generation.

The stack has very little to do with it; see my other comment. The reason Go's GC needs to work less is because Go has arrays-of-structs, while Java doesn't (yet; should be coming in Java 10).

And IBM's JVM already has an extension called Packed Objects.


There are languages whose execution model inherently produce large amounts of short-lived objects as sideffect of execution of some code with no explicit allocations, Smalltalk and Scheme with call/cc are probably best examples with Python being comparable, albeit for different reasons.

On the other hand, very often it's convenient to just return some freshly allocated object out of function invocation (non-obvious, but trivial example: string), which is why the generational hypothesis tends to hold for almost any language.

Sure, but it's also worth pointing out that the class of programs that behave this way is the class of programs that are inefficient to begin with.

If you handle strings, and you are copying and freeing strings all the time, that's just slow code.

And it's not the case that GC makes the slow code faster... it's that certain techniques enable GC to not fail catastrophically on slow code.

It's a little bit disturbing that folks are ready to extrapolate this to some kind of universal rule.

In a high-end video game, for example, we hardly allocate anything short-term at runtime. The whole program is architected to avoid it. When we allocate, it tends to be things with medium-to-long-term life (texture maps, sound effects, render targets, whatever) so a generational system is useless there. In fact we don't use GC on these kinds of things at all, because we also need to control exactly when they are deallocated so we can put something in their place, because memory is limited.

You have things that only live for a short while do you not? Object pooling, stack allocation, etc are designed to solve that problem as I understand it.

While the objects in the pools or stacks may seem transient (on your screen for only a few moments), one simple technique is to only allocate those objects at load time and then flip them in an "on" or "off" state during gameplay. That way you're not doing any real allocation after your game space has loaded.

No, those are only seen as "ways to solve the problem" by the functional-style programmers who created the problem in the first place.

Stack allocation is what your computer is built from the ground up to do. It is not some kind of workaround or optimization, it is how software was originally designed to work.

Game engines are strongly generational too, you just don't notice because the short lived allocations are all put on the stack by the developer.

I think at this point you are abusing terminology such that it's meaningless.

I am not just talking about putting copies of things on the stack, but having few copies to begin with, etc.

> It might be that these young objects never reach the GC, because they are now allocated elsewhere, which means we have mostly long living objects under GC management.

Where elsewhere? The stack? One of the things people always get wrong about GC challenges is "what about stack allocation?". Modern servers come with up to 6TB of RAM. Applications that are interesting from a GC perspective (i.e., lots of concurrency, data structures etc.) commonly employ heaps that are between 10s and 100s of GBs today. The amount of memory available on stacks is 0.5-10% (thousands of threads are required for the high amount) at most. In other words, stack allocation does not remove the needle by much (although it may affect the frequency of young-gen collections). The Java team discovered that when they added escape analysis and stack allocation, and the effect was not dramatic. The stack simply does not matter that much in the grand scheme of things (arrays-of-structs, OTOH, matter a great deal).

Ok, then array-of-structs. Maybe obstacks (region-based allocation). All that stuff that cannot be done in Java, Smalltalk, etc. Does that invalidate the generational hypothesis or not?

> All that stuff that cannot be done in Java

Maybe not on the standard implementation, but just like C and C++, Java also enjoys some extensions.

Packed Objects in IBM J9


ObjectLayout in Azul


And in some standard form in Java 10


I guess we'll see when Java gets value types soon, but the Go people can answer that now; only problem is, I'm not sure people build as many and as-data-heavy applications in Go as they do in Java just yet. Also, realtime Java VMs do support arenas (even scoped arenas). So, I don't know the answer, but let's put it this way: I haven't heard anyone saying it isn't true -- and that would be a pretty big thing -- even though there should be enough indicators by now.

C# can do value types and arrays of structs and it has a generational collector, so presumably Microsoft felt that C# fitted the hypothesis.

Go does escape analysis and stack allocation. Presumably this would be pointless if Go allocations rarely died young.

Bear in mind that the hypothesis talks about allocations in general. That means things that get put on the stack count as "dying young".

It's common for Go programs to have millions of stacks. A million simultaneous goroutines in a Go program is pretty common. Not so much for other languages.

Not to mention that in Go you create and destroy stacks much more often (probably orders of magnitude compared to typical Java program).

Go stacks, however, are heap objects... The thing about stack memory is that it's constant size and easy to manage. Go stacks, however, are just heap data structures (I think they used to be linked-list of arrays, and now they're "growable" arrays, like ArrayList, but I'm not sure). It's true that Go uses a different GC algorithm for managing the stacks, but it's still not stack memory. Quasar fibers on the JVM use the same approach. You can have millions of fibers, each with their own stack, but the stack themselves are heap objects.

The fact that stack segments are allocated from the heap, has absolutely no consequence whatsoever. (And in fact, they are allocated from system memory, not from GCd memory.)

That's the main thing of consequence, because those stacks are managed like heap memory. For example, you could implement each and every object as a lightweight thread with a stack -- in fact, that's not too far from how Erlang does it -- but that doesn't change the behavior of that memory. What matters isn't what it's called or what programming abstraction is exposed to the programmer, but how it behaves, which in turn dictates how it's managed. If your program consists of millions of lightweight threads -- as it's done in Erlang, Java with Quasar, or Go -- the algorithms required to manage that "stack" memory are very similar to the algorithms for managing plain heap memory, and not at all like how (heavyweight thread) stacks are managed. So the answer to "how is memory be managed if more things are on stack" is "pretty much the same" if those stacks really behave like heap objects, and are themselves managed as such.

> the algorithms required to manage that "stack" memory are very similar to the algorithms for managing plain heap memory, and not at all like how (heavyweight thread) stacks are managed.

No, this makes no sense. It's managed absolutely identical to how any plain stacks in C are managed. Stacks are created and destroyed when scheduling units (threads or goroutines) are created and destroyed. Whether that memory is allocated by some user level library or the kernel (albeit with POSIX threads, you can, and in fact it's common for thread memory to be allocated by the calling process beforehand) has absolutely no relevance whatsoever. All that matters is that the lifetime of the stack is 1:1 mapped to the lifetime of the schedulable entity.

Of course in Go goroutine stacks are dynamic, which is an optimization that makes Go programs use less memory, but that has no semantic effect.

The thing that makes "plain" stacks easy to manage is that they're of constant size and long lifetime. If the the stack (and its schedulable entity) may be very short lived, and the stack itself may resize, then the stack requires the same (or very similar) dynamic memory management as a heap object. It is the precisely the semantics that make no difference here, but the actual behavior and lifetime of data in memory. It doesn't matter that the lifetime of the memory is that of the thread, if the lifetime of the thread is not at all like that of a heavyweight thread. From the perspective of the programmer it's a stack; from the perspective of the runtime -- it's a heap object (or close enough; there are some assumption you can make -- which Go makes -- that makes managing that memory a little easier, but those assumptions are, as usual, tradeoffs).

"Managing" stacks does not matter here. That's trivial. What matters is the lifetime of the data that lives on the stack. You correctly pointed out that in Java, the data on the stack represents a small percentage of all data, so escape analysis has a limited amount of effect on decreasing memory pressure.

However, in Go, this percentage is much higher because there are orders of magnitude more stacks.

But those stacks aren't stacks from a memory management perspective! They're just heap objects (that happen to be exposed to the user using a stack abstraction): The data in them is heap data, their lifetime is plain heap lifetime, and their memory management (which is what we're talking about and isn't at all trivial) is heap memory management (pretty much). It's just another way of abstracting objects via message passing. You can call your objects "stacks", but they're still the same objects.

Let me take this to the extreme: Imagine a language with no assignable variables at all, but with processes and channels. In that language, you're able to create an arbitrary mutable variable as a process with a channel, that can get `set` or `get` messages. The value of the "variable" is kept in the process's "stack". In fact, this is pretty much how the process calculi work (one of which, CSP, is one of the inspirations behind Go's design). That this is how you choose to design your programming abstractions makes no difference to the runtime implementation. Your stacks would behave just like normal variables behave in languages with different abstractions.

> But those stacks aren't stacks from a memory management perspective!

Yes they are, it doesn't matter who is allocating them, the kernel, or the program itself (which as I mentioned, also happens with POSIX threads).

> The data in them is heap data

No it isn't, not anymore than data in POSIX thread stacks is heap data. You do realise that the memory for POSIX thread stacks also comes from "the heap" (regardless of who is allocating it), and it's not from the process stack segment, right?

> their lifetime is plain heap lifetime

No, their lifetime is the lifetime of the function activation record, just like in C with POSIX threads. The lifetime of some value that lives on the stack is the lifetime of the stack frame, not the lifetime of the stack segment.

> and their memory management (which is what we're talking about and isn't at all trivial) is heap memory management (pretty much).

No, the memory management of values that live on the stack is stack memory management. Values live as long as their stack frame they belong to lives. Stack frame allocation is a pointer decrement, and stack frame deallocation is a stack frame increment (plus some amortised O(1) cost caused by copying stack mechanism).

The lifetime of the stack segments themselves does indeed resemble (or is) heap-based memory allocation, but that is true in C, Java, Go, and any other language, and doesn't matter (or, if you take the position that it matters, it matters the same in all these languages and environments). And allocating stack segments is trivial.

> It's just another way of abstracting objects via message passing.

??? Message passing!? Wat.


> Let me take this to the extreme: [...]

If you invent new language implementations and change the normal definition of various concepts, I guess anything can be true in that particular frame of reference...

We keep talking past each other. The objects that are allocated on the stacks are managed like stack objects, but the stacks themselves are heap objects. Storing objects on the stack is, as you say O(1) (or, rather, amortized O(1) for dynamic stacks), but the same is true for using some Stack class in an OO language. It is the stack itself, the container of those objects, that behaves like a plain heap object. This is how pretty much all heap objects behave: they all primitives in fields or arrays, but what the runtime needs to manage is the container. So, yes, you can store the same heap data contained in lightweight thread stacks rather than in objects, but that won't change how that data behaves.

When you create a runtime for lightweight threads (as I have), what matters isn't what the programmer sees but what the runtime sees. POSIX threads and lightweight threads appear the same to the programmer, but behave very differently from the runtime's perspective, and require very different memory management (and very different scheduling) algorithms. What matters to the runtime is: how big each stack is, whether it's constant in size or can grow and shrink, and how often it's allocated and deallocated. It's those parameters that make the difference, not the user abstraction. The point of my example was to clarify that the stack abstraction can be used in different ways. Yes, POSIX thread stacks are also managed in some heap, but their usage parameters make them very different from lightweight thread stacks; the latter behave like a plain object would.

In Erlang, Go and Quasar, the lightweight thread stacks are managed very differently from how the OS manages heavyweight thread stacks, and this is because the two behave very differently despite exposing the same abstraction. You keep repeating that lightweight threads and heavyweight threads are the very same abstraction but what matters is the usage parameters, and that's why their implementation is so different.

BTW, this is true not only for memory management, but for scheduling as well. The scheduling algorithm for heavyweight threads simply does not work well for lightweight threads and vice versa. The reason is that the patterns of blocking and unblocking, and the number of threads -- i.e. the parameters of the behavior -- are radically different, in spite of the abstraction being exactly the same.

I would think the GC's work would correlate more closely to the number of objects than the raw heap size, and presumably most applications that operate on such large data will have relatively few, very large blobs in the heap as opposed to 100s of GBs of <1Kb objects. Am I mistaken?

That very much depends on the application, but for "ordinary" server-side business applications the answer is no, there are a lot of relatively small objects, organized in various data structures: maps (caches, lookups), tree indices (in-memory search), all concurrent.

I work on a JVM app that routinely allocates GBs, and sometimes 10s and occasionally 100s of GBs, of small objects intricately connected. (It's a static analyzer, and the applications it analyzes can get into the MLoCs.) So there's one anecdatum for you.

The title of Go's GC announcement is "Go GC: Prioritizing low latency and simplicity", and I think that a variation of that would be a better title for this. Some users may prefer a different tradeoff, and I think it's important to know that there is a tradeoff. But as this article says "I have no doubt that many Go users are very happy with the new runtime.", and that seems like a worthwhile metric.


That there are tradeoffs is the whole point, the messaging makes it sound like there aren't.

The title says "prioritizing"; if that's not synonymous with "making tradeoffs", I don't know what is.

> But as this article says "I have no doubt that many Go users are very happy with the new runtime.", and that seems like a worthwhile metric.

A metric for what? For it making many Go users very happy, or for making rather big claims like "future of GC"?

Making your users happy seems like a reasonable goal. Neither the article nor the Go announcement suggest this is the "future of GC".

First line of the Go announcement: "Go is building a garbage collector (GC) not only for 2015 but for 2025 and beyond"

to me, that suggests they think they have the future of GC

I thought it was implied that they were talking about a GC for Go specifically.

There is a difference between "a future-proof GC" and "the future of GC".

Interesting article.

I haven't internalized all the parameters of the Go GC, but I wonder if there's a way to tune it on the fly. For example, if I want to run at 60 FPS (16ms/frame), and I've measured that actually doing all the CPU side work took 10ms, can I let the GC have ~5ms of GC time (rather than the <1ms advertised) at the end of my frame to increase throughput (read: less cycles burned overtime & better power efficiency)?

If not in Go, can I do this elsewhere like on the JVM?

Hopefully this isn't too off topic (since the article is about trade-offs of throughput vs latency).

That's a realtime GC and yes, you can buy a JVM from IBM that has one, they call it the Metronome collector:


Metronome has pause time in the hundreds of microseconds range. The tradeoff it makes is that, amongst other things, it requires read barriers i.e. every time you read a pointer from the heap you have to execute some extra code to interact with the GC subsystem. If you think about how often you read pointers from the heap in a typical program, well, that turns into a lot of barriers and a corresponding execution performance hit.

Per the linked article, if their claims are to be trusted, the barrier overhead is only 4%, which is amazingly low. Low enough that it's inconsequential for any application actually needing low pause times, if the claims are correct, which I unfortunately do not know.

Funnny, I remember years ago reading one of the patents the GC is probably based on, and thinking the entire thing was brilliant.

FWIW, Go 1.8's GC uses essentially the same techniques as Metronome.

I don't think you can do it with Go. But you can do it with Nim for example, there if you set 3 ms max GC time you have guarantee that you will only get 3ms of GC time. You can set max time for Java GC also, but from what I seen it's not so strict on G1 collector.

Is there a way to ensure that the amount of time you set aside for GC is, err, enough? I imagine the outcome of it being insufficient would be quite bad... eventually.

I don't recall a lot about it, but what you described is "real-time garbage collection". If you google that term, you should find some work/writeups related to it :)

But why does Go prioritise low latency? I would think that low latency is not super important for web-servers since there are various other factors adding up to the total latency (what i mean here: time spent between call of the http request to finished receiving all bytes).

Consider jitter, rather than latency. If you're building a video calling service. You don't want random 10ms gaps between your frames.

Also consider fan-out. GC pauses add to the long-tail of latency. If your service depends on many backend services, your overall latency may well depend on the maximum of all your backends.

Processing video in a GC system is best handled like some other things that are continuous ongoing processing without doing a GC.

1. Allocate all your structures up front.

2. Your main video processing loop (or compression, encryption, signal processing, etc) does not do any memory allocation.

This is also like writing games in Java.

It prioritizes latency because it's written by Googlers, and Google prioritizes latency. I don't think we need any more complicated explanation.

From thew point of view of a capacity planner and performance engineer, one goes for low latency first, because for many algorythms, the best throughput is seen when latency is lowest. Some cases will be different, but starting with removing time wasted sitting in a queue is A Universal Good (;-))

... because a typical Google service will have to call many other services to process a request (keyword: fan-out). The effective latency is badly impacted by the _worst_ latencies of these services.

the world of computing isn't just web servers. besides, the vast majority of language implementations already favor throughput over latency when it comes to GC. when latency is a requirement for your project you have no choice but fall back on languages like C. it would be nice to have a viable alternative with managed memory.

There is still hope for a Modula-3 replacement with the ongoing work on Swift, .NET Native, D, Nim.

Even if it still isn't quite here today.

I know you are being facetious but is there actually a connection between Modula-3 and Swift, .NET Native, D, Nim?

Just a set of common features.

Memory safe system programming languages[0] with a GC by default, value types which can be allocated on global data/stack/registers, generic programming, ability to have more control over resource management than just finalizers, allocate memory outside GC control when performance requires it.

[0] .NET Native is not yet there, but is incrementally getting a few of the System C# features

yeah, but when i hear go i think of web servers ;)

It sounds like the Go GC would be well optimized for video games, where a 20ms pause would cause a dropped frame.

When processes are short-lived, sometimes the best GC is exit(2).

One thing I don't miss now writing in Swift is a garbage collector. Are there other languages that do something similar to Swift? It seems most languages either use a GC of some kind or make you do your own heap management.

Swift has a garbage collector.


Check chapter 5.

I think it's both fair to say that reference counting is a form of garbage collection (dynamic lifetime determination is dynamic lifetime determination), but also remark that reference counting has notably distinct performance characteristics (some good, some bad) compared to a tracing collector.

Both being GC algorithms, what one should discuss is implementation advantages and disadvantages of each algorithm.

Just like sorting or graph navigation algorithms get discussed, for example.

However somehow we get this pop culture of GC vs RC instead.

To me, a true garbage collector means any cyclic references can be cleared. That is not the case in Swift... you can still leak memory.

Which is not true if a data structure happens to have a strong reference, like in a cache or event handler, thus "leaking" in a true garbage collector.

That doesn't defeat the definition of a garbage collector. You wouldn't then say no GC exists... but in the case of no inbound references then that's memory that can definitively be cleared.

C++'s shared_ptr (not a language feature but in standard library), Python (has GC but only for detecting cycle references AFAIK), Perl, PHP. Lots of OOP libraries for C: COM, GLib. RC, ARC in Rust.

Actually C++ got a tracing GC API in C++11.

As for the rest, RC is GC.

Check chapter 5 in one of the most widely academic accepted books about GC algorithms.


> Actually C++ got a tracing GC API in C++11.

Technically true, but in reality it is just a token effort to please Hans Bohem and get him to focus on designing the C++ concurrent memory model instead :).

As of today, I don't think any implementation actually offers optional GC, so those functions are alway just trivially implemented.

Are GCs scaling well to new hardware? Memory sizes are increasing, as are core counts. But if those extra cores are being used to GC the larger memory, are we effectively losing those extra cores?

That's partly what the article is about.

In a pure mark/sweep GC, as the heap size goes up, the work needed to complete a GC cycle and free some memory goes up too. Thus the more memory you use the more cores you need.

In more complicated collectors that isn't necessarily the case: collection is incremental and done only as much as needed to keep up with the creation of garbage.

As I say in the article, there are reports of people running JVM/G1 on terabyte sized heaps. The problem they have is that if they experience a concurrent mode failure (which G1 is not designed to optimise because in a properly tuned system it's meant to never happen) you're looking at, like, a 5-10 minute pause time whilst the entire heap is mark/sweep and cleaned out as much as possible. So with such huge heaps you'd better make sure the collector doesn't fall behind!

I can't wait to see how ROC (Request Oriented Collector) [1] will benefit Go GC. Go is mostly used in web services which means you could collect garbage from whole request when it's over. ROC will try to do that. You can read more about it in the link that I've provided.

1. https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...

It's addressed in the article. Basically, you can mimick that with a generational GC as long as the youg generation space is large enough to hold all objects from a request.

Note that advanced collectors like G1 auto-adjust the young gen size as the program runs. This means pause times can take a little while to settle, but after the GC warms up you should find it's more or less like the ROC:


(note: that article is about a rather old version but the graphs illustrate the concept)

This may not help you if you can't tolerate warmup latency spikes. In that case you need to do some GC tuning to tell the collector enough that it can get it right from the start.

> implementing a generational collector requires the ability to move things around in memory

While this is true for traditional implementations with separate heap regions, it is not necessary. One can easily represent new and old generations as linked lists in per-object overhead and just change pointers in such overhead to move objects between generations. There is at least one Smalltalk VM that works in this way and IIRC traditional BEAM (ie. non-HiPE) Erlang VM has GC that works in essentially same way.

> One can easily represent new and old generations as linked lists

Sure, it's easy but non-optimal. A side-effect of a moving generational garbage collector is heap compaction and better temporal locality.

The Boehm collector claims to be generational, and doesn't move memory, but I haven't found the details of its generationality.

The relevant paper describes it as "mostly-parallel, mostly-generational". The idea is that it works like traditional tri-color mark and sweep collector, but white objects stay white between partial collections unless they are modified (ie. reside on memory page that was modified and thus caused memory access trap). In essence it seems to me that difference between generational and incremental/parallel mode lies only in what causes white objects to revert to black (ie. if it happens after each sweep (incremental) or only when sweep does not produce enough free memory (generational)).

Why use garbage collection at all when value semantics and smart pointers cover virtually all the use cases for GC, much more efficiently?

A generational GC is way more efficient than `std::shared_ptr`. `std::shared_ptr` is implementing a reference counting garbage collection, which is like the most inefficient way of implementing a GC.

FYI: - Allocating memory in a generational GC is as fast as allocating something on the stack. It's basically changing the value in a register. It's order of magnitude faster than calling `malloc()`. - A generational garbage collection is slower than freeing a stack frame, but has a lower per item overhead than calling `free()`.

tl;dr: for most programs, generational GCs are the fastest at allocating dynamic memory.

Yeah, but all that efficiency in allocation is paid for on the back end when all threads using the same heap grind to a screeching halt to do GC. Real world applications (incl. web apps at scale) do not need to perform allocations fast. They do need to run within strict space and time constraints. Value semantics and smart pointers guarantee these constraints. GC does not.

GC doesn't preclude value semantics or lack deterministic release of resources, not all languages are Java.

Also not all GC algorithms imply stop the world.

Regarding RC, it still is a GC algorithm, and Herb Sutter had a nice presentation where he shown how a cascade release of resources with RC algorithms can lead to indeterministic release time and worse, stack overflow.

> A generational GC is way more efficient than `std::shared_ptr`

that's true, but it's also a moot point because std::shared_ptr is almost never the semantics you want. You want std::unique_ptr.

Also, you can get away without allocating memory dynamically in a lot of cases (thus not incurring overhead of malloc).

After C++11 and Rust I'm starting to think garbage collection is holding us back from doing research into better approaches to automatic memory management.

> that's true, but it's also a moot point because std::shared_ptr is almost never the semantics you want. You want std::unique_ptr.

unique_ptr still has the malloc()/free()[1] overhead, I'm assuming the parent poster just misspoke. One major issue with shared_ptr (which unique_ptr doesn't have) is that it must work properly when different threads have different shared_ptr's (copies of each other) pointing to the same object. This adds overhead.

> Also, you can get away without allocating memory dynamically in a lot of cases (thus not incurring overhead of malloc).

Ok, but we're specifically talking about GC and RC here, so I don't see why you'd bring that up.

> After C++11 and Rust I'm starting to think garbage collection is holding us back from doing research into better approaches to automatic memory management.

GC literally is automatic memory management. (And as others have pointed out, so RC is a form of GC.)

Also, there's lots and lots of research into various different ways of doing automatic memory management.

[1] Alright, they don't literally use malloc/free, but you know what I mean.

> Ok, but we're specifically talking about GC and RC here, so I don't see why you'd bring that up.

well, my point was, in garbage-collected languages you don't have the option to never even engage the garbage collector by creating variables on the stack, while in languages like C++ you can avoid the overhead of malloc/free by using stack variables.

> GC literally is automatic memory management.

It is, but there are other approaches to automatic memory management that don't introduce another entity that does stuff to your program at runtime. C++'s smart pointers is one such approach, see also Automatic Reference Counting and Rust's borrow checking. I wish there was more research done into compile-time techniques to prevent resource leaks.

> well, my point was, in garbage-collected languages you don't have the option to never even engage the garbage collector by creating variables on the stack, while in languages like C++ you can avoid the overhead of malloc/free by using stack variables.

Oh, I see. I can see what you mean, but there are solid techniques for avoiding GC in e.g. JVM-based langauges -- for an example see e.g. Disruptor/Aeron.

I'll happily grant that it's a problem that it can be quite hard to see if you're actually avoiding GC when using these patterns. It'd be interesting if there were a "@no-gc" annotation which could enforce such patterns on the source code. (So, essentially you'd have a compile-time "GC by default, but you can opt-out and the compiler will tell you when you violate that". One can dream.)

The main problem with std::shared_ptr as a library type, is not being able to forbid developers to call get() or pass a reference or pointer to it.

agreed, but that doesn't affect performance (i was responding to grandparent's performance argument).

Calling malloc is not that expensive if you're using jemalloc or any other modern allocator. And with generational GC allocation is not just a register increment because Java has mandatory object initialization and all objects will be pushed to 2nd gen heap eventually. std::shared_ptr is quite fast when been used correctly (no excessive sharing between threads and no abusing).

Not all languages have Java's constraints, and "using something correctly" necessarily implies a cognitive cost, which might is probably less than the cognitive cost of optimizing around a GC, but worse than not optimizing at all (which is often appropriate for the majority of code paths through most applications).

Because calling malloc() & free() is expensive? That's just an implementation detail of your reference-counting GC or malloc(). You can also just pre-allocate memory in batches which makes allocation just changing a pointer to some pre-allocated memory.

Many reference based GC's do that, e.g. perl. Or have I entirely misunderstood your comment?

One problem amongst several with allocating memory in batches (or using a pool allocation scheme) is that you end up making poor use of the cache. After a while you have a lot of free memory mixed up with the used memory in your pool (see the article - the generational hypothesis), but because it's spread out a lot of that unused memory gets loaded into cache lines when accessing the memory which is still being used.

Having said all that, pool allocators can work really well for certain workloads, especially client-server stuff. Apache uses a pool allocator for serving requests. Pool allocators also fit well with languages like C/C++, but that's just because those languages are very poorly suited to GC, there are much better designed languages which integrate better with modern GC.

malloc and free are indeed far more expensive than allocation in a runtime system with copying GC.

> A generational garbage collection is slower than freeing a stack frame, but has a lower per item overhead than calling `free()`.

Almost, but not always. The Cheney on the MTA [1] paper describes a generational GC where the first generation is the stack. So for collections on the first generation, collection is equivalent to clearing a stack frame.

[1] http://dl.acm.org/citation.cfm?doid=214448.214454

I don't think you quite understand the problem... The cost isn't in literally freeing objects but in doing the mark/{sweep,copy}. So Cheney's collection is not equivalent to decrementing the stack pointer.

The only special things about the stack per se are a) arena[0] allocation and b) special ISA and OS support. When people say "stack allocation" the important points are (a) as well as c) static lifetime analysis. These are different concepts: Rust, for example, achieves (c) on the heap as well as the native stack.

[0] I.e., three-pointer, although for the stack the OS manages at least one of them

Your conclusion may be correct, but you're confusing "generational" GC with "copying" GC. There can be generational GCs that use freelists (eg. Boehm) and copying GCs that are not generational (I think the first copying GC for Smalltalk was like that).

>It's order of magnitude faster than calling `malloc()`

That depends on the underlying heap, Microsoft Low Fragmentation Heap is very fast for example.

The big plus with reference counting is that you (can) get a very attractive scoped release of resources.

GC shifts the cognitive burden off the programmer for the default case. I've done more than my share of programming in C++ and Rust, and I spend a lot of time thinking about memory management when it really doesn't matter. Personally, I prefer to only think about performance concerns when I'm working on something that needs to be fast--and for many applications these hot paths aren't a majority of the code base.

I definitely agree that it shifts a burden in the easy case, but once you develop an intuition (which happens at a different point for everyone) for what's going on, in my experience, it fades into the background. And when you make a mistake, the compiler is there to tell you what you did wrong, you fix it, done.

YMMV, of course.

I agree, but intuiting still takes more brain power than not having to choose at all. After programming in languages like Rust and C++ for the better part of a decade, intuiting about memory is still noticeably more taxing than using a GC. I don't know how common my experience is.

The pointers can form loops.

Alas, having no garbage collector at all is a blissful state of simplicity when performance matters. It's not very difficult to clean up one's objects manually in a well-thought-out codebase.

Experience (i.e. the number of memory-related bugs) suggests otherwise.

What about smart pointers? It gives the flexibility and benefits of garbage-collection without the performance degradation. It's not a silver bullet but it helps.

What I'm getting at is, Rust is the only modern language (in vogue at the moment) that does not use garbage collection. It would be nice if more languages didn't require a garbage collector but gave the option to use one.

If there are some more of such languages please chime in and provide a link to them! :)

> flexibility and benefits of garbage-collection

Not really, as it doesn't allow/collect cycles.

> without the performance degradation

Only for single-threaded code.

I agree with the rest of your comment!

A smart pointer is just a way to add custom behavior when it's created and destroyed. The question is what the smart pointer does. Usually the answer is update a reference count, which happens so frequently that it ends up being a lot more expensive (especially in bus bandwidth) than occasionally tracing live objects between long periods of useful work.

D has optional GC. However in hindsight it is perhaps not so smart. The problem is that it fractures the ecosystem: some libraries require GC and others don't. If you don't want to use it you have to rely only on non-GC libraries.

Smart pointers are as hard to manage as pointers.

There are different kind of smart pointers and you gotta think what you want to do with them. A GC is easier.

GC is certainly easier. However, it is important to understand the life-cycle of objects regardless. To be able to control them (if desired) is good because in many cases one would prefer to have that control instead of being forced to use garbage collector.

In an ideal world, garbage collector makes sense because we don't want to have to worry about life-cycle management when we are solving other problems. Memory management can indeed be an annoying implementation detail, but as it stands there are certainly limiting factors in hardware which are impacted by generic object-management for specific application domains.

GC isn't the the only way to achieve memory safety.

> well-thought-out codebase.

Which in real life means a single developer ownership codebase.

Or a meticulous team of very high quality engineers with top notch documentation and design practices. Note that I do agree with you, however, because as you said:

> in real life

In real life most companies don't want to pay for "meticulous team of very high quality engineers with top notch documentation and design practices".

They rather pay for outsourced consultants that deliver something that works within a fixed price contract.

Knuth's code has bugs. NASA's code has bugs. There's no reason to believe that high quality software engineers even exist.

High Quality != 0 bugs.

(Small) teams of high quality engineers do exist, but are very expensive.

And most humans are lazy f%$#s who throw things away rather than recycling them.

> It's not very difficult to clean up one's objects manually in a well-thought-out codebase.

Given the rarity of a well thought out codebase, I'd say it is. As soon as multiple people work on something confusion always ensues.

I read "typical worst-case pause time" in go and don't know whether to laugh or cry.

We are seeing massively longer pauses in production and the suggestion is to go to 1.8, which is not ready.

Reminds me of ruby - religious people practicing on production to get the nirvana.

Yes, it's getting better, or at least less bad, but yeesh. We'll just waste resources architecting around long pauses, and rewrite critical stuff in C, for a few more years...

We should focus on multiple small heaps instead - like erlang does it. Low latency, high capacity, no stop the world, simple implementation.

I thought this was going to be about actual garbage collection and municipal infrastructure.

That's what I get for being non-tech and browsing HN!

So, why GC instead of region memory allocation like in Rust? (and some other less known languages)

Rust doesn't use region-based memory management, though the borrow checker does somewhat resemble it. And the reason to use a GC is that it often works well enough for tasks that aren't especially demanding.

I would love to see an implementation of this http://wiki.luajit.org/New-Garbage-Collector

Why? It's still designed to use the worst of the three possible strategies, which Mike tend to ignore. A modern GC is compacting, with mark&copy or copying, if you have enough memory. Mark&sweep is only good for small devices with not enough memory, or a FFI or bad ABI/VM. Walking the heap is always at least 10-100x slower than walking the stack.


Because I don't know a lot about garbage collectors, but I am familiar with Mikes amazing work on LuaJIT and I trust that he knows enough about what he's talking about (based on the quality of his luajit work) for enough of his claims to be true for this to be interesting, useful or both.

The thing you have to understand about Mike's work on the new LuaJIT GC design is that it's heavily constrained by the Lua API. It's not a very interesting GC, though it does improve slightly on the standard tricolor incremental marking GC. See my reply to your other post.

I implemented something heavily based on that once¹, though no work has gone into optimizing much. It's not particularly state-of-the-art, but it's a good design for what it is. I'll probably use the quad-color strategy in my next incremental GC.

Note that while Mike's is strictly non-copying/non-compacting (because of Lua's API) mine does copy between generations and compacts the oldest generation. This makes arena management simpler and faster. I think that's the only significant difference between my implementation and Mike's design.

I suspect it's still a bit buggy, though it passes my test suite™. Non-trival GCs can be tricky. And to reiterate, it's not especially fast, because no work has gone into making it fast. (It's not slow either, because the design is good, but don't expect performance competitive with current production GCs².)

1. https://github.com/alpha123/yu

2. Well, it probably smokes Ruby and Python, but those go solidly in the ‘toy GC’ category.

>But if you take one thing away, let it be this: garbage collection is a hard problem, really hard, one that has been studied by an army of computer scientists for decades. So be very suspicious of supposed breakthroughs that everyone else missed. They are more likely to just be strange or unusual tradeoffs in disguise, avoided by others for reasons that may only become apparent later.

I really take issue with this. Imagine all of the progress that would have been delayed or quite possibly not achieved had people not persevered against the traditional thinking. Yes, GC is a hard problem but just because something's been researched for a long period of time does not mean there is no room left for innovation. Breakthroughs come some people that think differently not from people that think it's all been done before.

Excellent post, thank you.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact