Interesting that they are publishing the algorithm before implementing it. I wish more people did this. This way, if the algorithm doesn't work out, we'll have some way of knowing. If they hadn't published the algorithm, tried it, and then abandoned it, then nobody would have any way of knowing that there had been a negative result.
I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference. Looking forward to seeing the implementation and empirical data.
"Transactions" here doesn't mean roll-back-able units like in a database, just small bits of code that terminate (and, the authors hope, leave some garbage behind that you can clean up without scanning a bunch of RAM). Request-oriented collector would have been about as good a name.
This is kind of a spin on generational collection, built around the observation that a lot of Go programs are HTTP/RPC/whatever servers.
Yes, I've often considered that this is probably how Java's collector tends to behave in practice, particularly with the eden and young generation. If you're handling an RPC call that returns promptly, it's probable that all of the structures you've allocated on behalf of that call will be part of the young generation and can be efficiently collected.
Indeed, it's one reason why I've assumed that Rust's memory model might turn out to work well for services (not having had a chance to try it): there's a clear entry point where the request begins and ends, and all of the memory allocated from that point onward can be discretely tracked and freed upon completion. In the services that I've built, the majority of the allocations are for these short-lived objects, with a relatively smaller amount being related to long-lived objects like connections to other services or caches. I imagine that websites and services might behave like this, though I haven't studied it carefully.
I'll be interested to see how transaction-specific optimization turns out, and whether it can do better than a general purpose generational collector.
It's also not exactly original to Go... the Ur language, for instance, employs a request-oriented collector (arguably, though the documentation claims it does not use garbage collection).
Thanks for the link! I was actually at that talk :) To quote from the relevant slide:
* Transactions are integrated into Ur/Web at a deep level, so,
whenever we run out of space, we can always abort the
execution, allocate a larger heap, and restart.
* As a further optimization, we use region-based memory
management, inferring a stack structure to allow freeing
whole sets of objects at key points during execution.
Not sure how to reach twotwotwo directly. I was hoping to get permission to rename the GC to ROC from TOC ("request" instead of "transaction"). It's simply a better name and the bird makes for better visuals on the slides. I'm Rick Hudson, rlh@, and can be emailed directly at golang.org.
> I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference.
Every served request by net/http#Server, net/rpc#Server and similar gets its own goroutine. That is to say, its own transaction. Memory allocated within that transaction will be associated with it. Should make a positive difference for server software.
While trying to implement the new LuaJIT GC one of the things I tested adding early on was support for explicitly creating arenas from Lua and setting the 2 active arenas allocations happen from to support this kind of allocation patten. The minor GC mode also work well with this patten only having to touch modified objects and only sweeping away fresh white objects.
Interesting. The transactional hypothesis sounds like a refinement of the generational hypothesis.
Seems like the main advantage of this transaction oriented collector over a typical generational collector is that it precisely identifies the old-to-young pointers and thus may scan less total objects.
I'm having trouble seeing how this is actually different from a normal generational collector to begin with in practice (as opposed to implementation details).
The differences I read into it are:
- There is one heap rather than a young and tenure heap, but tenured objects are marked as such.
- Rather than moving an object from one heap to another, the write barrier tags the object being written to a tenured object (and presumably all its children). Not all collectors move actively (as opposed to lazily at collection time) to the tenured heap, but I'm pretty sure it has been done.
- 'Young generation' collections happen at a specific time: when a goroutine terminates, and happens more or less like a quick version of a semiheap copying collector, in that it just drops all objects not marked reachable.
So there is interesting stuff in there, and it seems like a very appropriate approach for an event-driven environment, but it still seems like a generational collector to me.
Also, I'm confused how the separation of the goroutines is maintained in this heap. When a goroutine advances its current pointer, what happens if it bumps into another goroutine's initial pointer?
> - There is one heap rather than a young and tenure heap, but tenured objects are marked as such.
That's a big difference. One of the largest, if not the largest, advantages of a standard generational GC in practice is bump allocation in the nursery. If I understand correctly, because of the fact that GCs don't unconditionally tenure surviving objects (instead only tenuring objects that cross thread boundaries), this algorithm loses the fast bump allocation because objects in the nursery can become noncontiguous.
Right, I was actually going to comment that the allocation scheme in this algorithm seems painfully slow compared to the kind of allocation you can get out of a separate nursery where you're just allocating the next free space. I forgot by the time I finished the post.
First-fit seems like it could be quite slow here.
The implication seems to be that they can at least avoid locking for allocation (and deallocation and promotion) as long as they find a fit within the reserved region for the goroutine, but I'm still confused how they maintain this separation safely.
They like their (mostly-)concurrent collections, and concurrent and compacting together seem harder to do (think e.g. Azul) than just concurrent or just compacting. And of course without compacting, a slower allocator that can use already-fragmented memory becomes attractive.
Don't know how much of a factor it is in this design, but might also just be expensive to introduce heap compaction to Go now, after going this far with heap objects having stable pointers. Might be a lot of random spots in the code that still assume that.
Not that there aren't benefits to compacting, including the bump allocation, less bloat as programs run a long time, and help with locality. Just, describing where Go's at.
> I'm still confused how they maintain this separation safely
Don't know about this design, but in other places they have some goroutine-local data structures and then do the expensive thread coordination stuff after they run out of resources. Seems plausible that could be the general approach here too.
I read the other day a comment that I thought was pretty interesting regarding compacting vs non-compacting. The idea was that compacting took the amortized cost of searching for free space and shoved it all into a single time span. Allocations are then just simple fast increments.
EDIT: Found the post.
> In the CLR allocation is blazing fast: just increment the heap head pointer. You don't have to take care of problems traditional allocators face: find a big enough fitting free hole. This has a cost: when objects are destroyed, the free memory on the heap becomes fragmented. This needs to be compacted, and this is where the GC pause is needed, as the references for moved objects need to be updated. Basically the cost of finding a free hole in memory at allocation was moved here.
> This was done intentionally, as studies found that most real world applications produce a load where most objects are short lived: this cheap allocation is needed, and a multigenerational GC can take care of them cheaply after destroyed (as for most of the time you only scan and free the youngest generation for unreachable objects, limiting the GC pause time).
plus Go has now acquired a culture of heavily optimising their code to "not allocate", so allocation is less of an issue for them than runtimes like Java.
It's not that those libraries "not allocate" but they work with stack allocation / provided buffers. You can reuse the provided buffers / output objects ... But a lot of req/resp code doesn't so it's just pushing the cost higher up the stack (app vs libraries). Which it self is a good argument for "transactional" GC.
> Seems like the main advantage of this transaction oriented collector over a typical generational collector is that it precisely identifies the old-to-young pointers and thus may scan less total objects.
I'm confused as to what this means. Write barriers, just as suggested here, are the standard solution for this issue in generational garbage collectors, and they precisely identify the old-to-young pointers as well. (In fact, the nursery and tenured generations are often separated in memory, so you can test just by masking a few bits off.)
I agree with the sibling comment that this seems to be a simple generational GC, just with a different policy for tenuring than the typical textbook one.
The standard solution is that the write barrier updates a card table, which generally reserves a bit for every page. When it comes time to mark through old-to-young pointers, the collector scans the card table - if it sees a bit set in the card table, it must mark through /all objects/ on that page - even if only one of the objects on the page has a cross-generational pointer.
Well, another solution (which I believe is common) is to just tenure the young object (and recursively tenure objects reachable from it) if the write barrier trips, which preserves the precision.
There isn't anything fundamental to generational GC that requires imprecision. It's an implementation detail. In fact, David Ungar's original generational GC paper from 1984 [1] uses precise remembered sets. GHC's GC [2] among others uses precise remembered sets.
Tenuring the young object in the write-barrier is not correct since the the young objects reachable from the newly-tenured object won't be marked in the young collection. That is why TOC then traces this newly-tenured object, which is novel to my knowledge.
Yeah, of course you need to recursively tenure objects, and I should have been more precise about that. I've heard of doing this before; it isn't novel.
Indeed, as you say, write barriers precisely identify old-to-young pointers at the time of the write. But most collectors defer tracing and track this old-to-young pointer in a card-table, which results in lost precision i.e. now the collector only knows that some object in this region has an old-to-young pointer.
In practice (advanced) generational collectors already use TLABs (thread-local allocation buffers) and multiple allocation regions with separate liveness calculation.
That's a pretty similar effect and could probably be tweaked further to achieve something that separates objects into different regions based on transactions.
I'm having trouble understanding what the practical difference between the transaction concept and something like RAII. It may just be the example they chose (goroutine processes an incoming request, does not "share" the contents with anyone else, and terminates when complete) but nonetheless I'm having a mental block. Can anyone clarify?
Note that I'm not saying that they should've used RAII along with existing GC, I'm just trying to distinguish the transaction concept and the RAII concept.
RAII is completely static; the lifetimes of all (uniquely-owned) objects are determined at compile time with a set of rules based on the language syntax. But the "published"/"local" distinction as in this GC is dynamic: the compiler does not (and cannot) statically work out which objects are published and instead determines that at runtime.
This basic distinction leads to completely different implementations.
If you take the corollary to RAII (resource disposal is destruction) as a given, then this (and any other automatic GC) system does not fit it. In C++ RAII, deallocation and deinitialization happens deterministically. In this it does not.
Go already allocates on the stack anything it can prove it doesn't leave it. I don't see much work going on on escape analysis and I hope they will improve it.
Nah, some of those things went in. I remember seeing a change to keep small string/byte slices' bytes and the smallest size map on the stack sometimes now, for example. https://twitter.com/golang_cls posts a selection of change discussions if you like this kind of stuff. The test/escape_* tests in the Go source tree show what escape analysis is expected to do and not do. Dmitry Vyukov, the author of the doc you linked, worked on a lot of the changes he was talking about; impressed with the range of his contributions (also including the race checker and go-fuzz).
Plenty of things in C++ are heap allocated. When you create a vector or dynamic string, it isn't allocated on the stack. Yes RAII would still take care of it but that's a separate thing.
At the risk of sounding a bit pedantic a vector or string isn't required to be on the heap.
You can use either a fixed size arena allocator, or one using alloca(you better know what you're doing) and still allocate those things on the stack.
In cases where you know the bounds of your input and runtime constraints you can get some really incredible performance boosts(multi-orders of mag) with those approaches.
Additionally, in C++ strings of less than 23 (IIRC) bytes are not heap allocated (they reuse fields in the existing structure). Whether this is a good thing is a matter of opinion, of course.
> Additionally, in C++ strings of less than 23 (IIRC) bytes are not heap allocated (they reuse fields in the existing structure).
Not quite. SSO is a lib-specific optimisation. LLVM's libc++ has a 22 characters SSO[0], recent MSVC and recent GNU libstdc++ have 15 characters SSO, Apache and older libstdc++ didn't have SSO at all (and used COW).
The SSO-23 POC/proposal does provide a 23-characters SSO for a 24 bytes string structure (such as LLVM's, MSVC 2013 and modern G++ strings are 32 bytes and SSO-23's scheme provides space for 31 characters in the same stack space).
[0] all capacities exclude the final NUL as it's required and unusable outside of C compatibility
I'm not at all an expert in this area but to me this sounds like a similar concept as Rust-esque ownership, at least internally in the runtime. If the runtime knows that an object wasn't shared by its goroutine owner and the owner goes out of scope then cleanup becomes (theoretically) much faster, if only because scope is greatly limited vs GC that has to scan the entire heap.
It's no more similar to Rust than any generational collector is (which is to say: not much at all). Rust's memory management story is primarily static, while this is dynamic.
Rust's memory management story is not primarily static. Anything that is allocated on the stack or anchored by a stack allocation is dynamic; it can fail, and Rust checks for failure at runtime.
But rust determines at compile time when almost all allocations need to be freed (the exceptions being Rc/Arc reference counting boxes), whereas it seems like a transaction/request GC would discover that at runtime.
In other words, the allocations and deallocations are determined statically, they aren't static allocations in and of themselves.
A nice thing about this is that it keeps the collections happening while no users are waiting. Before seeing this I figured what they'd do for throughput might be more of a Java/C#-like compacting young gen collection, which still introduces (short!) pauses.
Also, fun wrinkle: seems like occasionally, wrapping a slowish, mostly-self-contained, garbage-heavy function call in a goroutine could be a useful hint to this collector. It seems like it's only useful when 1) a transactional GC on the calling thread or full GC would really be slower (e.g. calling thread keeps lots of live data so GC on it takes more work), and 2) the launch/communication costs (expect at least µsecs) are small enough relative to gains. But interesting it's even possible for it to work out that way.
I have a generational art program that I wrote under Go 1.4 which was absolutely crushed by 1.5's GC. The program heavily abuses goroutines, so I'm cautiously optimistic that this optimisation would allow me to keep my version of Go current once again.
Did you open a bug report with some datasets (profile dumps, etc.)? Go developers treat these kinds of regressions very very seriously. They absolutely want to avoid situations were users are pinned to an old version because of a performance regression.
I did test in 1.6, and the performance and output were still significantly worse than in 1.4.
The part I'm saddest about is that because some of the goroutines abuse involved literally building in a race condition, the aggressive garbage collection made the resulting output less artistically interesting (on top of taking an hour to complete instead of a few minutes).
I think you're probably right, but this whitepaper gave me a sliver of hope that my goroutines might be free to work some nondeterministic magic once again
interesting work. if I am understanding correctly, you are essentially using the randomness generated by the scheduler to produce art pieces. do you have a public link to your work?
Alas, I've been hording almost all of the most interesting images I've created using the process, but I wrote up a short piece when I got the initial program running [1], and I do have a few examples of pictures generated using the derivations of the code that still make use of the core of the algorithm (these are all huge images!) [2][3][4][5][6].
Looking over your code, did you ever run it single-threaded to compare the performance? I see you spawning what appears to be a crapton of goroutines, each of which performs a very small number of operations, then terminates. As cheap as goroutines are, they still have some cost to startup and shutdown, and it's not a good idea to spawn a goroutine if it's going to do roughly the same order of work inside the goroutine as the spawn & shutdown costs. This is a pessimal situation. I'm not sure if you're seeing the GC fail or if you're seeing the scheduler choke under the harsh load.
If you just want nondeterminacy, I'd suggest using math/rand directly, not accidents of scheduling... which, as was just proved, can disappear as quickly as they appear. In fact it doesn't even take something as serious as a language upgrade for that to happen, that kind of indeterminancy can disappear if you so much as add a member to a struct or have another process on your system eating 100% of one CPU or something. It's not reliable, not even a little.
On my side I've experimented in the code a bunch with the number of goroutines as well as the number of different CPUs handling the requests, and although they seem to alter the nature of the image the program still produces output in a reasonable amount of time for most non-pathological values.
As to scheduling v. random functions, much of what I like about the original is that the scheduling is completely out of my hands and it's not guaranteed to be consistently distributed or reproducible. It's as though the computer were making its own decisions.
I have a project started to rewrite the whole thing in c++ using random and using entirely Manhattan distance estimations for speed, but working with C++ is incredibly unpleasant and difficult to justify when I need to find a real job.
That's interesting -- my use case had up to 2 minute GC pauses in 1.4, and now runs without a hiccup. For a variety of reasons, some code-side and some on Go's side. I didn't know there were use cases that were negatively affected by GC work done in the last year or two.
Well, in my case specifically, it was a program that generated coloured images based on a set of simple rules (namely, don't reuse the same RGB colour for any pixel & try to be as close to the colour of neighbouring pixels as possible). I wrote a brief overview around the time I finished the first version[1].
More recently, I wrote a new routine which generates several of those and then averages the RGB channel values on a per-pixel basis, and then repeatedly re-averages the original images with those average images to create more "generations." I haven't written anything up about that because the code required to sum and average pixels in a PNG is significantly less interesting.
For folks looking for a collector with a similar ideal (though with a potentially more useful constraint model), you might see the Pony language [1] collector [2].
The essence being to take advantage of some form of isolation in a concurrent system to dynamically infer something that looks like scoping/lifetimes.
I'd like to see this tried, along with a couple of wrinkles. Could information from the escape analysis done by the Golang compiler be used to provide "hints" to the garbage collector to make it more efficient? Could such hints be combined with further hints from the programmer, such as "finalize" calls or tags? Could runtime tracing information be used to derive even more hints for more efficient GC?
Note: "finalize"-hints in Go are just curly braces, which begin and end blocks. Variables local to the scope are "finalized" (i.e. collectable, assuming they live on the heap) when the block ends.
I'm not sure I understand: do you mean that they are collected after the program counter hits the last instruction of a given scope or are you implying that scope-local variables outlive their scope somehow?
I've been thinking about this before as RAII with optional deferring of recollection when batching would be advantageous , for instance when you have a high allocation pressure and small blocks.
RAII is using one C++ misfeature (destructors) to work around another misfeature (inability to do cleanup when leaving a lexical scope). Go has a far more elegant solution to the latter; if you're not familiar with the defer statement, check it out.
Eh, defer is more similar to javas try-with-resources or C#:s using, who both put the burden on the user of the object and cannot handle recursive cleanup (cleanup of trees). So I consider it much worse than destructors.
Except that deterministic destruction is actually a feature that enables cleanup when leaving a scope. Manually peppering your code with defer statements is often doing work a computer can do better than you can.
We really need to stop thinking of memory as a 'store'. Treat it as a cache and (a) GC naturally becomes a background process, and (b) you get persistent memory for free.
When considered as 'store', the value image is at a specific memory location, i.e. a specific offset in process heap. GC's difficult role here is to manage these allocations and compact the heap, etc.
When considered as a 'cache', the entire heap is a fixed sized cache. (LRU or better 2Q to deal with ephemeral objects). The actual memory object is at an offset in a mapped segment. (If mapped to a file, then voila, persistent memory).
As object are created, they are immediately placed in the 'process cache'. So to take the example of the OP's 'transactional' memobjs, these are cache entries that get promoted to the LRU section of a 2Q, get utilized, and then are naturally 'garbage collected' by eviction.
This scheme is not universal. Very long lived processes that sporadically access 'ancient' objects will stress the VM/FS. One can extend the temporal range of this scheme by employing a set of rotating FS-backed memory segments, but of course there will be a limit (e.g. your FS).
Does that clarify the scheme?
[p.s. of course one can used a leveled cache hierarchy, with optimal realization mapping to L1, L2, wired-memory. But a basic version can just use a single level 2Q cache.]
My late friend Alain Fournier once told me that he
considered the lowest form of academic work to be
taxonomy. And you know what? Type hierarchies are
just taxonomy.
Possibly Mr. Fournier meant 'foundational' by "lowest". If not, he may wish to revisit the foundational work performed by "lowly" academics such as a Mr. Darwin in the 19th century.
[p.s. we need to alert the CPU designers that they have been cluelessly using caches all these years. If only they had higher guidance from an infallible mind..]
Both systems have active runtime sub-systems, GC, scheduler, and a compiler front-end.
Java, of course, has a virtual machine and intermediate representation (byte codes), and a JIT to make it performant. The runtime class loading mechanism is a key +input here. The JIT aspect is a mix, with the -input being the upfront startup cost.
Go has far better C linkage, access to underlying memory object image, and provides the (glaringly missing from JVM) unsigned integer types.
Java has a more modern type system and first class metadata facilities, with -input being "more rope". Go has a 'interesting' type system, with -input being source level fragility, but +input here is "less rope".
Having extensively worked with concurrent systems in both languages, the distinction ultimately boils down to a question of whether langauge level continuation (fiber/green) as provided by Go vs. library/ByteCodeEngineering (various Java libs) is a critical requirement.
I honestly think it should be clear at this point that Go is really displacing Python and the notion of Go vs. Java is a false dichotomy.
This is changing. There already are 3rd-party FFI libraries for java[0][1] which are far less cumbersome to use than JNI and in the future there will be Project Panama[2] which will provide that out of the box.
> the distinction ultimately boils down to a question of whether langauge level continuation (fiber/green) as provided by Go vs. library/ByteCodeEngineering (various Java libs) is a critical requirement.
Excellent point. This doesn't seem to be brought up very often (or at least it's rarely emphasized) in Java/Go comparisons.
Also the Go advantage regarding AOT compilation to native code only exists, if one ignores that the majority of commercial third party JVMs do have it as a feature.
Except that Java already has G1GC[0], a region-based collector. And if you're willing to shell out money there even is a pauseless, fully concurrent collector[1] scaling to hundreds of gigabytes.
Shenandoah[0]. From what I've read it's contributed to openjdk by redhat. Their claims about pause times aren't quite as strong as those of Azul, but <100ms pauses for 100GB+ heaps will still be leagues ahead of most GCs.
It's probably also be less CPU-efficient than azul's, since it uses forwarding pointers instead of page table trickery. But if you have such a large heap you'll also have many cores. Here's a performance comparison to G1[1].
The Go 1.7 GC is already on par with most Java GCs and arguably better. Our Go projects with large heaps now have better pause times than the similar Java programs, all having tried many of the Java GCs...
Part of the reason for this is that Java GC development has focused more on throughput, whereas Go GC development has focused more on pause time. It's rather hard to compare throughput between two such different languages, though.
It isn't a language freeze but a comparability promise.
"It is intended that programs written to the Go 1 specification will continue to compile and run correctly, unchanged, over the lifetime of that specification. At some indefinite point, a Go 2 specification may arise, but until that time, Go programs that work today should continue to work even as future "point" releases of Go 1 arise (Go 1.1, Go 1.2, etc.)."
Granted that means it is effectively a freeze on backwards incompatible changes, but it does leave the door open for additions to the language.
Having said that, the go authors are definitely very conservative with language changes.
That stupid issue means that packages like os return 'error' even if it will always be of type '*PathError' so you have to be careful to note if the package you're using screwed up in returning err structs or not when doing `if err != nil` or else you'll check a nil interface vs a nil struct... not what you want.
This hinders the ability to make richer error types somewhat.
In addition, it means you have to always read the docs because the type system cannot represent the possible errors you can get there safely.
Second, this is bad because of places in the language (like io.Reader.Read) where it returns two values (numRead, err) where both are useful.
Due to the lack of sum types, you can't actually tell in multiple-returns of errors whether the results are mutually exclusive or not. What's more, without sum types, there's no way to express to the type system correctly that you indeed are in a non-nil branch of a value or so on.
Finally, there's no macros or other ways in the language to have more expressive error checking than the brain-dead "if err != nil". Sure, what go has is better than java's terrible implementation of exceptions, but it's so much worse than what you get in rust or ocaml or so on.
Actually no. Go compiler already has "stack promotion" (using "escape analysis"). TOC is in addition to that and I think it's intended to cover the cases not covered by "stack promotion" for example Function A calls Function B which allocates some memory and returns it to Function A which "consumes it" before returning. In this case the allocation can't be promoted to stack because B returns and its stack gets destroyed.. but the allocated memory can still be safely released when Function A returns.
Each release of Go seems to have yet another garbage collection algorithm, targetting a different use case each time. Presumably the previous use cases are left behind, ready for a future update going back to target them too. This leads to an infinite loop of GC alterations.
Does anyone feel confident that a 'good enough for everyone' GC will ever be produced? If not, surely the language designers should settle on picking a type of program that they will optimise for, and sticking to it.
Without full memory control you're always going to making trade-offs in terms of performance. Hopefully the GC is "good enough" for your use case but there will never be a one-size fits all GC.
FWIW that "regression" looks like a program that depends on particular runtime characteristics to generate "artistic" output. Not really the kind of issue you can realistically develop around.
In almost all practical programs the new GC (in 1.5) was a significant win, or at worst no change.
> The Go GC is constantly improving, and its performance in Go 1.6 and 1.7 is considered very good my most users.
Also true of Ruby or PHP. Which perform abysmally.
"Most users" needs aren't what we look for to define technical excellence - if anything, I most often see "most users" used to guide the compromising of technical standards.
GC highlight time-memory tradeoff at its worst because GC-based languages typically do not allow programmer to explicitly express his memory model forcing a lot of unnecessary "garbage" to be created. In Ruby all objects that do not fit in a tagged pointer are objects. In Go and Java some types are specifically optimized for being stack-based "value" types that reduces amount of work for GC.
I personally like the "ARC" approach with ObjC and Swift where compiler statically adds and optimizes reference counting routines, promotes objects to stack when possible and requires programmer to manually resolve cyclic references (and Xcode supports nice methods to debug and visualize memory in order to detect unintentional cycles). Accidental memory cycles are rare enough and quite debuggable at a very little mental cost to mark some references as "weak" or "unowned". But the long-term performance and memory exhaustion issues caused by GCs are completely bypassed.
The problem with RC as algorithm is that it increases locking on shared data structures, thus cache contention, and you still have "stop the world" when destroying a complex data structure causes a cascade of deletions.
There was a Swift talk exactly about struct vs reference types and making use of COW data structures to reduce ARC impact.
> Does anyone feel confident that a 'good enough for everyone' GC will ever be produced?
Probably not, GC is always full of tradeoffs. That's why the JVM has a bunch of tuneable GCs, and there are third-party JVMs with their own strategies and GC tradeoffs (e.g. Azul).
A while ago, i had an idea for a "tidal" collector. If you look at what a thread in a typical server application is doing, it's in a loop where it picks up some task (a network request, a job from a queue, etc), then goes and does a load of work on it. If you plot at the stack depth over successive loops, you should see a sort of rising and falling tide - low tide when the task is picked up, and high tide during processing. Along with that rising and falling tide, you should get a rising and falling volume of live objects. Because garbage collection has a cost proportional to the number of live objects, for best efficiency, you want to collect when there are the fewest live objects. Which should be low tide.
To make use of this idea (which might be false, but at least sounds plausible), you'd need two things: a layout where each thread has its own memory (at least, enough of its own memory to hold all the objects that escape eden but become garbage during a request cycle; you might want a shared old generation), and some magic device which watches the stack depth and identifies the moment of low tide.
G1 might provide the right layout. I think every thread gets its own eden region; if you could tweak it so that each one promotes into its own survivor regions, and then be able to find the survivor regions for a given thread, you just need to be able to decide when is the right time to collect. You might be able to do that on the back of the dynamic profiling infrastructure already in the JVM; you would want to sample each thread every now and then to identify the method, even the particular line of code, where low tide happens, and then compile in a GC trigger there.
I am a humble application programmer, so the only work i ever did on this idea was maybe telling people about it on newsgroups or forums (hi everybody!). I'd be amazed if nobody else has thought of it. But i've never heard of anyone trying it. Perhaps it's just a terrible idea.
The proposed Go design sounds really quite like this. With a huge advantage: because Go tends to use these fire-and-forget goroutines, the problem of identifying low tide goes away, and you can just collect when the goroutine dies.
I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference. Looking forward to seeing the implementation and empirical data.