Proposed New Go GC: Transaction-Oriented Collector

pizlonator · on June 24, 2016

Interesting that they are publishing the algorithm before implementing it. I wish more people did this. This way, if the algorithm doesn't work out, we'll have some way of knowing. If they hadn't published the algorithm, tried it, and then abandoned it, then nobody would have any way of knowing that there had been a negative result.

I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference. Looking forward to seeing the implementation and empirical data.

twotwotwo · on June 24, 2016

"Transactions" here doesn't mean roll-back-able units like in a database, just small bits of code that terminate (and, the authors hope, leave some garbage behind that you can clean up without scanning a bunch of RAM). Request-oriented collector would have been about as good a name.

This is kind of a spin on generational collection, built around the observation that a lot of Go programs are HTTP/RPC/whatever servers.

jcrites · on June 24, 2016

Yes, I've often considered that this is probably how Java's collector tends to behave in practice, particularly with the eden and young generation. If you're handling an RPC call that returns promptly, it's probable that all of the structures you've allocated on behalf of that call will be part of the young generation and can be efficiently collected.

Indeed, it's one reason why I've assumed that Rust's memory model might turn out to work well for services (not having had a chance to try it): there's a clear entry point where the request begins and ends, and all of the memory allocated from that point onward can be discretely tracked and freed upon completion. In the services that I've built, the majority of the allocations are for these short-lived objects, with a relatively smaller amount being related to long-lived objects like connections to other services or caches. I imagine that websites and services might behave like this, though I haven't studied it carefully.

I'll be interested to see how transaction-specific optimization turns out, and whether it can do better than a general purpose generational collector.

Jweb_Guru · on June 24, 2016

It's also not exactly original to Go... the Ur language, for instance, employs a request-oriented collector (arguably, though the documentation claims it does not use garbage collection).

ezyang · on June 24, 2016

Ur/Web uses region based memory management. Details are in http://adam.chlipala.net/papers/UrWebICFP15/ search for "memory management".

Jweb_Guru · on June 24, 2016

Thanks for the link! I was actually at that talk :) To quote from the relevant slide:

* Transactions are integrated into Ur/Web at a deep level, so, whenever we run out of space, we can always abort the execution, allocate a larger heap, and restart.

* As a further optimization, we use region-based memory management, inferring a stack structure to allow freeing whole sets of objects at key points during execution.

_rlh · on June 28, 2016

Not sure how to reach twotwotwo directly. I was hoping to get permission to rename the GC to ROC from TOC ("request" instead of "transaction"). It's simply a better name and the bird makes for better visuals on the slides. I'm Rick Hudson, rlh@, and can be emailed directly at golang.org.

pizlonator · on June 24, 2016

Thanks for the clarification. It might work.

sebcat · on June 24, 2016

> I don't have high hopes for this algorithm, because my intuition is that transactions aren't common enough to make a big difference.

Every served request by net/http#Server, net/rpc#Server and similar gets its own goroutine. That is to say, its own transaction. Memory allocated within that transaction will be associated with it. Should make a positive difference for server software.

sctb · on June 24, 2016

Another excellent write-up ahead of implementation: http://wiki.luajit.org/New-Garbage-Collector.

fsfod · on June 24, 2016

While trying to implement the new LuaJIT GC one of the things I tested adding early on was support for explicitly creating arenas from Lua and setting the 2 active arenas allocations happen from to support this kind of allocation patten. The minor GC mode also work well with this patten only having to touch modified objects and only sweeping away fresh white objects.

DblPlusUngood · on June 24, 2016

Interesting. The transactional hypothesis sounds like a refinement of the generational hypothesis.

Seems like the main advantage of this transaction oriented collector over a typical generational collector is that it precisely identifies the old-to-young pointers and thus may scan less total objects.

stormbrew · on June 24, 2016

I'm having trouble seeing how this is actually different from a normal generational collector to begin with in practice (as opposed to implementation details).

The differences I read into it are:

- There is one heap rather than a young and tenure heap, but tenured objects are marked as such.

- Rather than moving an object from one heap to another, the write barrier tags the object being written to a tenured object (and presumably all its children). Not all collectors move actively (as opposed to lazily at collection time) to the tenured heap, but I'm pretty sure it has been done.

- 'Young generation' collections happen at a specific time: when a goroutine terminates, and happens more or less like a quick version of a semiheap copying collector, in that it just drops all objects not marked reachable.

So there is interesting stuff in there, and it seems like a very appropriate approach for an event-driven environment, but it still seems like a generational collector to me.

Also, I'm confused how the separation of the goroutines is maintained in this heap. When a goroutine advances its current pointer, what happens if it bumps into another goroutine's initial pointer?

pcwalton · on June 24, 2016

> - There is one heap rather than a young and tenure heap, but tenured objects are marked as such.

That's a big difference. One of the largest, if not the largest, advantages of a standard generational GC in practice is bump allocation in the nursery. If I understand correctly, because of the fact that GCs don't unconditionally tenure surviving objects (instead only tenuring objects that cross thread boundaries), this algorithm loses the fast bump allocation because objects in the nursery can become noncontiguous.

stormbrew · on June 24, 2016

Right, I was actually going to comment that the allocation scheme in this algorithm seems painfully slow compared to the kind of allocation you can get out of a separate nursery where you're just allocating the next free space. I forgot by the time I finished the post.

First-fit seems like it could be quite slow here.

The implication seems to be that they can at least avoid locking for allocation (and deallocation and promotion) as long as they find a fit within the reserved region for the goroutine, but I'm still confused how they maintain this separation safely.

twotwotwo · on June 24, 2016

They like their (mostly-)concurrent collections, and concurrent and compacting together seem harder to do (think e.g. Azul) than just concurrent or just compacting. And of course without compacting, a slower allocator that can use already-fragmented memory becomes attractive.

Don't know how much of a factor it is in this design, but might also just be expensive to introduce heap compaction to Go now, after going this far with heap objects having stable pointers. Might be a lot of random spots in the code that still assume that.

Not that there aren't benefits to compacting, including the bump allocation, less bloat as programs run a long time, and help with locality. Just, describing where Go's at.

> I'm still confused how they maintain this separation safely

Don't know about this design, but in other places they have some goroutine-local data structures and then do the expensive thread coordination stuff after they run out of resources. Seems plausible that could be the general approach here too.

jdmichal · on June 24, 2016

I read the other day a comment that I thought was pretty interesting regarding compacting vs non-compacting. The idea was that compacting took the amortized cost of searching for free space and shoved it all into a single time span. Allocations are then just simple fast increments.

EDIT: Found the post.

> In the CLR allocation is blazing fast: just increment the heap head pointer. You don't have to take care of problems traditional allocators face: find a big enough fitting free hole. This has a cost: when objects are destroyed, the free memory on the heap becomes fragmented. This needs to be compacted, and this is where the GC pause is needed, as the references for moved objects need to be updated. Basically the cost of finding a free hole in memory at allocation was moved here.

> This was done intentionally, as studies found that most real world applications produce a load where most objects are short lived: this cheap allocation is needed, and a multigenerational GC can take care of them cheaply after destroyed (as for most of the time you only scan and free the youngest generation for unreachable objects, limiting the GC pause time).

https://news.ycombinator.com/item?id=11943816

grandinj · on June 24, 2016

plus Go has now acquired a culture of heavily optimising their code to "not allocate", so allocation is less of an issue for them than runtimes like Java.

mtanski · on June 25, 2016

It's not that those libraries "not allocate" but they work with stack allocation / provided buffers. You can reuse the provided buffers / output objects ... But a lot of req/resp code doesn't so it's just pushing the cost higher up the stack (app vs libraries). Which it self is a good argument for "transactional" GC.

pcwalton · on June 24, 2016

> Seems like the main advantage of this transaction oriented collector over a typical generational collector is that it precisely identifies the old-to-young pointers and thus may scan less total objects.

I'm confused as to what this means. Write barriers, just as suggested here, are the standard solution for this issue in generational garbage collectors, and they precisely identify the old-to-young pointers as well. (In fact, the nursery and tenured generations are often separated in memory, so you can test just by masking a few bits off.)

I agree with the sibling comment that this seems to be a simple generational GC, just with a different policy for tenuring than the typical textbook one.

ctangent · on June 24, 2016

The standard solution is that the write barrier updates a card table, which generally reserves a bit for every page. When it comes time to mark through old-to-young pointers, the collector scans the card table - if it sees a bit set in the card table, it must mark through /all objects/ on that page - even if only one of the objects on the page has a cross-generational pointer.

pcwalton · on June 24, 2016

Well, another solution (which I believe is common) is to just tenure the young object (and recursively tenure objects reachable from it) if the write barrier trips, which preserves the precision.

There isn't anything fundamental to generational GC that requires imprecision. It's an implementation detail. In fact, David Ungar's original generational GC paper from 1984 [1] uses precise remembered sets. GHC's GC [2] among others uses precise remembered sets.

[1]: https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20...

[2]: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage...

DblPlusUngood · on June 24, 2016

Tenuring the young object in the write-barrier is not correct since the the young objects reachable from the newly-tenured object won't be marked in the young collection. That is why TOC then traces this newly-tenured object, which is novel to my knowledge.

pcwalton · on June 24, 2016

Yeah, of course you need to recursively tenure objects, and I should have been more precise about that. I've heard of doing this before; it isn't novel.

DblPlusUngood · on June 24, 2016

My point was to compare TOC to "typical generational collectors" (i.e. those using card-marking, which is common).

My point was not that TOC is fundamentally different from generational collection.

DblPlusUngood · on June 24, 2016

Indeed, as you say, write barriers precisely identify old-to-young pointers at the time of the write. But most collectors defer tracing and track this old-to-young pointer in a card-table, which results in lost precision i.e. now the collector only knows that some object in this region has an old-to-young pointer.

the8472 · on June 24, 2016

In practice (advanced) generational collectors already use TLABs (thread-local allocation buffers) and multiple allocation regions with separate liveness calculation.

That's a pretty similar effect and could probably be tweaked further to achieve something that separates objects into different regions based on transactions.

vvanders · on June 24, 2016

Looks a lot like an Arena allocator tied to a specific execution context.

Doradus · on June 25, 2016

I'm not sure you've grasped the write barrier part. That's the really new part.

cmurphycode · on June 24, 2016

I'm having trouble understanding what the practical difference between the transaction concept and something like RAII. It may just be the example they chose (goroutine processes an incoming request, does not "share" the contents with anyone else, and terminates when complete) but nonetheless I'm having a mental block. Can anyone clarify?

Note that I'm not saying that they should've used RAII along with existing GC, I'm just trying to distinguish the transaction concept and the RAII concept.

pcwalton · on June 24, 2016

RAII is completely static; the lifetimes of all (uniquely-owned) objects are determined at compile time with a set of rules based on the language syntax. But the "published"/"local" distinction as in this GC is dynamic: the compiler does not (and cannot) statically work out which objects are published and instead determines that at runtime.

This basic distinction leads to completely different implementations.

stormbrew · on June 24, 2016

If you take the corollary to RAII (resource disposal is destruction) as a given, then this (and any other automatic GC) system does not fit it. In C++ RAII, deallocation and deinitialization happens deterministically. In this it does not.

ltbarcly3 · on June 24, 2016

If you replace 'transaction' with 'stack'...

giovannibajo1 · on June 24, 2016

Go already allocates on the stack anything it can prove it doesn't leave it. I don't see much work going on on escape analysis and I hope they will improve it.

There was a promising proposal document, with a couple of interesting goals, but I guess it was abandoned: https://docs.google.com/document/d/1CxgUBPlx9iJzkz9JWkb6tIpT...

twotwotwo · on June 24, 2016

Nah, some of those things went in. I remember seeing a change to keep small string/byte slices' bytes and the smallest size map on the stack sometimes now, for example. https://twitter.com/golang_cls posts a selection of change discussions if you like this kind of stuff. The test/escape_* tests in the Go source tree show what escape analysis is expected to do and not do. Dmitry Vyukov, the author of the doc you linked, worked on a lot of the changes he was talking about; impressed with the range of his contributions (also including the race checker and go-fuzz).

uluyol · on June 24, 2016

Plenty of things in C++ are heap allocated. When you create a vector or dynamic string, it isn't allocated on the stack. Yes RAII would still take care of it but that's a separate thing.

vvanders · on June 24, 2016

At the risk of sounding a bit pedantic a vector or string isn't required to be on the heap.

You can use either a fixed size arena allocator, or one using alloca(you better know what you're doing) and still allocate those things on the stack.

In cases where you know the bounds of your input and runtime constraints you can get some really incredible performance boosts(multi-orders of mag) with those approaches.

Jweb_Guru · on June 24, 2016

Additionally, in C++ strings of less than 23 (IIRC) bytes are not heap allocated (they reuse fields in the existing structure). Whether this is a good thing is a matter of opinion, of course.

masklinn · on June 24, 2016

> Additionally, in C++ strings of less than 23 (IIRC) bytes are not heap allocated (they reuse fields in the existing structure).

Not quite. SSO is a lib-specific optimisation. LLVM's libc++ has a 22 characters SSO[0], recent MSVC and recent GNU libstdc++ have 15 characters SSO, Apache and older libstdc++ didn't have SSO at all (and used COW).

The SSO-23 POC/proposal does provide a 23-characters SSO for a 24 bytes string structure (such as LLVM's, MSVC 2013 and modern G++ strings are 32 bytes and SSO-23's scheme provides space for 31 characters in the same stack space).

[0] all capacities exclude the final NUL as it's required and unusable outside of C compatibility

vvanders · on June 24, 2016

And if you had move semantics for values 'published' from a goroutine...

kevan · on June 24, 2016

I'm not at all an expert in this area but to me this sounds like a similar concept as Rust-esque ownership, at least internally in the runtime. If the runtime knows that an object wasn't shared by its goroutine owner and the owner goes out of scope then cleanup becomes (theoretically) much faster, if only because scope is greatly limited vs GC that has to scan the entire heap.

pcwalton · on June 24, 2016

It's no more similar to Rust than any generational collector is (which is to say: not much at all). Rust's memory management story is primarily static, while this is dynamic.

cwzwarich · on June 24, 2016

Rust's memory management story is not primarily static. Anything that is allocated on the stack or anchored by a stack allocation is dynamic; it can fail, and Rust checks for failure at runtime.

anp · on June 24, 2016

But rust determines at compile time when almost all allocations need to be freed (the exceptions being Rc/Arc reference counting boxes), whereas it seems like a transaction/request GC would discover that at runtime.

In other words, the allocations and deallocations are determined statically, they aren't static allocations in and of themselves.

twotwotwo · on June 24, 2016

A nice thing about this is that it keeps the collections happening while no users are waiting. Before seeing this I figured what they'd do for throughput might be more of a Java/C#-like compacting young gen collection, which still introduces (short!) pauses.

Also, fun wrinkle: seems like occasionally, wrapping a slowish, mostly-self-contained, garbage-heavy function call in a goroutine could be a useful hint to this collector. It seems like it's only useful when 1) a transactional GC on the calling thread or full GC would really be slower (e.g. calling thread keeps lots of live data so GC on it takes more work), and 2) the launch/communication costs (expect at least µsecs) are small enough relative to gains. But interesting it's even possible for it to work out that way.

Kapura · on June 24, 2016

I have a generational art program that I wrote under Go 1.4 which was absolutely crushed by 1.5's GC. The program heavily abuses goroutines, so I'm cautiously optimistic that this optimisation would allow me to keep my version of Go current once again.

giovannibajo1 · on June 24, 2016

Did you open a bug report with some datasets (profile dumps, etc.)? Go developers treat these kinds of regressions very very seriously. They absolutely want to avoid situations were users are pinned to an old version because of a performance regression.

_ak · on June 24, 2016

Have you tested it with the GC improvements in 1.6 and the 1.7 beta? If so, did they have any impact on your program's performance?

Kapura · on June 24, 2016

I did test in 1.6, and the performance and output were still significantly worse than in 1.4.

The part I'm saddest about is that because some of the goroutines abuse involved literally building in a race condition, the aggressive garbage collection made the resulting output less artistically interesting (on top of taking an hour to complete instead of a few minutes).

iainmerrick · on June 24, 2016

That doesn't sound like something that's likely to get fixed. More like a loophole that's now gone.

Kapura · on June 24, 2016

I think you're probably right, but this whitepaper gave me a sliver of hope that my goroutines might be free to work some nondeterministic magic once again

h8liu · on June 24, 2016

interesting work. if I am understanding correctly, you are essentially using the randomness generated by the scheduler to produce art pieces. do you have a public link to your work?

Kapura · on June 24, 2016

Alas, I've been hording almost all of the most interesting images I've created using the process, but I wrote up a short piece when I got the initial program running [1], and I do have a few examples of pictures generated using the derivations of the code that still make use of the core of the algorithm (these are all huge images!) [2][3][4][5][6].

[1] https://medium.com/@kapuramax/procedural-image-generation-in...

[2] http://www.nobledust.com/full_size/bullet.r0g0b0.x1000y1000....

[3] http://www.nobledust.com/full_size/b.brg.rr0.90.blur1.ch8.cp...

[4] http://www.nobledust.com/full_size/a.jpg

[5] http://www.nobledust.com/reds.png

[6] http://www.nobledust.com/full_size/mosey.png

jerf · on June 24, 2016

Looking over your code, did you ever run it single-threaded to compare the performance? I see you spawning what appears to be a crapton of goroutines, each of which performs a very small number of operations, then terminates. As cheap as goroutines are, they still have some cost to startup and shutdown, and it's not a good idea to spawn a goroutine if it's going to do roughly the same order of work inside the goroutine as the spawn & shutdown costs. This is a pessimal situation. I'm not sure if you're seeing the GC fail or if you're seeing the scheduler choke under the harsh load.

If you just want nondeterminacy, I'd suggest using math/rand directly, not accidents of scheduling... which, as was just proved, can disappear as quickly as they appear. In fact it doesn't even take something as serious as a language upgrade for that to happen, that kind of indeterminancy can disappear if you so much as add a member to a struct or have another process on your system eating 100% of one CPU or something. It's not reliable, not even a little.

Kapura · on June 24, 2016

On my side I've experimented in the code a bunch with the number of goroutines as well as the number of different CPUs handling the requests, and although they seem to alter the nature of the image the program still produces output in a reasonable amount of time for most non-pathological values.

As to scheduling v. random functions, much of what I like about the original is that the scheduling is completely out of my hands and it's not guaranteed to be consistently distributed or reproducible. It's as though the computer were making its own decisions.

I have a project started to rewrite the whole thing in c++ using random and using entirely Manhattan distance estimations for speed, but working with C++ is incredibly unpleasant and difficult to justify when I need to find a real job.

vessenes · on June 24, 2016

That's interesting -- my use case had up to 2 minute GC pauses in 1.4, and now runs without a hiccup. For a variety of reasons, some code-side and some on Go's side. I didn't know there were use cases that were negatively affected by GC work done in the last year or two.

gtirloni · on June 24, 2016

What is a generational art program? If you don't mind me asking.

Kapura · on June 24, 2016

Well, in my case specifically, it was a program that generated coloured images based on a set of simple rules (namely, don't reuse the same RGB colour for any pixel & try to be as close to the colour of neighbouring pixels as possible). I wrote a brief overview around the time I finished the first version[1].

More recently, I wrote a new routine which generates several of those and then averages the RGB channel values on a per-pixel basis, and then repeatedly re-averages the original images with those average images to create more "generations." I haven't written anything up about that because the code required to sum and average pixels in a PNG is significantly less interesting.

[1] https://medium.com/@kapuramax/procedural-image-generation-in...

spriggan3 · on June 24, 2016

s/generational/generative ?

tekacs · on June 24, 2016

For folks looking for a collector with a similar ideal (though with a potentially more useful constraint model), you might see the Pony language [1] collector [2].

The essence being to take advantage of some form of isolation in a concurrent system to dynamically infer something that looks like scoping/lifetimes.

[1]: http://www.ponylang.org/

[2]: http://blog.acolyer.org/2016/02/18/ownership-and-reference-c...

zerotosixty · on June 24, 2016

Somewhat off topic, but since there are growing opportunities for golang developers, I'm gonna ask a golang development question.

Is anyone using golang with a different language like C, rust, etc? Are you using it through linkage or through microservices?

What libraries are you using with golang that's written in a different language. What does your company /app / system / software do?

Do you recommend using libraries in a different language or using related golang libraries?

stcredzero · on June 24, 2016

I'd like to see this tried, along with a couple of wrinkles. Could information from the escape analysis done by the Golang compiler be used to provide "hints" to the garbage collector to make it more efficient? Could such hints be combined with further hints from the programmer, such as "finalize" calls or tags? Could runtime tracing information be used to derive even more hints for more efficient GC?

thwd · on June 24, 2016

Note: "finalize"-hints in Go are just curly braces, which begin and end blocks. Variables local to the scope are "finalized" (i.e. collectable, assuming they live on the heap) when the block ends.

stcredzero · on June 24, 2016

Note: "finalize"-hints in Go are just curly braces, which begin and end blocks.

But sometimes objects will survive the close braces.

thwd · on June 24, 2016

I'm not sure I understand: do you mean that they are collected after the program counter hits the last instruction of a given scope or are you implying that scope-local variables outlive their scope somehow?

strictfp · on June 24, 2016

I've been thinking about this before as RAII with optional deferring of recollection when batching would be advantageous , for instance when you have a high allocation pressure and small blocks.

Doradus · on June 25, 2016

RAII is using one C++ misfeature (destructors) to work around another misfeature (inability to do cleanup when leaving a lexical scope). Go has a far more elegant solution to the latter; if you're not familiar with the defer statement, check it out.

strictfp · on June 25, 2016

Eh, defer is more similar to javas try-with-resources or C#:s using, who both put the burden on the user of the object and cannot handle recursive cleanup (cleanup of trees). So I consider it much worse than destructors.

stormbrew · on June 25, 2016

Except that deterministic destruction is actually a feature that enables cleanup when leaving a scope. Manually peppering your code with defer statements is often doing work a computer can do better than you can.

eternalban · on June 24, 2016

We really need to stop thinking of memory as a 'store'. Treat it as a cache and (a) GC naturally becomes a background process, and (b) you get persistent memory for free.

weberc2 · on June 24, 2016

Can you elaborate? What's the distinction between 'store' and 'cache'? By 'memory', do you mean the heap or heap and stack?

eternalban · on June 24, 2016

When considered as 'store', the value image is at a specific memory location, i.e. a specific offset in process heap. GC's difficult role here is to manage these allocations and compact the heap, etc.

When considered as a 'cache', the entire heap is a fixed sized cache. (LRU or better 2Q to deal with ephemeral objects). The actual memory object is at an offset in a mapped segment. (If mapped to a file, then voila, persistent memory).

As object are created, they are immediately placed in the 'process cache'. So to take the example of the OP's 'transactional' memobjs, these are cache entries that get promoted to the LRU section of a 2Q, get utilized, and then are naturally 'garbage collected' by eviction.

This scheme is not universal. Very long lived processes that sporadically access 'ancient' objects will stress the VM/FS. One can extend the temporal range of this scheme by employing a set of rotating FS-backed memory segments, but of course there will be a limit (e.g. your FS).

Does that clarify the scheme?

[p.s. of course one can used a leveled cache hierarchy, with optimal realization mapping to L1, L2, wired-memory. But a basic version can just use a single level 2Q cache.]

weberc2 · on June 25, 2016

> Does that clarify the scheme?

Sadly no. I'm sure your explanation was good, but my familiarity with GC and low-level memory management is nominal at best. Thanks for trying. :)

SixSigma · on June 24, 2016

I doubt it :

@rob_pike

Caches are bugs waiting to happen.

22 Mar 2014

https://twitter.com/rob_pike/status/447202124753952768

eternalban · on June 24, 2016

Another bon mot from Rob:

    My late friend Alain Fournier once told me that he
    considered the lowest form of academic work to be
    taxonomy. And you know what? Type hierarchies are 
    just taxonomy.

Monday, June 25, 2012 https://commandcenter.blogspot.com/2012/06/less-is-exponenti...

Possibly Mr. Fournier meant 'foundational' by "lowest". If not, he may wish to revisit the foundational work performed by "lowly" academics such as a Mr. Darwin in the 19th century.

[p.s. we need to alert the CPU designers that they have been cluelessly using caches all these years. If only they had higher guidance from an infallible mind..]

enneff · on June 25, 2016

I think CPU designers would be a lot happier if they didn't need caches.

bakul · on June 25, 2016

All programming is an exercise in caching -- Terje Mathisen

Corollary: there will always be programming jobs :-)

dboreham · on June 24, 2016

Finally a reason for me to consider using Go vs Java.

eternalban · on June 24, 2016

The decision tree remains the same as before.

Both systems have active runtime sub-systems, GC, scheduler, and a compiler front-end.

Java, of course, has a virtual machine and intermediate representation (byte codes), and a JIT to make it performant. The runtime class loading mechanism is a key +input here. The JIT aspect is a mix, with the -input being the upfront startup cost.

Go has far better C linkage, access to underlying memory object image, and provides the (glaringly missing from JVM) unsigned integer types.

Java has a more modern type system and first class metadata facilities, with -input being "more rope". Go has a 'interesting' type system, with -input being source level fragility, but +input here is "less rope".

Having extensively worked with concurrent systems in both languages, the distinction ultimately boils down to a question of whether langauge level continuation (fiber/green) as provided by Go vs. library/ByteCodeEngineering (various Java libs) is a critical requirement.

I honestly think it should be clear at this point that Go is really displacing Python and the notion of Go vs. Java is a false dichotomy.

the8472 · on June 24, 2016

> Go has far better C linkage

This is changing. There already are 3rd-party FFI libraries for java[0][1] which are far less cumbersome to use than JNI and in the future there will be Project Panama[2] which will provide that out of the box.

[0] https://github.com/jnr/jnr-ffi [1] https://github.com/java-native-access/jna/blob/master/www/Ge... [1] http://openjdk.java.net/projects/panama/

weberc2 · on June 24, 2016

> the distinction ultimately boils down to a question of whether langauge level continuation (fiber/green) as provided by Go vs. library/ByteCodeEngineering (various Java libs) is a critical requirement.

Excellent point. This doesn't seem to be brought up very often (or at least it's rarely emphasized) in Java/Go comparisons.

pjmlp · on June 24, 2016

Also the Go advantage regarding AOT compilation to native code only exists, if one ignores that the majority of commercial third party JVMs do have it as a feature.

wheaties · on June 24, 2016

Go isn't displacing Python any more than Elixir is displacing Ruby.

the8472 · on June 24, 2016

Except that Java already has G1GC[0], a region-based collector. And if you're willing to shell out money there even is a pauseless, fully concurrent collector[1] scaling to hundreds of gigabytes.

[0] https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc... [1] https://www.azul.com/products/zing/pgc/

merb · on June 24, 2016

With Java 9 or 10 (doesn't remember) it gets a new Pausless GC, too. Not like the Azul one that works even with low Memory. But for high memory stuff.

the8472 · on June 24, 2016

Shenandoah[0]. From what I've read it's contributed to openjdk by redhat. Their claims about pause times aren't quite as strong as those of Azul, but <100ms pauses for 100GB+ heaps will still be leagues ahead of most GCs.

It's probably also be less CPU-efficient than azul's, since it uses forwarding pointers instead of page table trickery. But if you have such a large heap you'll also have many cores. Here's a performance comparison to G1[1].

[0] http://openjdk.java.net/jeps/189 [1] https://rkennke.wordpress.com/

voidlogic · on June 24, 2016

The Go 1.7 GC is already on par with most Java GCs and arguably better. Our Go projects with large heaps now have better pause times than the similar Java programs, all having tried many of the Java GCs...

icholy · on June 24, 2016

Not sure if it's better, it just has less work to do. Java creates a lot more garbage.

stcredzero · on June 24, 2016

Whatever works. It's the result that counts.

icholy · on June 25, 2016

ofc, I was strictly speaking about the GC algorithm.

voidlogic · on June 24, 2016

Fair enough, I was comparing the application + runtime interaction, basically I, app behavior.

>Java creates a lot more garbage.

Very true, Go is great about keeping things on the stack.

Also sync.Pool helps a lot when used as needed.

twic · on June 24, 2016

Part of the reason for this is that Java GC development has focused more on throughput, whereas Go GC development has focused more on pause time. It's rather hard to compare throughput between two such different languages, though.

pjmlp · on June 24, 2016

Java has plenty of GC algorithms to choose from.

weberc2 · on June 24, 2016

Sounds like a 'more rope with which to hang yourself' proposition.

andrewfromx · on June 24, 2016

oh there were plenty other reasons before this!

pjmlp · on June 24, 2016

I used to think like that until they decided to freeze the language.

remus · on June 24, 2016

It isn't a language freeze but a comparability promise.

"It is intended that programs written to the Go 1 specification will continue to compile and run correctly, unchanged, over the lifetime of that specification. At some indefinite point, a Go 2 specification may arise, but until that time, Go programs that work today should continue to work even as future "point" releases of Go 1 arise (Go 1.1, Go 1.2, etc.)."

Granted that means it is effectively a freeze on backwards incompatible changes, but it does leave the door open for additions to the language.

Having said that, the go authors are definitely very conservative with language changes.

pjmlp · on June 24, 2016

No, frozen are the words used by the Go team.

"Not everything was fixed, though. We're still learning (but the language is frozen for now). "

https://talks.golang.org/2012/splash.article

federicoponzi · on June 24, 2016

What do you mean?

TheDong · on June 24, 2016

The Go1 promise (https://golang.org/doc/go1compat) means that they cannot refine the language any further. It's frozen.

For example, there's a mess of different error handling techniques (e.g. err == io.EOF vs os.IsNotExist(err) vs perr, ok := err.(*os.PathError); ok).

This also prevents the language from evolving, as can be seen in the case of the context package (see 'Frozen' in this post https://github.com/golang/go/issues/14660#issue-138695431 )

While the above poster didn't explain his problem clearly, it is a valid complaint.

andrewfromx · on June 24, 2016

wow, i've been golanging for a year now checking tons of if err != nil and I never once thought there was a problem.

TheDong · on June 24, 2016

The whole `if err != nil` thing is a different problem as well!

Example of one of them: https://play.golang.org/p/pku8VJteH1

That stupid issue means that packages like os return 'error' even if it will always be of type '*PathError' so you have to be careful to note if the package you're using screwed up in returning err structs or not when doing `if err != nil` or else you'll check a nil interface vs a nil struct... not what you want. This hinders the ability to make richer error types somewhat. In addition, it means you have to always read the docs because the type system cannot represent the possible errors you can get there safely.

Second, this is bad because of places in the language (like io.Reader.Read) where it returns two values (numRead, err) where both are useful.

Due to the lack of sum types, you can't actually tell in multiple-returns of errors whether the results are mutually exclusive or not. What's more, without sum types, there's no way to express to the type system correctly that you indeed are in a non-nil branch of a value or so on.

Finally, there's no macros or other ways in the language to have more expressive error checking than the brain-dead "if err != nil". Sure, what go has is better than java's terrible implementation of exceptions, but it's so much worse than what you get in rust or ocaml or so on.

placeybordeaux · on June 24, 2016

This is only one of many differences between Go and Java. The style of coding is extremely different.

oleganza · on June 24, 2016

This is similar to "stack promotion" in Swift (http://fossies.org/linux/swift-swift/lib/LLVMPasses/LLVMStac...).

Objects that can be determined to stay within a local scope get allocated on the stack instead of a heap, so cleanup is very efficient.

The difference with Go is that decision is done at compile time in Swift and in runtime in Go.

leaveyou · on June 24, 2016

Actually no. Go compiler already has "stack promotion" (using "escape analysis"). TOC is in addition to that and I think it's intended to cover the cases not covered by "stack promotion" for example Function A calls Function B which allocates some memory and returns it to Function A which "consumes it" before returning. In this case the allocation can't be promoted to stack because B returns and its stack gets destroyed.. but the allocated memory can still be safely released when Function A returns.

joosters · on June 24, 2016

Each release of Go seems to have yet another garbage collection algorithm, targetting a different use case each time. Presumably the previous use cases are left behind, ready for a future update going back to target them too. This leads to an infinite loop of GC alterations.

Does anyone feel confident that a 'good enough for everyone' GC will ever be produced? If not, surely the language designers should settle on picking a type of program that they will optimise for, and sticking to it.

ngrilly · on June 24, 2016

> Presumably the previous use cases are left behind

No, they are not. The optimization discussed here adds something to the existing GC, and removes nothing.

> Does anyone feel confident that a 'good enough for everyone' GC will ever be produced?

The Go GC is constantly improving, and its performance in Go 1.6 and 1.7 is considered very good my most users.

vvanders · on June 24, 2016

There's a subthread further down that points to this exact type of regression: https://news.ycombinator.com/item?id=11970313

Without full memory control you're always going to making trade-offs in terms of performance. Hopefully the GC is "good enough" for your use case but there will never be a one-size fits all GC.

enneff · on June 25, 2016

FWIW that "regression" looks like a program that depends on particular runtime characteristics to generate "artistic" output. Not really the kind of issue you can realistically develop around.

In almost all practical programs the new GC (in 1.5) was a significant win, or at worst no change.

madgar · on June 24, 2016

> The Go GC is constantly improving, and its performance in Go 1.6 and 1.7 is considered very good my most users.

Also true of Ruby or PHP. Which perform abysmally.

"Most users" needs aren't what we look for to define technical excellence - if anything, I most often see "most users" used to guide the compromising of technical standards.

enneff · on June 25, 2016

But Go doesn't perform abysmally. The 1.5 GC is technically excellent (ask an expert).

Go does target "most users". That's the demographic. There will always be edge cases for which the general design does not work.

oleganza · on June 24, 2016

GC highlight time-memory tradeoff at its worst because GC-based languages typically do not allow programmer to explicitly express his memory model forcing a lot of unnecessary "garbage" to be created. In Ruby all objects that do not fit in a tagged pointer are objects. In Go and Java some types are specifically optimized for being stack-based "value" types that reduces amount of work for GC.

I personally like the "ARC" approach with ObjC and Swift where compiler statically adds and optimizes reference counting routines, promotes objects to stack when possible and requires programmer to manually resolve cyclic references (and Xcode supports nice methods to debug and visualize memory in order to detect unintentional cycles). Accidental memory cycles are rare enough and quite debuggable at a very little mental cost to mark some references as "weak" or "unowned". But the long-term performance and memory exhaustion issues caused by GCs are completely bypassed.

pjmlp · on June 24, 2016

The problem with RC as algorithm is that it increases locking on shared data structures, thus cache contention, and you still have "stop the world" when destroying a complex data structure causes a cascade of deletions.

There was a Swift talk exactly about struct vs reference types and making use of COW data structures to reduce ARC impact.

masklinn · on June 24, 2016

> Does anyone feel confident that a 'good enough for everyone' GC will ever be produced?

Probably not, GC is always full of tradeoffs. That's why the JVM has a bunch of tuneable GCs, and there are third-party JVMs with their own strategies and GC tradeoffs (e.g. Azul).

exabrial · on June 24, 2016

Nice... why hasnt the JVM ever done this?

twic · on June 24, 2016

Good question.

A while ago, i had an idea for a "tidal" collector. If you look at what a thread in a typical server application is doing, it's in a loop where it picks up some task (a network request, a job from a queue, etc), then goes and does a load of work on it. If you plot at the stack depth over successive loops, you should see a sort of rising and falling tide - low tide when the task is picked up, and high tide during processing. Along with that rising and falling tide, you should get a rising and falling volume of live objects. Because garbage collection has a cost proportional to the number of live objects, for best efficiency, you want to collect when there are the fewest live objects. Which should be low tide.

To make use of this idea (which might be false, but at least sounds plausible), you'd need two things: a layout where each thread has its own memory (at least, enough of its own memory to hold all the objects that escape eden but become garbage during a request cycle; you might want a shared old generation), and some magic device which watches the stack depth and identifies the moment of low tide.

G1 might provide the right layout. I think every thread gets its own eden region; if you could tweak it so that each one promotes into its own survivor regions, and then be able to find the survivor regions for a given thread, you just need to be able to decide when is the right time to collect. You might be able to do that on the back of the dynamic profiling infrastructure already in the JVM; you would want to sample each thread every now and then to identify the method, even the particular line of code, where low tide happens, and then compile in a GC trigger there.

I am a humble application programmer, so the only work i ever did on this idea was maybe telling people about it on newsgroups or forums (hi everybody!). I'd be amazed if nobody else has thought of it. But i've never heard of anyone trying it. Perhaps it's just a terrible idea.

The proposed Go design sounds really quite like this. With a huge advantage: because Go tends to use these fire-and-forget goroutines, the problem of identifying low tide goes away, and you can just collect when the goroutine dies.

pjmlp · on June 24, 2016

Which JVM, there are plenty to choose from.

Some of them have quite good escape analysis algorithms, which are somehow similar to this.

icholy · on June 24, 2016

I'm guessing because the transactions are essentially goroutines which the JVM doesn't have.

Doradus · on June 25, 2016

Nope. Transactions, in the sense of the article, bear no relation to goroutines.

The right answer is that Java's generational collector supports this transaction lifetime reasonably well in many cases.

japasc · on June 24, 2016

why dont let programers choose transactions boundaries within a goroutine, so they can have fine grained control over allocations and deletions

enneff · on June 25, 2016

Because that would be a different language altogether, and that language exists: Rust. It's pretty cool.