
Go 1.4+ Garbage Collection Plan and Roadmap - crawshaw
http://golang.org/s/go14gc
======
kator
STW has to die. I love that the goal is 10ms but in some environments the
world has passed you by in 10ms and with STW you've timed out every single
connection.

I want badly to fall in love with Go, I've enjoyed using it for some of my
projects but I've seen serious challenges with more then four cores at very
low latency (talking 9ms network RTT) and high QPS.. I cheated in one
application and pinned the go process to each cpu and lied to Go and told it
there was only one CPU but then you loose all the cool features of channels
etc.. That helped some but a Nginx/LuaJIT implementation of the same solution
still crushed it on the same box, identical workload.

It would be nice if we have to have STW to have it configurable, in some
environments swapping memory for latency is fine and should be configurable.

The way Zing handles GC for Java acceleration is quite brilliant, not sure how
much of that is open technology, but it would be cool to see the Go team
reviewing what was learned in the process of maturing Java for low-latency
high qps systems.

~~~
f2f
> lied to Go and told it there was only one CPU but then you loose all the
> cool features of channels etc..

channels work just fine in a single-cpu context. concurrency is not
parallelism.

~~~
dinkumthinkum
That struck me as odd. I questioned the entire poster's results based on that
statement.

~~~
Jare
I took it to mean he launched several copies of the process, each copy pinned
to one (and only one) core.

~~~
kator
Correct it was the only way to get Go to scale to a reasonable response rate
for this application and it thus meant my CPU loading was not even since it
was based on the randomness of socket connections that persisted on one cpu or
another.

------
eloff
People use the JVM in soft realtime / financial applications, and the trick is
to reduce allocations, especially of objects that are long-lived enough to
make it to gen 2.

Go is better suited than Java for those kinds of applications, because it's
easier to avoid allocations (Java doesn't have stack allocations, but Go
does.) Also hard upper limits on GC time are very helpful for those cases
where allocations can't be reduced any further. The standard library
additionally has a Pool type that allows for reducing pressure on the GC
through object reuse.

~~~
tomlu
> Also hard upper limits on GC time are very helpful for those cases where
> allocations can't be reduced any further

Yes, but 10ms is too long to be useful for games. I'd rather take 1ms every
frame than 10ms sometimes.

~~~
eloff
I don't buy that argument. Lua has a much worse STW garbage collector and it's
commonly used in games.

Furthermore with games you can do tricks like allocating out of an arena with
bump pointers and then resetting the pointer at the end of the frame.

Lots of games are developed with JVM, .NET, or Lua. GC doesn't seem to be a
show stopper for them, you just have to be smart about allocations.

~~~
jblow
Lots of _low-end_ , non-premium-experience games are made with those things.
And e.g. games that use Lua usually only use it for high-level gameplay logic,
i.e. most of the running code is in C++ or something.

And yes, it's a problem.

See the other comments here about VR. With VR you want to render at 90 frames
per second, in other words, you get 11 milliseconds to draw the scene _twice_
(once for each eye). That is 5.5 milliseconds to draw the scene. If you pause
and miss the frame deadline, it induces nausea in the user.

But this comment drives me up the wall:

"GC doesn't seem to be a show stopper for them, you just have to be smart
about allocations..."

 _The whole point of GC is to create a situation where you don 't have to
think about allocations!_ If you have to think about allocations, GC has
failed to do its job. This is obvious, yet there are all these people walking
around with some kind of GC Stockholm Syndrome.

So now you are trapped in a situation where not only do you have to think
about allocations, and optimize them down, etc, etc, but you have also lost
the low-level control you get in a non-GC'd language, and have given up the
ability to deliver a solid experience.

Bad trade.

~~~
enneff
> The whole point of GC is to create a situation where you don't have to think
> about allocations!

Nope. The point of GC is memory safety.

GC also means you don't have to think about _freeing_ memory, which is
important in concurrent systems.

But even if GC _was_ about "not thinking about allocations", what's bad about
only having to think about allocations when it's important? Code clarity
trumps performance, except at bottlenecks.

~~~
jblow
Well, you're being a bit revisionist.

You can get memory safety without GC, and a number of GC'd systems do not
provide memory safety.

If you think that, for concurrent systems, it is a good idea to let
deallocations pile up until some future time at which a lot of work has to be
done to rediscover them, and during this discovery process ALL THREADS ARE
FROZEN, then you have a different definition of concurrency than I do. Or
something.

If you want to know about code clarity, then understanding what your program
does with memory, and expressing that clearly in code rather than trying to
sweep it under the rug, is really good for clarity. Try it sometime.

------
davidtgoldblatt
For those interested in additional technical details of high-performance
garbage collection, the book cited (The Garbage Collection Handbook: The Art
of Automatic Memory Management) is a _fantastic_ reference. It's one of the
best-written technical books I own, and distills much of the modern
literature. If you need to do GC performance tuning or reason about memory
management issues in the JVM, having this book around will be very useful.

~~~
bcantrill
My apologies for going off topic for a second, but is it well-known that
Amazon seems to use Uber-esque surge pricing?! I (like apparently a lot of
people) went to Amazon to buy the book cited as soon as I saw it (and it looks
like a great book!). There were eight copies left in stock, and I bought it
for $62.69.[1] When I went back to Amazon a few minutes later (following
someone else's link), there was only one copy left in stock, and it was listed
for $98.60.[2] With my apologies again for being off-topic, is this a known
phenomenon?

Anyway, thanks for seconding their recommendation of the book; I'm looking
forward to reading it -- and glad I saved the 35 clams!

[1]
[https://www.evernote.com/shard/s249/sh/53d7b93a-d737-41c4-8e...](https://www.evernote.com/shard/s249/sh/53d7b93a-d737-41c4-8e2d-be8a48d88407/a6cc6a65f6890b77759ec1c9aef90a35)

[2]
[https://www.evernote.com/shard/s249/sh/a1c55244-d64e-40cc-a6...](https://www.evernote.com/shard/s249/sh/a1c55244-d64e-40cc-a6b0-b3941533a0e2/da97f45d87bdb50b47c80fb5cc02e16a)

~~~
Someone
As another poster said, these aren't 100% identical books, but you can see
that for exactly the same book, too, if two booksellers that sell through the
Amazon web site have bots that try to outsmart each other. For an extreme
example, see [http://news.discovery.com/tech/amazon-lists-books-
for-23-mil...](http://news.discovery.com/tech/amazon-lists-books-
for-23-million-bucks.htm)

~~~
bcantrill
Crap -- I'm an idiot. (Though at least the one I purchased is the right one.)
If I could only downvote myself, I would...

------
chetanahuja
Seems like the 10ms pause thing provokes a much sharper reaction (at least
among this crowd) than this little nugget:

 _" Hardware provisioning should allow for in-memory heap sizes twice as large
as reachable memory."_

So I know "memory is cheap"(TM) but surely a 100% physical RAM overhead for
your memory management scheme is worth at least a small amount of hand-
wringing. No?

~~~
acqq
It is normal, not even a too big demand, as soon as you accept that the system
depends on the GC. They are just being honest, even slightly optimistic. The
current state of art is, you should suspect anybody who claims to you that GC
systems don't need significantly more RAM.

See for an example:

[http://sealedabstract.com/wp-
content/uploads/2013/05/Screen-...](http://sealedabstract.com/wp-
content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png)

(quoted in [http://sealedabstract.com/rants/why-mobile-web-apps-are-
slow...](http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/))

According to the graph, most GCs start being really slow even with 3 times
more memory than needed with the manual management.

~~~
rbehrends
That is misleading (and that article is really bad in other ways, too, but
that's a different story).

First of all, the "most GCs" are the non-generational GCs.

That's well known. The amount of tracing work that a non-generational
allocator has to do per collection is proportional to the size of the live
set. Thus, collecting less frequently (by increasing the heap size before
another collection has to occur) makes GC faster, using time roughly inversely
proportional to how much bigger you make the heap.

Generational collection can greatly mitigate tracing of the live set for when
you have many short-lived objects that never leave the nursery.

Second, the benchmark compares the allocation/collection work of various
garbage collectors vs. an oracular memory manager using malloc()/free(). Not
only do alternative solutions that don't use automatic memory management not
necessarily match this performance (naive reference counting tends to be even
slower, pool allocation also can have considerable memory overhead, etc.):
more importantly, it's an overhead that applies only to the
allocation/collection/deallocation work. For example, if your mutator uses 90%
of the CPU with an oracular memory manager, then a 100% GC overhead means 10%
total application overhead.

~~~
acqq
I really welcome the links to the better measurements and graphs. Please give
me the hard data, properly presented, don't write the claims without the
citations. I really want to learn more.

Correct me if I'm wrong, but even the "generational" GCs present their own
problems: the "costs" of having the GC increase not only with having "too
little" memory but also with trying to use "too much" memory (as "more than
e.g. 6-8 GB, which can be needed on the server applications). As far as I
know, only Azul's proprietary GC is claimed to avoid most of the problems
typical for practically all other known GCs. fmstephe in his comment here
linked to one discussion where the Azul's GC author participated. But I
nowhere read the claim that any GC doesn't need significantly more RAM than
manual management.

~~~
rbehrends
> I really welcome the links to the better measurements and graphs.

Simply look at the picture you referenced and fully understand what it says.
Read the accompanying paper also.

> Correct me if I'm wrong, but even the "generational" GCs present their own
> problems

Every memory management scheme has pros and cons, yes.

~~~
acqq
Please be clear. Do you claim it's misleading that the GCs need at least twice
as much RAM to be performant? If so, based on what actually do you claim that
the graph I linked doesn't support that? Can you give an example of some
system that does better, with measurements etc?

~~~
rbehrends
> Do you claim it's misleading that the GCs need at least twice as much RAM to
> be performant?

This is not what you wrote. You said that "most GCs start being really slow
even with 3 times more memory than needed with the manual management", while
the generational mark-sweep collector has essentially zero overhead with 3x
RAM in that benchmark. The "most GCs" you're referring to are algorithms that
are decades behind the state of the art.

Also, "really slow" is a fuzzy term and I am not sure how you come to that
conclusion from the image.

Remember, they're compared to an oracular allocator that has perfect knowledge
about lifetime/reachability without actually having to calculate it. That
ideal situation rarely obtains in the real world. The paper uses this case to
have a baseline for quantitative comparison (similar to how in some situations
speeds are expressed as a fraction of c), not because it represents an actual
and realistic implementation.

~~~
acqq
You answered nothing what I asked from you. I asked for links, measurements,
graphs.

Your only arguments: after showing that I wrote "most need even 3 times more"
then you give an example of one which needs 2 times more. Then you complain
that "really slow" is fuzzy. Then you claim that "ideal situation rarely
obtains in the real world."

I asked you for the graphs and links.

~~~
rbehrends
My point is that you don't understand your own source. The "links,
measurements, graphs" are in the paper you referenced, they just say something
different from what you believe they are saying.

If you're struggling with understanding the paper, there's really nothing more
I can do to help.

~~~
acqq
Apart from the claim that I use "fuzzy" words or that my set of "most GCs"
unsurprisingly doesn't include the kind that Go still doesn't have and
probably won't have for some years more, what have I written that you actually
refuted?

------
SEJeff
10 milliseconds is far too long for STP for anything in the financial
industry. I can see it as also not being great for robotics or several other
latency critical industries.

It is a shame too. I love writing go

~~~
schmichael
You mention two industries with extremely demanding performance profiles
(financial industry and robotics). I don't know if I'd trust any 5 year old
platform in those contexts.

There are lots of other contexts where the performance profile outlined in
this document are sufficient. If you love writing Go, then I'd suggest staying
out of the financial industry and robotics.

~~~
jblow
And video games.

And avionics/aerospace.

And self-driving cars. And medical equipment.

etc, etc. You can list lots of fields for which this is unacceptable, and they
are a lot of the really interesting fields.

~~~
ecuzzillo
Not sure about self-driving cars. I work in robotics, and there are a few
modules that need to hit hard real-time deadlines. You wouldn't use GC
languages for those, but there are a LOT of other parts that don't, and you
can get a BIG win by writing those in a more concise language.

------
BinaryIdiot
I'm absolutely in love with C++'s RAII scheme but it seems almost no languages
use it and, instead, go with a complex garbage collecting scheme that requires
pausing.

I want to like GO but a language that targets native development but still
uses garbage collection just seems like an odd pairing to me. Maybe it's just
me especially since I rarely get an opportunity to do native development.

~~~
dilap
ObjC, with automatic-reference-counting (and now Swift, I assume, though I
haven't looked into it) is close to garbage collection, but has deterministic
destruction.

The price you pay is you have to manually avoid cycles by annotating some
references as weak.

It's a very nice scheme, actually.

~~~
pcwalton
Thread-safe reference counting usually has lower throughput than tracing
garbage collection. If it's throughput you're after, you're not going to get
it with atomic reference counting.

~~~
masklinn
OTOH refcounting has a more stable (and usually lower on average) memory
usage, and no pauses (if cycles are ignored)

~~~
Tuna-Fish
> and no pauses (if cycles are ignored)

Actually, even without cycles the destruction time is in general unbounded.
Just think what happens when you allocate a _very_ long linked list one
element at a time and then drop the head. With a bit of ill luck or intention,
you can make each element be allocated from a different page with enough
different pages so that they fall out of the TLB. In that situation, even
without having pages written out to disk, you can expect each element to take
~ a thousand cycles to free. On a million element list, that's ~500ms for
freeing the head.

~~~
masklinn
That specific issue exists in manual memory management schemes as well.

------
dualogy
"Quantitatively this means that for adequately provisioned machines limiting
GC latency to less than 10 milliseconds (10ms), with mutator (Go application
code) availability of more than 40 ms out of every 50 ms" \--- once they get
there, they should try to make those targets customizable, with default values
being 10ms/50ms. That'd be marvellous.

~~~
dsymonds
Those numbers are also for a current average $1k computer; a faster computer
would naturally lower those numbers.

~~~
kator
While a more expensive computer may also be doing more memory allocations. :-)

------
rurban
Be fast and use Cheney (copying, double memory) or be slow with Mark&Sweep and
only the current memory usage. Nothing new here.

Catching up old GCs via concurrent GC states is fine and tandy, but it is
still just catching up, and it requires GC state. Cheney not. And a typical
Cheney GC is 3-10ms not 10-50ms.

------
ChikkaChiChi
Go is a language built on concurrency. Today's computers are usually allocated
a lot of memory resources.

Garbage collection is required but even a hybrid STW only reduces latency but
doesn't eliminate it. Nor is there seemingly any foreseeable way of allowing
developers to issue a GC request in a timely fashion.

What if, prior to enacting GC, Go concurrently shifted from it's current
memory allocation to an exact clone, cleaned up, then shifted back? Or maybe
it cleans as it's cloning, enabling a shift to a newer, cleaner allocation?
Sure, there would be latency during the switch, but it would be considerably
less than stopping everything and waiting for GC to finish.

~~~
howeyc
please forgive my ignorance, but aren't you basically describing a "mark and
sweep"[1]?

[1]
[https://en.wikipedia.org/wiki/Mark_and_sweep#Moving_vs._non-...](https://en.wikipedia.org/wiki/Mark_and_sweep#Moving_vs._non-
moving)

~~~
ChikkaChiChi
I'm new to these concepts, so please forgive me for coming up with something
that already exists.

I guess then the question becomes why would the roadmap avoid such an
implementation?

~~~
hedgehog
C4 is a well known pauseless GC, the paper is good and digging through the
citations will get you pretty caught up on the theory and practice:

[http://www.azulsystems.com/sites/default/files/images/c4_pap...](http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf)

------
sudhirj
When there's talk of making sure the GC doesn't take up more than 10ms out of
every 50ms, this is only when the GC is actually happening and not during
regular running, right?

~~~
bearbin
Yes, the 10ms out of 50ms blocked, plus at maximum 25% of the CPU time
otherwise is only when the GC is running.

------
Thaxll
Is that possible to make a realtime gameserver with the actual GC?

~~~
masklinn
Depends what you mean by "realtime". Pauses of 10ms is still 60% of a frame
budget potentially every 3 frames. You're probably not simulating the game at
60fps, but it's worth about 2000 miles of network delay[0].

All in all, depends whether you're serving for a turn-by-turn roguelike, an
RTS or an FPS. Won't be an issue for the first one, may not be for the second
one, the latter though...

[0]
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)

~~~
marvy
I smile every time I read that story. I'm up-voting for the link alone. (But
yeah, it really sounds like they didn't have games in mind when they designed
Go.)

~~~
gillianseed
Well 'games' is a quite wide concept, there's been a ton of games released
through XNA which used a 'stop the world' garbage collector until late in it's
existence IIRC.

So there's no reason why your typical indie style game would not be written in
Go, but of course if you want to write AAA style 'push the boundaries of
realtime graphics and physics', Go or any other garbage collected language is
not a very likely candidate.

~~~
marvy
Now that you mention it, I guess you're right. Even XNA would be bad if you're
pushing the limits of the machine.

------
jaekwon
Curious, how "real time" can a large application get in Go by judiciously
working with pre-allocated structs and slices? Perhaps if the underlying
system libraries don't require much garbage collection, then you can avoid
stopping the world for too long.

~~~
hedgehog
That lets you make GC infrequent but the amount of heap will determine the
pause time when the GC comes calling. If you have a large number of live
objects with pointers in them then pauses are unavoidable with the current Go
GC, it sounds like if all goes according to plan things will be a lot better
in 1.5 though.

~~~
TylerE
If you're pre-allocating you can just turn the GC off ;)

~~~
hedgehog
Certainly if you could get away from all allocations but it's pretty likely
you'll end up with some from the standard library or some other external code.

------
smegel
As a side issue, are there any plans to switch to gcc as thr main compiler? It
can do everything that 6c can do from what I understand, and produces much
faster code due to gcc well tuned optimizer.

~~~
twotwotwo
Nope; gccgo is a separate project. Its good code generation is totally cool
for particular purposes, but the Go project has its own priorities that
they're going to implement in their own toolchain.

------
jblow
Availability of 40ms out of every 50ms... how many 9s is that? Oh wait.

"Go, the language with zero nines of availability."

~~~
kyrra
From my understanding, this only happens when the GC is run. Currently (Go1.3)
the GC runs every 5 minutes or unless you manually invoke it. Currently the GC
will stop the world and do the entire cleanup at once. From my understanding
of this, the GC will do the cleanup in small 10ms bursts, so you won't have a
200ms pause every 5 minutes.

