
Recycling memory buffers in Go - jgrahamc
http://blog.cloudflare.com/recycling-memory-buffers-in-go
======
stcredzero
Buffer caching, the technique that never goes away.

GC does not mean your memory management is solved. It means that your memory
management works well enough for development and sometimes for undemanding
tasks. That said, generational GC can be like mediocre buffer caching "for
free." But this still doesn't make all explicit buffer caching go away.

Given that buffer caching is "the techique that never goes away," why isn't
some facility for handling this built into the runtimes of languages with GC?
You could tell the GC that this particular type of object needs to be cached,
with a cache of this size and growth policy, and then programmers never have
to write and debug code that does this.

EDIT: Another way to think abut this: because GC is designed to be universal
and foolproof, it's easy to get GC into a corner case where its behavior is
very suboptimal. In the case of "memory pressure" you can think of this as the
GC having to scan the graph of references way more often than it really needs
to. However, the optimization techniques for many of these are fairly generic.
What if optimization techniques could be built into languages and runtimes,
then instrumentation could be used to empirically determine what optimizations
are actually needed, then these could be simply "switched on" for particular
classes or types.

~~~
fiatmoney
That shouldn't usually require any explicit support from the runtime; it can
exist as library code (and does - for instance Apache Commons:
[http://commons.apache.org/proper/commons-
pool/](http://commons.apache.org/proper/commons-pool/) ).

~~~
stcredzero
_> That shouldn't usually require any explicit support from the runtime; it
can exist as library code_

Well, of course. People writing their own logically implies that such can
exist as library code. But if you build in support from the runtime, then it
might be possible to squeeze out a bit more performance. For example, perhaps
it's possible to tell the GC that certain objects don't have to be considered
as roots. (And to the extent that this is possible with current libraries,
it's because there is support built into the runtime.)

EDIT: Also, having these mechanisms as library code probably opens up more
opportunities for introducing implementation errors.

~~~
jamesaguilar
Generic buffer caching support looks an awful lot like generational GC. It's
faster when you hand-roll it because you know how you're going to use the
buffers and which buffers are going to be reused.

~~~
stcredzero
I'm proposing something that's not quite as foolproof and automatic as
generational GC, but more optimal and a lot quicker and safer than hand-
rolling. Also something for which profiling is enough to "know how you're
going to use the buffers and which buffers are going to be reused."

So basically, take something that's a programming problem and reduce it to
"profile, then throw some switches."

~~~
jamesaguilar
I guess I'm just not seeing how that's going to result in anything different
than a userspace freelist library.

~~~
stcredzero
If you do everything correctly, not too much. But you can't assume the former.
Also, having the thing baked into the language/runtime will encourage use.
It's along the lines of what happens with refactoring facilities that aren't
part of the browser/editor/ide. They stay "that esoteric thing over there" and
don't get used that much.

------
zwieback
Maybe I didn't read the code right but where does it check whether a retrieved
buffer is the right size? The makebuffer() routine appears to generate random
size buffers but that length then doesn't get checked.

My experience from embedded systems is that buffer rings are great if you can
predict two or three common sizes that cover your typical allocation needs.

I think even for people that work mostly in GC environments it's a good idea
to read through the basic mechanism of heap management and allocation
strategies. This article is a good intro to the basic idea and I like the
charting and analysis being done.

~~~
jgrahamc
The sample code does not check the buffer size. This is a slightly artificial
example to show the technique and have some interesting graphs to back it up.

------
jgrahamc
There's a proposal to add something like this in Go itself:
[https://code.google.com/p/go/issues/detail?id=4720](https://code.google.com/p/go/issues/detail?id=4720)

~~~
stock_toaster
unfortunately, it looks like it isn't going anywhere right now:

[https://groups.google.com/d/topic/golang-
dev/QtSIOFsNFrQ/dis...](https://groups.google.com/d/topic/golang-
dev/QtSIOFsNFrQ/discussion)

------
thirsteh
This is nice, but I can't help but feel it's a little wrong that we're still
making free-lists like in the olden days.

~~~
tptacek
You absolutely don't need to do any of this to write Go programs that
outperform Ruby or Python.

~~~
thirsteh
Generally, sure. But it's still pretty easy to write garbage-generating Go
applications that exhaust memory because the GC isn't compacting. In some
applications this is a nobrainer: use a free list. In some others, like when
you don't actually know what kind of buffer size you'll need, it's not so easy
(although you could argue you should just fixed-sized chunks for whatever
you're doing, and recycle those, but that's a lot of work.)

What I'm saying boils down to: Free lists are definitely useful, but the GC
could be better.

~~~
jbooth
"Could be better" is of course true for every single situation, no matter how
good.

But even Java, which has what is generally regarded as the most advanced GC
ever written, still has lots of applications doing the free-list of
ByteBuffers or even better, the free-list of off-heap-allocated
DirectByteBuffers which don't have to be moved during tenured generation
compactions.

It seems pretty reasonable to me to have one or two freelists at the spots
you're frequently allocating 64kb buffers, along with using the GC for the
other 5,000 allocations in your project.

~~~
thirsteh
Sure. No disagreement.

Let me put it another way: Go's GC is still in its infancy compared to those
in most other languages. It's perfectly fine to use, but there are some big
things (e.g. no compaction) that stick out when using the language, whether
you're working with free-lists or something else.

~~~
tptacek
I think you're going to find that even with no attention to allocation at all,
your Golang programs are going to outperform their Ruby and Python
equivalents.

That's a dicier assertion to make versus other languages, notably Java, but if
you're coming to Golang from a high level language like Python, you're
probably going to be happy with the performance you get out of naive code.

~~~
thirsteh
You're right (except for some cases like crypto and regex where other
languages "cheat" by using C bindings or unstable versions), but that isn't
where it has bit me. In general all my Go programs have been more than fast
enough, but on several occasions I've had memory starvation problems that I
would not be having in other languages (even if there _is_ a way to do it
without doing a lot of allocation and fragmentation in Go.)

I love Go, but "it's fast" isn't a good reason not to improve a subpar GC.
(Which fortunately isn't a position the Go team is taking. I know everyone
there would like to see a lot of interesting things happen with the GC. It
just hasn't been one of the more pressing issues so far, and I've agreed. It's
also a very difficult and "boring" thing to work on for most people, so it's
not surprising that somebody hasn't just added an Azul C4-style concurrent GC
in their spare time.)

~~~
pcwalton
Azul C4 requires a kernel module or a specialized kernel to support all the
MMU tricks that it relies on to achieve maximum performance. I don't think
it's really an appropriate choice for general purpose GC in a language that's
designed to be easy to deploy.

Java HotSpot's collector makes the most sense IMHO: stop the world compacting
in the young generation, mostly concurrent in the tenured generation. It
achieves better throughput than Azul C4 too, according to the C4 paper.

~~~
agentS
I vaguely remember reading somewhere that they figured out a way to run their
collector on stock Linux; I think the price they paid in exchange was (even)
lower throughput (trying to track down reference so I know I wasn't
daydreaming) (yes, I daydream about GCs).

A while ago, I read through the kernel patch they provided, and it was
basically adding batch APIs for mremap so you don't have to have a TLB flush
for every call. Also because all mutator threads needed to be paused at a
safepoint while the remaps happen, so the batch API has much shorter pause
time.

------
thrownaway2424
This technique puts a lock in your serving path which will necessarily limit
parallelism. Go channel operations are not "non-blocking" in every sense. Even
if you have them in a select block your thread has to take the mutex to find
out if the buffer is empty (on receive) or full (on send).

------
felixge
FWIW: Node.js uses a buffer cache like this by default:
[https://github.com/joyent/node/blob/master/lib/buffer.js](https://github.com/joyent/node/blob/master/lib/buffer.js)

~~~
felixge
Would the down-voters care to explain? I've pointed out how another platform
deals with allocating raw memory buffers as it's relevant to the discussion.

I've made no claims of it being superior, and in fact, am largely using Go
myself these days.

~~~
thirsteh
Using a shared buffer list isn't very efficient in a language with concurrency
since you have to synchronize to take from/add to the list.

~~~
cmbaus
All heap access must be synchronized. Using pooled buffers may actually be
faster because you can ensure that only one thread is accessing a specific
pool. Just create a new pool, when a new thread is created.

~~~
thirsteh
Or, you know, each thread could allocate only when it needs something and
handle its own recycling. Node.js has a particular advantage in that it isn't
concurrent in any way (which, of course, has many other downsides.)

~~~
bad_user
You can certainly build concurrent systems on top of Node, what it doesn't
have is preemptive multi-threading, which sucks IMHO, but speaking of Go I
also don't like that it does M:N multi-threading instead of 1:1 like the JVM,
as the scheduling done is suboptimal.

~~~
GhotiFish
Oh man, I'm pretty keen to write many types of subsystems, but thread
scheduling?

I'll take stuff I don't want to mess with for 300, Alex.

------
wzsddtc
coincidentally …. this is how we pooled the buffers on Vine for Android to
make Camera initializations efficiently.

