
Show HN: Mmm – manual memory management for Go - teh_cmc
https://github.com/teh-cmc/mmm
======
tptacek
This is certainly interesting work. It seems like the kinds of designs for
which it would make a huge difference are also the kinds of designs that would
underperform in raw C as well, though; a fair amount of memory optimization in
C also involves minimizing pointers.

~~~
eternalban
> ... memory optimization in C also involves minimizing pointers.

Please elaborate. Do you mean reducing overhead via compressed address space
references or offset based?

~~~
JoachimSchipper
Chiefly, just using fewer pointers. That is, turn

    
    
        struct foo {
            struct bar *bar;
            struct baz *baz;
        }
    

into

    
    
        struct foo {
            struct bar bar;
            struct baz baz;
        }
    

Of course, this is not always possible - but a C program written naively/for
encapsulation/flexibly tends to have a lot of places where you can perform
such an optimization.

The performance win, on modern platforms, comes chiefly from reducing the
number of (unpredictable/non-cached) pointer dereferences.

~~~
pcwalton
> The performance win, on modern platforms, comes chiefly from reducing the
> number of (unpredictable/non-cached) pointer dereferences.

The parenthetical is important. Pointers are just fine as long as you don't
dereference them and suffer cache misses as a result.

It's also worthwhile to note that what you described is not always a
worthwhile optimization, even if you can do it. If, for example, you traverse
an array of the objects containing a sub-object frequently and follow the
pointer to that object only rarely, it's often worthwhile to allocate it
separately to improve cache locality of that traversal. This is even more
important if you're doing a lot of structure copies of the outer structure;
avoiding copying more bits saves a lot of time. The latter case comes up
surprisingly often in my experience, and I've improved performance of code
quite a few times with separate allocations and pointer indirection.

~~~
tptacek
Nitpick: they're fine in otherwise-reasonable code as long as they aren't
shredding the cache. But there are lots of unreasonable designs that waste
tons of space, further burning up locality even when the pointers aren't
traversed, that are due to pointer overuse.

~~~
pcwalton
Yes, definitely true; sizeof(void *) and malloc slop aren't free.

~~~
jheriko
sizeof( void* ) is free at run-time.

if not then the compiler team should rethink their careers...

~~~
ctz
I think the parent is saying that pointers need storage too and this can add
up, not that sizeof expressions are computationally difficult :)

------
slimsag
It seems a bit contradictory to write a manual memory management library in Go
and _then use type reflection as a shoehorned version of generics_.

It must be that the author only cares about GC pauses, i.e. 'nondeterministic'
pauses -- not really perf itself?

~~~
teh_cmc
Yes, as I said in this comment [1]; my real issue here is lengthy GC pauses,
which cause peaks in response times.

Regarding overall performances, I'm not really convinced that the reflect
calls affect them that much (there's 2 reflect calls per Read() or Write()
call), but I honestly just don't know, I'd have to bench to be sure. Although,
if it turns out to actually be an issue, you can still use Pointer() to keep
references to your unmanaged heap and you'll be able to work with your data
without GC nor reflection overhead.

[1]
[https://news.ycombinator.com/item?id=10650885](https://news.ycombinator.com/item?id=10650885)

------
eljimmy
The benchmarks seem to revolve around the invoking of the collector. I am very
curious as to how common it would be to do that often enough for it to matter
in terms of overall speed?

~~~
jerf
Infrequently enough that it's _usually_ premature optimization to spend too
much time worrying about it. But in Go's core application space, high-load
network applications, frequently enough that it's something you should be
aware of, and there's a realistic chance of optimizing everything else to the
point that GC time becomes your biggest problem, even after the latest work on
it.

You'd still want to reach for mmm only after A: verifying that GC time really
is your biggest problem via profiling and B: taking more normal steps to
minimize overallocation first, because with value types Go has some tools (if
not necessarily "a lot" of such tools, but definitely _some_ ) for dealing
with that. But if you end up backed against the wall, this may be helpful.

You could also arguably add a C: did you really mean to use a language with
manual memory in the first place, or perhaps Rust? Or can you factor just the
relevant bit out into such a language and interact via some RPC mechanism back
to the Go code base? But as the situation becomes arbitrarily complicated
there simply ceases to be a silver bullet.

~~~
teh_cmc
I couldn't have said it better. This is almost word for word why I decided to
build mmm: I maintain a few high-load RPC services in Go, each of which stores
millions, if not tens of millions of items in their cache.

In this configuration, each incoming request means allocations inside Go's RPC
package [1], which in turn means that a GC pass will be triggered if
GC_PERCENT [2] has been reached, which in turn means that the GC will have to
scan all of those long lived pointers (Go's GC is not generational), which in
turn means a huge peak in response time.

This basically leaves me with three possible solutions:

\- hack into Go's RPC package to minimize allocations, which is a huge price
to pay just to delay the inevitable

\- build my caches in a language that offers manual memory management, then
query those via RPC from my Go services; but I don't want to add a new
language into the mix

\- provide a generic solution for manual memory management in Go, which is
where we are now

[1] [https://golang.org/pkg/net/rpc/](https://golang.org/pkg/net/rpc/)

[2]
[https://golang.org/pkg/runtime/debug/#SetGCPercent](https://golang.org/pkg/runtime/debug/#SetGCPercent)

~~~
giovannibajo1
There's also #4, using a ready-to-use external cache that can be accessed over
RPC/network. Like memcache, or Redis. I'm sure you've thought of it, so can
you explain why it didn't work in your scenario? I'm just curious.

~~~
jerf
Well, at the limits of performance, you're still looking at additional copies
to get the cached data from the cache to the program, then probably another
copy to get it from the input to the output, plus several context switches
(which, even if they may ultimately have a relatively small effect on
throughput, will increase latency). Having it in-process can be faster, if
you're willing to pay the complexity price.

Again, it's almost certainly premature optimization to start with that design,
but if that's where your optimization leads you, it's not that surprising.

FWIW, I wrote something myself that hits a similar problem, but in a
completely different dimension:
[https://github.com/thejerf/gomempool](https://github.com/thejerf/gomempool)
_My_ problem was that I had an otherwise rather placid program (from an
allocation perspective) that liked to allocate buffers for messages that were
many hundreds of kilobytes to low numbers of megabytes in size. In normal
usage, only maybe one or two of these are ever in use at a time, but I use
hundreds per second. In my case, each individual GC was actually not that big
a deal, but I was triggering them every few seconds. The GC would see a lot of
large allocs, and then _successfully_ clean them up, meaning that the next
batch of large allocations would be seen as crossing the threshold again. My
stats clearly showed that on this system, once I started pooling my []byte I
never even filled up the memory pool itself, and my GCs plummeted so far that
I wouldn't even particularly care if they took half-a-second apiece anymore,
which they don't. Almost everything other than those large message buffers
were stack-alloc'ed anyhow.

------
_rlh
You use runtime.GC() when you did your benchmark. This tells the Go runtime to
do an aggressive stop the world GC, which it does. The normal GC is concurrent
and if you use it latency should not be a problem. For your use case I'd
remove runtime.GC() and try timing the RPC and look for latency spikes. Report
back if you can, if you are still seeing high latency we can file a bug report
with the Go team.

~~~
teh_cmc
Yes, these benchmarks use forced GC calls (i.e. all phases are STW) because
it's the only (good) way I can think of to make theses tests deterministic
(maybe you know of a better way? I'd definitely be interested).

Of course, I don't have such calls on production systems; and while concurrent
collections greatly reduce the time spent in STW phases, latency spikes are
still a very real issue, as I explained here [1]. The monitoring on my
production systems leaves absolutly no doubt that those latency spikes are due
to garbage collections: once the GC was out of the picture, every thing was
flat and fast again.

[1]
[https://news.ycombinator.com/item?id=10650885](https://news.ycombinator.com/item?id=10650885)

~~~
_rlh
Some folks export GODEBUG=gctrace=1 which will result in the GC dumping out a
lot of interesting information including stop the world latency. It doesn't
increase the determinism and the GC cycle is started when the runtime decides
but it does provide visibility into the concurrent GC latency in a benchmark
or actual application. Perhaps you already use it and that is how you noticed
the latency problems to start with.

I do know that you need version 1.5 of Go that was released last August to get
the low latency GC. If throughput is an issue some folks adjust GOGC to use as
much RAM as they can. If none of this helps file an issue report with the Go
team. You seem to have a nice well thought out work around but a reproducible
gctrace showing hundreds of millisecond of GC latency on heaps with millions
of objects will be of interest to the Go team and might really help them.

I hope this helps.

------
jheriko
it does feel a little backwards, but i can see the utility.

i still hope there is a better solution out there for memory management than
using a GC. they are pretty much golden sledgehammers for this problem and
enable lots of very bad practice to occur without leaking (this is both a
strength and a weakness though...).

