
Optimizing M3: Halving Our Metrics Ingestion Latency by Forking the Go Compiler - roskilli
https://eng.uber.com/optimizing-m3
======
jerf
While I won't claim this is unique to Go, I've had some similar good
experiences cloning out various bits of Go for my own crazy purposes. The
standard library and compiler are relatively clean code for what they are, and
it's relatively easy for a pro developer to fork them temporarily like this,
or pick up a standard library that almost does what you need and add what you
need. I've forked encoding/json to add an attribute that collects "the rest"
of the JSON attributes not automatically marshaled, both in and out. I've
forked encoding/xml to add a variety of things I needed to write an XML
sanitizer (in which you are concerned with things like "how long the attribute
tag" is _during_ parsing; it's too late to be presented with a 4 gigabyte
attribute in the user code, it's already brought your server to its knees). I
saved weeks by being able to start with a solid encoder/decoder backend and be
able to follow it and bring it up to snuff, rather than start from scratch. A
coworker forked crypto/tls because it was the easiest TLS implementation to
break in deliberate ways to test for the various SSL vulnerabilities that have
emerged over the years.

Of course I recommend this more as a last resort than the first thing you
reach for, but it's a fantastic option to have in the arsenal, even if you
don't reach for it often. I encourage people to at least consider it.

~~~
stubish
This is something I noticed different to Python. Go modules love to hide all
their internals, and its made impossible to access the privates. If you need
to tweak behavior, you are forced to fork. Unlike Python, where it is not
unusual to reach in, ignoring the suggestions about privates, and mess around
at runtime ('monkey patching'). Both approaches have problems, but the Python
approach seems faster.

~~~
jerf
At this point in my career, I'd fork the Python library anyhow, if it were for
any non-trivial professional usage. Monkeypatching is a maintenance nightmare.

Of course, if it's a true one-off script or some personal hackery, go for it.
But I wouldn't consider it a professional option.

(I find myself thinking more and more lately about how to program
professionally in the long term, rather than simply solving the problem at
hand.)

------
andrewfromx
Summary: developers were calling a method over and over 30 levels deep inside
the stack and just barely not going over the 2k golang stack size initial
limit. i.e. they were getting great performace because everything happened to
be 1.9k or 1.8k, or just not quite 2k or more. Then, a change, and performance
went terrible. On the opposite side of 2k with 2.1k or 2.2k an entire extra 2K
more had to be allocated for a total of 4k to fit everything. Engineers stop
at nothing to find the RCA looking at assembly of the binary and yes, forking
the go compiler.

------
abalone
Ah, memory management. Here’s my basic understanding:

\- Go initially allocates 2KB stack per routine. When it exceeds it it copies
all of it into 2x the space.

\- This was happening once or twice per request. They didn’t explain exactly
why all that stack memory was being used (maybe someone can chime in), but
contributing factors were a 30 function deep call stack and a minor code
change that tipped it into the next stack growth tier.

\- Also this doesn’t get freed up until garbage collection runs.

\- They worked around it by implementing a kind of go routine pool that keeps
assigning work to the same (stack-expanded) routines, staying ahead of the
garbage collector.

My takeaways:

1\. Fantastic analysis and job well done.

2\. Pooling does not seem to be how things “should” work in Go. It’s more of a
hack around undesirable allocator / garbage collector behavior.

3\. I’m really interested in reference counted languages like Swift on the
server for these reasons. I know ARC means more predictable latency when it
comes to garbage collector behavior (which is only indirectly the problem
here). Now I’m really curious how Swift allocates the stack and whether it
would avoid this “morestack” growth penalty that Go has.

~~~
Someone
Swift doesn’t have threads in the same sense that go has them. The _language_
doesn’t have special constricts for creating threads; you use _library_ calls
to do that.

You typically either use Grand Central Dispatch, or directly make OS calls to
create threads. In either case, you get way larger stack sizes, 512kB for
secondary threads by default
([https://developer.apple.com/library/archive/documentation/Co...](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Multithreading/CreatingThreads/CreatingThreads.html))

Also, AFAIK, those stacks never grow. If you try to grow a thread’s stack too
much, it’s end of game (for that thread, I think, but possibly even for the
application)

Go’s smaller stacks allow for way more goroutines to exist in parallel, but as
this article shows, if go has to grow a stack, things slow down.

~~~
saagarjha
Trying to grow a stack too much will cause a segmentation fault, which is
recoverable but usually not something you would want to do.

~~~
masklinn
You can't grow a C stack, it's pre-allocated and fixed size. Trying to _use_
too much of a stack - beyond what's initially allocated — is why you're going
to segfault (in the best case), as you'll be traipsing on unallocated
addresses and hopefully unmapped pages.

~~~
saagarjha
I'm using "grow" in the sense of moving the stack pointer down. But yes,
you're correct, it's trying to access that memory that will cause a
segmentation fault. And to address your first point, I'm sure there's hijinks
you can perform with mmap to add a couple more pages if you so desire.

~~~
masklinn
> I'm sure there's hijinks you can perform with mmap to add a couple more
> pages if you so desire.

I don't think you can do that, but I think you'd just create a bigger stack
when spawning your process / thread: the stack allocation is virtual (at least
on unices, possibly not on Windows?) so on a 64b system it's unlikely to be
very expensive, physical pages will get allocated on-demand as stack use
increases.

This could probably be tested / benched using pthread:
pthread_attr_setstacksize lets you define the (virtual) stack size before
spawning the process.

------
gen220
Fascinating read. Although the idea of using thread pools evokes the pthread
management, this post is rather convincing that such "hand-holding" is
necessary in applications with intense SLAs. Alas, the magic the Go team has
worked with routines doesn't yield a free lunch for _everybody_.

If we accept that pooling is necessary in some cases, I'm curious – is there a
common source that these applications use?

In trying to answer my own question, I found that M3 has a mature-looking
implementation of such an abstract solution.
[https://github.com/m3db/m3/tree/master/src/x/sync](https://github.com/m3db/m3/tree/master/src/x/sync).

Elsewhere, I couldn't find anything similar in the usual suspects. CockroachDB
has one-off, specific implementations in the places where they've decided
pooling is worth it. Looks like Kubernetes uses the stdlib's `sync.Pool`
interface in a similar way, but doesn't use a full-fledged "routine pool".

Do people at Uber think this is a robust enough solution to be used outside of
m3? Seems like it might be useful in the stdlib as an implementation of
`sync.Pool` :)

~~~
richieartoul
We use this
[https://github.com/m3db/m3/blob/master/src/x/sync/pooled_wor...](https://github.com/m3db/m3/blob/master/src/x/sync/pooled_worker_pool.go)

all over our code base so its definitely stable enough to use in your own
projects if you have a need, although it does require some tuning.

My guess is that the Go team would not consider this critical / core enough to
include in the standard library and I'd be inclined to agree with them.

~~~
gen220
Awesome, thanks for the ref.

I'd agree that it probably doesn't belong in the stdlib, as there aren't many
programs that would really benefit from it. OTOH, it's good one to keep in
pocket, for the few applications that would.

------
kjksf
It seems to me they could have used the following hack to fix such issue:

    
    
      var dontOptimizeMeBro byte
    
      // go:noinline
      func makeStackBig() {
        var buf [16386]byte
        dontOptimizeMeBro = buf[0] + buf[len(buf)-1]
      }
    

Call this at the start of the goroutine.

What it does, I hope, is extend stack to 16kb, once (as opposed to going from
2kb to 4kb then to 8kb then to 16kb and paying for coyping the memory multiple
times).

The stack stays big for the remaining lifetime of the goroutine.

~~~
richieartoul
Yep that works too although I didn't benchmark the difference in performance.
The nice thing about the worker pool though is that it auto-tunes the stack
size based on the workload.

------
pcwalton
This is a case in which generational GC can help. If you allocate goroutine
stacks in the nursery, then you can use a bump allocator, which makes the
throughput extremely fast. Throughput of allocation matters just as much as
latency does!

(By the way, Rust used to cache thread stacks back when it had M:N threading,
because we found that situations like this arose a lot.)

~~~
shereadsthenews
You can't bump-allocate 6K contiguous to an existing 2K allocation, and Go
currently requires the stack to be contiguous.

~~~
pcwalton
That's right, but you can bump-allocate a new 6K and copy over. It's a lot
faster than falling through to a general-purpose malloc implementation.

~~~
shereadsthenews
That’s not obviously true. Page sized and page aligned allocation is pretty
fast in Go. Having a generational GC where all pointers are movable could have
systemic impact in the performance of the whole program.

~~~
pcwalton
If it were fast enough, then caching stacks wouldn't be a win. Bump allocation
is like 6 instructions.

The idea that generational GC would not be a win does not match the experience
of any other language. Generational GC is virtually always a win for languages
like Go. This is just another reason why Go should adopt it.

~~~
readittwice
Isn't it a win because the goroutine now starts with 4K of stack instead of
2K? This makes the stack large enough, such that the stack doesn't have to
expanded on every request.

------
robocat
Except it was the second growth just exceeding the 4096 stack size that was
causing the issue:

"it looked like the goroutine stack was growing from 4 kibibytes to 8
kibibytes"

------
shereadsthenews
I seem to recall there are a couple of github issues around reusing goroutines
and their stacks or being able to specify the stack size of a new goroutine
instead of making it a runtime constant. Either would be very helpful for
those of us using Go at scale.

~~~
nine_k
Would the segmented stack of the early Go implementations be helpful in such a
case?

~~~
richieartoul
Yeah I think this issue would not have cropped up with the early segmented
stack implementation, but segmented stacks had their own issues (which is why
the Go team migrated away from them)

------
billsmithaustin
Hopefully Uber won't have to maintain their forked Go compiler for too long.

~~~
richieartoul
We don't! We only briefly forked the compiler to prove our RCA, but we
resolved the issue with a custom worker pool (linked in the blog post). We
compile all our code using the same compiler as everyone else :)

~~~
bdamm
Very nice! Also another proof point for why it's nice to have open source
tools.

------
iampims
At scale, details matter. Great read.

------
curiousDog
Great read, so a key takeway would be to make sure to prime the connection
pool and also re-use it. Isn't re-using goroutines a bit of an anti-pattern
though?

~~~
aflag
I don't think that's the takeway. Reusing goroutines was only necessary in a
very specific situation at a very large scale.

The article is very good in providing ideas and tools you can try to use
whenever you find yourself in a similar situation.

------
lostmsu
Great tech article!

