Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing M3: Halving Our Metrics Ingestion Latency by Forking the Go Compiler (uber.com)
206 points by roskilli 5 months ago | hide | past | web | favorite | 67 comments

While I won't claim this is unique to Go, I've had some similar good experiences cloning out various bits of Go for my own crazy purposes. The standard library and compiler are relatively clean code for what they are, and it's relatively easy for a pro developer to fork them temporarily like this, or pick up a standard library that almost does what you need and add what you need. I've forked encoding/json to add an attribute that collects "the rest" of the JSON attributes not automatically marshaled, both in and out. I've forked encoding/xml to add a variety of things I needed to write an XML sanitizer (in which you are concerned with things like "how long the attribute tag" is during parsing; it's too late to be presented with a 4 gigabyte attribute in the user code, it's already brought your server to its knees). I saved weeks by being able to start with a solid encoder/decoder backend and be able to follow it and bring it up to snuff, rather than start from scratch. A coworker forked crypto/tls because it was the easiest TLS implementation to break in deliberate ways to test for the various SSL vulnerabilities that have emerged over the years.

Of course I recommend this more as a last resort than the first thing you reach for, but it's a fantastic option to have in the arsenal, even if you don't reach for it often. I encourage people to at least consider it.

This is something I noticed different to Python. Go modules love to hide all their internals, and its made impossible to access the privates. If you need to tweak behavior, you are forced to fork. Unlike Python, where it is not unusual to reach in, ignoring the suggestions about privates, and mess around at runtime ('monkey patching'). Both approaches have problems, but the Python approach seems faster.

At this point in my career, I'd fork the Python library anyhow, if it were for any non-trivial professional usage. Monkeypatching is a maintenance nightmare.

Of course, if it's a true one-off script or some personal hackery, go for it. But I wouldn't consider it a professional option.

(I find myself thinking more and more lately about how to program professionally in the long term, rather than simply solving the problem at hand.)

It's certainly not unique to Go; this is encouraged (if rarely done) in the Erlang community.

In fact, this is the reason that "Erlang/OTP" is a concept distinct from "Erlang." OTP is a distribution of Erlang, a "fork" that remixes together the core Erlang components in a certain way.

(Specifically, OTP is Ericsson's distro of Erlang targeted at building telecom switches. That means it contains not just regular language stdlib stuff, but libraries like "Megaco/H.248: a protocol for control of elements in a physically decomposed multimedia gateway".)

You're free to make your own distro of Erlang; it's extremely easy (and in fact, everyone is implicity doing it every time they use Erlang's release building process to build an app. An Erlang "release" is a customized, derivative Erlang distribution! You can build a new SDK with the release tooling just as easily as you can build an app runner.)

Even if it's rarely taken advantage of, it's clear that "forking your own Erlang distro and customizing it" is the idiomatic approach for many potential user stories. If you study the Erlang "kernel" library, you'll notice that there are many customizations which the kernel "expects" people to make, but not in the sense that there is any stable ABI grip-point to slot in a plugin via a configuration stanza. Rather, Erlang just has a trivial implementation of the logic in the kernel, sitting there in well-factored module. The kernel authors are expressing a clear intent by doing so: if you want some logic more fancy than this trivial version, just fork the kernel and plop in your own version of this module that does something else!

Personally, I love this way of thinking. Rather than the usual problems cropping up where something in the runtime that wasn't "quite right" for the application's use-case, was then reimplemented as a non-runtime-integrated library, with other warts and higher overhead as a result; instead, users are encouraged to just solve their problem in the runtime. Sometimes that results in upstreaming a patch, but that's not-at-all the goal. The goal is just to have software that has e.g. exactly one IO path, which everything uses.

(Side note: I find it mystifying that Ericsson's OTP release of Erlang is still considered by the community to be "the" Erlang SDK. It's as if Ubuntu was the only Debian-alike and there was no Debian, even though it's very clear exactly what Debian would look like. A "core" Erlang distro—with all the generally-useful OTP stuff like supervisors, but without the telecom-specific stuff—is totally possible; as is rebasing OTP to be a downstream distro of it. But nobody really seems to care about doing so.)

I have had great results creating my own non validating xml tokenizer using the function signatures of the standard library.

I agree wholeheartedly that the code in the standard library is very clean.

One thing that Go does better here than I've seen in other (compiled) languages so far, is that you can literally just cd into GOROOT, do your changes and a build of your project will pick up on the changed stdlib and re-compile it on-the-fly. I pretty regularly do that as a debugging aide. Of course, it isn't just as simple with the compiler itself, you still have to re-build that one manually.

Summary: developers were calling a method over and over 30 levels deep inside the stack and just barely not going over the 2k golang stack size initial limit. i.e. they were getting great performace because everything happened to be 1.9k or 1.8k, or just not quite 2k or more. Then, a change, and performance went terrible. On the opposite side of 2k with 2.1k or 2.2k an entire extra 2K more had to be allocated for a total of 4k to fit everything. Engineers stop at nothing to find the RCA looking at assembly of the binary and yes, forking the go compiler.

Ah, memory management. Here’s my basic understanding:

- Go initially allocates 2KB stack per routine. When it exceeds it it copies all of it into 2x the space.

- This was happening once or twice per request. They didn’t explain exactly why all that stack memory was being used (maybe someone can chime in), but contributing factors were a 30 function deep call stack and a minor code change that tipped it into the next stack growth tier.

- Also this doesn’t get freed up until garbage collection runs.

- They worked around it by implementing a kind of go routine pool that keeps assigning work to the same (stack-expanded) routines, staying ahead of the garbage collector.

My takeaways:

1. Fantastic analysis and job well done.

2. Pooling does not seem to be how things “should” work in Go. It’s more of a hack around undesirable allocator / garbage collector behavior.

3. I’m really interested in reference counted languages like Swift on the server for these reasons. I know ARC means more predictable latency when it comes to garbage collector behavior (which is only indirectly the problem here). Now I’m really curious how Swift allocates the stack and whether it would avoid this “morestack” growth penalty that Go has.

Swift doesn’t have threads in the same sense that go has them. The language doesn’t have special constricts for creating threads; you use library calls to do that.

You typically either use Grand Central Dispatch, or directly make OS calls to create threads. In either case, you get way larger stack sizes, 512kB for secondary threads by default (https://developer.apple.com/library/archive/documentation/Co...)

Also, AFAIK, those stacks never grow. If you try to grow a thread’s stack too much, it’s end of game (for that thread, I think, but possibly even for the application)

Go’s smaller stacks allow for way more goroutines to exist in parallel, but as this article shows, if go has to grow a stack, things slow down.

Trying to grow a stack too much will cause a segmentation fault, which is recoverable but usually not something you would want to do.

You can't grow a C stack, it's pre-allocated and fixed size. Trying to use too much of a stack - beyond what's initially allocated — is why you're going to segfault (in the best case), as you'll be traipsing on unallocated addresses and hopefully unmapped pages.

I'm using "grow" in the sense of moving the stack pointer down. But yes, you're correct, it's trying to access that memory that will cause a segmentation fault. And to address your first point, I'm sure there's hijinks you can perform with mmap to add a couple more pages if you so desire.

> I'm sure there's hijinks you can perform with mmap to add a couple more pages if you so desire.

I don't think you can do that, but I think you'd just create a bigger stack when spawning your process / thread: the stack allocation is virtual (at least on unices, possibly not on Windows?) so on a 64b system it's unlikely to be very expensive, physical pages will get allocated on-demand as stack use increases.

This could probably be tested / benched using pthread: pthread_attr_setstacksize lets you define the (virtual) stack size before spawning the process.

You can only do that if the new virtual memory address range isn’t used yet for something else. If so, you can almost just as well make the stack larger to start with. Certainly on 64-bit systems with their enormous address spaces, that’s probably the wise thing yo do.

Go Can’t do that because it is designed for allowing to run thousands of short-lived threads, even on 32-bit systems.

Threadsafe refcounting is just terrible for throughput. Not only do the atomic ops hurt you, but logically read-only operations like data structure traversals require modifying refcounts, and therefore become read-write operations, with associated cache line ping-ponging between cores.

The Swift people are smart and maybe they have a plan to mitigate this but I don't know what it might be.

So using Go and Java on the server makes sense when the developer ergonomics of GC outweigh the performance costs (and the developer ergonomics disadvantages of having to analyze GC performance). Using Rust makes sense when performance matters more. Swift seems somewhere in between; maybe there's a sweet spot there but, but it wouldn't be surprising if not.

> Swift seems somewhere in between

I would say Swift is easier to learn/use than Rust while aiming to deliver some of the same performance characteristics. At least that's the sweet spot it's shooting for on the server.

On throughput, two things:

- Reference counting only applies object references. It does not apply to value types, and Swift is very strong on the "value semantics" approach to coding. It's like a sensible incorporation of some functional programming concepts without getting dogmatic about it. I'm not sure exactly how this will work out on the server but it seems very promising from a performance and thread safety standpoint.

- Even when we consider throughput burdened by ARC, latency is probably more important of a metric than throughput for a whole lot of services. Throughput is sort of the knee-jerk metric which I feel is a holdover from desktop benchmarking days. For services at scale, p99 latency is where you get the real headaches. And that's one area where ARC is especially good.

Refcounting makes sense when latency and responsiveness are more important than throughput. That’s true of both native UIs and most kinds of servers, which between them cover a good fraction of code being written today, maybe the majority of code.

I’m not yet convinced by Swift the language, but I think it’s positioned in just the right place.

For me the benefit of Rust like language is the added benefit provided by the type system. Even when performance is not the critical factor. There some level of unit tests that are not required in Rust like language which would be required in Java, even more so in Golang.

It would be nice if the initial stack per routine size was configurable. Ultimately, it would be interesting if the Go runtime could profile the typical stack usage for new routines and dynamically adjust the pre-allocated stack size configuration. It's a little crude and might waste some memory, but it would help.

Isn't lacking such nicities part of what keeps Go simple enough that forking the runtime is a relatively lightweight approach? The other extreme is a toolchain clever enough it requires carnal knowledge of its workings to coax reliable performance out of it

Does this mean go doesn't do (enough) inlining of functions?

Everyone agrees this is a problem. Today Go only inlines a function under certain very narrow circumstances.

In C# you can apply an `AggressiveInlining` to remove certain limiting restrictions, making it more likely that a method will be inlined - does Go have something similar?

This has actually improved in the latest release, version 1.12. I'm sure it's still behind other languages, though.

I can only imagine where Go will be with compiler optimizations in 3-5 years.

Look at gcc, it has tons, but it took years to get there.

Go has the added advantage of being able to learn off this knowledge

> I can only imagine where Go will be with compiler optimizations in 3-5 years.

Nowhere, optimisations cost CPU time and the Go developers focus on compilation time. They could already have implemented plenty of well-known and well-understood optimisation passes in the 10 years since the initial release, that they didn't do so and restricted themselves to the sort of optimisation passes you'd find in a bytecode interpreter was by design.

In theory gccgo is where you'd get an optimising go compiler, but apparently even with gcc 8.1's improvements it has issues which lead to the end result being slower than the mainline compiler. I'm sure they'd be glad to get more people involved to resolve this.

> Look at gcc, it has tons, but it took years to get there.

> Go has the added advantage of being able to learn off this knowledge

Go could have used LLVM, they specifically refused to do so on grounds of compilation time being too slow, preferring to not optimise instead.

Yet D, Delphi, Eiffel, .NET Native, Free Pascal, OCaml among others, manage to offer rich languages, deliver performance while having fast compilation times.

I'm pretty sure OCaml doesn't use an optimising compiler, I don't think Delphi or FreePascal do either, not sure about the others. Just as Go does, these compilers will perform some optimisations but generally fairly local and simple. They "deliver performance" in the same way Go does.

According to its documentation, dmd doesn't even do function inlining until you specifically opt into that with `-inline`.

I am pretty sure you are wrong.

OCaml has multiple implementations, it is a matter to choose the right one for the task at hand.

Delphi can make use of C++Builder's backend.

Eiffel uses its MELT VM for development, or compilation via C and C++ code generation, thus enjoying all their optimization capabilities.

D has three main implementations available, including GPGPU support on the ldc one.

.NET Native uses Visual C++'s C2 backend.

Seems like none of them "deliver performance while having fast compilation times", rather that you can pick between fast compilation and high-performance result. Which is pretty much the theoretical intention between mainline go[0] and gccgo.

And I will thus restate my original assertion: in 3-5 Go (the mainline compiler) will be nowhere in terms of optimisations. If you want an optimising Go compiler, work on and help with gccgo.

[0] whose maintainers simply don't care for advanced optimisations at the cost of compilation speed

On the the contrary, in some of the mentioned cases it just the difference between debug and release builds.

Go's community approach seems to keep being "you are holding it wrong", disregarding experience from other language communities.

Swift memory management lost to Go, C# and OCaml in the benchmark of the user space implementation of network drivers, presented at last year's CCC.

Yes I've seen that[1] but I think it's premature to extrapolate general conclusions about server engineering from a low-level network driver. Chris Lattner has noted that Swift will take longer to optimize for cases like low-level systems programming because unlike most languages it was initially designed for generality.[2]

What's interesting about Swift on the server is that the super lightweight runtime and lack of a tracing garbage collector hold the potential of much more consistent latency with low memory overhead. And as for reference counting overhead, application-level code is much more likely to be able to take advantage of Swift's preferred functional-esque "value type semantics", which don't suffer from that.

[1] https://news.ycombinator.com/item?id=18788069

[2] https://developers.slashdot.org/story/17/01/23/085232/slashd...

Value type semantics exist in plenty of GC enabled languages, like those from CCC benchmarks, hardly a Swift advantage.

Didn't say they weren't. Not sure if you misunderstood my comment.

It's a mistake to extrapolate too much about the "reference counting vs tracing GC" question from this one low-level driver benchmark. In fact reading their paper, they don't understand why Swift ARC traffic was so high.[1] They weren't allocating a lot of objects or copying. It's probably just an unoptimized use case or bug, not something fundamental in the memory management model. Swift is still quite young.

[1] https://github.com/ixy-languages/ixy.swift/blob/master/perfo...

Check the language report from Mesa/Cedar, which used reference counting with a cycle collector, already back then they weren't that inspiring.

The fact that the language designers switched to tracing GC on their further work, namely Modula-2+ and Modula-3. Or that Wirth after his 2nd sabbatical year at Xerox PARC decided for tracing GC on his system programming languages, Finally Microsoft own research, also using tracing GCs.

All in all quite telling, where they seen performance gains to be held.

Swift's approach makes sense from Objective-C compatibility point of view. Which also made sense given the failure to add a tracing GC while keeping the underlying C semantics.

From performance point of view not so much.

Having a tracing GC doesn't remove the possibility to have stack allocations, even when they look like heap ones, implement pools or whatever.

Unfortunately Java's approach of not having the same type system as older GC enabled systems programming languages created a bad picture of what is actually available out there.

This is not really strong evidence. A analysis from the 80s and a possibly buggy driver benchmark. You’re digging deep.

Look, tracing GC has come a long way. I’m not saying it’s unsuitable for servers at all. I’m just excited to see what sidestepping its problems entirely can do for server apps, especially with respect to tail latency and energy efficiency, increasingly salient metrics for services at scale.

OCaml is hardly a language specialized for low level network drivers

Actually OCaml has low-level memory manipulation features that help like cstruct.[1] Swift isn't currently optimized for the direct manipulation of external device memory. That's solvable, but moreover it's not really something you can extrapolate from to draw conclusions about server applications.

[1] https://dsheets.github.io/codoc/cstruct.1.5.0/_build/lib/cst...

I have used goroutine pooling. It has it uses when you want to limit resource usage.

Fascinating read. Although the idea of using thread pools evokes the pthread management, this post is rather convincing that such "hand-holding" is necessary in applications with intense SLAs. Alas, the magic the Go team has worked with routines doesn't yield a free lunch for everybody.

If we accept that pooling is necessary in some cases, I'm curious – is there a common source that these applications use?

In trying to answer my own question, I found that M3 has a mature-looking implementation of such an abstract solution. https://github.com/m3db/m3/tree/master/src/x/sync.

Elsewhere, I couldn't find anything similar in the usual suspects. CockroachDB has one-off, specific implementations in the places where they've decided pooling is worth it. Looks like Kubernetes uses the stdlib's `sync.Pool` interface in a similar way, but doesn't use a full-fledged "routine pool".

Do people at Uber think this is a robust enough solution to be used outside of m3? Seems like it might be useful in the stdlib as an implementation of `sync.Pool` :)

We use this https://github.com/m3db/m3/blob/master/src/x/sync/pooled_wor...

all over our code base so its definitely stable enough to use in your own projects if you have a need, although it does require some tuning.

My guess is that the Go team would not consider this critical / core enough to include in the standard library and I'd be inclined to agree with them.

Awesome, thanks for the ref.

I'd agree that it probably doesn't belong in the stdlib, as there aren't many programs that would really benefit from it. OTOH, it's good one to keep in pocket, for the few applications that would.

What fix/patch could the Go team do here to mitigate the issue from happening? It does seem like it would be nice if you could specifically to the runtime the amount of pre-alloc stack size needed. Would this help?

I think this is a trade-off, between configurability/expressiveness and simplicity. Historically, the Go team has favored simplicity. My impression is that the Go team would believe that exposing such a runtime config isn't a good idea. Personally, I agree with you it'd be nice to pre-alloc. But I also think that exposing such an option would probably create more bugs than it would solve.

I believe they'd prefer to improve the runtime's internal stack allocation algorithm, while keeping it opaque to the average developer. The proposal mentioned elsewhere in this thread, to re-use allocated routines "under the covers" is an example in that direction.

Until they implement something like that (essentially, routine-pooling) in the runtime, these apps will just have to use routine pooling at the application level – which I suppose isn't that bad, after all?

It seems to me they could have used the following hack to fix such issue:

  var dontOptimizeMeBro byte

  // go:noinline
  func makeStackBig() {
    var buf [16386]byte
    dontOptimizeMeBro = buf[0] + buf[len(buf)-1]
Call this at the start of the goroutine.

What it does, I hope, is extend stack to 16kb, once (as opposed to going from 2kb to 4kb then to 8kb then to 16kb and paying for coyping the memory multiple times).

The stack stays big for the remaining lifetime of the goroutine.

Yep that works too although I didn't benchmark the difference in performance. The nice thing about the worker pool though is that it auto-tunes the stack size based on the workload.

This is a case in which generational GC can help. If you allocate goroutine stacks in the nursery, then you can use a bump allocator, which makes the throughput extremely fast. Throughput of allocation matters just as much as latency does!

(By the way, Rust used to cache thread stacks back when it had M:N threading, because we found that situations like this arose a lot.)

You can't bump-allocate 6K contiguous to an existing 2K allocation, and Go currently requires the stack to be contiguous.

That's right, but you can bump-allocate a new 6K and copy over. It's a lot faster than falling through to a general-purpose malloc implementation.

That’s not obviously true. Page sized and page aligned allocation is pretty fast in Go. Having a generational GC where all pointers are movable could have systemic impact in the performance of the whole program.

If it were fast enough, then caching stacks wouldn't be a win. Bump allocation is like 6 instructions.

The idea that generational GC would not be a win does not match the experience of any other language. Generational GC is virtually always a win for languages like Go. This is just another reason why Go should adopt it.

Isn't it a win because the goroutine now starts with 4K of stack instead of 2K? This makes the stack large enough, such that the stack doesn't have to expanded on every request.

It's possible. Go has been replaying the entire history of GC research at about 2.5x forward speed. They should intercept the state of the art eventually.

It might make allocation a bit faster, but is this really where performance was lost?

I guess for the runtime it is more expensive to copy the stack contents (so nearly 2K) and e.g. update pointers in it (a simple memcpy might not be enough).

Except it was the second growth just exceeding the 4096 stack size that was causing the issue:

"it looked like the goroutine stack was growing from 4 kibibytes to 8 kibibytes"

I seem to recall there are a couple of github issues around reusing goroutines and their stacks or being able to specify the stack size of a new goroutine instead of making it a runtime constant. Either would be very helpful for those of us using Go at scale.

Would the segmented stack of the early Go implementations be helpful in such a case?

Yeah I think this issue would not have cropped up with the early segmented stack implementation, but segmented stacks had their own issues (which is why the Go team migrated away from them)

Hopefully Uber won't have to maintain their forked Go compiler for too long.

We don't! We only briefly forked the compiler to prove our RCA, but we resolved the issue with a custom worker pool (linked in the blog post). We compile all our code using the same compiler as everyone else :)

Very nice! Also another proof point for why it's nice to have open source tools.

At scale, details matter. Great read.

Great read, so a key takeway would be to make sure to prime the connection pool and also re-use it. Isn't re-using goroutines a bit of an anti-pattern though?

I don't think that's the takeway. Reusing goroutines was only necessary in a very specific situation at a very large scale.

The article is very good in providing ideas and tools you can try to use whenever you find yourself in a similar situation.

Great tech article!

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact