Hacker News new | past | comments | ask | show | jobs | submit login
Pprof++: A Go Profiler with Hardware Performance Monitoring (uber.com)
240 points by vquemener 4 months ago | hide | past | favorite | 38 comments

Full disclosure: I used to work in Uber's infrastructure group.

I'm always happy to see improvements to Go's profiling - the standard library's profiling is good, but it hasn't improved much since Go 1.0 (it took years just to get flame graphs integrated, and it's still near-impossible to get a usable dump of the in-memory object graph).

That said, I'm _very_ wary of tools that require a fork of the compiler and/or runtime. Uber's programming language research group is small (though mighty) and has a lot of demands on their time. Even over the medium term, it'll be hard to allocate time to keep this project current.

In the Go issue tracker, there's a long discussion [1] of the formal proposal [2] for these improvements. My understanding of the feedback is that the changes introduce too many Linux and x86-specific APIs. I'd love to see the author narrow the proposal to just improve timer fidelity when the OS exposes the PMU, without any new APIs. The more controversial proposal to collect richer platform- and architecture-specific data can be discussed separately.

Discussion [1]: https://github.com/golang/go/issues/36821

Proposal [2]: https://go.googlesource.com/proposal/+/refs/changes/08/21950...

They also forked Android to do similar profiling improvement [1] and even wrote a paper on that [2]. It's an impressive achievement and I hope they tried to upstream it as well.

The Android ART profiler today is still kinda limited (too high overhead or too imprecise) so we tend to switch over to simpleperf [3]. However I think there are things that only in-langue profilers can do.

  [1] https://eng.uber.com/nanoscope/
  [2] https://2018.splashcon.org/details/vmil-2018/9/Profiling-Android-Applications-with-Nanoscope
  [3] https://developer.android.com/ndk/guides/simpleperf

> it's still near-impossible to get a usable dump of the in-memory object graph

This, lately, is my #1 gripe. I just cannot get viewcore to provide valuable insights. Anybody know of any projects in this space?

Meanwhile, Java has incredible support for several kinds of profiling for a decade.

This is the reason I don't liKe Go: anything Google deems unimportant (like generics or packaging) either take many years or never happen. The whole language reeks of such zealotry. In fact there's many Google projects where I've seen popular GitHub issues linger for many years because the core devs just don't care about usage outside big G.

Google generally does a bad job of open source stewardship and Golang is no different.

It's a shame but not a surprise that outside companies who have married their horses to Google and Go find themselves fighting hard just to have decent tooling that virtually every other language has.

Forgive my ignorance, but I have a very general question. The article says Go runs on millions of cores at Uber. Obviously, the company is interested in maximizing the performance, hence the Pprof++ project.

My experience with Go is very limited, but in my tests it was always slower than C. Sometimes just a bit, sometimes 2-5 times. So my question is: looking back, don't you guys regret choosing this language?

Please don't misunderstand me, I don't intend to start any flamewars, but it seems like you're very much focused on CPU-bound performance, and the choice of language is not neutral in this case.

Full Disclosure: I do not work at Uber but at another large enterprise that is at Ubers scale and is almost 100% Go across 100s of micro services.

Inside an enterprise, there’s more to a language then just the performance (though that is a large factor). You also have to take into account existing tooling (both internal and external), developer experience, whether you can find enough developers to code it and source code maintainability, as well as many other common concerns with a language. Most languages will do well in a few cases but none are best in class in all cases (or everyone would use it). Go does well enough in the performance category while also doing moderately to extremely well in other categories. In CPU bound tasks that don’t rely on CGo, go does extremely well in my experience. I think in general though, for most enterprises, Go strikes a happy medium and makes the right trade offs that most developers are willing to make.

Facebook managed to optimize mercurial (Python) to be faster than git (C) for their huge monorepo use case simply because Python made it so much easier to implement more efficient algorithms.

If you have some fairly simple function/task, then yeah, a C version will probably blow the Go version away almost all of the time. But that's not necessarily indicative of real-world performance of a full application.

And of course, there are other interests than "maximize performance" too, such as developer efficiency.

Ya got a source/article on that? I'm curious about it.

Overall I agree. I'd take a speed hit for ease of development most of the time, but there are degrees of speed hit that are acceptable depending on the context.

As both a C and Go developer, in my experience, 2-5x execution time increase going from C to Go does not sound reasonable at all.

You could get a 5x slowdown like that for an interpreter written in Go (versus C), but that's a special case and usually it could be replaced by switching from an interpreter to a compiler. (Code generation.)

I think finding a good balance between performance and development time & risks is important and C does not seem like a very good choice regarding the second part. At least for web stuff.

I doubt most of Uber microservices are CPU bound. Yes it's natural for a fleet of this size to also be scrutinized for CPU savings but that's not the only cost involved when choosing a language. Good tooling, dev experience and availability of competent engineers comes to mind.

At least when I was at Uber (left in mid-2019), this was absolutely the case: most services weren't CPU-bound, but a few were. Rewriting the CPU-bound services in C++, Rust, or some other language might have helped, but would have required a _ton_ of well-understood, reliable libraries for networking, observability, authn/authz, database access, etc.

In nearly all cases, there was plenty of room to make the Go service faster. A more careful choice of data structure and algorithm, finer-grained locking, fan-out across a goroutine pool, or just avoiding a zillion heap allocations solved most problems. I don't recall any cases where Go was simply incapable of meeting our performance needs.

As a side benefit, services with more stringent performance requirements often exposed inefficiencies in widely-used libraries. Optimizing those libraries made every Go service faster and cut our overall compute spend. Avoiding rewrites in C++ or Rust let those wins add up over time.

Thanks for the context. Good to keep in mind for the long term.

That said, I’m very thankful that tools like this are being shared with the community, even if it’s less than perfect. It’s great that we have access to so many tools and so much research.

Agreed! This is super cool, and I hope a working prototype makes discussion of the formal proposal move a little faster.

I've also used `perf` to profile Go apps, and the results can be pretty interesting. For example, the Go profiler doesn't continue out of the process and into the kernel, so are your I/O waits on the network, or are they on the local filesystem? `perf` can show you the difference. (And it does come up in the real world; you can often have Go functions that operate on io.Readers that are connected to the network, or are connected to disk files. A profile that extends into the kernel can show you which case is slow.)

I've been consistently disappointed by Go's mutex profiling. It is kind of useful if you can synthesize the contention, but less useful if the contention isn't obvious. For my most recent contention debugging issue, I was very close to modifying the runtime (mostly sync.Mutex) to emit events that perf could sample on, and build a flame graph for cases like "took the fast path on sync.Mutex". What takes the fast path under synthetic load could easily take the slow path under different load, and that would let you identify the bottlenecks before they occur. I ended up not needing it, so didn't do it, but it's interesting to me that other people have run into the case where the built-in profiles aren't quite enough. There are definitely a lot of improvements to be made here.

(Getting ahead of lock contention has come up a lot in my Go career. When I was at Google, I had an application that computed metrics over hundreds of thousands of concurrent network streams. I used the standard library for metrics, which synchronized on sync.RWMutex, and it was just too slow to synchronize millions of events per second. This sort of bottleneck will be obvious to anyone, probably; you simply can't block all metric-increments on reads every 15 seconds and expect to fully utilize use 96vCPUs that are trying to write ;) No profile ever pointed me at the culprit in a straightforward manner; I had to dig into the time-per-assembly-instruction profile and noticed inordinate amount of time being spent in an instruction incidental to the implementation of sync.Mutex. I ended up switching to atomic.AddInt64, which solved my problem. I think the overhead of a write barrier is actually too heavy and wouldn't do that today; I'd just maintain goroutine-local counters and synchronize them occasionally. You can still fall behind, of course, but your program can easily detect that case and log a message like "we're too slow, buy another computer!" in that branch. Easier than reading the profiles.)

Thought it might also be helpful to add that I usually invoke perf like this:

    perf record --call-graph dwarf -e cycles  --switch-events --sample-cpu -p $(pidof my-go-process)
And then view the results in hotspot (https://github.com/KDAB/hotspot).

I played a little with off-cpu profiling (-e sched:sched_switch), but Go kind of outsmarts that by doing its own scheduling (taking Goroutines off Machines when it thinks it's going to block inside a syscall, so the resulting flamegraphs are usually rooted in runtime.park_m; though if the runtime gets something wrong, you'd be able to see it here.)

A lot of people don’t grok the contention profiler, even people who work on service performance full-time. The problem is the docs. What you’re getting from the contention profile is samples of call stacks for contended releases, that is when a lock is released and there are waiters. This isn’t explained anywhere, to my recollection.

Yup, there is an upstream issue to improve this: https://github.com/golang/go/issues/44920

FWIW I'm planning to work on the problem from downstream as well by adding more docs here: https://github.com/felixge/go-profiler-notes . I haven't gotten to the mutex profile yet, but I've covered the block profile in great detail. It actually overlaps with the mutex profile for contention, but it tracks Lock() rather than Unlock() which is subtly different like you pointed out!

Very interesting points about mutex profiles. When you say fast path, do you mean spin locks?

For analyzing I/O via perf and looking at kernel stacks, I'm surprised that would work. I'd think that perf will not show you much waiting on the network, because the netpoller implementation in Go uses epoll and friends to make that non-blocking under the hood. For disk I/O it probably works. Maybe I'm missing something?

There's a CompareAndSwap to do a quick acquire when there isn't contention: https://github.com/golang/go/blob/master/src/sync/mutex.go#L...

If you look at the slow path, there are many interesting paths that a lock can take in there. They are all potentially interesting when you're debugging contention.

I think this is great work. Hopefully this can motivate upstream to reconsider the proposal for it. In the long run I don't think anybody wants to run with a forked version of Go for better profiling.

> The ability to monitor go programs with a very high sampling frequency — up to 10s of microseconds

I'm curious about the overhead of this and will probably try to measure. This work certainly overcomes the limitations of setitimer(2) for sampling rate, but faster sampling is going to increase the performance overhead significantly.

In particular Go's built-in stack unwinding APIs use gopclntab (which is Go's Plan9 inspired version of eh_frame). This is rather slow compared to frame pointer unwinding (up to 55x in my testing). I think on average you can expect to spend at least 1 microsecond on unwinding, and then some additional time hashing the stack and updating the in-memory profile (perhaps another microsecond? I haven't measured this part yet). I've started writing some of this up here [1] but it's still WIP.

Anyway, based on the above, sampling every 10usec will likely slow the target program by ~20%. This is probably a very optimistic estimate because it doesn't take into account all the CPU cache thrashing that will happen.

As I said, I'll try to take a closer look later, but if anybody has some insights/comments here I'd be very interested.

[1] https://github.com/DataDog/go-profiler-notes/blob/main/stack...

Disclaimer: I work at Datadog on Continuous Go Profiling, but I'm here speaking for myself.

I suspect you're right about overhead. 10us seems much more aggressive than even the author is actually suggesting, though - the motivating example was a service with latencies in the 10s of ms, so sampling every few ms would be sufficient.

Also, if anyone reading isn't familiar with sampling profilers, Felix gave a great talk explaining them: https://github.com/felixge/talks/blob/master/fgprof/

Yeah, decent sampling for requests that last 10s of ms seems absolutely feasible with this approach, unlike the current setitimer(2) stuff.

I'm really looking forward to taking it for a spin soon and see how far the sampling rate can be pushed without introducing noticeable overhead in practice!

Nice to see Milind and the rest of the team are still making progress! One of the alternatives explored in the original proposal was using a go library that wraps perf, which I explored quite a bit previously. I haven't had time to continue that work, but it is possible to use a perf library within benchmarks (ex: https://github.com/hodgesds/compressbench) to some success. Getting this profiler merged into runtime would be really nice. Alternatively, if someone wants to extend a perf library that is able to resolve runtime stacks using eBPF that could be an interesting alternative.

It is so odd that companies with such immense engineering budgets and computational costs continue to use low resolution profiling techniques such as sampling instead of modern high resolution tracing based solutions. The gap in insight you can get between seeing some things and literally everything is similar to the gap between no profiler and a sampling profiler. You can actually trace the direct causes of slowness instead of just the symptoms. It is so massive that every time we move a system that was optimized using a sampling profiler to a tracing profiler we learn just how little we knew about how it worked and how stupid the program was behaving and routinely see 30%-50% speedups from those insights on even heavily optimized code.

The only real downsides to tracing over sampling is that the overhead is higher than low resolution sampling down to ~100 usecs per sample (single-digit to low double digit overhead) and that any performant solution basically requires instrumentation and thus results in small overhead if you want it to be runtime toggleable and requires tooling support to inject. However, given the immense benefits, it is baffling that companies continue to struggle with such minimal tooling and platforms that only have such tooling available.

That is...so the opposite of my experience. Sampling has the critical benefit of not biasing the profiling towards frequent-but-fast spots. Tracing profilers give you very precise but inaccurate information on your hotspots.

This is really cool stuff, we're at Pyroscope [1] would love to add support for it.

Too bad it's a fork of Go runtime. As @akshayshah mentioned, it seems like Go team is hesitant about straight up merging this. I wonder if a better approach would be to open up Go runtime apis so that this can be a module instead of a whole fork. I think opening up these APIs would also enable more innovation in profiling space.

[1]: https://github.com/pyroscope-io/pyroscope

"Modularizing" the runtime sounds like a good idea, but also adds new constraints: they would have to document the interfaces between the various modules and then either keep them unchanged, which would hinder further development; or if they change the module interfaces, the modules will have to be updated nevertheless, so having modules won't be much of an advantage...

I'm curious about the results under the section titled "Demonstration."

The author claims that ground truth is that each goroutine utilizes 10% of the CPU time (so stipulated that this should be the case). But, what if the results shown are accurate, i.e. that the results are the actual CPU time (because of idiosyncrasies of scheduling between the OS, the go runtime, and anything else happening on that system).

Does running the new profiler show less variance in the results from that initial experiment? Showing this result would strengthen the claim that the "out of the box" solution is inaccurate.

Trying to figure out how this would be of use to anyone else, in practice. You need a hacked up runtime and a privileged process to use the PMU. Compared to just using perf, it seems like a hassle.

I'm happy to rebuild the binary when I'm stuck on a performance problem. Obviously, the best case is having all the instrumentation on the production system that is experiencing problems, of course. I've run into problems that only manifest themselves after running for several hours, so the debugging cycle is slow when you aren't collecting what you need to debug the issue. In other cases, I've had bugs where a test case can reproduce them immediately, but the profiler still provides more targeted guesses than randomly changing something to see if it improves the outcome (even if the edit/test cycle is fast enough to support this). That's the case where this sounds helpful.

Yeah, but you know what also works? Running perf and running that through normal pprof...

The examples seem to run fine and report counters without any special privileges. Building the toolchain and then building your own code with the toolchain is trivial too, I was up and running in 10 minutes, much of which was waiting for the builds.

That suggests your system has a liberal setting of /proc/sys/kernel/perf_event_paranoid or your process is privileged, running as root or with either CAP_SYS_ADMIN or CAP_PERFMON. Generally unprivileged processes cannot access the PMU, because the PMU can be used to take over the system.

On a similar note, is there any similar tool exists for Rust?

I know one crate that offers a Go-like pprof experience: https://github.com/tikv/pprof-rs

But that crate doesn't (yet?) Use the hardware based perf counters, so this may not answer your actual question.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact