I'm always happy to see improvements to Go's profiling - the standard library's profiling is good, but it hasn't improved much since Go 1.0 (it took years just to get flame graphs integrated, and it's still near-impossible to get a usable dump of the in-memory object graph).
That said, I'm _very_ wary of tools that require a fork of the compiler and/or runtime. Uber's programming language research group is small (though mighty) and has a lot of demands on their time. Even over the medium term, it'll be hard to allocate time to keep this project current.
In the Go issue tracker, there's a long discussion  of the formal proposal  for these improvements. My understanding of the feedback is that the changes introduce too many Linux and x86-specific APIs. I'd love to see the author narrow the proposal to just improve timer fidelity when the OS exposes the PMU, without any new APIs. The more controversial proposal to collect richer platform- and architecture-specific data can be discussed separately.
Discussion : https://github.com/golang/go/issues/36821
Proposal : https://go.googlesource.com/proposal/+/refs/changes/08/21950...
The Android ART profiler today is still kinda limited (too high overhead or too imprecise) so we tend to switch over to simpleperf . However I think there are things that only in-langue profilers can do.
This, lately, is my #1 gripe. I just cannot get viewcore to provide valuable insights. Anybody know of any projects in this space?
This is the reason I don't liKe Go: anything Google deems unimportant (like generics or packaging) either take many years or never happen. The whole language reeks of such zealotry. In fact there's many Google projects where I've seen popular GitHub issues linger for many years because the core devs just don't care about usage outside big G.
Google generally does a bad job of open source stewardship and Golang is no different.
It's a shame but not a surprise that outside companies who have married their horses to Google and Go find themselves fighting hard just to have decent tooling that virtually every other language has.
My experience with Go is very limited, but in my tests it was always slower than C. Sometimes just a bit, sometimes 2-5 times. So my question is: looking back, don't you guys regret choosing this language?
Please don't misunderstand me, I don't intend to start any flamewars, but it seems like you're very much focused on CPU-bound performance, and the choice of language is not neutral in this case.
Inside an enterprise, there’s more to a language then just the performance (though that is a large factor). You also have to take into account existing tooling (both internal and external), developer experience, whether you can find enough developers to code it and source code maintainability, as well as many other common concerns with a language. Most languages will do well in a few cases but none are best in class in all cases (or everyone would use it). Go does well enough in the performance category while also doing moderately to extremely well in other categories. In CPU bound tasks that don’t rely on CGo, go does extremely well in my experience. I think in general though, for most enterprises, Go strikes a happy medium and makes the right trade offs that most developers are willing to make.
If you have some fairly simple function/task, then yeah, a C version will probably blow the Go version away almost all of the time. But that's not necessarily indicative of real-world performance of a full application.
And of course, there are other interests than "maximize performance" too, such as developer efficiency.
Overall I agree. I'd take a speed hit for ease of development most of the time, but there are degrees of speed hit that are acceptable depending on the context.
In nearly all cases, there was plenty of room to make the Go service faster. A more careful choice of data structure and algorithm, finer-grained locking, fan-out across a goroutine pool, or just avoiding a zillion heap allocations solved most problems. I don't recall any cases where Go was simply incapable of meeting our performance needs.
As a side benefit, services with more stringent performance requirements often exposed inefficiencies in widely-used libraries. Optimizing those libraries made every Go service faster and cut our overall compute spend. Avoiding rewrites in C++ or Rust let those wins add up over time.
That said, I’m very thankful that tools like this are being shared with the community, even if it’s less than perfect. It’s great that we have access to so many tools and so much research.
I've been consistently disappointed by Go's mutex profiling. It is kind of useful if you can synthesize the contention, but less useful if the contention isn't obvious. For my most recent contention debugging issue, I was very close to modifying the runtime (mostly sync.Mutex) to emit events that perf could sample on, and build a flame graph for cases like "took the fast path on sync.Mutex". What takes the fast path under synthetic load could easily take the slow path under different load, and that would let you identify the bottlenecks before they occur. I ended up not needing it, so didn't do it, but it's interesting to me that other people have run into the case where the built-in profiles aren't quite enough. There are definitely a lot of improvements to be made here.
(Getting ahead of lock contention has come up a lot in my Go career. When I was at Google, I had an application that computed metrics over hundreds of thousands of concurrent network streams. I used the standard library for metrics, which synchronized on sync.RWMutex, and it was just too slow to synchronize millions of events per second. This sort of bottleneck will be obvious to anyone, probably; you simply can't block all metric-increments on reads every 15 seconds and expect to fully utilize use 96vCPUs that are trying to write ;) No profile ever pointed me at the culprit in a straightforward manner; I had to dig into the time-per-assembly-instruction profile and noticed inordinate amount of time being spent in an instruction incidental to the implementation of sync.Mutex. I ended up switching to atomic.AddInt64, which solved my problem. I think the overhead of a write barrier is actually too heavy and wouldn't do that today; I'd just maintain goroutine-local counters and synchronize them occasionally. You can still fall behind, of course, but your program can easily detect that case and log a message like "we're too slow, buy another computer!" in that branch. Easier than reading the profiles.)
perf record --call-graph dwarf -e cycles --switch-events --sample-cpu -p $(pidof my-go-process)
I played a little with off-cpu profiling (-e sched:sched_switch), but Go kind of outsmarts that by doing its own scheduling (taking Goroutines off Machines when it thinks it's going to block inside a syscall, so the resulting flamegraphs are usually rooted in runtime.park_m; though if the runtime gets something wrong, you'd be able to see it here.)
FWIW I'm planning to work on the problem from downstream as well by adding more docs here: https://github.com/felixge/go-profiler-notes . I haven't gotten to the mutex profile yet, but I've covered the block profile in great detail. It actually overlaps with the mutex profile for contention, but it tracks Lock() rather than Unlock() which is subtly different like you pointed out!
For analyzing I/O via perf and looking at kernel stacks, I'm surprised that would work. I'd think that perf will not show you much waiting on the network, because the netpoller implementation in Go uses epoll and friends to make that non-blocking under the hood. For disk I/O it probably works. Maybe I'm missing something?
If you look at the slow path, there are many interesting paths that a lock can take in there. They are all potentially interesting when you're debugging contention.
> The ability to monitor go programs with a very high sampling frequency — up to 10s of microseconds
I'm curious about the overhead of this and will probably try to measure. This work certainly overcomes the limitations of setitimer(2) for sampling rate, but faster sampling is going to increase the performance overhead significantly.
In particular Go's built-in stack unwinding APIs use gopclntab (which is Go's Plan9 inspired version of eh_frame). This is rather slow compared to frame pointer unwinding (up to 55x in my testing). I think on average you can expect to spend at least 1 microsecond on unwinding, and then some additional time hashing the stack and updating the in-memory profile (perhaps another microsecond? I haven't measured this part yet). I've started writing some of this up here  but it's still WIP.
Anyway, based on the above, sampling every 10usec will likely slow the target program by ~20%. This is probably a very optimistic estimate because it doesn't take into account all the CPU cache thrashing that will happen.
As I said, I'll try to take a closer look later, but if anybody has some insights/comments here I'd be very interested.
Disclaimer: I work at Datadog on Continuous Go Profiling, but I'm here speaking for myself.
Also, if anyone reading isn't familiar with sampling profilers, Felix gave a great talk explaining them: https://github.com/felixge/talks/blob/master/fgprof/
I'm really looking forward to taking it for a spin soon and see how far the sampling rate can be pushed without introducing noticeable overhead in practice!
The only real downsides to tracing over sampling is that the overhead is higher than low resolution sampling down to ~100 usecs per sample (single-digit to low double digit overhead) and that any performant solution basically requires instrumentation and thus results in small overhead if you want it to be runtime toggleable and requires tooling support to inject. However, given the immense benefits, it is baffling that companies continue to struggle with such minimal tooling and platforms that only have such tooling available.
Too bad it's a fork of Go runtime. As @akshayshah mentioned, it seems like Go team is hesitant about straight up merging this. I wonder if a better approach would be to open up Go runtime apis so that this can be a module instead of a whole fork. I think opening up these APIs would also enable more innovation in profiling space.
The author claims that ground truth is that each goroutine utilizes 10% of the CPU time (so stipulated that this should be the case). But, what if the results shown are accurate, i.e. that the results are the actual CPU time (because of idiosyncrasies of scheduling between the OS, the go runtime, and anything else happening on that system).
Does running the new profiler show less variance in the results from that initial experiment? Showing this result would strengthen the claim that the "out of the box" solution is inaccurate.
But that crate doesn't (yet?) Use the hardware based perf counters, so this may not answer your actual question.