Hacker News new | past | comments | ask | show | jobs | submit login
Practices for writing high-performance Go (github.com)
498 points by ingve 6 months ago | hide | past | web | favorite | 97 comments

Almost none of this is about Go. 80% down they mention some basic stuff about GC, but nothing very specific about the Go GC. Then after that it is back to very basic language-agnostic optimization ideas.

I was hoping for more insights about Go specifically.

I've done a Go course a while ago; interestingly enough, at least the first day, it wasn't about Go at all. The trainer basically stated that if you know any curly braces language, you can write Go.

And I think that's the point. The language doesn't get in your way, and it's not any specific language feature that will make your code go fast or slow - unless you abuse it, or use it when you don't need to.

I mean if you're happy with C then by all means; Go isn't claiming to be faster than other languages, not this close to the metal. It's mainly aimed to take away some of the mental overhead you get with C and similar languages. And compile times.

Thanks for this, I just started getting into and writing Go code.

I have a question that I don't remember being answered in any tutorial I've done so far. I've written a lot of C code and I typically make memory managed lists, so if I need a new common object I grab one from the list to avoid free/malloc as much as possible. Does Go do this automatically, or should I still do this on my own? I'm writing a long-running server, not a utility or something short lived.

There is sync.Pool [0] but I wouldn't reach for it until you've profiled your application[1] and found that the specific allocation in question is a hot spot.

As of now[2], they get cleared when a GC occurs so they can be a bit quirky to use in practice. This behavior is changing though

[0] https://golang.org/pkg/sync/#Pool

[1] https://blog.golang.org/profiling-go-programs

[2] https://github.com/golang/go/issues/22950

That is still a useful tool in Go, but you don’t need to use it as much as you’d think. The GC is pretty good at short-lived objects. The canonical implementation is sync.Pool if you don’t want to build one yourself.

Usually I use pools to avoid the allocation cost. The GC sweep is not usually my problem.

Profile it and see if it is a problem in Go, I would say.

I don't know if the Go runtime has optimizations for this kind of thing inherently, and it might.

Profile, and see.

That's ... not helpful. I already profile and profiling indicates that allocations are slow and sweeps are fast.

Go's optimization for this is escape analysis which is a best-effort attempt to put things on the stack. Because the stack analysis is naive (which isn't necessarily a bad thing), it means lots of things are still allocated on the heap, and those are the allocations I'm referring to.

The bigger issue is usually stack vs heap allocation and it’s hard to reason about that. But the good news is that unlike a JIT system you can actually pretest golang escape analysis.

Look at the -m gcflag to see what is escaping, then see if that is where the allocation hotspot is. If so see if you can get the escape to happen where you want.

Otherwise you need to go no allocation. Notice that go actually gets _worse_ over time generally at heap allocation so it’s likrly worse than early profiles suggest.

> The bigger issue is usually stack vs heap allocation and it’s hard to reason about that.

I don't think it's especially hard to reason about in the main cases, and when you're unsure (as you mention) you can actually print out the escapes. Would be really interesting to have an editor that would show you where things were escaping (although that could easily lead to premature optimization).

.NET has the ability to allocate on the stack, using `stackalloc` - does Go have a similar feature?

I’ve written them myself & used sync.pool. I’d carefully test where you are using them though as I’ve had an instance where an implementation went from lots better to lots worse than the naive allocation due to escape analysis getting better across a golang version upgrade.

The Go authors have steadfastly refused to offer users any means of writing a CPU-local freelist, even though the utility of such things is obvious and evidenced by the fact that they use such things all over the runtime package. They just think that _you_ are too stupid to be allowed to do such a thing. Honestly with the attitude that the Go authors treat their users I don't know why anyone tries to write high performing code in that language. C++ is there, after all.

Honestly with the attitude that the Go authors treat their users I don't know why anyone tries to write high performing code in that language. C++ is there, after all.

I work in C++. I'm sometimes surprised at the way C++ treats its users. There seems to be a culture of pre-optimization. It's hard to write the most abstract application code without constantly thinking about performance under the surface. That's baked right in. Concepts which are used everyday by programmers can barely be cleanly summarized in a paragraph, even by well respected luminaries of the field. Trivia you "just have to remember," impinges on almost every single function or method you write.

That said, there's also a lot of awesome things in C++. It just represents a particular set of cost/benefit dials. Golang represents another.

C++ is unique in that it aims to be feature-rich, while preserving at much as possible the property of if you don't use it, you don't pay for it.

It breaks its own rule occasionally -- RTTI, exceptions, [0] standard library machinery with always-on thread safety -- but the C++ folks go to extreme lengths in the name of performance.

There's no small irony in the way embedded folks write off C++ as too heavyweight, given that C++ tortures itself in the name of not forcing bloat upon the programmer.

[0] https://llvm.org/docs/CodingStandards.html#do-not-use-rtti-o...

C++ is unique in that it aims to be feature-rich, while preserving at much as possible the property of if you don't use it, you don't pay for it.

"You don't pay for it," should really be, you == the-CPU doesn't pay for it. On the other hand you == the-programmer has to think about it all the time. For some domains/contexts, yes this is actually very desirable. In that case C++ becomes a performance/optimization Swiss army knife.

Other people might want it dialed in for, "You don't have to think too much about it, until it's time to optimize, and then you can't optimize all the way, but you get enough of the way there."

> it aims to be feature-rich

Visual basic had a Replace() function back in 1998, to replace substrings. C++ now has many amazingly advanced high-level features, but still no built-in way to replace a substring. I don’t think I needed to write my own string replacing function in any other language I used (I’ve been programming for living since 2000).

I like C++ and use it a lot. But these seemingly small issues with its standard library escalate quickly. String handling, IO, localization, date & time, multithreading before C++ 11, many standard collections, and other parts are just not good enough. I pretty much stopped writing complete apps in C++, nowadays I’m using C++ for dll/so components I consume from higher-level languages like C#, Python or golang. And when I do, I often choose to ignore large parts of the standard library in favor of alternatives from atl, eastl, or my own ones.

String operations of C look so much cleaner and complete than those "algorithmic and templating" ones of C++. And C++ can quickly become difficult to read after overloading various operators on classes or after using fancier less-known features. I personally find C++ too complex and confusing. I know Google dictates some reasonable subset of C++ for Fuchia and that is good. Btw also the dependency management is sometimes really pain with C/C++ - then they had to use crazy compilation tools like GN or Basel to get it compile in various environments and platforms... Programming in go/c# allows to quickly focus to get the work done.

> so much cleaner and complete than those "algorithmic and templating" ones of C++

That's C++ strings, too: https://docs.microsoft.com/en-us/cpp/atl-mfc-shared/referenc...

Not only they have better API (replace, tokenize, implicit cast to const pointers), these strings are often faster. That particular class is Windows-only, but nothing prevented C++ standard folks to come up with conceptually similar cross-platform stuff. BTW, that CString class predates C++ standard library by many years.

> And C++ can quickly become difficult to read after overloading various operators on classes or after using fancier less-known features.

Yes, and I saw quite a lot of code like that.

But in other cases these features help with readability. I often code math-heavy stuff in C++ processing vectors, matrices, quaternions, complex numbers, etc. The ability to implement custom math operators on these structures IMO helps with readability.

What doesn't help is the ability to abuse them, like C++ iostreams do with `operator <<` everywhere.

Not just operators, it's generally too easy to abuse features of the languages, writing code that's very hard to work with. Unfortunately, not doing that requires lots of experience with the language.

I'm not planning to switch due to the good parts. First-party SIMD intrinsics support. Trivial interop with C and C++ libraries: hard requirements like OS kernel APIs and GPU APIs, industry standards like libpng, or just very nice to have like Eigen. Very small runtime allows to build dynamic libraries, consume them from anywhere, and not worry about binary size or runtime dependencies. Also tools like debuggers and profilers are very good.

But when performance is less critical, I'm more productive using other, higher-level languages.

It is a trait inherited from the C culture, although in C it is even worse, micro-optimizing every line of code as it is being written.

A C dev would start cold sweating has s/he types virtual, or operator something().

Actually C programmers frequently use dynamic dispatch - even in the Linux Kernel for example. They just like to be explicit about it.

A search on comp.lang.c archives might reveal other points of view.

Are you trying to counter an argument based on relevant practice (the Linux kernel) with a newsgroup from the 90s? Without even giving a reference?

The Linux kernel is one example, as if there aren't more C coders out there.

And yes, I am. I don't care about earning HN brownie points just to prove something that will be hand waved anyway.

"The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt."

– Rob Pike

That is taken out of context. Practically all Google server code is written in C++ or, at a much smaller scale, Java. Go is used for logs analysis, internal services, monitoring and automation, etc. Nobody at Google reaches for Go when they are thinking of high performance nor would anyone try to optimize a large service in Go. They would just write it in C++ instead as soon as it starts to cost non-trivial amounts of money. Remember that Pike wrote Sawzall and Go replaces it, in a similar role.

High performance is a very vague terms now days, there are company that have Google scale problems that don't use C++ as main language ( Nerflix Uber ect ) any modern runtime can do "high performance" especially for backend applications. YouTube primary language is Python for instance.

Netflix OK but Uber? They did 4 billions rides in 2017. Thats naivly about 130 rides per second with nice parallelism due to locality of the physical rider. That's not an insane amount of data crunching.

You could say the same for Netflix, it's a giant CDN so why do they need 150k ec2 instances?

The problem they solve looks simple on the outside, the reality is different.


In my experience, most non-trivial sites are similar to icebergs of code - users might see 5% of the codepath in normal usage for what they want, but there's boatloads of OLAP, data analysis, instrumentation, error handling, and reporting being done on those transactions to help make that 5% both more profitable and reliable.

It’s not just rides, but also matching drivers with riders, geofence lookups, etc. it’s not a CRUD app.

Hmm, that seems a bit exaggerated. I seem to recall someone on the Go team describing the rewrite of a server that was originally in C++ and became much faster. (Largely because it started out at a heap of legacy code that was neglected for years.)

That's probably pretty unusual, but the point is, just because it was written in C++ doesn't mean it's any good. Not everything gets serious attention from people who know what they're doing.

You are thinking of dl.google.com. It was a Go program that replaced a very old single-threaded C++ server (using the ancient SelectServer C++ core, deprecated at that time). The thing you have to realize about Google infrastructure is it does not require vertical scalability of its service backends. It is very typical to write a program and deploy it with 100 replicas having one CPU each on 100 different machines. A variety of load balancers (all written in C++, naturally) papers over the complexity. Nobody at Google expects a Go program to occupy an entire machine because Borg packs hundreds of services onto a single machine. The question is for _you_ do _you_ have Borg or another workload coordinator that allows you to do this? Or do you have "the database machine" and "the server" where you expect individual processes to scale up to many cores?

BTW the reason dl.google.com rewrite was faster was not because it was in Go, it was because the C++ server was serving off its local disk and the rewrite was serving off a cluster file system with ~infinite I/O capabilities. Apples and oranges.

It was even crazier: when the original download server was written, local disk was faster, mainly because the network wasn't too fast (rack locality was a concern, way back when), but also because GFS chunk servers weren't, either. At the time of the rewrite, Firehose and co. were being deployed everywhere, D did a better job at serving bytes and, later, local disk use was placed in a lower QOS level. Unless you were one of the few teams that had a good rationale for dedicated machines, if you fought for I/O time on a given disk against D, invariably you lost.

Apples and oranges, though still an important lesson: many optimizations happen at the level of interactions between services. C++ is more micro-optimizable than Go, but it doesn't matter when the low hanging fruit has nothing to do with language (and I would argue this is more often than not).

It is not out of context, it was stated just like that on a talk given by Rob Pike at Microsoft, it is available on Channel 9.

For a company with supposedly high hiring standards. Granted domain knowledge also matters but that isn't what the people described have. So what does Google teach them?

Everyone I met there had my respect. I find "not capable of understanding a brilliant language" to be baseless contempt and I'm surprised he didn't get more heat over it. I wouldn't work with him, nor whatever people he's trying in vain to accommodate.

Yikes I think he is arguing against overwhelming programmers with complexity and the ability to hammer themselves with pointers, leaks and other banes of existence for average C programmers.

That's really quite ironic given: https://news.ycombinator.com/item?id=19825787

I should have worded that more gently. The greatest programmers in the world have demonstrated that even they can't use C reliably.

Concurrency is a bit easier in Go than in C++.

Disagree. I can write a server in C++ that, with careful thought and a bit of planning, will be able to exploit the resources of the kind of 88-core dual-socket machines that are mainstream today. No amount of planning will allow me to write a Go program that does the same thing on the same machine. Small amounts of concurrency might seem easy in Go but lots of concurrency is hard.

"I can write a server in C++ that, with careful thought and a bit of planning, will be able to exploit the resources of the kind of 88-core dual-socket machines that are mainstream today."

This is downvoted grey as I write this, and as someone who has been called a "Go shill" on occasion... it's exactly right. Go is a decent language for writing code in a fairly straightforward manner and getting pretty good performance out of it, but if you need to squeeze every bit of performance out of your hardware, it's a bad choice. (I can give you choices that are worse by an order of magnitude, or even more in some cases, but it's still a bad choice.) You have a fairly smooth optimization ride up to 1.5-2x slower than C for most use cases, a few pathological edge cases where it's grossly worse (many clustered around these "every drop of performance" problems!) and a few where it'll reach parity, and then you're going to hit a brick wall.

An 88 core system is probably not impossible to sensibly use with Go, but you're going to be more constrained. I'd imagine it can probably serve web requests really well, but it's more likely to hit pathological cases if you hammer certain global resources.

Arguably, precisely part of the point of Go was that C++ makes you pay for that level of performance in code complexity and cognitive overhead all the time, even when you don't remotely need it. (I can't prove this, but I'd guess the median "cloud service" is grotesquely overprovisioned on the smallest AWS instance. The "cloud services" that leap to mind are things like the AWS auth servers or Netflix content servers or the Google crawling or indexing servers, but while those are huge and important, they're also in many dimensions the exceptions. A good chunk of the popularity of "serverless" is probably a result of this.) When you do need every bit of performance, though, the list of viable options is short.

As a fellow gopher, I agree, Go isn't trying to compete with C/C++ on performance, it's trying to give you safety and ease of use of high level language.

I think the point about concurrency typifies this, channels aren't as fast as mutexs and semaphores, but it makes sharing code with co workers easier to reason about.

If you're in a domain where performance is still king, Go isn't trying to find a place there.

Presumably he was downvoted for responding to “concurrency in Go is easier” with “Disagree. I can get every bit of performance out of an 88 core CPU with C++”. It’s not a coherent counter argument; no one claimed Go was as efficient as C++, only that concurrency is easier.

How would Rust compare?

The benchmark game puts c, rust, and c++, in that order, roughly on par in performance, with go being about 2-3x slower. No idea if that's accurate. Sampling bias means people who like performance are optimizing the languages used for performance, and people who just like to get something working quit after the first benchmark in go or python is finished. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

If you discard regexredux, then Rust is faster than C and C++ at average: see average bar at "How many times slower graph" [1].

regexredux program is outlier in Rust, because replacement of a regex in string is slower in regex crate, because author of regex crate chose to implement safer, but slower algorithm. To fix this, regex crate must be updated or replaced. I spent two weekends on this.

[1]: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

> To fix this, regex crate must be updated or replaced.

I somehow doubt pcre2 is being changed to make the tiny toy C programs run better.

> I spent two weekends on this.

So shouldn't we assume the program performance simply reflects all-those-hours you've spent working on it?

Look at the program:

  fn find_replaced_sequence_length(sequence: String) -> usize {
    // Replace the following patterns, one at a time:
    let substs = vec![
        ("tHa[Nt]", "<4>"),
        ("aND|caN|Ha[DS]|WaS", "<3>"),
        ("a[NSt]|BY", "<2>"),
        ("<[^>]*>", "|"),
        ("\\|[^|][^|]*\\|", "-"),
    // Perform the replacements in the sequence:
        .fold(sequence, |s, (re, replacement)| {
                .replace_all(&s, NoExpand(replacement)).into_owned()

It measures performance of RE engine. I can switch from regex crate to PCRE2, and program performance will match C.

> I can switch from regex crate to PCRE2, and program performance will match C.

Perhaps it would; those measurements have not been made.

What does that have to do with re-writing libraries to make tiny toy programs run better?

What does that have to do with program performance being a proxy for programmer effort?

mandelbrot is outlier in Rust, because… :-)

Because hardware acceleration is used in C and C++ versions. I will fix this soon.

> Sampling bias means people who like performance are optimizing the languages used for performance …

Also, there might be a something to prove "bias" :-)

Also, Dropbox rewrote the core stuff from go to rust when they needed more performance. So that is one example/anecdote.

Concurrency in Rust is extremely easy, because it safe by design. Just import rayon crate and change iter() to par_iter() [1]. Compiler will point out to problems, e.g. it will not allow to send a type, which cannot be used concurrently, until it will be wrapped by Arc (atomic reference counter).

[1]: https://docs.rs/rayon/1.0.3/rayon/

You're talking a about a very specific scenario, out of the box Go is off course easier than C++ for concurrency. Who by the way runs applications on a 88 core server, why would you so they instead of splitting that into smaller chunks.

> Who by the way runs applications on a 88 core server, why would you so they instead of splitting that into smaller chunks.

Graph analysis and/or routing, for one. Distributed Dijkstra is pretty impractical (the optimal lower bound on the number of messages required equals the number of edges in the graph!)

For example, I don't work for Google, but I can pretty much guarantee you that an individual Google Maps routing query runs on a single, extremely-high-memory (but not NUMA!) node, which has likely been optimized for speed using plain-old threading on a high-core-count CPU, with the upper bound on the number of useful cores being the system architecture's per-socket memory bandwidth.

Where do you get high core count computers that aren’t numa?

I highly doubt Google runs any workload on Borg with CPU count > 24. I don't work there, so I'm just assuming but it goes against all modern scaling / deployments patterns.

There are various workloads that occupy entire machines, they are all very carefully written by dedicated performance experts, in C++.

I would be very, very curious if this is actually true. I agree with the intuition behind the statement but I am actually curious how it is done.

I think you're ignoring caching/preprocessing in your analysis.

I am, yeah; but it’s an infinite regress—if these aren’t the specs of the machine that answers your query, then they’re the specs of the machine that does the spec-work to build the cache.

Or the machine that designs the machine that answers your query.

Anyone doing scale up in the HPC space. It's a massive market. This link has a picture in it showing the industries that buy that stuff:


Although they cost a ton, there were at least two reasons to like NUMA machines:

1. You program them much like multithreaded machines instead of message passing like MPI. You do need to account for locality. There's OS, library, and documentation support for handling that, though. Porting a multithreaded library to NUMA is much smaller problem than clustering it.

2. A massive amount of memory with lower-latency access than on clusters. If your data is in memory, it's much faster than if it's being moved in and out of memory. Then, when it is moved, it's moved faster. Low-latency reduces the damage of many smaller copies, too.

For such reasons, I always wanted a SGI or Cray machine. Modern, multi-core machines with plenty of RAM are good enough for most of my purposes, though. I do plan to do some model-checking of software in the near future. The amount of RAM used grows exponentially or something like that with the size of the program. Small ones already use GB of RAM in the analyses. Some are getting parallelized a bit. Obviously, 88 cores with 100+GB of RAM could be pretty useful if handling programs twice or three times as large. :)

Btw, there are also languages designed specifically to take advantage of parallelism in many HPC situations. Like Go, they were intended to let you describe the algorithms in a high-level way with the compiler synthesizing efficient implementations for everything from multi-cores to NUMA to clusters. That's the theory. The best one from my prior research was Chapel. Then, there's simpler ones for stuff like data parallel with Cilk being an example.



ParaSail was a recent one with interesting design. I'm not sure what its current status is in terms of usability.


You're making a different argument, that one does not need so much vertical scalability, and I might agree, but that doesn't negate the fact that Go has limits of vertical scalability that C++ doesn't face.

The original claim was that concurrency is easier in Go, not that Go has a higher performance ceiling. You are the one making the different argument. :)

Could you describe how you would go about designing this and what libraries/frameworks would you use ? AFAIK co-routines is only coming with C++ 20, so I guess you would still use std::thread right ?

std::thread has only been around since C++11, so most existing HPC codebases probably either use pthreads[1] directly, or an organization- or project-specific library wrapping pthreads. Some used boost::Thread, which was effectively the predecessor of std::thread.

If you want to saturate every core on a very parallel problem like "handle as many packets as you can", a very rough first approach would be to spin up 88 threads (one per core). Within each thread, you could use either something like boost::Fiber[2] (similar to goroutines, except mapped N:1 to OS threads, rather than M:N) to avoid blocking the threads on IO. This paper[3] has a good overview of different concurrency approaches.

If you're doing something that's not embarrassingly parallel[4], then often there is a well-researched approach for your specific domain. The same general ideas apply to keeping your cores busy, but you're often more bounded by communication between threads, memory bandwidth, etc.

[1] https://en.wikipedia.org/wiki/POSIX_Threads

[2] https://www.boost.org/doc/libs/1_70_0/libs/fiber/doc/html/fi...

[3] http://www.sosp.org/2001/papers/welsh.pdf

[4] https://en.wikipedia.org/wiki/Embarrassingly_parallel

Where's the evidence that Go can't handle large amounts of concurrency?

All over my profiler. Go can deal with something embarrassingly parallel, same as any other language, but try getting it to scale up to high core counts with small RPCs is impossible, for the reasons the OP mentioned: you can't avoid its global allocator and you don't control the scheduler either. You will spend all of your CPU time in runtime.findrunnable and runtime.mallocgc.

You will spend all of your CPU time in runtime.findrunnable

Profile your channel latency, then design around that. "Small RPCs?" With an 88 core server, should you be making that many remote procedure calls? Find the 20% most intensive code and take that out of the hands of the scheduler, as much as possible.

and runtime.mallocgc

Even with sync.Pool? Can you design those parts of the system to mostly use the stack?

Not for nothing but channels become a bottleneck pretty early when you start doing highly concurrent golang.

You can very carefully design around it using only channels but it quickly starts making more sense to just use a different concurrency abstraction.

My experience has been channels are a really good way to make highly concurrent code readable for people who hate concurrency (ymmv).

They aren't as efficient, but using them as a semaphore (i.e. only signaling) can facilitate fairly-efficient, easier-to-read shared memory. But I'm kind of proving the whole...

> You can very carefully design around it only channels

...part of your post.

I guarantee you the "little bit of planning" you're doing in C++ is far greater than the equivalent planning in Go to avoid the global allocator in your goroutines (i.e., allocate on the stack, use object pools, etc). I'm sure you still won't get to the same performance ceiling that C++ allows, but it doesn't mean "Go is bad at concurrency".

The philosophy of Go, as far as I can tell, is opinionated simplicity. In this way, it is fantastic at concurrency.

It gives you a simplified fork in goroutines, a way to safely pass by value or signal between these through channels, and structures like select to psuedo randomly deal with race conditions.

That stuff is all awesome!

But when you are talking about performance, opinionated simplicity is a bad philosophy. You are going to want the finer grained control the poster is asking for, and which Go deliberately doesn't allow for. C, C++, and Rust will always have a speed advantage because they lack the overhead of the GC and were designed with allowing the developer minute control of the system, which is the space to make such optimizations.

That said, opininated simplicity means every Go codebase I drop into looks roughly the same. That is not true in the above langauges, precisely because they allow for more control.

To me, Go's ideal use case is the domains of Java, Python, NodeJS, and even Elixir -- where it has a performance edge in many cases and where its consistant style really shine.

All that to say, Go is bad at performance optimized concurrency, as designed.

Your post doesn’t address mine. I gave specific advice for addressing the GP’s issue. It’s much easier to move your allocs out of the hot path in Go than it is to do virtually anything in C++. I’m not claiming Go will compete with C++ on performance, only that his argument isn’t a good reason to write off Go. (I’ve written a lot of both languages).

Is there a Github issue for this problem? If not, would you consider filing one?


It should probably explain why one cannot "avoid its global allocator" by using pools and/or stack objects.

Stack is great, then you will spend more time in runtime.morestack because the Go authors have wisely decided that we are all too dumb to request specific stack sizes and we must rely on copying and doubling stacks larger than 2KiB even if we know in advance that 2KiB is not enough. This compares poorly with native threads where the stacks are dynamically allocated one page at a time without copying and we can specify the stack size _limit_ per thread if we need to. This is another place where the Go authors just arbitrarily decided that users cannot be trusted.

sync.Pool is fun but it has known scalability limits that are yet to be addressed in a released runtime (see https://go-review.googlesource.com/c/go/+/166960/ for example). You also can't use sync.Pool for anything that you need to be non-ephemeral, like CPU-local counters, because the GC can just blow them away at any time and the finalizers run at arbitrary future times or possibly never.

The runtime.morestack issue is actually being tracked btw [0]. I'm also getting the impression that you want Go to be another C++, but I don't understand why.

All of your listed issues have additional details and discussions around them. It not "we don't add this stuff because you are too dumb to use it". It's more "we don't know how to add this currently so it would properly interact with other features and be non-confusing". There are some exceptions to this rule (time.Time issue comes to mind) but overall they tend to listen to the developer base. Also, keep in mind that there are also relatively few Core Go developers, so they have to be extremely careful about what they gonna add in the language. It's going to be them who is going to support it in years to come.

As for C++ - yes, nothing beats it in highly optimized cases. And there is nothing wrong with that. The problem is not only writing C++ for those highly optimized cases is an extremely hard task by itself. The problem is finding people who can actually do it with the resulting code which is going to be supportable in the future (I'm not talking about UB, races or memory leaks).

[0]: https://github.com/golang/go/issues/18138

Edit: wording

My point is that you should file a Github issue if there isn't one.

Why "should"? Nobody owes the writing of issues.

If you depend on an open source project, and can articulate a problem with it that can be fixed, then I do feel you should file an issue to contribute your insight.

Up to you. If absolute performance is your reason, then you'll still end up writing your own pool like games seem to do. This is mainly to avoid the garbage collector. I'd try without first to see if that's really necessary.

Try using sync.Pool - it does the same thing: https://golang.org/pkg/sync/

The "How to Optimize" section is worth reading even if you don't write any Go.

This looks very useful, I'd definitely buy a book on the subject. Does anyone know about more Go-specific performance book/articles out there? Please share!

any thoughts on whether to use values or pointers to large structs?

in practice I haven’t seen any major performance overhead when passing values about.

You should use pointers if you need to mutate the content of the struct, and values otherwise.

Almost all types that could hold a lot of data have reference semantics, so if you have a struct with a large slice, or a large map, all of the actual data is going to be on the heap, and not affect copy performance.

The only time you might want to consider using a pointer to improve copy performance is if you have lots of non-heap-allocated data in your struct, and in practice the only way that happens is via large arrays. That is, e.g. `[8192]string`, and not simply `[]string` (which is an efficient slice). But you should never do it preemptively, you should always benchmark both styles and only switch to pointers once you have proof it's meaningfully affecting performance.

There is no straight answer to this, so, it depends. I think it's really one of those problems that isn't important yet until you can actively measure a performance issue. Until then, focus on clarity and functionality; it's much easier to measure and optimize code if your code and its intent is clear. Until then, beware of premature optimization.

I don't know what's idiomatic in Go, but I would do whatever makes the code easiest to understand (if that could be an issue). Modern compilers are very good at optimising pass-by-value into a pointer, so write first then measure and then optimise where required.

I also wonder about this. I think some benchmarks are in order, with checks for the impact of escape analysis/copy to heap.

Passing pointers is typically a code smell in Go, given that the "don't communicate by sharing, share by communicating" hinges on copying values when passed into e.g channels so as not to have mutexes everywhere, which would defeat the purpose of such a design.

Not that you should not have mutexes in idiomatic Go, but they usually have to be warranted for.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact