The article is from a database company, so I'll assume that approximates the scope. My scope for the GC discussion would include other parts that could be considered similar software: cluster-control plane (Kubernetes), other databases, and possibly the first level of API services to implement a service like an internal users/profiles or auth endpoints.
The tricky thing is GC works most of the time, but if you are working at scale you really can't predict user behavior, and so all of those GC-tuning parameters that were set six months ago no longer work properly. A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters. It is easier to remove and/or just not allow GC languages at these levels in the first place.
On the other hand IMO GC-languages at the frontend level are OK since you'd just need to scale horizontally.
> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters
After 14 years in JVM dev in areas where latency and reliability are business critical, I disagree.
Yes, excessive GC stop the world pauses can cause latency spikes, and excessive GC time is bad, and yes, when a new GC algorithm is released that you think might offer improvements, you test it thoroughly to determine if it's better or worse for your workload.
But a "good portion" of outages and developer time?
Nope. Most outages occur for the same old boring reasons - someone smashed the DB with an update that hits a pathological case and deadlocks processes using the same table, a DC caught fire, someone committed code with a very bad logical bug, someone considered a guru heard that gRPC was cool and used it without adequate code review and didn't understand that gRPC's load balancing defaults to pick first, etc. etc.
The outages caused by GC were very very few.
Outages caused by screw-ups or lack of understanding of subtleties of a piece of tech, as common as they are in every other field of development.
Then there's the question of what outages GCed languages _don't_ suffer.
I've never had to debug corrupted memory, or how a use after free bug let people exfiltrate data.
You're lucky! When OpenJDK was still closed-sourced Hotspot from Sun, we have chased bugs that Sun confirmed was a defect on how Hotspot handle memory (and this is on a ECC'd system of course), although these days I can't remind of anything remotely related.
> or how a use after free bug let people exfiltrate data.
Yeah, have only ever hit one or two JVM bugs in very rare circumstances - which we usually fixed by upgrading.
> Technically you're just outsourcing it :)
Haha, very true. Luckily, to developers who are far better at that stuff than the average bear.
The recent log4j rigmarole is a great example of what I was describing in JVM dev though - no complicated memory issues involved, definitely not GC related, just developers making decisions using technologies that had very subtle footguns they didn't understand (the capacity to load arbitrary code via LDAP was, AFAIK, very poorly known, if not forgotten, until Log4Shell).
> You're lucky! When OpenJDK was still closed-sourced Hotspot from Sun, we have chased bugs that Sun confirmed was a defect on how Hotspot handle memory (and this is on a ECC'd system of course), although these days I can't remind of anything remotely related.
I mean sure. I remember having similar issues with early (< 2.3) Python builds as well. But in the last decade of my career, only a handful of outages were caused by Java GC issues. Most of them happened for a myriad of other architectural reasons.
> After 14 years in JVM dev in areas where latency and reliability are business critical
What sort of industry/use cases are we talking here? There is business critical and mission critical and if your experience is in network applications as your next paragraph seems to imply then no offence, but you have never worked with critical systems where an nondeterministic GC pause can send billions worth of metal into the sun or kill people.
The parent of this thread being is about systems languages and how GC languages are rarely the right tool.
Then your comment about having experience that the GC doesn't really matter in critical environments, which I personally find not to be true at all but am interested in which domain your comment is based on.
Go doesn’t offer a bunch of GC tuning parameters. Really only one parameter, so your concerns about complex GC tuning here seem targeted at some other language like Java.
This is a drawback in some cases, since one size never truly fits all, but it dramatically simplifies things for most applications, and the Go GC has been tuned for many years to work well in most places where Go is commonly used. The developers of Go continue to fix shortcomings that are identified.
Go’s GC prioritizes very short STWs and predictable latency, instead of total GC throughput, and Go makes GC throughput more manageable by stack allocating as much as it can to reduce GC pressure.
Generally speaking, Go is also known for using very little memory compared to Java.
Java _needs_ lots of GC tuning parameters because you have practically no way of tuning the way your memory is used and organized in Java code. In Go you can actually do that. You can decide how data structures are nested, you can take pointers to the inside of a a block of memory. You could make e.g. a secondary allocator, allocating objects from a contiguous block of memory.
Java doesn't allow those things, and thus it must instead give you lots of levers to pull on to tune the GC.
It is just a different strategy of achieving the same thing:
Counter-example: The Go GC is tuned for HTTP servers at latency sensitive companies like Google. It therefore prioritizes latency over throughput to an astonishing degree, which means it is extremely bad at batch jobs - like compilers.
What language is the Go compiler written in? Go.
This isn't fixable by simply writing the code differently. What you're talking about is in the limit equivalent to not using a GCd language at all, and you can do that with Java too via the Unsafe allocators. But it's not a great idea to do that too much, because then you may as well just bite the bullet and write C++.
Java doesn't actually need lots of GC tuning parameters. Actually most of the time you can ignore them, because the defaults balance latency and throughput for something reasonable for the vast majority of companies that aren't selling ad clicks. But, if you want, you can tell the JVM more about your app to get better results like whether it's latency or throughput sensitive. The parameters are there mostly to help people with unusual or obscure workloads where Go simply gives up and says "if you have this problem, Go is not for you".
> it is extremely bad at batch jobs - like compilers.
> What language is the Go compiler written in? Go.
I do not see what are you trying to say?
The Go compiler is plenty fast in my experience, especially compared to say `javac`. The startup time of `javac` (and most java programs) is atrocious.
You can't compare throughput of two totally different programs to decide whether an individual subsystem is faster. The Go compiler is "fast" because it doesn't do much optimization, and because the language is designed to be simple to compile at the cost of worse ergonomics. Why can't it do much optimization? Partly due to their marketing pitch around fast compilers and partly because Go code runs slowly, because it's using a GC designed to minimize pause times at the cost of having lots of them.
The algorithmic tradeoffs here are really well known and there are throughput comparisons of similar programs written in similar styles that show Go falling well behind. As it should, given the choices made in its implementation, in particular the "only one GC knob" choice.
Yes, my comments were targeted to Java and Scala. Java has paid the bills for me for many years. I'd use Java for just about anything except for high load infrastructure systems. And if you're in, or want to be in, that situation, then why risk finding out two years later that a GC-enabled app is suboptimal?
I'd guess you'd have no choice if in order to hire developers, you had to choose a language that the people found fun to use.
Is go's GC not copying/generational? I think "stack allocation" doesn't really make sense in a generational GC, as everything sort of gets stack allocated. Of course, compile-time lifetime hints might still be useful somehow.
I can see the difference being that you have to scan a generation but the entire stack can be freed at once, but it still seems like an overly specific term. The general term elsewhere for multiple allocations you can free at the same time is "arenas".
From a conceptual point of view, I agree, but... in practice, stacks are incredibly cheap.
The entire set of local variables can be allocated with a single bump of the stack pointer upon entry into a function, and they can all be freed with another bump of the stack pointer upon exit. With heap allocations, even with the simplest bump allocator.. you still have to allocate once per object, which can easily be an order of magnitude more work than what you have to do with an equivalent number of stack allocated objects. Your program doesn't magically know where those objects are, so it still also has to pay to stack allocate the pointers to keep track of the heap objects. Then you have the additional pointer-chasing slowing things down and the decreased effectiveness of the local CPU caches due to the additional level of indirection. A contiguous stack frame is a lot more likely to stay entirely within the CPU cache than a bunch of values scattered around the heap.
Beyond that, and beyond the additional scanning you already mentioned, in the real world the heap is shared between all threads, which means there will be some amount of contention whenever you have to interact with the heap, although this is amortized some with TLABs (thread-local allocation buffers). You also have to consider the pointer rewriting that a generational GC will have to perform for the survivors of each generation, and that will not be tied strictly to function frames. The GC will run whenever it feels like it, so you may pay the cost of pointer rewriting for objects that are only used until the end of this function, just due to the coincidence of when the GC started working. I think (but could be wrong/outdated) that generational GCs almost all require both read and write barriers that perform some synchronization any time you interact with a heap allocated value, and this slows the program down even more compared to stack objects. (I believe that non-copying GCs don't need as many barriers, and that the barriers are only active during certain GC phases, which is beneficial to Go programs, but again, stack objects don't need any barriers at all, ever.)
GCs are really cool, but stack allocated values are always better when they can be used. There's a reason that C# makes developer-defined value types available; they learned from the easily visible problems that Java has wrestled with caused by not allowing developers to define their own value types. Go took it a step further and got rid of developer-defined reference types altogether, so everything is a value type (arguably with the exception of the syntax sugar for the built-in map, slice, channel, and function pointer types), and even values allocated behind a pointer will still be stack allocated if escape analysis proves it won't cause problems.
You can’t do it that way because each object has to have its own lifetime.
If you allocate them all as a single allocation, then the entire allocation would be required to live as long as the longest lived object. This would be horribly inefficient because you couldn’t collect garbage effectively at all. Memory usage would grow by a lot, as all local variables are continuously leaked for arbitrarily long periods of time whenever you return a single one, or store a single one in an array, or anything that could extend the life of any local variable beyond the current function.
If you return a stack variable, it gets copied into the stack frame of the caller, which is what allows the stack frame to be deallocated as a whole. That’s not how heap allocations work, and adding a ton of complexity to heap allocations to avoid using the stack just seems like an idea fraught with problems.
If you know at compile time that they all should be deallocated at the end of the function… the compiler should just use the stack. That’s what it is for. (The one exception is objects that are too large to comfortably fit on the stack without causing a stack overflow.)
I edited my comment before you posted yours. Making heap allocations super complicated just to avoid the stack is a confusing idea.
Generational GCs surely do not do a single bump allocate for all local variables. How could the GC possibly know where each object starts and ends if it did it as a single allocation? Instead, it treats them all as individual allocations within an allocation buffer, which means bumping for each one separately. Yes, they will then get copied out if they survive long enough, but that’s not the same thing as avoiding the 10+ instructions per allocation.
It’s entirely possible I’m wrong when it comes to Truffle, but at a minimum it seems like you would need arenas for each size class, and then you’d have to bump each arena by the number of local variables of that size class. The stack can do better than that.
In your example, the objects are all the same size. That would certainly be easy.
If you have three local objects that are 8 bytes, 16 bytes, and 32 bytes… if you do a single 48 byte allocation on the TLAB, how can the GC possibly know that there are three distinct objects, when it comes time to collect the garbage? I can think of a few ways to kind of make it work in a single buffer, but they would all require more than the 48 bytes that the objects themselves need. Separate TLAB arenas per size class seem like the best approach, but it would still require three allocations because each object is a different size.
I understand you’re some researcher related to Truffle… this is just the first I’m hearing of multiple object allocation being done in a single block with GC expected to do something useful.
> If you have three local objects that are 8 bytes, 16 bytes, and 32 bytes… if you do a single 48 byte allocation on the TLAB, how can the GC possibly know that there are three distinct objects, when it comes time to collect the garbage?
Because the objects are self-describing - they have a class which tells you their size.
Ok, so it’s not as simple as bumping the TLAB pointer by 48. Which was my point. You see how that’s multiple times as expensive as stack allocating that many variables? Even something as simple as assigning the class to each object still costs something per object. The stack doesn’t need self describing values because the compiler knows ahead of time exactly what every chunk means. Then the garbage collector has to scan each object’s self description… which way more expensive than stack deallocation, by definition.
You’re extremely knowledgeable on all this, so I’m sure that nothing I’m saying is surprising to you. I don’t understand why you seem to be arguing that heap allocating everything is a good thing. It is certainly more expensive than stack allocation, even if it is impressively optimized. Heap allocating as little as necessary is still beneficial.
Go does not put class pointers on stack variables. Neither does Rust or C. The objects are the objects on the stack. No additional metadata is needed.
The only time Go has anything like a class pointer for any object on the stack is in the case of something cast to an interface, because interface objects carry metadata around with them.
These days, Go doesn’t even stack allocate all non-escaping local variables… sometimes they will exist only within registers! Even better than the stack.
> These days, Go doesn’t even stack allocate all non-escaping local variables… sometimes they will exist only within registers!
Did you read the article I linked? That's what it says - and this isn't stack allocation, it's SRA. Even 'registers' is overly constrained - they're dataflow edges.
Go isn't just doing SRA, as far as I understood it from your article, though it is certainly doing that too. Go will happily allocate objects on the stack with their full in-memory representation, which does not include any kind of class pointer.
As can be seen in the disassembly, the object created in "Process" does not leave the stack until it is copied to the heap in "ProcessOuter" because ProcessOuter is sending the value to a global variable. The on-stack representation is the full representation of that object, as you can also see by the disassembly in ProcessOuter simply copying that directly to the heap. (The way the escape analysis plays into it, the copying to the heap happens on the first line of ProcessOuter, which can be confusing, but it is only being done there because the value is known to escape to the heap later in the function on the second line. It would happily remain on the stack indefinitely if not for that.)
It's cool that Graal does SRA, but Go will actually let you do a lot of work entirely on the stack (SRA'd into registers when possible), even crossing function boundaries. In your SRA blog post example, when the Vector is returned from the function, it has to be heap allocated into a "real" object at that point. Go doesn't have to do that heap allocation, and won’t have to collect that garbage later.
Most of the time, objects are much smaller than this contrived example, so they will often SRA into registers and avoid the stack entirely… and this applies across function boundaries too, from what I’ve seen, but I haven’t put as much effort into verifying this.
I was addressing your statement that seemed to say class metadata is included in every language when dealing with stack variables. I trusted your earlier statements that this was definitely the case with Java/Truffle. Misunderstandings on my part are entirely possible.
Sorry I haven’t had time to read your article. It’s on my todo list for later.
> at a minimum it seems like you would need arenas for each size class
But TLABs are heterogenous by size. Objects of all sizes go in one linear allocation space. So allocating two objects next to each other is the same as allocating them both at the same time.
No, outside special code it is impossible to know at compile time how many heap allocation a function will have.
The stack has the requirement that its size must be known at compile time for each function. In oversimplified terms its size is going to be the sum of size_of of all the syntactically local variables.
So for example you cannot grow the stack with a long loop, because the same variable is reused over and over in all the iterations.
You can instead grow the heap as much as you want with a simple `while(true){malloc(...)}`.
> The stack has the requirement that its size must be known at compile time for each function.
Not really. You can bump your stack pointer at any time. Even C has alloca and VLAs. In lots of languages it's dangerous and not done because you can stack overflow with horrible results, and some (but not all) performance is lost because you need to offset some loads relative to another variable value, but you can do it.
What the stack really has a requirement that any values on it will go full nasal demon after the function returns, so you'd better be absolutely certain the value can't escape - and detecting that is hard.
Not quite - stack is constantly being reused and thus always hot in the cache. Young gen is never hot because it's much larger and the frontier is constantly moving forwards, instead of forwards and backwards.
> The tricky thing is GC works most of the time, but if you are working at scale you really can't predict user behavior, and so all of those GC-tuning parameters that were set six months ago no longer work properly. A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters. It is easier to remove and/or just not allow GC languages at these levels in the first place.
Getting rid of the GC doesn't absolve you of the problem, it just means that rather than tuning GC parameters, you've encoded usage assumptions in thousands of places scattered throughout your code base.
> A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters.
Can’t really accept that without some kind of quantitative evidence.
No worries. It is not meant to be quantitative. For a few years of my career that has been my experience. For this type of software, if I'm making the decision on what technology to use, it won't be any GC-based language. I'd rather not rely on promises that GC works great, or is very tunable.
One could argue that I could just tune my services from time to time. But I'd just reduce the surface area for problems by not relying upon it at all -- both a technical and a business decision.
If you're needing to fight the GC to prevent crashes or whatever then you have a system design issue not a tooling/language/ecosystem issue. There are exceptions to this but they're rare and not worth mentioning in a broad discussion like this.
Sadly very few people take interest in learning how to design systems properly.
Instead they find comfort in tools that allow them to over-engineer the problems away. Like falling into zealotry on things like FP, zero-overhead abstractions, "design patterns", containerization, manual memory management, etc, etc. These are all nice things when properly applied in context but they're not a substitute for making good system design decisions.
Good system design starts with understanding what computers are good at and what they suck at. That's a lot more difficult than it sounds because today's abstractions try to hide what computers suck at.
Example: Computers suck at networking. We have _a lot_ of complex layers to help make it feel somewhat reliable. But as a fundamental concept, it sucks. The day you network two computers together is the day you've opened yourself up to a world of hurt (think race conditions) - so, like, don't do it if you don't absolutely have to.
It's because system design is a lot less theoretically clean than something like FP, zero-cost abstractions, GC-less coding, containerization, etc, and forces programmers to confront essential complexity head-on. Lots of engineers think that theoretically complex/messy/hacky solutions are, by their nature, lesser solutions. Networking is actually a great example.
Real life networking is really complicated and there are tons of edge cases. Connections dropping due to dropped ACKs, repeated packets, misconfigured MTU limits causing dropped packets, latency on overloaded middleboxes resulting in jitter, NAT tables getting overloaded, the list goes on. However most programmers try to view all of these things with a "clean" abstraction and most TCP abstractions let you pretend like you just get an incoming stream of bytes. In web frameworks we abstract that even further and let the "web framework" handle the underlying complexities of HTTP.
Lots of programmers see a complicated system like a network and think that a system which has so many varied failure modes is in fact a badly designed system and are just looking for that one-true-abstraction to simplify the system. You see this a lot especially with strongly-typed FP people who view FP as the clean theoretical framework which captures any potential failure in a myriad of layered types. At the end of the day though systems like IP networks have an amount of essential complexity in them and shoving them into monad transformer soup just pushes the complexity elsewhere in the stack. The real world is messy, as much as programmers want to think it's not.
> The real world is messy, as much as programmers want to think it's not.
You hit the nail on the head with the whole comment and that line in particular.
I'll add that one of the most effective ways to deal with some of the messiness/complexity is simply to avoid it. Doing that is easier said than done these days because complexity is often introduced through a dependency. Or perhaps the benefits of adopting some popular architecture (eg: containerization) is hiding the complexity within.
> It's because system design is a lot less theoretically clean
Yea this is a major problem. It's sort of a dark art.
> Computers suck at networking. We have _a lot_ of complex layers to help make it feel somewhat reliable.
I've got bad news pal: your SSD has a triple-core ARM processor and is connected to the CPU through a bus, which is basically a network, complete with error correction and exact same failure modes as your connection to the new york stock exchange. Even the connection between your CPU and it's memory can prodice errors, it's turtles all the way down.
Computer systems are imperfect. No one is claiming otherwise. What matters more is the probability of failure, rates of failure in the real world, P95 latencies, how complex it is to mitigate common real world failures, etc, etc, etc.
"Turtles all the way down" is an appeal to purity. It's exactly the kind of fallacious thinking that leads to bad system design.
the difference of distributed (networked) systems is that they are expected to keep working even in the presence of partial (maybe byzantine) failures.
The communication between actor itself is not the problem, unreliable comunication between unreliable actors is.
If any of my CPU, RAM, Motherboard has a significant failure my laptop is just dead, they all can assume that all the others mostly work and simply fail if they don't.
>Computers suck at networking. ... The day you network two computers together is the day you've opened yourself up to a world of hurt.
This is actually a pretty insightful comment, and something I haven't thought about in a number of years since networking disparate machines together to create a system in now so second nature to any modern software that we don't think twice about the massive amount of complexity we've suddenly introduced.
Maybe the mainframe concept wasn't such a bad idea, where you just build a massive box that runs everything together so you never get http timeouts or connection failed to your DB since they're always on.
> I'd rather not rely on promises that GC works great, or is very tunable.
I'm always puzzled by statements like these. What else do you want to rely on? The best answer I can think of is "The promise that my own code will work better", but even then: I don't trust my own code, my past self has let me down too many times. The promise that code from my colleagues will do better than GC? God forbid.
It's not like not having a GC means that you're reducing the surface area. You're not. What you're doing is taking on the responsibility of the GC and gambling on the fact that you'll do the things it does better.
The only thing that I can think of that manually memory managed languages offer vs GC languages is the fact that you can "fix locally". But then again, you're fixing problems created by yourself or your colleagues.
It's impossible to spend any time tuning Go's GC parameters as they intentionally do not provide any.
Go's GC is optimized for latency, it doesn't see the same kind of 1% peak latency issues you get in languages with a long tail of high latency pauses.
Also consider API design - Java API (both in standard & third party libs) tend to be on the verbose side and build complex structures out of many nested objects. Most Go applications will have less nesting depth so it's inherently an easier GC problem.
System designs that rely on allocating a huge amount of memory to a single process exist in a weird space - big enough that perf is really important, but small enough that single-process is still a viable design. Building massive monoliths that allocate hundreds of Gb's at peak load just doesn't seem "in vogue" anymore.
If you are building a distributed system keeping any individual processes peak allocation to a reasonable size is almost automatic.
You tune GC in Go by profiling allocations, CPU, and memory usage. Profiling shows you where the problems are, and Go has some surprisingly nice profiling tools built in.
Unlike turning a knob, which has wide reaching and unpredictable effects that may cause problems to just move around from one part of your application to another, you can address the actual problems with near-surgical precision in Go. You can even add tests to the code to ensure that you're meeting the expected number of allocations along a certain code path if you need to guarantee against regressions... but the GC is so rarely the problem in Go compared to Java, it's just not something to worry about 99% of the time.
If knobs had a "fix the problem" setting, they would already be set to that value. Instead, every value is a trade off, and since you have hundreds of knobs, you're playing an impossible optimization game with hundreds of parameters to try to find the set of parameter values that make your entire application perform the way you want it to. You might as well have a meta-tuner that just randomly turns the knobs to collect data on all the possible combinations of settings... and just hope that your next code change doesn't throw all that hard work out the window. Go gives you the tools to tune different parts of your code to behave in ways that are optimal for them.
It's worth pointing out that languages like Rust and C++ also require you to tune allocations and deallocations... this is not strictly a GC problem. In those languages, like in Go, you have to address the actual problems instead of spinning knobs and hoping the problem goes away.
The one time I have actually run up against Go's GC when writing code that was trying to push the absolute limits of what could be done on a fleet of rather resource constrained cloud instances, I wished I was writing Rust for this particular problem... I definitely wasn't wishing I could be spinning Java's GC knobs. But, I was still able to optimize things to work in Go the way I needed them to even in that case, even if the level of control isn't as granular as Rust would have provided.
I think I toggled with the GC for less than a week in my eight years experience including some systems stuff - maybe this is true at FANG scale but not for me!
As many have replied, the available levers for 'GC-tuning' in go is almost non-existent. However, what we do have influence on is "GC Pressure" which is a very important metric we can move in the right direction if the application requires it.
The tricky thing is GC works most of the time, but if you are working at scale you really can't predict user behavior, and so all of those GC-tuning parameters that were set six months ago no longer work properly. A good portion of production outages are likely related to cascading failures due to too long GC pauses, and a good portion of developer time is spent testing and tuning GC parameters. It is easier to remove and/or just not allow GC languages at these levels in the first place.
On the other hand IMO GC-languages at the frontend level are OK since you'd just need to scale horizontally.