The Go collector isn't generational or moving, and the write barrier AIUI is only used for getting it to run concurrently. The barrier records interesting writes when the collector is running, to avoid "losing" objects whose liveness changed while the collector was running.
Barriers are pretty cheap; [0] claims a 0.9% time overhead for a card-marking barrier common in generational collection and 1.6% for an object-logging barrier which is also useful for concurrent collection. Apparently they're cheaper nowadays, but the results aren't published yet [1]. That's not to say that the barriers are free, but it seems feasible that collector optimisations could still cover that ground.
It can help throughput, still, by running the collector concurrently. I've felt that while working on a new parallel (but not yet concurrent) collector for SBCL; the program parallelises well, but a serial collector hurts worse than it should by stopping the program.
> Barriers are pretty cheap; [0] claims a 0.9% time overhead for a card-marking barrier common in generational collection and 1.6% for an object-logging barrier which is also useful for concurrent collection. Apparently they're cheaper nowadays, but the results aren't published yet [1]. That's not to say that the barriers are free, but it seems feasible that collector optimisations could still cover that ground.
A quick look at the paper indicates that researchers were able to find cases in which the cost was as low as 0.9%. I don't think it's a good idea to operate as if that is the cost.
It's not a big deal. People give up more than 1% in suboptimal compiler flags alone. People give up far more than that by using unoptimized or lightly optimized code. People give up more than that by being too lazy to use profile guided optimization. Or by using the system malloc. To say that a 1% loss in performance disqualifies something as a systems language is absurd.
One important difference in all of those is that they are decisions made by the user of the programming language, not the author.
Sub-optimal performance decisions made by the user are their own business (and, perhaps, their customers'), and not really the language authors' concern. The converse is not true. Sub-optimal performance decisions made by the language authors end up affecting everybody.
I agree that saying a 1% difference would disqualify it as a systems language is hyperbolic. I disagree with the assertion that it's not a big deal. Having a core development team that considers 1% to be a big deal is a very desirable trait in a systems language. Achieving very high performance goals often comes through an accretion of many such 1% (and sub-1%) decisions.
The difference in performance between existing systems languages is greater than 1%. This is just not a big deal. People will not choose their systems language based on a 1% performance difference. It's so far down the list of considerations that it's a total non-issue.
He just did, it's called Rust. And if someone invented a language 5% faster than C. He'd still use C. If someone had invented a language faster than C when he started working on Linux, he would still have chosen C. He didn't pick it because it was the fastest (and it likely wasn't back then - C compilers have come a long way.)
Because practice is not the same as theory. It's both faster and slower than C, depending on the program, but overall slower. The answer to why is deeply complex and varies on a case-by-case basis. Rust is far more complex than C, so some language features compile down to a LOT more IR (intermediate representation) than C, and LLVM is left to try and make the best of it. Many problems in compilers are NP-complete, and the compiler will use heuristics and other techniques to get an approximately good enough answer. Like register allocation. The more complex the starting state, the less well that tends to work out. But there's other things too, Rust has bounds checks for example, while C often does not. C has null-terminated strings, while Rust strings store a length. Rust doesn't have aliasing, but doesn't yet (last I checked anyway) optimize for that. It will one day though. This is just touching on an extremely nuanced and complex subject. But C is typically faster than Rust, and will likely stay that way.
This question is impossible to answer in a vacuum. What if the language made developers 90% more productive or 90% less likely to introduce a memory bug or offered 3x faster compiler times? A 1% runtime performance hit would be well worth it if that were the case. Perhaps with the extra time available, developers could find better algorithms/ways of expressing their programs that would ultimately lead to faster runtime performance on average even with the 1% hit in certain microbenchmarks.
He certainly hasn't accepted D as a language for core Linux work, despite the benefits without any supposed performance tradeoffs. And he uses GCC for compilation, despite the existence of specialized compilers that offer better runtime performance.
That is a common mistake, malloc()/free() aren't deterministic, specially in threaded code, heaviliy loaded system, ISO C doesn't even provide guarantees of how deterministic they should come back to the caller across C implementations.
At system scale, a 1% loss is a huge deal. Perf teams on operating systems do things for systemwide performance with small percentage wins all the time. Whether this is the right trade off for D, whether it’s feasible to flip a global switch for “slower code but optimal GC” or whether it’s an intractable problem I can’t say, but Walter is correct on this point.
> Perf teams on operating systems do things for systemwide performance with small percentage wins all the time.
And they evidently spend very little time on the things that would actually improve the lives of their users in macroscopic ways. The fact that our dominant model of an operating system is essentially c + shell is a tragedy. The operating system does almost nothing for us to help write fast, correct programs in a reasonable amount of time. It is of course for this reason that unix itself has barely improved in a material way in 30 years except in these kinds of microbenchmarks. Third party tooling and languages have of course improved, but largely to cover the gaps that the operating system has largely failed to address in a meaningful way.
> The operating system does almost nothing for us to help write fast, correct programs in a reasonable amount of time.
Um, when the PC computers switched from real mode to protected mode operating systems, that was an enormous boost to programming. It's hard to understate it.
It is contextual. The things like system IPC where 1% would be huge are also typically not written in C, but in a mix of C and assembly.
But writing at that level of specificity for an entire system greatly increases creation and maintenance time costs. You may lose the opportunity to make future performance improvements as a result.
How much would Boeing pay to reduce the weight of an airplane by 1%? How much would Ferrari pay to make their F1 cars 1% faster? How much would car companies pay to be able to make 1% more fuel efficient cars? How much would your electric bill save if rates were reduced 1%?
1% is a lot of leverage. One reason I got paid well when I worked in industry is because I could write code that ran faster. Shaving 1% off execution speed is a big deal.
> How much would Boeing pay to reduce the weight of an airplane by 1%?
I don't know but this seems like a trivial decision given how many metrics they must track about the relationship between weight and capacity/fuel cost - either you pay less than you save, or you don't pay it.
> How much would Ferrari pay to make their F1 cars 1% faster?
Probably a lot, but this is a (somewhat literal) Red Queen's race.
> How much would car companies pay to be able to make 1% more fuel efficient cars?
Probably nothing, their precision is already several times that. Empirically, they also don't care much.
> How much would your electric bill save if rates were reduced 1%?
Less than 5 euros a year - though this year maybe more, could be as high as 10. So, negligible - not even worth the time to write this comment, probably.
Overall it seems like "1%" is often a fairly meaningless measure and I'm not sure what any of these examples are supposed to illustrate to me about compiler design or language choice since the context for each seems to matter more than anything else.
> Less than 5 euros a year - though this year maybe more, could be as high as 10. So, negligible - not even worth the time to write this comment, probably.
Then apply this 1% saved logic to everything else, food, clothing, leisure, insurance, medical bills, sport
Electricity is not 100% of the cost of "everything else". Nor are those part of my electricity bill.
If you want to make a point about where 1% efficiency matters, go ahead and make it! I might even agree! But it still won't have anything to do with systems programming languages.
whether you agree or not, whether you understand or not, whether i'm able to convey the point in proper english or not, it doesn't change the facts
your system being 1% faster mean everyone who depend on your system will be 1% faster, and if you yourself apply that logic to your system, it's commutative, and the results ends up being impressive
if you don't understand that, then there is no point arguing further
How? OP made a claim but OP hasn’t written a systems programming language as prevalent as D. It’s a baseless claim that’s either the result of inexperience or attempted flame bait. Walter is the only compiler expert on HN and our arguments should at least be thoughtful
Are you really saying that a 1% performance loss makes something "not a systems programming language"? Of all the weird definitions of "systems language" I've heard this has to be the weirdest.
1% compared to what? Hand-written assembly? C? C++?
Clang and GCC differ by way more than 1% in most benchmarks. Does that mean that one of them can't compile system languages or something?
I mean, I would agree that there definitely is a performance threshold below which you wouldn't consider a language a "systems language" - i.e. it would be silly to write an OS in the language. But 1% seems at least an order of magnitude off.
I see D as a non-compromising language (in terms of eating your cake and having it too) and that's why I like it. I think the authors don't want to limit its performance at the design level. For equivalent code, they don't want it to only ever be able to run 99% as fast as C, they want it to be able to run, at least theoretically, as fast as C or better. That's probably where the 'systems language' comparison came from. In practice though, I'm willing to bet DMD,GDC and LDC result in more than 1% difference in performance (but that's just speculation).
I would expect to be able to find another 1% slowdown somewhere in an implementation of another "system programming language" - would that also disqualify it from being such a language?
It's a competitive business. If a compiler generates 1% faster code, you need 1% fewer servers in your server farm, which translates to enormous amounts of money. Or if your hedgefund trading software runs 1% faster, you can get your trades in faster than the other guy, eating his lunch. Or would you like to reduce your costs of rented servers in the cloud by 1%?
> if your hedgefund trading software runs 1% faster
Jane Street has used OCaml for about 20 years.
Last year, OCaml merged an improvement to its GC that reduced latency by 75% (by one measure), and reduced execution time by 23% when compiling OCaml and around 6 to 7% when compiling Coq libraries:
Hedge fund traders are using D, for performance. I've even had a hedge fund trader come to me to teach them how to implement their own compiler, as they wanted to make it faster.
Remember the stories about traders trying to shave milliseconds off of their trading speed?
Milliseconds is at least 20 years out of date. Last I worked in the industry we measured our latency in micros and that was 10 years ago. Even then anyone that actually cared about latency never left the switch, all fpga & asic.
I even met a hedge fund that ran their back test platform mostly on hardware because it allowed them to run more tests, which was a strategic advantage.
That said, 1% for free makes the ghost of Admiral Hopper happy and that’s reason enough for all of us.
Right, I don't mean to downplay the significance of 1%, but aren't there many ways to win or lose 1%? More specifically perhaps, are there so few GCed pointers that a faster collector cannot recuperate that 1%?
moonchild's suggestion to monomorphise against GC/explicitly managed pointers might reduce that figure lower; evidently you know D programs better than I, but it doesn't seem unreasonable that the 1% can be won by a faster GC.
A 1% slowdown is a ridiculous low number IMHO. I would argue that almost no program is within 1% of the optimal speed (however you define it). Therefore practically it does not matter. If you think it matters then the consequence would be that you write everything in Assembler.
You won't notice it on your desktop. But you will notice it in cases I mentioned. Top shelf games also are not going to give up that 1%, or so the game devs tell me.
And 70 million people go eat Mickey D's every day, doesn't make it good in any way. Whenever I play unity games (rarely to be honest as I play maybe 5-20 hours a year nowadays, but this has been consistent) I brace myself for a laggy experience. Most people don't care about this it seems but it really makes me want to close the game when there's a laggy scene (and definitely put the tech used in my blacklist).
A minor, not well-thought out use of shared pointers in C++ can easily add much more to execution time (and yes, I know it is not often used in game dev)
The very vast majority of native games I played don't have the kind of lag that unity games have (and seem to do much more things). Only very bad one I could remember was Neverwinter nights 2.. that one was really unplayable even on god-tier computers
the only games I noticed it to the point it was actively bothering me and distracting me from the game were:
- Unity games
- NWN2 which I mentioned
- CrossCode which is an HTML5 game (while being an absolutely great game, performance wise this one was by far the worst I ever experienced, it's laggy on a fucking GTX1070 / 8750H / 32GB RAM system for a 2D pixel-art rpg and made me quit the game in rage more than once)
> And yet even the likes of Sony and Nintendo are working in major titles with Unity.
I don't think this is a reasonable remark.
The whole point is that the 1% penalty is added on top of whatever choices "the likes of Sony and Nintendo" make. If they have alternatives that ensure them a performance win just by chosing the right tool to work with Unity, they won't choose the wrong one.
It definitly is, because despite being mostly C#, outside the graphics engine itself, they have decided it is the right tool for some of their projects.
D had a chance in the games industry with Remedy and lost it, instead a language that supposedly lesser one, is being adopted by major platform owners.
> It definitly is, because despite being mostly C#, outside the graphics engine itself, they have decided it is the right tool for some of their projects.
No, it was not a reasonable remark. Pointing someone else's choice changes nothing. The point that you keep missing is that performance is a key decision factor, one among many, and baseline performance penalties imposed by a particular choice of programing is a factor that adds up to all other choices. If you degrade the performance of your offering without any relevant tradeoff, you're making it harder to justify it's adoption.
Feel free to insist in your personal assertion. Those who care about performance feel strongly about gratuitously pile performance penalties without any meaningful tradeoff to show.
A key factor and yet they are using Unity, because even for Sony, Nintendo and Microsoft, that tradewort is worthwhile from business point of view.
Minecraft wouldn't never have happened if Notch was busy discussing if it would be acceptable at all doing it in Java.
Just like too many in HN dream of being the next FANG, too many worry about the ultimate performance when their games would hardly win a fraction of Minecraft when placed into the market.
As somebody said above, there is a difference between performance choices made by platform providers vs application writers ( or game studios ).
Getting acceptable performance in my game despite my focus on playability and time to market is one of the benefits of platform choice. I want to focus on my game, not overcoming platform limitations.
I am not saying a single 1% makes a difference but the idea that performance in the platform does not matter is wrong.
> the idea that performance in the platform does not matter is wrong.
But nobody said this. Maybe we're too deep in the thread for anyone to remember, so let's refresh our memories. The claim was not that performance doesn't matter, or that 1% never matters, but that any language 1% slower than D is currently could categorically not be considered a systems programming language.
That's even more absolutist and absurd than "performance does not matter".
It matters when it’s significant, eg. Python can indeed be an order of magnitude slower than a native, compiled language. But much more than 1% can easily accumulate from all the things that are forced on you by not writing assembly so I don’t think it is a fair cut-out point.
Especially that research OSs were written in managed languages, often with better performance!
Uh you know who’s so patiently answering your questions, right? My guess is that he’s been world famous at this since you were a twinkle in your daddy’s eye. He’s had a minute to think about it...
Blind appeals to authority are not appropriate here. It’s pretty clear that Walter has staked out an unreasonable position, regardless of how renowned (and rightly so!) he is for his work on D.
If my position is unreasonable, why hasn't the C Standard endorsed garbage collection? Or the C++ Standard? Rust's entire reason for existing is to find a way to make non-gc memory safe.
ISO C++11 introduced support for GC APIs, thus endorsed garbage collection in C++.
These API have been removed in ISO C++20, only because the biggest C++ GC customers, like Epic and Microsoft, never made use of them and carried on using their own implementions.
Your position is unreasonable because I think you know better than to espouse it. You don't have to be an expert on everything, of course, but considering the knowledge you already have you should at least be able to not make the claims you're making in this thread, if not arrive at what I mention here.
Your specific claim at the top of this chain is in response to someone asking "why don't you do this thing [that has 1% performance impact]", to which you respond "I cannot bundle a 1% performance hit in my language, because this would not make it a systems language". Putting aside the usual debate of what a systems language even is, if we consider the ones that are typically not up for debate: C, C++, Rust, these differ in performance by far more than 1% on typical workloads. As some commenters have mentioned, compiler optimizations alone cause pessimizations of greater than 1%, so looking at a number like this as any qualifier of how feasible something is doesn't make sense.
Taking a step back, I feel like you are missing what people actually mean when they're talking about "1%". Like, yes, 1% of Facebook's server load is $$$. Making a dozen 1% improvements to SQLite is a good improvement. But, like, you're conflating this with what you're doing, and it's not at all related. There are companies using Python in production right now trying to save 1%! The reason why this is a "rational" decision is that performance work is not actually a function of how much percent you can shave off your workload, but how you can balance a couple of people trying to wring a couple percent off of your existing code. The unfortunate truth is pretty much all code, even the stuff running billions of CPU hours in datacenters, is leaving tens of percent on the table at the very least just by using a high level language, with poor data access patterns, etc. The reason this is OK is that rewriting all of the code into perfect assembly or whatever is not a feasible task. It would be super tedious, error prone, and require huge amounts of effort. So it becomes relevant for specific people to try shave off a few percent here and there around an inherently inefficient codebase, because that's where the balance lies. Compared to the effort it took to write or migrate the code, the 1% win is always going to be a small fraction of the engineering cost. Otherwise, you'd just replace the thing altogether.
So, circling back around to your point: a 1% loss isn't actually catastrophic. If you put 1% extra code in for no reason and it was easy to get rid of it with something a little smarter, people would rightfully be up in arms. But if you bring actual benefit that is very hard to get any other way, then it's usually going to be welcomed. I mean, people are putting in specialized hardware to slow down their general C++ application code "just" 5-10% to get a fraction of the benefits of actually having memory safety. There's a limit to how much people will tolerate, and it's definitely not like 30%, but if the only penalty was 1% at runtime and you could never have to worry about freeing memory again this would probably be a good tradeoff.
(To answer your bit about why the C standard doesn't have garbage collection: multiple reasons. One is that people who use C have very picky views of how garbage collection should look like on their platform. And the other bit is that garbage collection is typically not "just" 1% CPU overhead, it has interactions with things like latency and memory usage, which I think other people have actually pointed out about this 1% figure. The reason I specifically called you out was because you accepted this and decided to argue for that number, rather than saying "oh, well, there's actually more to it than just that 1%".)
> Blind appeals to authority are not appropriate here.
True! Duly upvoted. Although I feel his arguments are closely reasoned and don’t agree with your point. And the person I responded to brought nothing you to the table.
Where is it that Walter is obviously wrong, by the way? Not trying to be argumentative. I just did not see the holes in his argument that you do.
1% Less servers in the farm means 1% less expenses on the server farms. That is a lot only if the server farm is your main cost and you run on thin margins.
My guess is that it is not that big of a niche, although the players are probably big
If you look at the revenue cloud service providers get, and that revenue is based on CPU time used, 1% is a huge amount of money.
It's like weight on an airplane. Boeing told me back in 1980 that saving one pound was worth a quarter million dollars of investment. That'd be like a million bucks today.
There are massive python and bash codebases used at all of the major cloud service providers. And none of the cloud service providers are using custom-compiled PGO-optimized OSes specifically optimized for the exact architecture and instruction set and resource set that they will be deployed on, despite >1% performance advantages for doing so. Because sometimes the complexity of doing things the fastest way possible is not worth the performance improvement.
The boeing example is illuminating. One could conceive of infinite possibilities where it would be better to spend the extra million dollars than to get something that saved a pound but didn't meet other project requirements. What good would it be to buy a $1M cheaper 747 if the landing gear broke on each landing? Everything is a tradeoff, even things that are extremely important like weight on airplanes or runtime performance in computers.
I mean, 1% is fine if it comes with no other tradeoffs, but that's probably not the case. I would trade a 1% improvement in performance for a 1% improvement in developer productivity almost every single time (very rarely are businesses performance constrained such that they can afford to slow development for marginal performance deltas), and when you're operating near the margins of performance, it's usually more likely that you'll have to halve productivity to eek out a small amount of performance improvement.
I guess the problem is usually not about 1% throughput, but about where the 1% is and how it affects the latency. You can have the 1% overhead evenly distributed which is normally fine, but you can also have a latency spike for some operation (and I guess this is the case for barriers), which IMO really hurts in some cases.
Barriers are probably too small to cause latency spikes. The object logging barriers are a bit complicated, as they write an object into a thread-local log and infrequently have to get more memory for the log. But a card barrier is rather simple, and could look like
where <cards> is one below an immediate power of two number of cards, each card covering 2^9 = 512 bytes of heap. This is roughly the write barrier in SBCL on x86-64; there isn't much of a reason to think it will cause a latency spike.
On the other hand, "Barriers revisited" does mention a pathology about how cards can introduce false sharing, but that's still more of a smear than a spike.
Register allocation is an NP-hard problem. It's not feasible to reach a single "good" decision, you'll always have to choose how much compiler time should be assigned to it, and what heuristics to use.
This isn’t a problem in practice - you don’t need a single best solution, because you can get a good-enough solution very easily - linear scan. Classic example of an over-theorised problem.
It does. D offers a mix of manual and gc allocating, and the people who want higher overall performance lean towards more manual. Any penalty that applies to pointer code then gets amortized against less use of the gc.
It's a different tradeoff if the languages primary allocation scheme is GC.
The obvious approach to that issue is to support garbage-collected objects as a special address space within the program. You could use normal read-only pointers to address these, but write pointers to them would be specialized and incur that memory barrier overhead.
Why does this have to be a runtime check? Perhaps the compiler could optimize pointer writes that can be proven not to affect garbage-collected objects, by dispensing with the memory barrier overhead for those.
I'd probably go even further and say that if there's any hidden check or restriction on reading or writing from a pointer, then it's not a systems language. I'm glad D does not have anything like that.
That seems like a pretty arbitrary definition of "systems language". Lots of hidden stuff happens in "systems languages"--compilers do unintuitive magic all the time, and the languages themselves have their own runtimes.
I've explained[0][1] in the past why this is nonsense, and seem to have eventually convinced[2] the one person who actually read the literature. Beyond that--crickets.
I've worked with languages with multiple fundamental pointer types (needed for 16 bit code). This kind of thing never worked well. For example, you'd need 4 versions of strcpy() to deal with 2 pointer types. What a mess DOS programming was with that.
Yes, in much the same fashion as with template functions: multiple definitions are generated, and callers are rewritten to target the appropriate specialisation.
Code bloat comes with its own performance problems (code caching).
Besides, I seriously doubt that there's much gold in unambiguously recognizing stack pointers. I try to put as much as possible on the stack, but it just doesn't seem like much. The more stack allocation you manage to do, the more code bloat gets generated. Any malloc'd pointer will look just like a GC pointer, so you'll be paying the penalty for all of those.
How these tradeoffs play out will be difficult to determine in advance. If you've the confidence that it will work out favorably, by all means, give it a try.
Of course, one could add user annotations, but annotations are generally disliked.
> Any malloc'd pointer will look just like a GC pointer, so you'll be paying the penalty for all of those.
I don't see why. A small amount of manual tagging for extern(C) things—done in this case as part of the standard library, so opaque to user code—will go a long way in that regard. And in any custom allocator, automatic inference should work just fine, since the data returned will be clearly derived from the already-tagged malloc/mmap/VirtuallAlloc/whatever (type info does still have to be associated manually, of course, as part of the GC.addRange dance, to avoid spurious pinning).
That would all come undone with separate compilation. You'd have to recompile the entire program every time.
You'd also have to keep around a massive amount of data at compile time. A variable that's a pointer to a struct, can't just be represented as a pointer to a struct. It'd have to have a map of which pointers in the struct are managed for every pointer to that struct.
The way the D compiler is implemented now, a struct S has only one instance, as every pointer to a struct's type points to that same instance. And still sometimes people compile programs so large the compiler runs out of memory.
If you were going to debate in a public forum, you should have anticipated some pedantic loser like myself coming along and pointing out your No True Scotsman fallacy.
> You were complaining about 1% performance deltas else-thread (...)
You should revisit that corner of the thread to revive your memory. The discussion was about adding an extra 1% performance penalty for reasons, not the importance of a 1% delta either way.
You could always use huge pointers in the general case, and treat 16-bit pointers as an optimization only. This is probably how these platforms might be best supported by something like modern GCC/LLVM.
> And since there are all kinds of pointers in D, one no longer can use a moving GC allocator, because it cannot know exactly where 100% of the GC pointers are.
Go's GC is non-moving and it can't safely be made moving in the current state[0].
It also used to be partially conservative until 1.3[1][2][3]: the heap was precise, but the stack was conservative.
It's also worth mentioning that being conservative only on the stack, even when moving, is often good enough. SBCL still works that way, and the space and time overheads are pretty tiny [0].
AFAICT pinning is temporary? E.g. even Java's native interface allowed pinning objects for a limited amount of time, which had to be incorporated into the design of Java's long series of GCs.
Erg, but soon the Virgil GC will have to support pinning, because you can directly call the kernel to do I/O with on-heap byte arrays. Currently I get away with that because Virgil is single-threaded only, but that's gotta change for Wizard to compete with more advanced VMs.
With syscalls you can mmap the ethernet controller's DMA region and write directly to it; zero copy to the wire and back. In Virgil, there's literally no restriction on what Linux syscalls you can do, so why not that too. Though, Range<byte> of off-heap memory (a new Virgil feature implemented with fat pointers), is probably better; you don't even have to leave the memory-safe core language to manipulate off-heap data.
> And since there are all kinds of pointers in D, one no longer can use a moving GC allocator, because it cannot know exactly where 100% of the GC pointers are.
I was astonished to learn that researchers found a way to implement a compacting malloc (!!!) by using very clever virtual memory tricks - and which they were able to use to demonstrate memory usage improvements in a long-running Redis instance that used their drop-in malloc replacement:
That's just trading a fragmented heap for fragmented page tables, though. Modern OS's even support "huge" pages to specifically avoid that page table overhead.
Still one of the best ideas in the field in recent years. I will note that it also works for non-moving GC collectors and if they are precise, like Go, they can also update pointers and eliminate the redundant page table entries.
I think many people look at GC in the wrong way, as if absolute maximum speed is always the goal. It depends on what the goal is, and GC performance can be as fast as needed or enough. GC is also a convenience, that many may find worthwhile, to not worry about memory management.
Optional GC, also means you are free to go for it manually, if there is really and truly some bit of extra performance needed. In regards to language wars, it's arguably better to make the case of how to make manual memory management more convenient to use, when the optional GC is turned off.
The GC made sense in comparison to C style memory management. But with with Swift, Rust and modern C++ the argument can be made that GC doesn't bring anything to the table.
In the beginning the biggest drawback seemed to be performance, so people have been hell-bent on fixing that for decades now. Completely ignoring that the "don't need to care about memory" mindset is flawed. You always need to think about memory, and if you do you might as well do it yourself. It doesn't take more time to do in the long run and you are forced to design a better system because of it.
For scripts it is a really nice have. But for a proper language writing large-scale applications? No thanks.
> But with with Swift, Rust and modern C++ the argument can be made that GC doesn't bring anything to the table.
Well, it would be a bad argument :D
A reference counter is arguably the slowest option, so Swift is out of the picture (besides it leaks circular dependencies). Rust is a good idea, but it is a tradeoff. Quite a lot of programs can’t be expressed with strictly tree-like lifecycles, and then you are either left with the yet again slower (A)RC, or unsafe. C++ is in pretty similar shoes. They are good for the niche they were made, systems/low-level programming.
Let me quote from the Garbage collector handbook: “above all, memory management is a software engineering issue. Well-designed programs are built from components [..] that are highly cohesive and loosely coupled. [..] modules should not have to know the rules of the memory management game played by other modules. [..] GC uncouples the problem of memory management from interfaces.”
It says nothing about not thinking about memory, you are free to optimize it where you need it.
EDIT: Oh and dynamic allocation is much faster with a modern GC than with malloc and friends, plus a moving GC can defragment heap.
> A reference counter is arguably the slowest option, so Swift is out of the picture
It's not that simple. Obligate refcounting has lower throughput than obligate tracing GC but offsets that with more predictable latency, which might be good for the Swift use case. Rust uses a mix of stack-based allocation, manual heap allocation and RC - and the RC only needs to manage "owners" as determined via the borrowck, as opposed to ephemeral references. So is way more efficient than something like Swift ARC, where refcounting is essentially ubiquitous and any operations via references will incur slow atomic RC updates.
Well, ever seen a C++ program that seemingly finished execution, but still haven’t exited?
That’s shared pointers recursively freeing up their reference trees. There is a great paper showing that RC tracks object death, while tracing GCs track object “liveness”, so its essentially two sides of the same coin.
As for RC, sure, at compile time many of the work can be omitted, but atomic writes needed for the counters are fundamentally slow operations on modern architectures.
You can do reference counting without making it atomic, by prohibiting sharing across threads. It’s not fundamentally required. See Rust’s Rc, for example.
> But with with Swift, Rust and modern C++ the argument can be made that GC doesn't bring anything to the table.
GC will always be needed to solve problems dealing with general graphs containing possible cycles. Not coincidentally, this describes many of the GOFAI problems for which GC (in the context of LISP) was first developed.
I mean, what you’re saying is probably technically true for very specific problems that are contrived specifically to show shortcomings in refcounting, but in practice I’ve never seen it matter.
Weak and unowned pointers are pretty simple to deal with, and I’ve never seen a situation where it’s really confusing whether a pointer should be strong or weak in swift. Everything mostly just works, and if you need to point back to the object that owns you, it’s usually pretty clear by the abstraction you’re building that you need a weak/unowned reference to do it.
> and if you need to point back to the object that owns you
That's not the interesting case; that is indeed easily addressed by "weak" pointers. What's interesting is the case of a spaghetti reference graph where none of the objects definitely "owns" any other, and even the choice of GC roots might be dynamic.
> With Rust and C++ (don't know about Swift), you s
still have to spend a lot of time thinking about memory.
You don't think about memory, you just think about how to ensure that you're always referencing live objects. And you need that kind of thinking anyway to deal with non-memory resources, where a GC won't help you at all.
You can just ignore it all almost as long as you stick to RAII and keep customizing allocators where you need. It is quite reasonable bc you can start to code fast and later imprpve performance in the allocation patterns.
There are other ways such as customozong new/delete.
My argument is that you always need to think about memory - even in GC languages. Being able to not think about memory is not a feature, it is a problem.
Swift is a GC language, reference counting is a GC algorithm as per CS literature.
Modern C++ unfortunely still has too much people writing classical C with classes, to make it properly safe.
Rust, yes, affine types are great, but their usability is only justified in scenarios where having any kind of automatic memory management isn't acceptable.
> The GC made sense in comparison to C style memory management. But with with Swift, Rust and modern C++ the argument can be made that GC doesn't bring anything to the table.
Haskell needs a GC too. You can't do ergonomic functional programming without GC (it has been attempted).
Haskell needs a GC because it can't figure out whether a closure might outlive any of the variables it references. This can be addressed by a Rust-like borrow checker. (In the rare general case where multiple independent pieces of code need to keep some object alive, this can be done via refcounting which of course is a kind of GC.)
gc makes it easier to conceive of and program algorithms that use cyclical graphs, a pretty broad problem. try and implement these structures in rust and you quickly run into headaches. solveable headaches but headaches all the same. they’re trivial in a gc language
"Convenience" is exactly right and syntactic sugar is underrated.
Just like static typing is a convenience. And dynamic typing is a (different) convenience. And unit tests are a convenience.
The computer runs fine without any of these. And none of them are a magic wand that eliminates all bugs. Instead all of them make debugging easier by reducing the search space, usually eliminating fairly trivial bugs. Which is more useful than it sounds because a lot (the majority?) of bugs are pretty trivial and at the same time hard to find because they are often so trivial that they stare us in the face and we can't see them.
But again, all these things are conveniences for the human programmer, they don't make one iota of a difference for the actual computer running the bits.
If I understand you right, then following your logic, all programming languages are conveniences. The computer runs just fine if we give it a binary to execute. It's not clear to me where you'd draw the line.
Is that wrong though? Compilers are "just" a kind of program that runs on our computers that afford us the convenience of writing programs in a more human friendly manner. The main argument in my mind against considering programming languages an awesomely powerful convenience would be that this is reductionist and that in the same manner, literally every invention ever was just a convenience as well.
But maybe the view of modern society as a tower built on the back of a million million one time conveniences that have since become indispensible isn't that far from the truth either?
It's not always a convenience. Perhaps when dealing with the potential elimination of certain classes of programmer error, but there are definitely use cases where specialized GC can solve runtime problems more efficiently than hand-written memory management.
As a real world example, consider a firmware boot loader I wrote for a previous client. They needed to apply code polymorphism (think, extreme ASLR) when they applied an over-the-air update or a factory reset of firmware. This is part of their defense in depth strategy: if a gadget attack were discovered in their system, they wanted to mitigate its effectiveness across all deployments. The problem with this approach is that boot loaders are tightly space constrained. To implement a sound code polymorphism strategy, around a dozen different heuristics must be applied in several passes over the code. These heuristics had to be weighed against timing and cache coherency constraints in the code under mutation, but most critically, these heuristics had to fit in the very limited space set aside for the boot loader. The GC was an optimization. It significantly reduced the code size of the heuristics.
I always discuss about SLAs when arguing about language X vs Y.
Even if language X loses in microbencharks against language Y, if the application written in X is still within the project delivery SLA for acceptance testing, who cares about the microbencharks.
GCs don't eliminate leaking, they eliminate use-after-free, and can also be used together with other language design choices to eliminate any kind of memory corruption (like in Java).
Indeed, many think of GC languages as if GC was the only way to allocate memory.
Java is one of the languages to blame for such misunderstanding, many other languages, even Lisp variants like Interlisp and Common Lisp, provide all the tooling to manage resources like one would do in C and co languages.
Having a GC doesn't preclude those other features.
On D's case, we have the problem that language designers are against moving the GC forward, as they refuse to introduce managed pointers alongside untraced pointers.
So even optimizations that Mesa/Cedar, Modula-3 and Active Oberon were capable of before D came to be, aren't possible in D with its current language design.
> we have the problem that language designers are against moving the GC forward
This is incorrect. There even was a concurrent collector written for D, but it failed for technical reasons, not anyone blocking it.
The idea that I would block any improved GC implementation for D is just silly. D is free and open source and Boost licensed. I couldn't possibly stop anyone from doing a better one.
You're free to prove me wrong about what can be done, I would welcome your implementation!
Plenty of languages have already proven D wrong, as mentioned on my language list, my contribution isn't required.
C# already offers what I care about in managed languages with C and C++ like capabilities.
And honestly why would anyone go through the trouble to prove you wrong, if in the end it eventually has the same fate as the concurrent one, that only works in Linux anyway.
I do, and it isn't called Managed C++, rather C++/CLI.
Managed C++ was replaced by C++/CLI in .NET 2.0.
It is the best way to write bindings to native libraries on the .NET ecosystem without having to bother with getting P/Invoke or COM bindings attributes quite right.
However it is constrained to Windows, so for portable .NET code, P/Invoke it is.
Because C# covers all bases, specially since they started bringing Midori features into C# 7 onwards, as I mentioned, I only use it for non portable code out of convenience.
In fact C# actually has three models of pointers, regular references, JIT intrisics like IntPtr/UIntPtr and raw unsafe pointers.
(~gc side-track Walter) Has anyone ever tried using a caching model for application memory? LRU evict memory objects to disk w/ a pseudo persistent memory application behavior?
Your keyword here is 'transparent persistence' (or 'orthogonal persistence'). There was quite a lot of work on this a few decades ago - https://dblp.org/db/conf/pos/index.html
p.s.
those seem to be persistent object focused. Persistence in a caching memory model would be a side-effect (and doesn't even have to be used). Do you know if there exists research primarily focused on 'a caching memory model' as an alternative to (general) GC or reference counting strategies?
Oh, I see. I am not aware of any work in that domain. Caching tends to be rather application-specific; despite all that phk says about varnish, mmap is not really a good general solution to the problem. Something language-specific with some awareness of the object model and perhaps access patterns could be an improvement, but not, I think, more than an incremental improvement.
It's an idea that its been attempting to get me seriously interested /g for a while now. I didn't wish to assert novelty (but to date haven't found anything).
The general idea is pretty straight forward, you have a ~fixed size chunk of (virtual) memory that active memory objects reside in. Garbage and very rarely used references get flushed to disk. Presumably, if the operating memory requirements (i.e. active objects) are met by available Ln cache layer, this should be a viable alternative to 'collecting garbage'. We're trading the overhead of tracing/ref-counting with the cost of cache mechanism. If cache is tiered, then the 'garbage' & 'rarely used' objects will end up in nice VM blocks that OS will flush to disk. One approach basically requires the use of one annotation at language level to distinguish 'long lived' objects.
Something like "Address/memory management for a gigantic LISP environment or, GC considered harmful" <https://dl.acm.org/doi/10.1145/1317203.1317206>? I think that uses usual LRU mechanisms to determine what to page though.
The one that's in druntime at the moment? Or a real concurrent GC, there are currently two different parallel GC's in druntime as I write this i.e. a forking one and a parallel marking one one which is enabled by default.
I don't think it is slower, the real answer is 'it depends'; D's GC will only run when the GC needs to grow its buffer, so D's GC can be actually much faster than Go's
The problem however is, actually 2 problems:
- stop the world, nobody wants that in a world with lot of cores and threads
- it doesn't scale, the more pointers in your heap, it'll need to scan and traverse your WHOLE heap whenever it needs its buffer to grow, and that doesn't scale well
So it's good when you don't have much in your heap, and it starts to loose its benefits the bigger your program become, i wouldn't use it for my servers
But that's not the main problem of D, since the GC is optional, it's just not competitive with what's available in the market today
The people who want to drag D into the Java/C# territory are the problem in my opinion
D would be better if it focused being a system language, and took what C had to offer and put it to the next level, simplify the language, boost the existing features, allocators, pattern matching, tooling, compiler performance, hot-reload, binary patching
That's the thing i want to hear when there is a new version, not the endless GC topics
> The people who want to drag D into the Java/C# territory are the problem in my opinion
Absolutely and that's the majority of D community.
> took what C had to offer and put it to the next level
ImportC is a fantastic stuff that Walter is working on.
> simplify the language
Sane defaults, but the ship has sailed. Reminds me of a talk Scott Meyers gave a while ago "The last thing D needs (is to hire him)". I think it's time they hire him.
> compiler performance
Rather focus on just LDC or GDC and drop DMD altogether. DMD is a good piece of software, but for such a small community, I find it alarming that they waste human effort across 3 different compilers and still complain lack of resources.
> ImportC is a fantastic stuff that Walter is working on.
I agree, it's one of the things that stands out when you decide to pick a system language: "how does it play with C? can i easily consume the ecosystem?"
> Rather focus on just LDC or GDC and drop DMD altogether. DMD is a good piece of software, but for such a small community, I find it alarming that they waste human effort across 3 different compilers and still complain lack of resources.
I disagree, there is value in having your own backend, DMD compiles so fast, it's a comparative advantage, they should never give that up
GDC/LDC are great because that allows D to be highly portable, even if they are slower to compile than DMD
Even Zig people decided to maintain their own self hosted backend for that reason, performance and independence
They learnt from D, a real language has its own backend, if you don't then you are just LLVM sugar
>> GDC/LDC are great because that allows D to be highly portable, even if they are slower to compile than DMD
So you'd use DMD for development because it compiles fast, and once debugging/testing/etc are completed, you build the "production release" using LDC/GDC?
No problem with LLVM, but multiple implementations is healthy. And in any case I'd probably think it makes sense DMD came into existence alongside the idea of D first seeing Walter's experience (and well lack of LLVM back then).
Java/C# territory is already better than D, even for low level programing, because in the last 10 years that D spent changing projects to atract the next wave to the language "because we need X for Y", they grew out of where they were in 2010, when Andrei's book came out.
D doesn't even have hardware vendors shipping commercial embedded SDKs for baremetal development like Java, C# and Go have in 2022.
I actually find it one of the biggest downfalls of D. IMO, you have to pick one, among other things for what's listed in the article. Riding the fence leads to a worse experience for both sides.
> Riding the fence leads to a worse experience for both sides.
I don't entirely agree, I think having a garbage collected pointer (GCP) can make a lot of sense as a performance optimisation: manual memory management (MMM) and reference counting (RC) have stampeding characteristics similar to GC pauses when releasing large hierarchies e.g. persistent data structures, or large trees of widgets: all the tree is freed synchronously and recursively, if it's large it can be very sensible.
This stampeding is more predictable than a GC pause, but it's no less problematic when it occurs, and mitigating it can be difficult (you have to manually shunt the objects you'd like to release off-thread). A GCP can do that shunting on its own, letting a GC thread perform the releases asynchronously, and piecemeal.
Furthermore, GCP means you're not expecting to precisely track allocations at the system level, which means the usual tricks (arenas, bump allocators, ...) can be applied by default to GC pointers, where they usually can't be to MMM or RC pointers. This decreases allocation and deallocation overhead by reducing the amount of work the system has to do.
> you have to manually shunt the objects you'd like to release off-thread. A GCP can do that shunting on its own, letting a GC thread perform the releases asynchronously, and piecemeal.
You could also use async programming to do the same thing in GC-free code. Or just use an arena that can be freed all at once.
> You could also use async programming to do the same thing in GC-free code.
No? Allocations are blocking, "async programming" won't do anything.
Unless you have an async dealloc which does the shunting implicitly, at which point you don't need an async dealloc, you can just have a sync one which shunts the actual dealloc and mandate a background thread. Which means now you're mandating a background thread for freeing memory. And you hope that the load is light enough that implicit thread which you've tasked with all deallocations (rather than just the ones which make sense) keeps up.
> Or just use an arena that can be freed all at once.
That is essentially the same thing, you now have a different allocation strategy for these, except it's a lot more limited and specific.
Modern malloc/free implementations must satisfy requirements such as being scalable. It eliminates any chance to implement blocking deallocation from non-owner threads. A modern implementation already has to be async in some ways. You can check paper behind Mimalloc for example.
Your objects in trees and arrays don't have the same lifecycle?
Quoting GP:
> ... reference counting (RC) have stampeding characteristics similar to GC pauses when releasing large hierarchies e.g. persistent data structures, or large trees of widgets: all the tree is freed synchronously and recursively, if it's large it can be very sensible.
This is mostly where RC & MMM fails. This is also where you are supposed to use arena allocators.
If you have a tree you're constantly adding to / deleting from, you can use a Pool backed by an Arena. At that point you don't use pointers, only indices into the pool. When you want to remove something from the tree, you mark the objects in the pool "deleted" until you want something added back again.
> Your objects in trees and arrays don't have the same lifecycle?
No, there is a nested set of lifetimes.
Take a hierarchy of widgets for instance, if you change a sub-view you're going to reclaim the widgets composing that sub-view, replacing them with widgets of an other sub-view.
Then there's persistent datastructures, where lifetimes are reversed (higher nodes are younger), you update a node, you're going to invalidate all the nodes on the path to that, but the nodes which are not on that path are still alive.
> If you have a tree you're constantly adding to / deleting from, you can use a Pool backed by an Arena. At that point you don't use pointers, only indices into the pool. When you want to remove something from the tree, you mark the objects in the pool "deleted" until you want something added back again.
So you're implementing an ad-hoc GC by hand (and so does everyone else) and making your traversal more complicated.
If there's a GC pointer provided (by the language or a library), in the vast majority of cases it will work well enough and you won't need to waste time on that, you can solve actual issues instead.
Sure, it’s a great optimization opportunity — though I would argue that replacing pointers with indexes can be a bad thing (often seen in Rust when people don’t want to argue with the compiler), since then not even valgrind and the like can catch your “use-after-free” errors.
I agree about riding the fence. The main reason D is so criticised for having a GC because it gives users a choice. I'd also rather have it either go full GC or full manual/semi-manual memory management, probably leaning more towards GC side personally.
There are other aspects where D gets hit from both sides because it never picks a side, instead it tries to cater to both sides, never going full into any of them. I guess there is some benefit in a language that's general and doesn't force you into specific paradigms, but it also increases the surface area (maintenance area) of the language and adds complexity for developers when every step of the way you have to consider the alternatives.
Agreed that riding the fence doesn't give a satisfactory result. In theory, GC is optional in Go, but I don't think anyone seriously running without a GC, or writing code that works well without a GC.
Interesting distinction, but once the object is replaced by scalars, those scalars are placed on the stack.
Do I understand it correctly that you are saying real stack allocation would involve allocating the whole object, including the header, on the stack, and passing references to such objects to (non-inlined) functions that worked on either stack allocated or heap allocated objects?
> Interesting distinction, but once the object is replaced by scalars, those scalars are placed on the stack.
No, they become data flow edges, so could be in a register, or part of an addressing operation, or value-numbered, or nothing at all if they’re never used, so only on the stack as a worst-case fallback.
> Do I understand it correctly that you are saying real stack allocation would involve allocating the whole object, including the header, on the stack
Yes, which is useful in some cases, but generally a lot weaker than full scalar replacement.
I love the garbage collector in Nim. Not only you can chose between different memory management strategies but you can tune the garbage collector to your liking and it's pretty fast by default.
Not to speak of Nim's ORC, which is automatic memory management, but arguably not garbage collection. Instead it's, roughly, automatic reference counting with cycle detection. The stopping times are absolutely minuscule!
Not at all professing any love of Nim, but do think their concept of various options for memory management was a good one. Not all programmers have the same goals and problems.
Part of the issue can be that languages that do not provide any options for memory management, can try to make it seem that GC is more of a liability than it is. In the case of Nim, D, and other langauges... They are giving options, versus none. The lack of convenient choices, might be the greater issue, versus stigmatizing GC.
> Not all programmers have the same goals and problems.
That's why we have different programming languages. The danger of making a language a jack-of-all-trades is that it will be master of none. Restrictions are more often than not a good thing, they give the power to reason, for both humans and machines (giving, for example, memory safety).
D has a long history, but D has always been a language in transition. It is going through a period of refinement where features are added and the underlying abstractions are made more efficient. So, I expect the performance problems to be solved reasonably soon. For now, D developers should look at the current status of D's garbage collection and how to improve it.
When D was originally developed it was intended to be a language that was in between C and Java. D was to be a very low level language. So, it was not as low level as C and not as high level as Java. It still managed to outperform Java and rival C. Ten years from now it'll be even faster.
> 1. It enables a killer feature - CTFE that can allocate memory. C++ doesn't do that. Zig doesn't do that.
As anything but a compiler expert, I don't understand how GC is a necessary condition of having memory allocation within CTFE[0], can somebody expand on that?
The CTFE interpreter knows what "new" does, and just calls the compiler's "new" function to allocate memory. If the user wrote their own custom allocator, then the CTFE would have to interpret that, which is kind of a big mess.
2. Zig and C++ require any code to be run at compile time to be specially marked (comptime, constexpr). D will run any code that appears in a const-expression at compile time.
3. Zig and C++ require that if a function is to be used for CTFE, the entire function must be compatible with CTFE. D only requires the path taken through the function to be compatible with CTFE. D even has a `__ctfe` pseudo-variable that can be used to branch within a function to compile-time and run-time paths.
Point 2 is not really true for Zig, functions don't need to be specially marked to be callable at compile time. The comptime keyword is sometimes required to force compile time resolution (Zig currently doesn't eagerly try to resolve function calls at compile time), but in any compile-time-only context not even that is necessary, like so:
Similarly, point 3 is not really true either, what matters in Zig as well is the path that the function took during evaluation.
const std = @import("std");
fn double(n: usize) usize {
if (n % 2 == 0) {
std.debug.print("even!", .{});
}
return 2*n;
}
const MyArrayType = [double(5)]u8; // this works
//const Bad = [double(6)]u8; // this will fail to compile
You are right about allocation not working during comptime evaluation at the moment, but this is not a final design decision, just the current status quo.
D's actually worse than Zig in regard to point 3. Kristoff's example demonstrates why. In D you would have to change the "if" to a "static if", D will always evaluate normal "if's" at runtime even if they are comptime-known, it will not do this automatically for you.
The bigger issue D has with this example is that normal parameters are always runtime. If you wanted this to work in D you would need to implement 2 versions of "double", one that takes n as a template parameter and one that takes it as a runtime parameter. D keeps comptime and runtime parameters separate, making comptime-knowness a "parameter color" which in practice means having to implement things twice if you want it at comptime and often in different ways. There are some things that can work with both but it's small subset of both.
<i>D will always evaluate normal "if's" at runtime even if they are comptime-known, it will not do this automatically for you.</i>
WRONG. if in a CTFE function works without problems.
static if has nothing to do with CTFE. static if is conceptually a beefed up proper #ifdef/#endif.
The biggest issue D has is all the FUD and misconceptions that are propagated about it.
C++20 has limited support for allocating memory at compile time. The allocations can't be leaked from the constant expression context, so that limits usefulness.
There are proposals to extend it, but there are some non-trivial const correctness issues to be solved there, AFAIK.
Somewhat on topic, but it feels like D evangelists and Walter Bright keep extoling everywhere about D but D has yet to really be used as widely as other modern languages. D forever was the "C++ replacement" and I feel like it still hasn't shaken that perception in people's minds even though D has come onto its own as a unique language.
Since Andrei Alexandrescu decided to refocus on C++, I think it kind of settled the future of the language, when core members go back to the language that apparently is less capable.
> The GO GC can also take advantage of always knowing exactly where all the GC pointers are, because there are only GC pointers.
You can call the mmap syscall (or a wrapper like C.malloc) from Go just fine, and obtain non-GC pointers. Not all Go pointers are pointers into the GC-managed region. This makes me question the rest of the opinions herein.
Non-pointers, yes. (EDIT: Pointers to non-Go memory are fine, though generally impossible to use in e.g. mmapped file data. Pointers to Go memory stored outside of the GC region will not be seen by the GC, nor guaranteed valid past a single CGo function call.) A common trick is to e.g. treat an mapped file as a large array of structs.
I would love it if some language would implement a segmented heap, where each part could be GCed separately.
Erlang has this model with it's lightweight processes. And it's a great model that helps not only with GC, but also guaranteeing no shared state between different parts of the code.
That would mean disallowing cyclical references across different GC heaps; non-cyclical references would just create additional GC roots. It would be a step away from totally automated memory management and towards something closer to "smart" reference counting.
Because it's "optional". Some of D's features (https://dlang.org/spec/garbage.html#op_involving_gc) depend on it. It's optional in terms of you cannot use those. But if it's to never use them, why they even exist?
How interesting! I know virtually nothing of GCs, but posts like this suggest that there is a huge amount of research behind something that looks do deceptively simple as calling a "free/delete" here and there. Kudos to whoever works on this stuff!
I've implemented novel GC algorithms in firmware. There are some algorithms that are just more compact when GC is available. Algorithms and heuristics that are graph heavy or require pruning nodes with possible cyclic references are just more elegant and code space efficient with GC. However, firmware is highly resource and timing constrained, which means that the GC is specialized.
I won't be worried about the speed of a Garbage collector. Rather focus on its efficiency. Actual work done per cycle per byte.
Collectors can run on a separate core and show double the speed, but then you have to halve the total number of working processes on your machine. Thus total work done per cycle remains the same.
Or it can just be lazy and run collection half the time, and thus show half processor use. But RAM use will then double for the same work load. At the end actual work done per byte of memory remains the same.
So focus on efficiency. The user can utilize the spare cores and RAMs for paralyzing their work, or for more work.
If one wanted a fast GC, they could begin with code generation or implementing a VM to work with something like ORCA. It has more powerful sharing and exclusivity semantics than Rust. Pony being the example language implementation. In benchmarks, it stomps Azul C4 and BEAM/HiPE.
Basically, you should only use GC for small litter, i.e. some short-lived tiny objects, and for larger and longer living objects use deterministic memory management tools available in the language and libraries.
Barriers are pretty cheap; [0] claims a 0.9% time overhead for a card-marking barrier common in generational collection and 1.6% for an object-logging barrier which is also useful for concurrent collection. Apparently they're cheaper nowadays, but the results aren't published yet [1]. That's not to say that the barriers are free, but it seems feasible that collector optimisations could still cover that ground.
It can help throughput, still, by running the collector concurrently. I've felt that while working on a new parallel (but not yet concurrent) collector for SBCL; the program parallelises well, but a serial collector hurts worse than it should by stopping the program.
[0] "Barriers reconsidered, friendlier still!" https://users.cecs.anu.edu.au/~steveb/pubs/papers/barrier-is... [1] https://twitter.com/stevemblackburn/status/14942409060061102...