I do wonder what the appropriate metrics for a garbage collector should be. This post focuses entirely on the main-thread times, but maybe "total CPU-time used" would be better? Especially if it was split up using hardware performance counters (ex: how much time waiting on L1, L2, L3 cache, and DDR4 RAM).
I'd imagine that having multiple cores work on application + garbage collection in parallel would cause more main-memory hits (the garbage collector would have to keep its state in cache, probably L3 cache, meaning the application has less L3 cache to work with). So overall throughput would be lower than the single-threaded solution.
But yes, latency is king for UI programs. I guess I'm just musing about the "what to measure" problem.
Using a separate thread gets you the same GC benefits.
There have been some attempts to make it easier to use threads on the web like the excellent Comlink library:
And even with many background threads, in WASM: https://developers.google.com/web/updates/2018/10/wasm-threa...
Without knowing too much about intricacies, wouldn't it be beneficial to have system (kernel) provided gc? Surely there are no cost ways of sharing userspace usage of pointers with the kernel?
As such: you must never allow a side-channel attack against other users through the kernel. This means that a significant amount of inefficiency must be introduced (ie: flushing TLB caches, among other things) whenever the kernel executes.
The current plan of action for most OSes is to therefore never have the kernel run. Linux's "futex" for example, are purely userspace in the "hot" case. (In the "cold" case, such as a task-switch becomes necessary... the kernel would run. But these cases are minimized as little as possible).
Similarly: I'm not sure what moving the garbage collector into the kernel would accomplish at all. It'd frankly be faster as a library, since you wouldn't need to do a secure task-switch.
Maybe a bit far fetched but parts of the kernel (will never happen) or kernel modules (possible) could use it as well?
Spectre is a side-channel over any branch-predicted loop (and I assume most kernel code has a for-loop __somewhere__ that can serve as a target. Even a stray memcpy is sufficient), where the Spectre gadget can read memory across the timing-attack side channel. (by measuring the time it takes for the branch-prediction to occur in the target for-loop)
That's why TLB, L1, and branch-predict data needs to be discarded whenever the kernel code starts executing, to prevent these side-channels from leaking information. These caches / branch predict data is worth many hundreds, maybe thousands, of clock cycles. You don't want to just do kernel task-switches willy-nilly.
> Kernel should know better when it's good time to mark and sweep
The optimal time to mark and/or sweep depends on the application and goals of the programmer. There's a reason why Java has over 3 garbage collectors for the user to configure.
Throughput-based code (ex: number crunching) wants a different garbage collection strategy than UI-based code (aka: what we have here, a grossly latency-optimized GC but probably much worse throughput figures).
Throughput-based code is happy with "stop the world", because "stop the world" provably does the fewest garbage collect cycles. "Stop the World" means you __only__ run the GC when you run out of memory, and not a minute sooner!! Because you're out of memory, you have to stop the world, causing large latency spikes randomly. But from a throughput perspective, this is the most efficient: least amount of memory reads/writes or CPU-clock cycles spent doing garbage collect duties.
I don't understand why you'd have to do task switching in the first place.
You can't mark pointers as unused without locking/syncing because kernel can reallocate pointers in the middle of scanning, no? That's the only reason, isn't it? But in kernel you'd know about it and you could mark unused pointers in single, lock/synchronisation free pass, no? In other words you don't need to stop the world if you are in kernel space because you know about all reallocations that are happening?
Because each programming community has a particular workflow / paradigm that works for them, each specific garbage collector is explicitly designed for each language's quirks. For example: Lisp "cons" is practically a constant-sized garbage collector, possibly implemented using bitsets because all car/cdrs are of a consistent size. Java is throughput focused by default, because of a large number of server-based customers.
> I don't understand why you'd have to do task switching in the first place.
Wrong word. I meant system call. Either way, switching "into" kernel-context is expensive if only because of the security procedures necessary.
By turning garbage-collection into a system call, you've introduced a potential security issue into the kernel (and therefore: security-related slowdowns). If anything, you want to keep as much functionality outside of the kernel as possible.
> You can't mark pointers as unused without locking/syncing because kernel can reallocate pointers in the middle of scanning, no? That's the only reason, isn't it? But in kernel you'd know about it and you could mark unused pointers in single, lock/synchronisation free pass, no? In other words you don't need to stop the world if you are in kernel space because you know about all reallocations that are happening?
I would recommend that you write a malloc/free implementation by hand.
All garbage collection is, is when free becomes a no-op, and all the work is done in the malloc() call. It turns out that "malloc()" has all the information for which nodes were allocated, as well as which ones are alive and/or dead.
The kernel never has this information, because the malloc() call threw it away in the first place. If the programmer _had_ the information, they would have called free().
But the programmer decided that calling free() was either too hard, or error-prone for their use case, and decided to go with the easier malloc-only interface. That means that malloc() now has to work a lot harder to "rediscover" the locations of all pointers that must be freed.
It doesn't matter if these routines are in the kernel, in user-space, or taking place on the moon over a TCP/IP port by carrier-pigeons. The fundamental tradeoff you have is that your code must rediscover live vs dead pointers on its own, independently from the programmer.
Tanenbaum was right after all.
The Switch performance also doesn't seem to suffer from having a micro-kernel, specially if we look at their sales.
Finally, Linux has long been tainted with micro kernel like stuff, with type 1 hypervisors, userspace drivers like DPDK, Project Treble (userspace drivers with Android IPC), systemd, security critical applications like browsers moving back to UNIX process per task model, containers,...
In a hypervisors cloud based systems, doing microservices and serverless, with language runtimes that are hypervisor aware, who cares about monolith kernels?
Nintendo consoles aren't exactly targeting the high performance games. What they are targeting with extreme prejudice are people using their hardware in ways they did not bless, the need to lock down the console probably supersedes everything else.
> userspace drivers like DPDK
Which drops all processing in the userland because the large amount of system calls necessary for traditional network operations is too damn slow. Newer kernel APIs like io_uring are trying to address some of that.
> Project Treble (userspace drivers with Android IPC)
I have an eight core android phone, I do not understand why I need eight cores to run a phone but apparently performance did not hit rock bottom back when feature phones ran on a Java based system.
Everything you mention screams horribly bad performance to me.
But even if you throw in the towel on that idea, GC isn't one-size-fits-all. If you built a GC that was so tunable that it covered all use cases and profiles then it would probably be as complex to tune as it would be to write your own.
And that said, there is already GC provided by the kernel. We just call that "manual memory management" and wrapping it efficiently for automatic usage is the process of designing a GC for your language runtime.
Can the GC move? (Then you must be careful storing pointers if a GC can occur). Is it the user's responsibility to note when an object changes? (Which can be faster, but needs lots of bookkeeping code). Are objects kept alive by internal pointers, or only pointers to their start?
You could have a system GC for Go, one for Python, one for Java, etc which is basically the JVM/.Net approach, except they extended it to the whole VM--not just the GC--however, in practice it seems like applications want different versions of the runtime and increasingly end up shipping their entire runtime with the app anyway (Go just went all-in and made this the default, which is actually really nice from a developer and user perspective).
So while a system GC (give or take plugability) sounds nice, in practice it seems like the ecosystem is moving the other direction.
This is an interesting observation. I think in both cases the ideal of having one large system remains elusive and the more robust, practical solution is RPC (basically more, smaller control loops rather than a single larger control loop). To abstract this out even further, I've noticed that this seems to be a common pattern in reliable distributed systems are architected--from Erlang's OTP to Kubernetes' controller approach. I'm actually really surprised this "controller" or "control system" architecture vs monolith architecture isn't discussed more frequently, at least in terms of reliability tradeoffs.
D, Java and .NET provide such APIs.
By mixing GC with the reference counting of pining such references.
Naturally you can get the reference count wrong.
Still much better than just doing plain old manual memory management, as those are corner cases, while with manual memory management every allocation is a possible leak.
Incidently this scenario was exacly one of the ones presented at why Linear Haskell (JVM/Haskell interoperability) by Tweag at the recent functional programming languages conference.
Given how intrusive this one is compared to a conservative Boehm GC? No, especially not when every GC ends up inheriting design warts like finalize(). Moved my entire Java code away from finalize to the half dozen dedicated cleanup classes Java provided (that was before try with resources).
> wouldn't it be beneficial to have system (kernel) provided gc?
I can only think of giving the kernel the ability to force cleanup in low memory situations as a positive and for that you probably wouldn't have to put the entire GC into the kernel.
...but isn't "low memory situation" the only situation when you'd have to actually do it in the first place (at least thoroughly/more expensively but bail once enough reclaimed)?
Only in a throughput based collector.
This here is a latency-optimized collector. It wants to run many, many small cleanups so that the main-application thread continues forward with the smallest amount of delay. Big cleanups require bigger delays.
Isn't it already there? You allocate memory with malloc(), free it with free() and the system takes care of fragmentation and stuff like that. The GCs that we are used to are usually one level higher, but I imagine that you could also go one level lower, where memory managment is really manual. For exemple, allocating all memory at the start of the program and doing everything yourself.