Hacker News new | past | comments | ask | show | jobs | submit login
High-performance garbage collection for C++ V8 (2020) (v8.dev)
97 points by signa11 3 months ago | hide | past | favorite | 38 comments



It makes sense for a Javascript engine to optimize for main-thread latency. But the code I write for fun is usually throughput bound: meaning a 2nd thread should try to minimize the additional work to be done.

I do wonder what the appropriate metrics for a garbage collector should be. This post focuses entirely on the main-thread times, but maybe "total CPU-time used" would be better? Especially if it was split up using hardware performance counters (ex: how much time waiting on L1, L2, L3 cache, and DDR4 RAM).

I'd imagine that having multiple cores work on application + garbage collection in parallel would cause more main-memory hits (the garbage collector would have to keep its state in cache, probably L3 cache, meaning the application has less L3 cache to work with). So overall throughput would be lower than the single-threaded solution.

--------------

But yes, latency is king for UI programs. I guess I'm just musing about the "what to measure" problem.


On a related note, if you are building a UI, an even better solution for computationally intensive work is to offload it onto a separate process instead of just a separate thread. This ensures that the UI thread is not blocked as often due to GC. Obviously this isn't really possible in JS (outside of Electron), but this can be a good idea in languages like Java.


There's overhead to a process that doesn't exist with threads. You can definitely split your work across threads in JS using Workers, ServiceWorkers and the main thread. Most apps don't take advantage of this very well though.

Using a separate thread gets you the same GC benefits.

There have been some attempts to make it easier to use threads on the web like the excellent Comlink library: https://github.com/GoogleChromeLabs/comlink


I think this is possible to do with JS + WASM/Service workers?

And even with many background threads, in WASM: https://developers.google.com/web/updates/2018/10/wasm-threa...


It has been possible for quite a while with background workers.


Looking from the distance it feels like making gc a library is... important. There seems to be a lot of gcs, many extremely sophisticated. Wouldn't it be great if they consolidate with time on common api and could be exchanged with minimal effort?

Without knowing too much about intricacies, wouldn't it be beneficial to have system (kernel) provided gc? Surely there are no cost ways of sharing userspace usage of pointers with the kernel?


Kernel-code is secure code. That means you need to defend against a variety of speculative and/or side-channel attacks. This is because the superuser has the passwords (and other "secure" data) to the entire system in the superuser's memory space. Lesser users must never gain access to superuser memory. The kernel is the keys to your kingdom, since its responsible for protecting superuser (and other users).

As such: you must never allow a side-channel attack against other users through the kernel. This means that a significant amount of inefficiency must be introduced (ie: flushing TLB caches, among other things) whenever the kernel executes.

-----------

The current plan of action for most OSes is to therefore never have the kernel run. Linux's "futex" for example, are purely userspace in the "hot" case. (In the "cold" case, such as a task-switch becomes necessary... the kernel would run. But these cases are minimized as little as possible).

Similarly: I'm not sure what moving the garbage collector into the kernel would accomplish at all. It'd frankly be faster as a library, since you wouldn't need to do a secure task-switch.


You wouldn't have to preallocate memory. Kernel should know better when it's good time to mark and sweep. Unpinning memory doesn't have to have immediate effects. I don't think sidechannel attacks would be possible, the only task for kernel is to deallocate at some unknown point in time dereferenced memory, which by definition should not be visible from the user space?

Maybe a bit far fetched but parts of the kernel (will never happen) or kernel modules (possible) could use it as well?


Meltdown is a side-channel over any memory-dereference that is in L1 cache. (by measuring the time it takes to read memory: you can determine what is or isn't in L1 cache). As such, any memory read is vulnerable to a meltdown-like attack unless the appropriate caches are cleared (if the cache is cleared, then the Meltdown gadget takes the same amount of time on all memory and no longer works)

Spectre is a side-channel over any branch-predicted loop (and I assume most kernel code has a for-loop __somewhere__ that can serve as a target. Even a stray memcpy is sufficient), where the Spectre gadget can read memory across the timing-attack side channel. (by measuring the time it takes for the branch-prediction to occur in the target for-loop)

That's why TLB, L1, and branch-predict data needs to be discarded whenever the kernel code starts executing, to prevent these side-channels from leaking information. These caches / branch predict data is worth many hundreds, maybe thousands, of clock cycles. You don't want to just do kernel task-switches willy-nilly.

> Kernel should know better when it's good time to mark and sweep

The optimal time to mark and/or sweep depends on the application and goals of the programmer. There's a reason why Java has over 3 garbage collectors for the user to configure.

Throughput-based code (ex: number crunching) wants a different garbage collection strategy than UI-based code (aka: what we have here, a grossly latency-optimized GC but probably much worse throughput figures).

Throughput-based code is happy with "stop the world", because "stop the world" provably does the fewest garbage collect cycles. "Stop the World" means you __only__ run the GC when you run out of memory, and not a minute sooner!! Because you're out of memory, you have to stop the world, causing large latency spikes randomly. But from a throughput perspective, this is the most efficient: least amount of memory reads/writes or CPU-clock cycles spent doing garbage collect duties.


Couldn't this be solved with strategy hint flags if necessary?

I don't understand why you'd have to do task switching in the first place.

You can't mark pointers as unused without locking/syncing because kernel can reallocate pointers in the middle of scanning, no? That's the only reason, isn't it? But in kernel you'd know about it and you could mark unused pointers in single, lock/synchronisation free pass, no? In other words you don't need to stop the world if you are in kernel space because you know about all reallocations that are happening?


> Couldn't this be solved with strategy hint flags if necessary?

How is that easier than just the programmer choosing a garbage-collecting library and/or language that works for their use cases? Ex: Javascript / V8 gets this GC from this blogpost. Java gets their 3 or 4 command-line configurable GCs. Go gets their own. Lispers use their Lisp-gc implementation. Etc. etc.

Because each programming community has a particular workflow / paradigm that works for them, each specific garbage collector is explicitly designed for each language's quirks. For example: Lisp "cons" is practically a constant-sized garbage collector, possibly implemented using bitsets because all car/cdrs are of a consistent size. Java is throughput focused by default, because of a large number of server-based customers.

And of course: this Javascript / V8 focuses on latency optimization for the best UI responsiveness.

> I don't understand why you'd have to do task switching in the first place.

Wrong word. I meant system call. Either way, switching "into" kernel-context is expensive if only because of the security procedures necessary.

By turning garbage-collection into a system call, you've introduced a potential security issue into the kernel (and therefore: security-related slowdowns). If anything, you want to keep as much functionality outside of the kernel as possible.

> You can't mark pointers as unused without locking/syncing because kernel can reallocate pointers in the middle of scanning, no? That's the only reason, isn't it? But in kernel you'd know about it and you could mark unused pointers in single, lock/synchronisation free pass, no? In other words you don't need to stop the world if you are in kernel space because you know about all reallocations that are happening?

I would recommend that you write a malloc/free implementation by hand.

All garbage collection is, is when free becomes a no-op, and all the work is done in the malloc() call. It turns out that "malloc()" has all the information for which nodes were allocated, as well as which ones are alive and/or dead.

The kernel never has this information, because the malloc() call threw it away in the first place. If the programmer _had_ the information, they would have called free().

But the programmer decided that calling free() was either too hard, or error-prone for their use case, and decided to go with the easier malloc-only interface. That means that malloc() now has to work a lot harder to "rediscover" the locations of all pointers that must be freed.

It doesn't matter if these routines are in the kernel, in user-space, or taking place on the moon over a TCP/IP port by carrier-pigeons. The fundamental tradeoff you have is that your code must rediscover live vs dead pointers on its own, independently from the programmer.


Correction on futexes: You've mostly got hot and cold inverted, here. A hot mutex/futex or code segment is one that many threads are trying to use at once. A cold one is one that is not very busy. So cold futexes are typically purely userspace and hot ones become less so.


The current plan of action for most OSes is to migrate to micro-kernel like designs.

Tanenbaum was right after all.


Isn't that going to get extremely costly with all the workarounds around hardware based side channel attacks? Only way I see micro kernels making sense is if we find a way to make the CPU cache less leaky without wasting tons of performance on it.


Most OSes in high integrity computing are micro kernels already.

The Switch performance also doesn't seem to suffer from having a micro-kernel, specially if we look at their sales.

Finally, Linux has long been tainted with micro kernel like stuff, with type 1 hypervisors, userspace drivers like DPDK, Project Treble (userspace drivers with Android IPC), systemd, security critical applications like browsers moving back to UNIX process per task model, containers,...

In a hypervisors cloud based systems, doing microservices and serverless, with language runtimes that are hypervisor aware, who cares about monolith kernels?


> The Switch performance

Nintendo consoles aren't exactly targeting the high performance games. What they are targeting with extreme prejudice are people using their hardware in ways they did not bless, the need to lock down the console probably supersedes everything else.

> userspace drivers like DPDK

Which drops all processing in the userland because the large amount of system calls necessary for traditional network operations is too damn slow. Newer kernel APIs like io_uring are trying to address some of that.

> Project Treble (userspace drivers with Android IPC)

I have an eight core android phone, I do not understand why I need eight cores to run a phone but apparently performance did not hit rock bottom back when feature phones ran on a Java based system.

Everything you mention screams horribly bad performance to me.


That's not really feasible for languages that are fully bootstrapped, where the GC is written in the language itself.

But even if you throw in the towel on that idea, GC isn't one-size-fits-all. If you built a GC that was so tunable that it covered all use cases and profiles then it would probably be as complex to tune as it would be to write your own.

And that said, there is already GC provided by the kernel. We just call that "manual memory management" and wrapping it efficiently for automatic usage is the process of designing a GC for your language runtime.


There are many fundamental decisions which can't really be pluggable.

Can the GC move? (Then you must be careful storing pointers if a GC can occur). Is it the user's responsibility to note when an object changes? (Which can be faster, but needs lots of bookkeeping code). Are objects kept alive by internal pointers, or only pointers to their start?


I imagine one could look to the JVM GC interfaces for inspiration. They seem to have made a pluggable architecture for GC. That said, I don't know how beneficial a system-wide pluggable GC architecture would be considering how much the tradeoffs vary for GC and how intricately those tradeoffs are related to the host programming language (e.g., Go's upfront allocations are quite expensive, but in exchange the collection is super low latency--that works very well for Go, but it wouldn't work so well for languages that idiomatically generate tremendous amounts of garbage (basically any functional language, Java, .Net, Python, etc).

You could have a system GC for Go, one for Python, one for Java, etc which is basically the JVM/.Net approach, except they extended it to the whole VM--not just the GC--however, in practice it seems like applications want different versions of the runtime and increasingly end up shipping their entire runtime with the app anyway (Go just went all-in and made this the default, which is actually really nice from a developer and user perspective).

So while a system GC (give or take plugability) sounds nice, in practice it seems like the ecosystem is moving the other direction.


What if you want to make two languages cooperate? If they both have their own GC, then you can't garbage collect across language boundaries, which can be annoying. I think this is similar to the microservices debate, where you have either one big transactional database or a bunch of disconnected ones where committing a transaction involving multiple services becomes increasingly difficult.


> What if you want to make two languages cooperate? If they both have their own GC, then you can't garbage collect across language boundaries, which can be annoying. I think this is similar to the microservices debate, where you have either one big transactional database or a bunch of disconnected ones where committing a transaction involving multiple services becomes increasingly difficult.

This is an interesting observation. I think in both cases the ideal of having one large system remains elusive and the more robust, practical solution is RPC (basically more, smaller control loops rather than a single larger control loop). To abstract this out even further, I've noticed that this seems to be a common pattern in reliable distributed systems are architected--from Erlang's OTP to Kubernetes' controller approach. I'm actually really surprised this "controller" or "control system" architecture vs monolith architecture isn't discussed more frequently, at least in terms of reliability tradeoffs.


You have GC APIs at the FFI level to let runtimes communicate and pin references across runtimes.

D, Java and .NET provide such APIs.


But do these runtimes allow circular references across language boundaries?


How do you think something like mixing .NET and COM/UWP libraries work?

By mixing GC with the reference counting of pining such references.

Naturally you can get the reference count wrong.

Still much better than just doing plain old manual memory management, as those are corner cases, while with manual memory management every allocation is a possible leak.

Incidently this scenario was exacly one of the ones presented at why Linear Haskell (JVM/Haskell interoperability) by Tweag at the recent functional programming languages conference.


That is how OS implemented in tracing GC enabled systems programming languages work, the GC is an OS service.

https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Kerne...


> Wouldn't it be great if they consolidate with time on common api?

Given how intrusive this one is compared to a conservative Boehm GC? No, especially not when every GC ends up inheriting design warts like finalize(). Moved my entire Java code away from finalize to the half dozen dedicated cleanup classes Java provided (that was before try with resources).

> wouldn't it be beneficial to have system (kernel) provided gc?

I can only think of giving the kernel the ability to force cleanup in low memory situations as a positive and for that you probably wouldn't have to put the entire GC into the kernel.


> I can only think of giving the kernel the ability to force cleanup in low memory situations as a positive and for that you probably wouldn't have to put the entire GC into the kernel.

...but isn't "low memory situation" the only situation when you'd have to actually do it in the first place (at least thoroughly/more expensively but bail once enough reclaimed)?


> ...but isn't "low memory situation" the only situation when you'd have to actually do it in the first place

Only in a throughput based collector.

This here is a latency-optimized collector. It wants to run many, many small cleanups so that the main-application thread continues forward with the smallest amount of delay. Big cleanups require bigger delays.


Yes, makes sense.


> Without knowing too much about intricacies, wouldn't it be beneficial to have system (kernel) provided gc?

Isn't it already there? You allocate memory with malloc(), free it with free() and the system takes care of fragmentation and stuff like that. The GCs that we are used to are usually one level higher, but I imagine that you could also go one level lower, where memory managment is really manual. For exemple, allocating all memory at the start of the program and doing everything yourself.


glibc isn't anywhere close to the kernel. glibc (where malloc() / free() live) is 100% userspace.


When a process dies, OS automatically frees all resources held by the process. You can view it as a form of GC.


Unfortunately, they didn't follow up on the scanning part. How does that work? Ok, the allocator is obviously cooperating, but how do you get from scanning piece of C++ memory to the class of the object?


When scanning the stack, you interpret every word as a potential pointer. Then, you ask your GCs allocator "is this pointer's address in one of your pages?". If it isn't, you're done. If it is, then you check if it's the pointer to the start of an object (there's a bitmap of "is object start" for each word on the page). If this passes too, then you effectively reinterpret cast it to a "GC object" base class which is shared for all GC objects, and call the virtual "trace" function from there.


Maybe a stupid question, but how would this detect random data byte sequences that accidentally form a valid GC object pointer?


It doesn't. That is what it means for GC to be conservative.


All GC objects have GC headers prepended, from which you can lookup the class and how to scan it.

https://github.com/v8/v8/blob/master/src/heap/cppgc/heap-obj...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: