Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] A Hardware Accelerator for Tracing Garbage Collection (ieee.org)
57 points by andrekandre 4 days ago | hide | past | web | favorite | 23 comments

I'm skeptical that cpu usage is really the problem with multicore CPUs. I would think it is much more about concurrency.

I'm also a bit surprised there isn't more of a transition to modern C++ for problems where so much time is spent fighting the garbage collection. Ownership and memory management in C++11 and up really isn't a big deal. Keep your allocations to a minimum, make sure you are moving instead of copying. Shared pointers are there, but I actually never use them.

You would have to spend an awful lot of time fighting garbage collection to make C++ worthwhile in an organization that isn't already well-versed in C++ development. The learning curve is massive, the builds are slow, the errors are terrible (although better than they once were), and the tooling fights you every step of the way (CMake still makes me shudder).

Even if you use C++ you can still benefit from hardware allocator acceleration. Here's a paper on hardware acceleration of tcmalloc. http://www.samxi.org/papers/kanev_asplos2017.pdf

The new garbage collector on Java 12 is fine for almost every case. Hardly ever pauses more than a few milliseconds. Shenandoah looks promising as well.

At least in Java land, within in the few years GC will be something you don't have to think about unless you need sub ms promise

Isn’t that the same with as C? You can use alloca to allocate on the stack, giving you the same benefit as RAII (I think?). I haven’t used C++ so maybe I’m speaking out of my ass though.

But regardless, you could use the Boehm GC for objects you don’t feel like tracking the lifetime of.

RAII can help you protect heap memory (and other resources besides memory) because it's more about registering code to run on the end of object lifetimes rather than just the memory being deallocated. Alloca isnt really a great option for async flows for instance, to say nothing of file descriptors, or abstract concepts like 'the idea of this mutex being locked'. Alloca is also terrible for memory passed on to other threads obviously..

Albeit most of that is moot when compared to a GCed language like the global discussion.

Stack memory doesn't need to be managed regardless. With boehm gc you are going to run into the same problems of gc. RAII works well for lots of things - heap memory, mutex locks, reference counts, files, io, etc.

This is one of those cases where active memory might prove extremely useful.

Imagine store instructions that can tell memory some metadata about what's stored. "This is a pointer in an object at offset O to another object". "This is an object with a header you can use to paint objects". "This is no longer an object or a pointer". Also imagine instructions like "clear all object paint", "do a one layer trace", and "report unreachable objects". The one-layer trace would have every cell that has a pointer paint the pointed to object with either a "reachable by a reachable object" or "reachable by an object whose reachable state is unknown".

Then GC could proceed as follows: clear all object painting, use the CPU to scan roots to the first objects on the heap and mark those reachable, do two runs of one-layer traces, report unreachable objects and reclaim them.

Something like that.

What if memory was so cheap and there were other read and write paths for words that you could have a multitude of extra metadata associated with a location. I have toyed with a language that had bidirectional references so a piece of referenced data would know every originating pointer. It would then be an O(#pointers) operation to move a piece of data on the heap.

The thermal budget of something like that would be bananas I would think.

It needn't all happen at once. Memory talking to memory is still subject to bandwidth limits. It would take time, but not CPU, and it should run faster than a CPU could do it.

That would be pretty awesome if a compiler could infer the lifetimes of (some) objects. It would especially help for paradigm that involve a lot of short-term allocations, like FP.

Anyone have some info about how the Vega from Azul did gc acceleration? Came up in a discussion awhile back and we couldn't seem to find any technical details.

Mainly boils down to write barriers with interesting semantics, and an MMU that let's you in user space update a subset of the permissions without a transition to user space so you can quickly mark pages as "under GC migration" and know you'll get traps if another thread/core tries to modify the references on that page out from under you.

I touch on it here: https://news.ycombinator.com/item?id=19650813 which includes a link to their paper on it.

Would CPU support for tagged memory make this easier?

"Those who don't know Lisp Machines are doomed to reinvent them, poorly."

Eh, this is pretty different than Lisp Machines of old.

Most of the minicomputers they we're competing with were more or less true von Neumann architectures without caches. That meant that one those systems the I fetch of the interpreter itself was constantly competing for memory bandwidth with the actual Lisp code it's interpreting and it's data. By pushing the interpreter into microcode, you've created a Harvard arch where the interpreter can fetch out of uCode entirely and not pollute the main system bus. IMO this is why Lisp machines dies off coinciding with the usage of caches (particularly I caches) in those kinds of computers. At that point the hot path of the interpreter can totally live in I cache, so it's not polluting system bandwidth the same way anymore, and now the biggest reason in favor of a dedicated lisp machine is moot.

This work in the article is actually really cool and goes way beyond any lisp machines by off loading GC to another independent unit that runs in parallel to the main CPU, interleaving into the regular system memory bandwidth in a clean way.

Why don't you provide some comments on how Lisp machines implemented GC acceleration, and how the approach outlined in this paper is worse? Go...

Memory allocation management should not be a big deal, and it is not the kind of problem that can easily be solved at the language level.

Correct memory management needs information on the allocation context, in some cases, it can be handled automatically by a framework (like a game engine) or manually by the coder, but it is really cumbersome to force this into the language...

In my opinion, garbage collection is a bane.

Functional programming can’t really exist in any meaningful way without GC. If you’re not a FP developer, it’s easy to think that memory allocation falls into two categories: a) short-term objects that can be allocated on the stack, or b) long-term objects that the lifetime of can be tracked easily. But with FP a lot of short-term garbage is created that isn’t known at runtime, thus necessitating GC.

I program professionally in Scala and without GC, the paradigm of strong FP would be (almost) impossible. Fortunately, the JVM handles this pretty exceptionally.

I think a GC in hardware is a fantastic idea and hope Intel and others incorporate it into x86-64. I don’t know how practical this is as I’m not a hardware guy, but a hyper-efficient pauseless concurrent GC that could keep garbage out of the cache sounds like a massive win in my book.

> that isn’t known at runtime

Typo: s/runtime/compile time

Also, this isn't particular to FP; every style of programming requires dynamic memory management for real-world programs (modulo pedantry about embedded programming).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact