Hacker News new | comments | show | ask | jobs | submit login
Azul's Pauseless Garbage Collector (artima.com)
180 points by DanielRibeiro 2586 days ago | hide | past | web | favorite | 50 comments

> Self healing is a unique capability that, as far as we know, the Azul collector is the only one to do, and it can only be done with a read barrier.

The Metronome GC from IBM Research (http://www.research.ibm.com/people/d/dfb/papers/Bacon05High....) also uses a read-barrier to fix up pointers to objects which have moved.

Although, I don't think it does marking in the read barrier.

Always taking on the most difficult thing instead of waiting until you have no choice but to clean house sounds like the best strategy for java garbage collection, running your life, running a country, you name it.

That doesn't really map to computers, which spend an enormous amount of time waiting and doing nothing. Putting it off lets you collect when nothing is happening, which would otherwise be "wasted" time.

More in general, putting things off frequently means being able to handle them without interruption when the time is right; it's only a problem if the right time doesn't come up frequently enough (too busy). And context switching is pretty much guaranteed to be expensive in every circumstance, be it computers, your life, or a country.

Back to computers again, this would mean that the problem is being too busy, not that you put it off - they'll likely take about the same amount of time no matter when they're handled; if anything, they're more likely to be faster if done all at once rather than bit by bit, due to cache behaviors. At that point you know you have other, larger problems.

ok, when all urgent and important things have been attended to, do the most important, non-urgent thing regardless of how hard it is, until its done or something urgent and important needs doing.

I disagree. Context switches are always expensive. If I devote 60 seconds to washing dishes, then 60 seconds to picking up trash, then to coding, then to mopping, then to laundry... You get the idea. On the other hand ignoring hard (or boring) tasks has other negative consequences. I try to balance my activities and devote reasonable amount of time to each task when there is enough work to be done. Multiplexing poor old me, if you will. Pause the world scenarios are not that detremental as long as your total throughput does not diminish.

On topic: so they traded some large unexpected pauses for a lower throughput rate. I like the lazy object remapping idea. I usually like stability over increased throughput and this scheme seems to provide just that.

Sometimes the most difficult things becomes easier over time, either because it is solved by someone else, or the situation changes where it is not necessary to solve it any more . For example, "Should I rewrite this in C or just wait buy faster hardware next year?"

"... or write a faster compiler because everyone has a secret desire to write a compiler?"

A compiler of course! :) [And then once you have your compiler, like Click did, GC becomes an obvious next hack!]

Not really. you shouldn't be always doing the hardest thing, but the thing that gives you the most value/ROI, or in terms of >1 entity, the thing with which you have the most comparative advantage.

please see my earlier response to the first response to my comment

Sounds like fascinating work. I would have expected big unpredictable performance problems from using a read barrier to lazily fixup references to moved objects, given how expensive each page fault is in a superscalar pipeline.

But it's hard to recommend a vendor who's opaque about pricing. If we have to ask, we probably can't afford it.

Page faults are usually expensive because the bits you need must be read in from disk. In this case the bits you need are still in RAM, and could even conceivably be in cache. It's probably not tremendously more expensive than a cache miss (which is still quite expensive, but hopefully the collector is fixing up pointers fast enough that actual page faults are relatively rare).

They'd be about as expensive as throwing an exception. And the live objects would, I assume, quickly pull all the other live objects into the active pages. I love this hack.

Furthermore they're using 2mb pages instead of the usual 4kb (on x86). That cuts the overhead by a factor of 512 alone.

The remaining problem I see is cache-trashing because on every minor page-fault you have to run a part of your mark&sweep algo, surely kicking out all the application data&code from caches. That's inevitable less efficient than a stop-the-world approach.

On their hardware appliance, they use a special instruction to implement the barrier. For software appliance, the performance may be a real problem.

If i'm not mistaken, the 'read barrier' is only active during a mark of the heap, not all the time. So, considering that marking can be parallelized and there's a good chance a lot of the heap is now dead objects (you only mark live ones) you shouldn't be too affected.

To be fair, if you've got a budget that supports a multi-developer team and you fundamentally need their solution, you can afford their prices. They're not insane, and if you just ask they're not opaque.

On the other hand, did you know you can turn off the garbage collector on CPython?

Could anyone explain the "self-healing" algorithm in simplistic terms?

From what I gathered, when they compact a page of memory, moving all the objects within it to different locations, they will set a marker on all pointers to be "unset". Then, while program execution is still going on, the GC thread will be busily going through the pointers and correcting them to their new locations as necessary, then setting the marker flag. If, during this period, the executing code tries to use an unmarked pointer, a "read barrier" is hit in the VM, and the GC code corrects that pointer ("self-heals"), sets the marker, then allows execution to continue.

Do I have this right? What about the initial unsetting of all these markers? It would seem to require going through all pointers before you want to compact a page, and I would suspect they're being more clever than that.

You don't mark pointers, you unmap a page of virtual memory. This instantly invalidates all pointers to that page without touching them, and allows you to immediately reuse that physical memory by mapping it to a new virtual address. When you eventually dereference a pointer to an unmapped page the processor's MMU throws a page fault. The VM catches the fault and fixes up the pointer on the spot.

What I'd like to know is how you can guarantee that all garbage is eventually collected in a system like this, and how you can guarantee that you've fixed up all pointers to an unmapped virtual page so you can reuse it.

> how you can guarantee that you've fixed up all pointers to an unmapped virtual page so you can reuse it.

Maybe they don't.

amd64 effectively has a 48-bit addressing limit. You can unmap 1,000 4K pages each second for over two years before you need to reuse a page address.

If I'm reading the article correctly, they do fix up all the pointers on the next gc run. For every scanned pointer just check first whether it points to a stale page (and update it), then do the liveness marking. btw, they're using 2mb pages to cut down on page faulting overhead. And they're talking about terabytes of allocations per second (because they're compacting the first gc generation with this scheme, too), so virtual memory deallocation needs to happen somewhat timely. Furthermore at least the linux kernel never deallocates page tables, so no longer used virtual memory is definitely not free.

An interesting idea, but I'd imagine the tables you need to maintain to fixup pointers would become prohibitively large before you ran out of address space.

Perhaps the VM doesn't use tables but a pointer remapping algorithm, something along the lines of (page_address * large_prime_number) mod 2^48.

The problem is you're not just moving whole pages, you're moving and compacting the objects within that page. Each object in the page has a new offset in the new page (or pages).

Only if you had a very large number of undead per live instance.

So does that mean there are conceiveable advantages to using the full 64bit addressing scheme? What about 128 bits?

No, that means that only having 48 bits of physically addressable memory gives you the ability to have many times that amount of virtual memory -- and that virtual memory can be used to implement this fun page-faulting shenanigans without worrying about taking up any real memory address space. You have enough virtual memory for all your physical memory, and these tricks, and then some.

The GC process does a full mark-and-sweep collection, so eventually it will have processed the entire heap. At that point, all garbage that existed when the collection started will have been collected, although new garbage will have been created by the program in the meantime. Similarly, all objects that survived the collection will have had their pointers fixed, and new objects will have had the correct pointers in the first place.

The purpose of the read barrier is to allow the program to keep running while collection is happening. It lets the VM trap access to the parts of memory that the GC is working on, and do little bits of the collection algorithm in the program threads so they don't have to stop and wait for the GC thread to finish what it's doing.

I guess I don't understand how the mark phase can work while the program is running. If the program is constantly modifying references then how does the GC know when it's done marking?

If the program modifies an object that's already been scanned during this mark phase, does the GC have to go back and re-scan that object? If so, then how can you guarantee the mark phase will complete without stopping the program at some point?

That's definitely the hard part. :-)

From the interview: "We will find all the live objects in the heap in a single pass. We will never have to revisit any reference." So, no the GC never has to revisit any object, and it completes when it's scanned all live objects.

The read-barrier is what prevents the program from interfering with the marking phase. That bit about unmapping virtual memory happens later when marked objects are being moved. As the GC walks through memory it's marking references by setting (or clearing) a bit in the pointer. Meanwhile the program goes about modifying objects by copying references - first reading them into a register and then writing them back out to the object's storage on the heap. The read triggers the read-barrier. If the reference being read has already been marked, great. If not, the trap handler marks the reference before allowing the program to continue. The write operation then writes a marked reference on to the heap, so no need for the GC thread to return to that object.

Make sense?

How do you mark a single reference? Don't you have to mark all its children recursively, in the worst case scanning the whole heap (if this cycle just started and the GC thread hasn't gotten far yet) before returning control to the program?

Ah good catch. That detail isn't apparent from the interview. See this paper for more: https://www.usenix.org/events/vee05/full_papers/p46-click.pd.... Basically, references contain flags that indicate whether the GC thread has already visited that reference. When the read-barrier traps, the handler checks the flag. If the reference has been visited already, fine. If not, it gets added to the GC thread's queue of objects to visit. Then the program marks the reference as visited, and continues. New objects are always marked visited, and aren't collected until the following GC cycle, so the current mark phase is guaranteed to complete.

So when you're marking, you're working off a mark stack. You pop an object off the market stack, scan it and mark each of its children, then put any newly-marked children onto the mark stack. Note that the mark stack separates when an object gets marked from when it is scanned and its children marked. So when the read barrier triggers, you can mark the object and put it on the mark stack, as if you'd just scanned its parent. The object's children will get recursively marked when the marker processes the object from the mark stack, but that can happen independently of the read barrier fault.

Marking is concurrent. The read barrier trap only has to mark the immediate reference. If the mutator that triggered the trap proceeds to reference one of the reference's children before the marker thread has gotten around to it, that will trigger another trap.

BTW, you can read more in Cliff Click's USENIX paper from 2005: https://www.usenix.org/events/vee05/full_papers/p46-click.pd...

Ah, that makes things clearer - thanks.

> What I'd like to know is how you can guarantee that all garbage is eventually collected in a system like this.

Me too. If you can't guarantee that, it seems that when you unmap a page of virtual memory, you'd have to make sure never to use that page again. You'd also have to keep around your table of "mappings from old pointers to new pointers" forever, just in case you encounter a lingering bad pointer and need to correct it.

Yes, exactly, in fact I was just editing my post to add that concern! Their garbage collector constantly scans the heap fixing up pointers so you don't have to wait for every reference to hit a page fault, but it seems difficult to guarantee that you're done if the program is twiddling references while the collector is doing its work. Perhaps there is also a write barrier which makes sure to never write a pointer to a collected page.

Most Generational GC's (including Hotspots, which is the origin of Azul's JVM) will have the JIT insert a write barriers for all pointer sets. This is to keep track of cross generational references (so you don't have to scan the entire heap).

The hard part has always been dealing with reads, (which are much more common and expensive to put a software barrier around), and Azul has quite brilliantly figured a way to handle this both in their specialized hardware, and now their VM.

You're probably right. Although he doesn't mention the write barrier, this is usually required for a generation GC.

I think the trick is that after you complete the next mark phase, you have visited every possible pointer to the unmapped page.

Only if the program hasn't written a new reference to the unmapped page in that time. You need to check on writes too.

<quote> Can you explain how Azul's pauseless garbage collector works?

Gil Tene: At Azul, we've been building high performance Java virtual machines (JVM)s for the past eight years now. We started the company in 2002, and we aimed to solve several scalability and infrastructure problems for Java back then.... </quote>

Anyone else really hate then people don't actually answer the question? They should have edited that useless part out.

The whole rest of the article is him explaining how it works!

Additionally, "we started a company to make Java faster and garbage collection was around #3 on our list" is informative. There's only one sentence that could really be cut ("founded in 2005" or somesuch.)

No, they should have just edited out the question at the beginning, seeing as the article isn't in a Q&A format anyway. Then the company history wouldn't have felt out of place.

a little corporate history should be okay if the rest of the article is actually answering the question.

Wwwwow. The Boehm-Demers-Weiser GC can't move objects because, you know, C is gnarly and you can't really trust anything to update pointers in-place. But what if you could? This work seems to indicate that you could have a compacting collector with arbitrary C programs, given the extra hooks they added to Linux (http://lwn.net/Articles/392307/).

Azul continues to impress.

The guy says his company solved some difficult engineering problems, but this is not language of an engineer. "On Zing they generally tend to be below a millisecond." - what on earth does it mean? He tested it? How? What are the results? What is the confidence interval?

What happens with objects that can't fit on a single page, such as a large array?

The obvious answer is to use more than one page. Why wouldn't that work with a special case for "large allocations", like most heap implementations?

Good idea. Now that I think about it, a large object is trivial in this sysyem. You dedicate some group of contiguous pages solely to the object. This results in some potential waste of the last page, which is probably acceptable. You would not ever need to perform compaction in those pages, just pointer remapping. When the object eventually dies, you release all the pages at once.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact