
The Azul Garbage Collector - DanielRibeiro
http://www.infoq.com/articles/azul_gc_in_detail
======
Roboprog
Here's what I got out of the article:

* They are inserting special x86 instructions around object access JIT output instructions to "trap" uses of references to objects that had cleanup/relocation in progress. The trap works around the work in progress to "heal" the references in rare instances, usually allowing branch prediction on the x86 to simply fall through.

* It almost sounds like part of the relocation work by the GC used a mechanism not unlike "transactional memory" to either "commit" a block of moves of active objects, or roll them back in case of a conflict caused by the running application accessing/updating/creating something at an inopportune moment.

* One of the diagrams suggests that there are N GC threads corresponding to N application threads. If there is in fact a one-to-one correspondence, rather than just "there are many of both kinds of thread", I wonder if they have thread specific sub-heaps, and employ some kind of processor affinity binding together the application thread and its corresponding GC "shadow" on the same processor? Maybe that's automatic anyway, based on memory region in use? Anyway, localizing these tasks together might avoid processor cache misses. I may have read much more into one of the diagrams than was really meant, though. Even if they don't have thread specific heaps, I think I like the idea of having heaps tied to individual threads, only migrating objects/references to a global heap when they have in fact been shared between threads, or are anchored to some sort of static context.

Anybody care to provide an alternate interpretation of some of this?

~~~
rayiner
<http://www.infoq.com/articles/azul_gc_in_detail#appendixa>

^^^ Details how their read barrier works.

------
nobody_nowhere
Has anyone tried this in a high-volume production environment? Sounds very
interesting.

~~~
gthank
Azul's systems are pretty much the definition of high-volume. Nobody (sane)
buys an embarrassingly parallel machine with custom hardware and software just
to host their blog on it.

~~~
nobody_nowhere
Also, it looks like they're now selling software as well as HW solutions.
That's potentially more interesting than when it was just an appliance
solution.

~~~
jacques_chester
They're also selling Azul hardware as a service.

~~~
nobody_nowhere
Ok, cool, sounds like it's time to evaluate them again. Surprised their sales
team didn't call us back about the new offerings!

------
alecco
I wonder how much trashing of the L1/2/3 these passes do. How often, how bad.

~~~
Roboprog
Some of that would depend upon how much the work can be localized per
processor, since there are multiple caches with multiple CPUs. But, yes, I
very much wonder as well.

"you can't write memcached in Java" <http://roboprogs.com/devel/2010.12.html>

Or can you??? (efficiently, given the right JVM?)

~~~
cagatayk
The linked article is naive as to GC algorithms used in current Java VM's.
These go to a lot of trouble to avoid tracing or scanning the heap. Card
marking
([http://www.ibm.com/developerworks/java/library/j-jtp11253/#2...](http://www.ibm.com/developerworks/java/library/j-jtp11253/#2.0))
is one way to avoid this cost. As to building something like memcached, you
can use native memory directly using direct byte buffers in Java; this is
essentially what Terracota BigMemory (<http://www.terracotta.org/bigmemory>)
does.

~~~
Roboprog
At some point, I am going to have to experiment with the java command line
options and redo my string whacking benchmark program. Still, I can't help but
think that while these refinements limit the trips made wandering about the
heap, there still are a good number of times when all of those gigabytes of
pages still have to be marched through CPU caches displacing active work to
check on things that haven't changed status. Perhaps with enough cores, some
of them are simply left alone that vast majority of the time to do productive
work with data in cache. I'd like to see some measurements of this, and how
the effectiveness is affected by worker threads vs CPU cores available, as
well as how many background GC threads there are. Data, anybody???

At any rate, the defaults for Java are slower than those for Perl when doing
many string operations on a single thread. Measurements:
<http://roboprogs.com/devel/2009.12.html>. I have since rerun these tests on a
6 core AMD, with largely similar results. Of course, when doing threads or
fork, the comparison breaks down as these constructs are implemented so
differently between these languages, to say the least.

