
What sort of allocation rates can servers handle? - r4um
http://stuff-gil-says.blogspot.com/2014/10/what-sort-of-allocation-rates-can.html
======
thoughtpolice
> When they can choose (and there is a choice) to use collectors that don't
> pause, the pressure to keep allocation rates down changes, moving the "this
> is too much" lines up by more than an order of magnitude.

You mean - like choose the only garbage collector on earth that actually can
do this - for the JVM only - which your old company wrote and keeps guarded as
a heavily proprietary secret, with the currently published specifications for
the design (in the C4 paper) requiring untenable, unmergeable and extensive
hacks in the operating system virtual memory layer that will never be largely
accepted, with the current design in your product (which alleviates this)
again being a heavily guarded secret that will never be revealed, and never
have source available, and costs a shitload of money? Which basically means
Azul is going to have the only available pauseless GC for the forseeable
future?

Well, now that you said it like that - you're right! We really _do_ have a
choice... a choice of actually using and working with things that exist and
are widely available today, or already living in a land of unlimited money for
licensing of a _single_ product that might be completely and totally
irrelevant to most of us. I guess the 'choice' isn't so clear, when you say it
like that.

Look, that aside, great blog post, and extremely informative. I'm a giant fan
of Gene's work, seriously. But this work is Azul's brainchild and they'll
never let it go. And I don't blame them for that. But drop the rhetoric,
please - few people have the 'choice' of spending shitloads of money on a
proprietary JVM like Azul.

~~~
wahern
I don't think that there's anything magical about C4. All the basics are
discernible from this presentation:
[http://www.youtube.com/watch?v=5uljtqyBLxI](http://www.youtube.com/watch?v=5uljtqyBLxI)

Azul originally started out as a hardware company. A pauseless collector is
easy as long as you have transactional memory (at a minimum hardware which
supports strong LL/SC, allowing you to build the rest in software). No
commodity CPUs support this, then or now, so Azul's first product was a
specialized CPU and memory architecture with real transactional memory
semantics.

But nobody wants to buy specialized hardware these days. So they ended up
having to emulate transactional memory on x86. That means _emulating_ the
memory architecture of their original CPU. Again, there's no secret sauce to
do doing this. Things like QEMU are free--which is to say, many people know
how to do it. It just takes a lot of programmer hours to build it, and many,
many more to optimize it.

Long story short, Azul's collector is truly pauseless, but its transactional
throughput in the vast majority of workloads sucks because it's an emulator
(for the JVM) built atop another emulator (for their original memory
architecture). And while you can speed up emulating the JVM using JIT
techniques, that doesn't work for emulating a transactional memory
architecture. There are endless tweaks you can make, including playing around
with memory page tables, etc, but at the end up the day you're still stuck
emulating the semantics you need, causing huge latencies relative to your
clock speed, which already suck on Ghz scale systems.

This won't change anytime time soon. Even Intel's new TSX features are not
strong enough to be able to implement true transactional memory. Real
transactional memory requires a way to atomically signal contention for every
word of memory, or at least lines of words. It requires not only a specialized
CPU, but specialized memory controllers, and specialized memory. Something
like TSX would only provide a marginal performance improvement for C4, if
anything at all, because almost all their effort goes into arranging to avoid
locking, not making locking faster.

~~~
giltene
While it's true that there is nothing magical in C4, pretty much everything
you describe here is wrong (as in the exact opposite of what is going on).
Specifically:

\- C4 makes no use of hardware transactional memory.

\- C4 works great (and natively) on commodity x86-64 cores.

\- C4 does not emulate anything (not memory architecture, not HTM).

\- C4's transactional throughput kicks ass.

If you want to know how C4 works, all you need to do is read the actual C4
ISMM paper:
[http://www.azulsystems.com/sites/default/files/images/c4_pap...](http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf)
<And yes, I'm one of the authors>

How you got to your posted conclusions from the link you posted is a mystery
to me. Cliff doesn't say anything much about the GC mechanism there: he is
talking about lessons learned in designing custom hardware. And yes, Vega had
some cool GC-assist features, but they are in no way part of the C4 collector
mechanism.

Since you raised the bogus, unsubstantiated assertion that "[Zing's]
transactional throughput in the vast majority of workloads sucks" based purely
on your mis-interpretation of what C4 actually does, I feel I must correct
that notion.

To do that, I'll point out that Zing (with C4) is currently used, in
production, to run some of the highest transactional throughput and most
latency sensitive applications in the world. Zing's sustainable throughput
(the throughput at which the system still meets the required SLA) is generally
dramatically better than other JVMs running on exactly the same modern x86
hardware. Yes, you read that right: Zing gets more production-worthy TPS out
of the same x86 hardware. Not less.

This is why people who actually need good throughput and good latency tend to
graduate to using Zing, and start enjoying time in their hammocks as a result.
See [http://mail.openjdk.java.net/pipermail/hotspot-gc-
use/2014-O...](http://mail.openjdk.java.net/pipermail/hotspot-gc-
use/2014-October/002045.html) for a happy example.

C4 is being used in everything from low latency trading (Algo, HFT, smart
routing wire risk) to Online Retail (think black Friday workloads) and travel
sites. C4 is used for everything from 1GB compute-heavy and messgaing
workloads to 1TB data-heavy analysts applications. It powers big data
workloads. It powers search. It powers Java servers of all kinds, big and
small. And none of those are complaining about throughput. Quite the opposite.

Peace.

~~~
wahern
You're wrong.

If you read their published algorithm and are familiar with lock-less
algorithms, it's clear that theirs is a transactional memory algorithm.
Specifically, their LVB primitive. If this isn't obvious to you, I would
recommend reading the seminal transactional memory algorithms from the 1970s
and 1980s, including everything written by Hoare. Most of those are available
from the ACM library. You particularly need to pay close to attention to how
wait-free algorithms are achieved.

"Transactional memory" is not a marketing term, nor a synonym for a particular
set of CPU instructions. It's a class of lock-free algorithms, especially
lock-free, wait-free algorithms. And the C4 collector very clearly fits into
that class of algorithms. It's use of page remapping and read/write page
protections is precisely how you would emulate strong transactional memory
primitives on x86.

I think this terse quotation (from their own research paper) sums up the
relationship between the Vega hardware and the Linux software-based
implementation:

"Azul has created commercial implementations of the C4 algorithm on three
successive generations of its custom Vega hardware (custom processor
instruction set, chip, system, and OS), as well on modern X86 hardware. While
the algorithmic details are virtually identical between the platforms, the
implementation of the LVB semantics varies significantly due to differences in
available instruction sets, CPU features, and OS support capabilities."
([http://www.azulsystems.com/sites/default/files/images/c4_pap...](http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf))

Regarding the performance of C4, the reason Azul doesn't publish TPC
benchmarks is because there's no avoiding the immense costs of their page
mapping hacks. From the paper above: "the garbage collector needs to sustain a
page remapping at a rate that is 100x as high as the sustained object
allocation rate for comfortable operation."

Page remapping is insanely expensive at the micro-granularity needed. They
mitigate the cost by batching requests, but it's still significant.
Furthermore, they must use atomic reads and writes for internal pointers.
Those are cache killers.

I never said C4 can't be faster for particular workloads. Obviously for
workloads sensitive to latency a pauseless collector can be faster overall.
But as a general matter, those workloads are not in the majority. Ergo, for
the majority of workloads C4 will not be faster, at least not on commodity
hardware architectures.

You can continue to believe the hype, and believe that Azul possesses some
sort of magical fairy dust, using techniques entirely beyond the comprehension
of mere mortals. Or you can read about and learn how it _actually_ works.
Their algorithm and implementations are all laudable and significant
achievements. But there's nothing magical or secret about them.

~~~
giltene
> You're wrong.

Well. One of us is wrong.

I created the C4 algorithm. And I wrote the paper. I'm pretty sure I know what
it does.

:-)

You raise another "I think this is how it works and therefore it must be slow"
point in this posting that you didn't raise before, so let me address it to
avoid confusion:

> "...Page remapping is insanely expensive at the micro-granularity needed."

If you actually read the paper (e.g. Table 2), you would have realize that C4
uses a page remapping mechanism that (in 2011) could sustain >700TB/sec of
page remapping speed. Using my recommendation that remapping rate needs to be
able to handle 100x the allocation rate, that's enough to keep up with 7TB/sec
of allocation. So I think we have some headroom. The number have only gotten
better since with Zing's loadable module.

~~~
wahern
To be more specific, when I said that C4 "emulates" the Vega architecture, I
was referring the way that you presumably need to be able to invalidate an
operation when another thread concurrently reads or writes to shared memory in
the middle of a transaction (i.e. copying a black and updating pointers). By
emulate, I meant you need to construct a primitive that provides an efficient
lock-free large block copying operation, similar to what the Vega cache system
provides. Of course, that's only a primitive--you need to then build compound
operations atop that, and there's more than enough engineering there to keep
people busy for quite awhile.

Again, if you can claim that in fact your page mapping and protection schemes
are not analogous to the transactional memory support in the Vega hardware
(which looks to be both small block transactional memory tied into the cache
controller, as well as some useful page-level operations built into the memory
controller), then mea culpa.

~~~
giltene
Sheesh. Read the C4 paper. See if you can point to a single place in the paper
where we mention transactional memory. Or a need to invalidate an operation in
the middle of some transaction. Or any form of emulation. C4 simply doesn't do
or make any use of that stuff.

You seem to conflate Vega's [very cool] hardware transactional memory
capabilities (SMA, OTC) with GC. Vega never used transactional memory for GC
purposes. It used transactional memory to support OTC and transparent
concurrent execution of synchronized blocks. Nothing to do with GC, and
nothing to do with C4.

And yes. I can assertively "claim" that the page mapping and protection
schemes in C4 are not analogous to the transactional memory support in Vega,
and have nothing to do with caches or memory controllers. Vega (just like C4
on x86) used page mapping and protection schemes for GC purposes.

------
ougawouga
Bodes well for Rust as a server language. No GC, and almost as pleasant to
write as a high level language. And with memory safety and a compiler that
catches errors like crazy.

------
robotresearcher
It looks like this is in the specific context of Java application servers. Is
that correct? Can anyone explain how the advice generalizes?

~~~
jallmann
> Can anyone explain how the advice generalizes?

The factors constrained by the hardware can be generalized -- cacheability,
memory bandwidth, etc. The rest is dependent on the performance
characteristics of the runtime, especially the GC. While GC is inapplicable to
languages with manually managed memory, allocator performance and
fragmentation can still be problematic -- but this is usually much easier to
tune. OCaml, which has a very well-designed generational GC, is designed for
extremely high allocation rates in its minor heap, to the point where the
hardware is the bottleneck -- although compaction of the major heap is still a
concern sometimes.

While a lot of the recent research into "mechanical sympathy" is fascinating,
a lot of it seems driven by the problems brought by high-level languages such
as Java. If you're at the point where L1/L2 hit rates are a concern, use a
language with inherent control of the memory layout of data structures. The
kind of engineering done by Azul, etc is really about making applications
utilize the full capabilities of the hardware without being saddled by the
underlying platform (eg, JVM).

------
cperciva
Can we have s/allocation/memory allocation/ in the title here? It doesn't make
much sense otherwise.

~~~
aristidb
What other allocation rates would you immediately think of? Of course
allocation rate can mean other things, but memory allocation was the first I
thought of.

~~~
amelius
Well, it still doesn't make much sense to me. The article speaks of an
allocation rate of 400 MB/sec. However, AFAIK, a memory allocator takes time
proportional to the number of allocations, NOT the total amount of memory
allocated. Or does the "B" stand for block, and not byte somehow? Confused.

~~~
TheLoneWolfling
> However, AFAIK, a memory allocator takes time proportional to the number of
> allocations, NOT the total amount of memory allocated.

Depends on the allocator. On a C-style malloc? "Yes". On a Java-style new? No.
Because Java is designed to be sandboxed, and as such cannot allow
uninitialized memory to exist. And as such has to make sure that when you
create that 1000-element array that all 1000 elements are null. And as such,
takes time proportional to both the number of allocations and the amount of
memory allocated.

~~~
stormbrew
Er, this is conflating two different things. Initialization is always linear
the number of objects unless it's just not done, but generally speaking even
in C or C++ you will still need to perform some kind of initialization on
those elements. But really, this has nothing to do with the allocator, which
is at a lower level.

Malloc is linear to the number of allocations. Automatic tracing garbage
collectors are linear to the live set (because they tend to only explore live
objects, and there are techniques to make disposing of the non-live objects
happen in less than linear time[1]). This can really go either way as to which
is better, depending on your allocation rate, which is why this is a point of
contention.

But none of that has to do with how long it takes to initialize the objects.

[1] Semi-space and variants, which are usually used in the allocation-rate-
heavy young or nursery generation. Of course if there's extensive use of
finalizers you still have to touch those objects to deallocate them.

