

Dealing With JVM Limitations in Apache Cassandra - pron
http://nosql.mypopescu.com/post/17940621304/dealing-with-jvm-limitations-in-apache-cassandra

======
shin_lao
_Cliff Click: Many concurrent algorithms are very easy to write with a GC and
totally hard (to down right impossible) using explicit free._

Wrong, we did it.

~~~
sparky
Did what, exactly? The 'impossible' bit may be meant in more of a practical
sense than theoretical, since obviously you can just implement your own
garbage collector if it comes to that, but it's well-established in the lock-
free algorithm literature that algorithms that assume garbage collection are
less complicated than those that use some other SMR (scalable memory
reclamation) scheme like hazard pointers, epochs, or Pass The Buck. My own
experience porting ConcurrentSkipListMap from JDK6 to C supports this.

Reference counting is usually not too bad from a complexity standpoint, but
doesn't perform well for contended structures and has the cycle-collection
problem, which may or may not matter depending on the algorithm.

I guess my point is: if you've solved this problem, the rest of us would love
to hear how you did it :)

~~~
shin_lao
Our software is written in C++11 and we are highly asynchronous AND lockfree.

This was neither hard, nor impossible and surprisingly (<\-- irony) we have
excellent performance with a very low memory footprint (in other words, all
the problems discussed in the post don't exist for us).

We have a custom memory allocator for scalability. As for memory management we
either don't copy it (zero-copy) or when we have to track the lifetime, we use
reference counted memory (which does not have the problems you point out when
correctly implemented and used).

We don't have (yet) a mapreduce-like feature, but this is coming.

Actually the most difficult part was the network layer.

We have a workable version with which you can play here:
<http://www.wrpme.com/> Feel free to contact me in private if you have more
questions.

~~~
sparky
Thanks for the pointer, I look forward to playing around with wrpme. For
context, my work on lock-free algorithms and data structures is in the context
of processors with 128-1024 cores. I'm curious about your take on reference
counting; my statement was not meant to be inflammatory, but was based on
something I considered to be fundamental about reference counting. That is,
refcounting a data structure requires, at minimum, an atomic increment and
decrement for each element you traverse that must not be freed for your
algorithm to be correct (you can elide the inc/dec in some cases if your
algorithm can detect and recover from freed elements, but these elisions
usually have tradeoffs in terms of the complexity of your algorithm or
frequently having to restart your traversal).

Given that traversals require more or less one inc/dec per element, the
ultimate efficiency of your algorithm is bounded by the throughput of atomic
operations to a small number of memory locations (pretty low on x86 where they
require cache-to-cache transfers, higher on something like a GPU where atomic
functional units are colocated with the last-level cache), and by the amount
of other work you're doing per element. For something like a linked list (or
skip list) traversal, the inc/dec represents a 100-200% increase in the number
of instructions per element, since the amount of other work is so low. In
contrast, on a garbage collected system you can vary the collection frequency
to achieve arbitrary amortized overhead.

So the overhead of ref counting is pretty variable, and depends on:

    
    
       * Whether you can elide refcounts in some cases
       * How much other work you're doing per element traversed
       * The contention distribution across the elements in your
         data structure (the first few elements of a linked
         list, for example, are highly contended because
         everybody starts there)
       * How many concurrent accessors you have
       * How often the accessors are hammering on the data
         structure (vs. doing other things in the program)
       * The throughput of atomic operations on your platform
    

To my mind, none of these have to do with 'correct' implementation or usage,
they simply depend on the algorithm and the application. If your application
consists of 1000 processors maintaining a sorted set of integers in a skip
list, I don't see an easy way around making refcounting perform well. What am
I missing?

~~~
shin_lao
You're correct about the reference counter, the trick is that the compiler can
reduce the manipulations on the counter drastically with copy elision. C++
compilers are _very_ good at it. That's the easy part.

Now our implementation is a graph of small services (we call them nano-
daemons) that slice a request into elementary tasks. nano-daemons communicate
between them using messages that are copied around. This means a little memory
overhead but absolutely no access contention. As you know C++ is very good at
using the stack for that kind of allocation and the stack is FAST.

The actual content is only copied in the final phase into the network card's
buffer. The reference counting is used to make sure the data is kept alive
long enough, even if a parallel removal requested arrived meanwhile. The
content isn't actually manipulated in parallel. An update implies the creation
of a new buffer, reducing further the contention.

So in other words, if you have many parallel gets on the same data, you have,
in theory, a little contention caused by the reference counter being
incremented.

In practice, reference counting doesn't show up in profiling (so far).
Actually most time is spent in the asynchronous callbacks waiting for I/O
completion, something which we are very proud of! :)

We've benchmarked our software on a grid of 200 servers with 12 cores each and
it does much better than other products, especially for entries larger than 8
kiB.

If you think about it, you realize that somehow, at some point, you will have
a bottleneck, whatever design you choose.

