
Wait-free queueing and ultra-low latency logging - mortoray
http://mortoray.com/2014/05/29/wait-free-queueing-and-ultra-low-latency-logging/
======
emerson_clarke
Another approach is just to use two stacks, one for writing and one for
flushing.

User threads write log lines directly to buffers from an allocator usually via
a TLS mediated stream. The use of an allocator avoids locking on system calls
during memory allocation and minimizes copying between user code and eventual
flush to disc/network. Buffers are written to the write stack using atomic
CAS, if no buffers are available from the allocator the user thread may spin,
or force a flush in the same way as the flushing thread.

A single flushing thread watches the write stack on a timer and when it
reaches some threshold it uses an atomic CAS to switch the head pointer
between the flush and write stacks before enumerating the flush stack and
writing all buffers to disc/network freeing the buffers back to the allocator
(again using atomic operations). It is subtle but if done right user threads
and the flushing thread interact optimally in response to demand.

This solution is much more flexible than a ring buffer and also much simpler
and faster in testing than any complicated patterns like disruptor and
competitive with expensive hardware logging solutions.

It recognizes the fact that much of the overhead in logging comes from
expensive copying of log data in memory, and it also ensures that minimum
context switching takes place which is essential if you are not to defeat the
entire point of fast lock free algorithms.

Unlike a ring bufffer it has no blocking or performance degredation when the
buffer gets full and requires no large chunk of memory to be permanently
allocated, although the allocator may periodically allocate new temporary
memory if its buckets are full or if a log line is too large for a the maximum
bucket size.

What you end up with is a logging system where user threads are minimally
impacted during writes and throughput is able to max out the disc/network.

~~~
mortoray
The problem with this approach is that it requires coordination on when the
swap of the two stacks is done. Using CAS doesn't really help. The consumer
doesn't know if the producer is currently writing into the stack or not. It
still needs another mechanism to determine when it is safe to read from that
stack.

~~~
emerson_clarke
I think you misunderstand the CAS operation. It performs an atomic compare and
swap on a pointer (32 or 64 bits), so obviously it requires no coordination to
switch out the head pointer of a stack.

For example, on an LP64 architecure:

    
    
      Item * first = writers.first;
      while( CAS((long *)&writers.first,(long)0,*((long*)&first)) != *((long*)&first))
      {
      	first = writers.first;
      }
    

Here first represents the flush stack, and writers represents the write stack.
We just swap the first pointer of the write stack with a null pointer, and
this only succeeds if no other threads are currently trying to perform the
same operation. This works because pushing to the stack is performed using a
similar atomic CAS of the head pointer.

~~~
mortoray
No, I'm saying you have coordinated the writing to the stack. It's not enough
to just swap pointers to the stacks themselves, but you have to know how much
data has been written to the stack.

Perhaps your description is incomplete?

~~~
emerson_clarke
The stack head pointers can be swapped and flushed without knowing how much
data is in it.

However, if you need to know the count/size...

Then in general you must accept that the count/size of the write stack is
dynamic, so even if you use an integer to track the value (and keep updated
with atomic exchange or increment), at the point that you read its value on
one thread it may have already changed on another.

So it doesn't really matter if you reset the value to 0 using an interlocked
exchange (this cant be done as an atomic unit with respect to the head
pointers unless your platform supports a double CAS). Some loss of count/size
information will occur.

Alternatively you can safely iterate through the stack to calculate the
count/size anytime you need it (provided the next pointers are also treated
atomically).

Neither option will give you the exact size since as discussed above its
always dynamic, but this i no way impedes the functioning of the write/flush
stacks as described. In my solution i just set the value of the count/size to
0, and disregard any loss of information since it is not critical.

~~~
acqq
> disregard any loss of information since it is not critical.

Excuse me but I'd like to have it clearly stated: which information is
actually lost in your implementation? Do some of the messages passed to be
written actually end not being written? Which part ends up not knowing
count/size?

------
ajtulloch
For some excellent examples of concurrent queues in C++, Facebook's `folly`
C++ library contains a really clean lock-free SPSC queue [1], and a really
fast MPMC queue [2].

[1]:
[https://github.com/facebook/folly/blob/master/folly/Producer...](https://github.com/facebook/folly/blob/master/folly/ProducerConsumerQueue.h)

[2]:
[https://github.com/facebook/folly/blob/master/folly/MPMCQueu...](https://github.com/facebook/folly/blob/master/folly/MPMCQueue.h)

~~~
mortoray
Note it's important to be "wait-free" and not just "lock-free". The first item
here does appear to be "wait-free".

~~~
fmstephe
I am curious about your definition of wait-free and lock-free. There are a
number of conflicting definitions for these two terms floating around the
internet.

If we take a very simple problem, incrementing a shared counter and doing some
other work and returning its new value. In pseudo-code (with no exception
handling).

\--- A lock based solution would look like this

    
    
        public void inc() {
            this.l.lock()
            this.value++
            this.l.unlock()
        }
    

This has the crucial property that if a thread gets descheduled by the OS
while it holds the lock no other thread can make progress until that thread
has been restored and releases the lock. Importantly this is true whether we
use user space spin locks or OS integrated locks.

\--- A lock-free solution would look like

    
    
        public void inc() {
            while(!compareAndSet(&this.value, this.value+1))
        }
    

This has the charming property that if a thread gets descheduled by the OS
anywhere inside this method other threads will not be impacted. However, there
is a pathological case where a single thread may repeatedly fail the call to
compareAndSet(...). While this does mean that some other thread(s) must be
making progress an unlucky thread may be delayed indefinitely.

\--- A wait-free solution for this on x64 would look like

    
    
       public void inc() {
            // do some inline assembly
            XADD this.value 1
            // that's enough assembly
       }
    

where the XADD instruction is guaranteed to always work atomically and safely
regardless of how many other threads are also xadding to the same address. In
practice wait-free algorithms are exceptionally complex and often slower than
their lock-free counterparts. Wait-free techniques are most popular with
academics (because they are so hard :) and with hard real-time systems because
you can't risk failing indefinitely on a contended compareAndSet(...).

This is a good explanation of what I understand to be lock-free and wait-free

[http://rethinkdb.com/blog/lock-free-vs-wait-free-
concurrency...](http://rethinkdb.com/blog/lock-free-vs-wait-free-concurrency/)

Here is a really good discussion about it

[https://groups.google.com/forum/#!topic/mechanical-
sympathy/...](https://groups.google.com/forum/#!topic/mechanical-
sympathy/Tm3xRcvpnzE)

Truly no flame intended. I just often find it hard to discuss these things
over the net because of the range of possible definitions.

~~~
mortoray
I'm going on what wikipedia describes as [lock-
free]([https://en.wikipedia.org/wiki/Lock-
free](https://en.wikipedia.org/wiki/Lock-free)). In that theoretical sense
there is no difference between a mutex lock and a spin-lock. But, they may
both be "lock-free". By the definition presented there virtually all programs
are lock-free... that would be better termed "deadlock-free".

Most people I've meet though assume lock-free just implies not using mutexes,
but that spin-locks are fine. There appears to be a conflict between this
definition (the one you've given) and the theoretical one.

I generally stick to the term "wait-free" since it's meaning is less
ambiguous.

~~~
fmstephe
The definition for lock free written there is the same as the one I wrote
above.

The crucial sentence is found in the second paragraph. Where non-blocking is
the umbrella term covering both lock, wait and obstruction free algorithms.

"In modern usage, therefore, an algorithm is non-blocking if the suspension of
one or more threads will not stop the potential progress of the remaining
threads"

Neither spin locks nor OS locks satisfy this part of the definition. A thread
holding a lock who is suspended will prevent any progress being made by any
other thread.

I would avoid using wait-free in general because it is a very specialised kind
of non-blocking algorithm.

~~~
mortoray
I think I better understand what "lock-free" means now, thank you for the
explanation.

I will review what I said in my article and ensure I'm not spreading any
misinformation. When I learned of lock-free I was presented with a spin-lock
like system as an example, but that is clearly incorrect.

The key I guess is that any thread could halt at any point and the other
threads are not blocked (within obvious practical limitations). This doesn't
mean other works may not have to redo some work, such as finding a new
terminal node in a lock-free list.

In practice lock-free is likely sufficient, even in real-time systems. One
would need a very high level of contention to render lock-free incapable
(though with a high number of cores it's definitely possible).

~~~
fmstephe
It's a very tricky collection of definitions. All descriptions of it are a bit
vague. For instance under lock-free on wikipedia we read

"An algorithm is lock-free if it satisfies that when the program threads are
run sufficiently long at least one of the threads makes progress (for some
sensible definition of progress)."

That definitely sounds like locks would be included. I am pretty sure that if
my program runs for long enough that descheduled thread will get rescheduled
and continue to make progress. And then there is the hand-wavy 'sensible
definition of progress' what is that?

If we take an example from a single producer single consumer queue which is
certainly non-blocking. With two methods.

    
    
        // Returns true if o was successfully added to the queue, false otherwise
        public boolean enqueue(Object o)
    
        // Returns null if the queue is empty, otherwise returns a FIFO object
        public Object dequeue()
    

There can only be two threads running. So lets suspend one indefinitely to
test that our implementation is non blocking.

If we suspend the producer then the consumer will eventually stop returning
objects from dequeue() and just return null. If we suspend the consumer then
the producer will eventually start returning false from enqueue(). In either
of these cases we could definitely argue that our system has stopped making
progress and the definition 'actually enqueue or dequeue some useful thing'
seems like a sensible one. But this definition should really just be 'always
return from enqueue or dequeue' and this second definition allows us to say
our queue is non blocking.

It's pretty hard to pin down what constitutes a 'sensible definition of
progress'. For a user space spin lock saying that a thread continues to spin
in a tight loop could be a sensible definition of progress, but isn't helpful
for our purposes of deciding if an algorithm or data structure is non
blocking. This trickiness makes it surprisingly difficult to coherently
discuss non-blocking x-free algorithms.

(Luckily, in practice, these algorithms are so much fun that it is worth the
difficulty :)

~~~
mortoray
Yes, they are fun.

At least in practice the decision to use one or the other would come from
variance guarantees. From this view we can define a "practical" meaning to
each one.

* blocking: very high variance in operation time

* lock-free: lower average, likely lower variance, but prone to spikes

* wait-free: lowest variance, approaching zero, cannot have spikes

------
jzwinck
It's unfortunate that literal strings in C++ cannot be programmatically
distinguished from char* buffers. It would be useful if there were a separate
type for literal strings which could implicitly decay to char* when needed.
And functions should be able to return that type, because you may have
"literal_string toString(MyEnum)" which always returns a literal string (or
perhaps null).

Also, yes, Boost has some lock-free stuff now. It didn't back when the author
was writing the code described.

~~~
humanrebar
There are user-defined string literals (1) in C++11. They can do pretty much
whatever you want.

Also, using templates it is trivial to distinguish between arrays of
characters and character pointers.

Finally, there is a proposal for a string_view (2), which could be used to
represent string literals, no copies needed.

(1)
[http://en.cppreference.com/w/cpp/language/user_literal](http://en.cppreference.com/w/cpp/language/user_literal)
(2) [http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2013/n360...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2013/n3609.html)

------
FooBarWidget
It's a bit sad that all these low-latency approaches require burning CPU in a
loop, which uses tons of power.

~~~
fmstephe
In this particular case you should be able to use no-op instructions to spin
without consuming much power. Since the consumer only spins when the queue is
empty we know there will be space on the ring buffer and producers won't be
impacted. When the consumer wakes from a no-op loop and finds work on the
queue it can switch back to a hot loop for a certain amount of time before
returning to a no-op cold loop.

My understanding is that even having very small no-op pauses can significantly
reduce the amount of energy used, and crucially heat generated, while only
having a very modest impact on latency. A good fit for bursty low latency
systems.

To be clear, I am only relating something that another developer described to
me. I've not actually implemented this myself. :)

~~~
a-priori
I think your colleague is talking about the PAUSE instruction (which is also
known as REP NOP, since they encode to the same bytes). It's a special
instruction that hints to the processor that it's in a spin-loop waiting on a
synchronisation variable to change. It's used in tight wait loops like this:

    
    
        wait_loop: 
            pause
            cmp eax, sync_var
            jne wait_loop
    

The PAUSE instruction introduces a small delay to synchronise the spin loop to
the memory bus frequency. This is a power savings, since the value of
_sync_var_ , as seen from the perspective of this code, can't change faster
than that. Also, because the CPU can execute much faster than the memory bus,
it prevents many memory requests from piling up while in the loop (because of
out-of-order execution; those requests will have to be unrolled when the loop
exits), making the loop faster to exit. Because of Hyperthreading, the pause
also gives another thread a brief opportunity to execute on the same core.

[https://software.intel.com/sites/default/files/m/d/4/1/d/8/1...](https://software.intel.com/sites/default/files/m/d/4/1/d/8/17689_w_spinlock.pdf)

Intel does, however, recommend other approaches to dealing with tight loops
like this. See here:

[https://software.intel.com/en-us/articles/long-duration-
spin...](https://software.intel.com/en-us/articles/long-duration-spin-wait-
loops-on-hyper-threading-technology-enabled-intel-processors/)

~~~
arielweisberg
How would there be many requests? Wouldn't it load the cache line once into
the shared state and then spin waiting for the line to be invalidated before
reloading?

~~~
a-priori
Honestly, I don't know how this interacts with cache lines.

As far as I know, Intel has not released any official details about what the
PAUSE instruction does other than that it slows down spin loops to a
reasonable rate. The best source I know of for this information is the _Intel®
64 and IA-32 Architectures Optimization Reference Manual_
([http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
optimization-manual.pdf)). It is fairly vague about what the instruction does,
but gives some useful information in 13.5.3 "Spin-Wait Loops", as well as 8.4
"Thread Synchronization", and in particular 8.4.2 "Synchronization for Short
Periods" which says:

"On a modern microprocessor with a superscalar speculative execution engine,
[a spin loop] results in the issue of multiple simultaneous read requests from
the spinning thread. These requests usually execute out-of-order with each
read request being allocated a buffer resource. On detection of a write by a
worker thread to a load that is in progress, the processor must guarantee no
violations of memory order occur. The necessity of maintaining the order of
outstanding memory operations inevitably costs the processor a severe penalty
that impacts all threads."

So it seems that the concern here is that (without a PAUSE) the speculative
execution engine effectively unrolls the spin loop and executes a sequence of
reads on the sequence variable. This queues up a list of pending memory
operations that need to be unrolled _in order_ to ensure that the result is
the same as if they were executed sequentially -- even though in this case it
wouldn't make any difference.

~~~
arielweisberg
I think the harm here is for other threads on the same core that share the
same queue of in-flight loads.

Coherence wise it won't interfere with other cores and their access to memory.

That said how hyper-threads share CPU resources is a moving target. My
understanding through hearsay is that in the past many resources were
statically partitioned with hyper-threading enabled, but now things are moving
towards allocating resources dynamically which means that a thread wasting
resources would be sucking up capacity that could be used by another hyper-
thread.

------
zwieback
Nice writeup, enjoyed reading this.

Instead of pointers to string literals, did you consider tokens instead, e.g.
a big enum with a matching string table for the consumer? That's what we did
in the past in device drivers although for space reasons instead of speed.

~~~
mortoray
That's essentially what pointers to string literals are. They are indexes into
a constant data segment which contains the actual strings.

~~~
zwieback
Very true. The main reason we used them were a) they can be smaller and b) can
we flushed to disk for later decoding more easily and c) it's easier to
implement a selective logging algorithm if you're only interested in certain
events.

------
VikingCoder
> A key requirement for logging is to write statements, from any thread, in
> order, to a single log-file.

I do not agree that it's a necessary requirement that all threads must write
to a single log-file.

> Formatting strings, required by a log system, is a slow operation.

I also do not agree that it's a necessary requirement that a log system must
format strings. Binary log files have their uses. I've used Google Protocol
Buffers quite happily. They may not be appropriate for extremely high-speed
logs, like a low-latency trading system implies, but they have their uses. I'd
be tempted to try something like Cap'n Proto, if I were taking a whack at it.

~~~
kentonv
FWIW, CloudFlare uses Cap'n Proto for logging.

~~~
VikingCoder
Yeah, what do you know about it?

;-)

------
userbinator
> Profiling revealed that copying the format string was a significant part of
> the overall time.

Not surprising. In general, memory allocations and copying are to be avoided
unless absolutely necessary, if you want efficient code. I've made huge
performance improvements to systems simply by getting rid of a memory copy
that was located in a tight loop. As the saying goes, "the fastest way to do
something is to not do it at all."

Also, does anyone find the term "wait-free queueing" somewhat oxymoronic? A
queue is usually something to wait in.

~~~
mortoray
Well, only one side of the queue was wait-free. The other side, the consumer,
was a whole mess of queues and buffers. It's simply the process of putting
something in the queue that is fast. Getting out of that queue is a terribly
slow operatino.

------
lomnakkus
I'm not sure if it's just me, but I found the lack of reference to system time
interesting. Maybe it's one those things everyone does (but never talks
about), but if you're logging, one of the absolute performance killers can
actually turn out to be getting the system time via gettimeofday() or
whatever. That's a syscall which involves a context switch, which... (It goes
downhill from there.)

Much better to just get the time every once in a while and cache it.

~~~
nkurz
Historically this is true, but recent x64 Linux has made a lot of improvement
here. In particular, gettimeofday() is usually optimized so as not to require
a context switch: [http://blog.tinola.com/?e=5](http://blog.tinola.com/?e=5).
And then there are some even faster options that do the caching for you:
[https://access.redhat.com/site/documentation/en-
US/Red_Hat_E...](https://access.redhat.com/site/documentation/en-
US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-
Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html)

------
zwieback
Most of my high speed experience is in embedded systems with direct control
over HW.

Question: for these higher-level systems, is it possible to prevent preemption
for a very short time? I'm guessing that a userland app can't do that easily
but if it could then that would effectively give you larger atomic operations,
which can aid in implementing lock-free solutions.

~~~
javert
If I understand your suggestion, that only works when all the threads that
share a critical section can only run on the same core.

For instance, if you have a producer on core 0 and a consumer on core 1, even
if the producer disables preemptions, the consumer can still enter the
critical section (unless you protect it with locks).

Off the top of my head I actually don't know if you can turn off preemptions
from userspace in Linux, I've never needed to.

You can give threads sched_fifo or sched_rr priorities that will prevent them
from being interrupted by anything of lower priority (but that doesn't include
the kernel itself... unless you are on preempt_rt and then there are things
you can do).

~~~
zwieback
Yes, excellent point. Multi-core changes how critical sections would be
implemented and whether they'd be useful at all.

------
llogiq
Note that log4j2, which can log asynchronously, uses the LMAX disruptor, which
is a java implementation of the same technique.

------
easytiger
I wonder about the implementation of the ringbuffer itself. Would
boost::spsc_queue be suitable for a similar task?

~~~
mortoray
That definitely looks like it's providing the same functionality my ring
buffer did. It's impossible to say whether it is as efficient as mine was
without profiling it. It's possible. Even if it's almost as efficient, or even
half as efficient, I'd still consider using it. Writing this stuff consumes a
lot of time.

I notice the newest versions of this class added a "consume" function. I
assume it's for the same reason my class did it. Instead of pushing/popping
you can directly modify memory in the ring buffer. This avoids one copy
operation.

~~~
easytiger
interesting, i've actually been oblivious to the mechanics when i'm using it
(too much business logic to do).

I'm also using it and burning a whole core doing a busy wait. I was going to
try and fix this but i'm not sure how much i care. if we aren't paying for cpu
time then it doesn't matter

------
dllthomas
Nice - this is pretty much exactly my solution to the same problem. One thing
I noticed on my architecture (not sure how well it generalizes) is that
explicitly flushing the cacheline from the sending core dropped cache misses
quite a bit.

~~~
nkurz
Do you mean flushing it with _mm_clflush()/CLFLUSH? Which architecture, and
what's your use case? Any theories on why this was helping?

~~~
dllthomas
Right. Architecture is Sandy Bridge, use case is sending messages from one
core to another.

The theory is straightforward: After I populate a message slot (sized at 1
cache line) in the ring buffer I _know_ I'm not going to need to do anything
with that memory on that core anytime soon. Pre-emptively evicting it from
cache in favor of something that has a greater chance of being used soon has a
chance of avoiding a cache miss. There remains the question of whether the CPU
can figure out enough of this on its own. Empirically, the answer was "no" in
my case.

