
C++11 lock-free command buffers for multi-thread rendering - tuxalin
http://alinloghin.com/articles/command_buffer.html
======
warmwaffles
It's an interesting idea, but you still have single threaded rendering in the
end with OpenGL. However, you are able to keep that thread constantly busy
which is a big plus.

~~~
monocasa
This fits really well with the single submission queue model, particularly if
you're careful to use AZDO style OpenGL. You merge sort the lists once you're
done with the frame, and then you have a pretty good approximation of the
minimum state changes needed as you iterate through the sorted list on your
submission thread.

Now, the better implementations I've seen use per thread lists during queueing
so you don't need any of the barriers or atomics here at all, but this is a
good first pass.

~~~
warmwaffles
Agreed, this is actually a fun exercise to think about

------
dragontamer
Hmm, I was curious about the implementation. I checked out the code:
[https://github.com/tuxalin/CommandBuffer/blob/master/Command...](https://github.com/tuxalin/CommandBuffer/blob/master/CommandBuffer.h)

EDIT: I was wrong. Editing out a lot of wrong stuff I put here. Thanks to
scott_s for pointing out something I missed.

~~~
scott_s
The member m_currentIndex is a std::atomic<uint32_t>. The loads and stores to
that member specify the memory consistency, and will use the appropriate
fences as needed to meet those requirements.

~~~
dragontamer
Wow. I feel like I herp-derped hard. I don't know how I missed that.

I'll look at this later. Apparently I'm not in the right state of mind yet.

~~~
tuxalin
No worries, too much work tends to do that to you, I recommend some beers to
fix the state of mind.

Regarding the implemention, I presumed you asked about the lock-free logic,
apart from m_currentIndex which is used with the m_commands vector, the
LinearAllocator also has to be lock-free, see it's alloc function. The
allocator is used for storing the command packets (along with their command
data) and also their auxiliary data (if they have that).

Another implementation detail of the allocator is that it has two modes, an
auto-alignment mode which requires two atomic operations (load acquire and
compare_exchange_weak) and will be slower but lower memory usage, therother
mode uses the alignment of the CommandPacket class which will be faster (one
fetch_add) but consume more memory.

And another detail is the CommandPacket uses a intrusive list approach, that's
how you actually chain multiple commands together and do an indirect function
call (instead of virtual or other blasphemies) with dispatch of the command.

Hope that clears some things up about it.

~~~
dragontamer
I was getting ready to type something up, but you've basically answered my
questions!

> an auto-alignment mode which requires two atomic operations (load acquire
> and compare_exchange_weak)and will be slower but lower memory usage, ther
> other mode uses the alignment of the CommandPacket class which will be
> faster (one fetch_add) but consume more memory.

That's the main thing I was wondering about, especially the relative speeds of
the implementation.

~~~
tuxalin
Glad to hear that!

I dont have anymore the exact numbers, usually when you have around 1000
commands/calls you won't really feel a difference, in that case even using
std::sort is fast enough and there's no point in using radix sort, however
when you go over 10k commands/calls you'll feel it (can be up to a msec on the
dispatch side).

You can check this great post that provides numbers (see part 4):
[https://blog.molecular-matters.com/2014/11/06/stateless-
laye...](https://blog.molecular-matters.com/2014/11/06/stateless-layered-
multi-threaded-rendering-part-1/)

The performance should be very similar and it even goes further by adding
thread local storage to reduce false sharing even more, which I didn't add but
should be considered as it's easy to do.

~~~
dragontamer
Okay, now that I've had some time to look at other things and come back to
this, lemme offer some more detailed questions.

COMMAND_QUAL::addCommand uses fetch_add(1, std::memory_order_relaxed); It
seems like this is sufficient for memory-consistency for the list, but I'm
worried about the eventual sort() command.

In particular, your update in "addCommand" is as follows:

    
    
        const uint32_t currentIndex = m_currentIndex.fetch_add(1, boost::memory_order_relaxed);
        // asserts removed
        CommandPair& pair = m_commands[currentIndex];
        pair.cmd = packet;
        pair.key = key;
    

How does COMMAND_QUAL::sort() know that the other threads are "done adding"
and that pair.key is a valid value for comparison? My race-condition would be
"sort" has been called, while some thread is still setting pair.key=key, which
may corrupt the command/key packet.

\-----------

EDIT: Oh, you answered this in your blogpost. Lol. You assume that the user
waits and safely calls sort after waiting.

Okay, that's perfectly acceptable in my eyes. I'll leave this post up for
discussion purposes, even if I answered my own question.

~~~
tuxalin
Correct, you could also use a double buffering approach to avoid that, if
possible, depending on your use case. Feel free to reach out if you have any
other questions.

