C++11 lock-free command buffers for multi-thread rendering

warmwaffles · on June 17, 2019

It's an interesting idea, but you still have single threaded rendering in the end with OpenGL. However, you are able to keep that thread constantly busy which is a big plus.

tuxalin · on June 17, 2019

Indeed, that's a limitation of OpenGL which is addressed with Vulkan, however you can still use this with Vulkan and make use of the other features like: command sorting, less verbose API or command keys. However, in most cases you wouldn't need the lock-free aspect of it (i.e. should be configurable), since in the end you would dispatch commands from their own threads, as monocasa well-pointed, that is something I do plan to do at some point (a solution better suited for Vulkan/D3D12), but sadly didn't had the free time.

monocasa · on June 17, 2019

This fits really well with the single submission queue model, particularly if you're careful to use AZDO style OpenGL. You merge sort the lists once you're done with the frame, and then you have a pretty good approximation of the minimum state changes needed as you iterate through the sorted list on your submission thread.

Now, the better implementations I've seen use per thread lists during queueing so you don't need any of the barriers or atomics here at all, but this is a good first pass.

warmwaffles · on June 17, 2019

Agreed, this is actually a fun exercise to think about

dragontamer · on June 17, 2019

Hmm, I was curious about the implementation. I checked out the code: https://github.com/tuxalin/CommandBuffer/blob/master/Command...

EDIT: I was wrong. Editing out a lot of wrong stuff I put here. Thanks to scott_s for pointing out something I missed.

scott_s · on June 17, 2019

The member m_currentIndex is a std::atomic<uint32_t>. The loads and stores to that member specify the memory consistency, and will use the appropriate fences as needed to meet those requirements.

dragontamer · on June 17, 2019

Wow. I feel like I herp-derped hard. I don't know how I missed that.

I'll look at this later. Apparently I'm not in the right state of mind yet.

tuxalin · on June 17, 2019

No worries, too much work tends to do that to you, I recommend some beers to fix the state of mind.

Regarding the implemention, I presumed you asked about the lock-free logic, apart from m_currentIndex which is used with the m_commands vector, the LinearAllocator also has to be lock-free, see it's alloc function. The allocator is used for storing the command packets (along with their command data) and also their auxiliary data (if they have that).

Another implementation detail of the allocator is that it has two modes, an auto-alignment mode which requires two atomic operations (load acquire and compare_exchange_weak) and will be slower but lower memory usage, therother mode uses the alignment of the CommandPacket class which will be faster (one fetch_add) but consume more memory.

And another detail is the CommandPacket uses a intrusive list approach, that's how you actually chain multiple commands together and do an indirect function call (instead of virtual or other blasphemies) with dispatch of the command.

Hope that clears some things up about it.

dragontamer · on June 17, 2019

I was getting ready to type something up, but you've basically answered my questions!

> an auto-alignment mode which requires two atomic operations (load acquire and compare_exchange_weak)and will be slower but lower memory usage, ther other mode uses the alignment of the CommandPacket class which will be faster (one fetch_add) but consume more memory.

That's the main thing I was wondering about, especially the relative speeds of the implementation.

tuxalin · on June 17, 2019

Glad to hear that!

I dont have anymore the exact numbers, usually when you have around 1000 commands/calls you won't really feel a difference, in that case even using std::sort is fast enough and there's no point in using radix sort, however when you go over 10k commands/calls you'll feel it (can be up to a msec on the dispatch side).

You can check this great post that provides numbers (see part 4): https://blog.molecular-matters.com/2014/11/06/stateless-laye...

The performance should be very similar and it even goes further by adding thread local storage to reduce false sharing even more, which I didn't add but should be considered as it's easy to do.

dragontamer · on June 17, 2019

Okay, now that I've had some time to look at other things and come back to this, lemme offer some more detailed questions.

COMMAND_QUAL::addCommand uses fetch_add(1, std::memory_order_relaxed); It seems like this is sufficient for memory-consistency for the list, but I'm worried about the eventual sort() command.

In particular, your update in "addCommand" is as follows:

    const uint32_t currentIndex = m_currentIndex.fetch_add(1, boost::memory_order_relaxed);
    // asserts removed
    CommandPair& pair = m_commands[currentIndex];
    pair.cmd = packet;
    pair.key = key;

How does COMMAND_QUAL::sort() know that the other threads are "done adding" and that pair.key is a valid value for comparison? My race-condition would be "sort" has been called, while some thread is still setting pair.key=key, which may corrupt the command/key packet.

-----------

EDIT: Oh, you answered this in your blogpost. Lol. You assume that the user waits and safely calls sort after waiting.

Okay, that's perfectly acceptable in my eyes. I'll leave this post up for discussion purposes, even if I answered my own question.

tuxalin · on June 18, 2019

Correct, you could also use a double buffering approach to avoid that, if possible, depending on your use case. Feel free to reach out if you have any other questions.