Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing a ring buffer for throughput (2021) (rigtorp.se)
161 points by signa11 on June 27, 2023 | hide | past | favorite | 56 comments



There is a cool trick using virtual memory allowing fast and gapless ring buffers.

See this page (about half way down) https://ruby0x1.github.io/machinery_blog_archive/post/virtua...


It is a cool trick and even without virtual memory if the maximum read size has a low upper bound just copying a the first few bytes from the beginning to the end can be worth it to avoid the complexity and overhead of indexing into the ring buffer modulo its size.

This clever ring buffer has hidden costs that don't show up in microbenchmarks testing only the push() and pop() operations on an initialized ring buffer. These can be very high and impact other software as well.

The initialisation requires multiple system calls to setup the aliased memory mapping and it has to be a multiple of the page size. This can be very expensive on big systems e.g. shooting down the TLBs on all other CPUs with interprocessor interrupts to invalidate any potentially conflicting mappings. On other end small systems often using variations of virtual indexed data caches and may be forced to set the aliased ring buffer pages as uncacheable unless you can get the cache coloring correct. And even for very small ring buffers you also use at least two TLB entries. A "dumber" ring buffer implementation can share part of a large page with other data massively reducing the TLB pressure.

Abusing the MMU to implement an otherwise missing ring addressing mode is tempting and can be worth it for some use-cases on several common platforms, but be careful and measure the impact on the whole system.


It sprang from old mainframe implementations - there's a 2002 (??) Virtual Ring Buffer in Phil Howard's LibH collection (2007 snapshots of 2002 copyright code based on circa 1980 implementations IIRC):

* https://web.archive.org/web/20140625200016/http://libh.slash...

VRB:- These functions implement a basic virtual ring buffer system which allows the caller to put data in, and take data out, of a ring buffer, and always access the data directly in the buffer contiguously, without the caller or the functions doing any data copying to accomplish it.

* The "Hero" code

http://libh.slashusr.org/source/vrb/src/lib/h/vrb_init.c

* Front & back end source directories:

https://web.archive.org/web/20100706121709/http://libh.slash...


If anyone is considering to use this, I think the code has a couple of concurrency bugs but I do not know if this was ever intended to be used in a multithreaded setting. vrb_get() contains the following code.

    //-- Limit request to available data.
    if ( arg_size > vrb_data_len( arg_vrb ) ) {
       arg_size = vrb_data_len( arg_vrb );
    }

    //-- If nothing to get, then just return now.
    if ( arg_size == 0 ) return 0;

    //-- Copy data to caller space.
    memcpy( arg_data, arg_vrb->first_ptr, arg_size );
If at the time of the check for the amount of available data there is less data available than requested but additional data gets added to the buffer before arg_size gets updated, then this might get more data than requested and overflow the target buffer. At least vrb_read() and vrb_write() have the same bug.


Sadly I can't ask the author anymore, I suspect this was a least feature possible stripped down port of a more battle tested cross platform library that dated to before the times of modern multi multi multi core chips.

Concurrency issues existed in the days of yore .. but they arose in different ways with different timings.

It's an odd bit of code archeology recalling a concept ( mmap ring buffers ) from decades past and then hunting to find the best remaining example - much of LibH was 'clean' rewrites of code from the authors past work.


Are the memory semantics programmer-friendly on all (or most) platforms when using aliasing mappings like this? Eg if two CPUs concurrnetly update the memory through different VM address aliases, is the observed semantics identical vs accessing through only one address or not?


I think POSIX requires it to work. On all sane architectures that use physically indexed caches (or VIPT) that's easy. On other architectures the OS has to bend over backward to preserve the fiction.


VIPT caches only allow this if all virtual aliases for the physical memory backing them map to the same cache set. Let's look at an example: Writing to the first cache line in both aliases of such a double mapped ring buffer. The data cache indexes the cache by the virtual addresses while the MMU translates them. If the virtual addresses have the correct cache coloring to always hit the same set of cache to compare the physical address to the tag it works, but if virtual addresses don't index to the same set of cache lines you get data corruption because there will be two dirty cache lines with conflicting data for the same physical address waiting to be written back in an unknowable order. Good luck debugging this should your system allow you to create such a memory mapping. At least Linux and FreeBSD used to allow you to setup this time bomb on ARMv5 and you have only two bad options to choose from as kernel: break software relying on this or make it unbearable slow by making aliased memory uncacheable.


Nobody (not even IBM AFAIK) uses coloring anymore so this is moot. ARMv5 isn't even supported by Linux anymore.

All modern caches PIPT, VIPT, or even VIVT must work with this scheme as it's semantically transparent to virtual memory. The performance of line-crossing accesses is a totally different issue and I would never used this "trick".


You mean virtually indexed caches. VIPT is a transparent improvement on physically indexed caches, which is used by most architectures today.

Otherwise, yes, that's what I mean by the OS bending over backward.

Linus has choice words for such architectures.


AKA the Magic Ring Buffer.


Always worth linking to ryg / Fabian Giesen: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff...


Incidentally the non-portable remap_file_page(2) was great for doing this sort of page manipulations, but it has been deprecated and this days it is equivalent of doing the separate mmaps.


I didn't know the details of the MESI protocol so that prompted me to look it up and NC State University has a 10 minute lecture on Youtube that makes is very easy to grasp - https://www.youtube.com/watch?v=wqF9O7odNKY

Apparently Intel uses a variant called MESIF and AMD went for MOESI (https://www.realworldtech.com/common-system-interface/5/, https://en.wikipedia.org/wiki/MESIF_protocol) (https://www.amd.com/system/files/TechDocs/24593.pdf, https://en.wikipedia.org/wiki/MOESI_protocol)


Thanks for this.

I used this multiproducer multiconsumer ringbuffer

https://www.linuxjournal.com/content/lock-free-multi-produce...

I am inspired by LMAX Disruptor. I like the idea of splitting all IO into two halves: submit and reply. At one point I was looking slides of some company who used LMAX Disruptor pattern for their IO and had high performance.

My extremely naive Java benchmark to test locks I get 35,871,639 locks per second on 12 threads with contention. I am trying to work out how to improve the performance of the multiproducer multiconsumer ringbuffer which gets 10 million requests per second it should be higher.

https://github.com/samsquire/multiversion-concurrency-contro...

I want to create a message queue that uses ringbuffers to communicate between threads efficiently and elegantly and handles the splitting of IO in half. I want to combine coroutines and threads together.


LMAX Disruptor was very inspirational to me as well.

I've found a lot of problem spaces it fits into perfectly. For example, any sort of MPSC arrangement. Literally any. The most obvious being if you want to DIY your own database.

If performance is of interest to you, you may want to check out the latest variants of Disruptor on .NET. Figures nearing 500 million per second range are present here, and will likely be exceeded in future versions of .NET:

https://medium.com/@ocoanet/improving-net-disruptor-performa...

The thing .NET has to its advantage (not sure if this is true anymore) is how structs vs classes work in memory. This is why there is a speedup with ValueDisruptor<T>.

> how to improve the performance of the multiproducer multiconsumer

> multiconsumer

Fundamentally, I think multi-consumer is going to chop a whole order of magnitude off no matter how you slice it. Having just one consumer (aka a 'writer') is a foundational aspect of the performance games played in this arena.


If you're interested in queues, you might want to check out the USENIX BBQ paper from last year: https://www.usenix.org/conference/atc22/presentation/wang-ji...

They've gone with a block-based approach for SPSC/MPSC/MPMC that blows the linux implementation, DPDK and the LMAX Disruptor out of the water.


This is very interesting... I've never thought to divide the buffer into multiple blocks. Are they requiring some low-level OS abstraction to pull this off or are we looking at a novel technique at a higher level (i.e. could I patch an existing implementation)?


Thanks for your reply and thanks for your insight and wisdom.

500 million requests per second. I think that's awesome for _intra thread communication_. I have a sharded bank simulation which naively sends money between accounts and it gets 700 million transactions a second which is the aggregate of 12 threads all transacting local to a thread, so 500 million across threads seems good (I split money up rather than accounts per thread). I see that some web frameworks can handle up to 600,000 requests per second on techempower benchmarks. So we need to combine the best of each!

Especially the part about one writer. I have a journal entry called "Golden concurrency" inspired by someone's comment on HN (I really should have linked it!) talking about writing to a database from 3 processes and they decided to split the database into 3 databases rather than suffer contention from concurrency control.

"Golden concurrency" is my term or desire that degradation caused by multiple producers/multiple consumers in different configurations could be optimised for and could be a scenario that is handled without performance degradation. I suspect you need a different design if you have 1 producer:1 consumer producers to consumers, or 1 producer:N consumers or N producers:1 consumer or N producers:N consumers.

I enjoyed the following benchmarks with Kafka and RedPanda which made me think of this too.

https://jack-vanlightly.com/blog/2023/5/15/kafka-vs-redpanda...


That benchmark was comparing apples to oranges. Redpanda fsynced to disk and kafka saved to memory with deferred writes. Here is a response https://redpanda.com/blog/why-fsync-is-needed-for-data-safet...


I usually avoid solving that problem. There’s a good workaround — batch items, and either transfer ownership, or implement shared ownership, instead of copy.

For an example, look at how multimedia frameworks are designed: Media Foundation, gstreamer, ffmpeg, to lesser extent V4L2 and ASIO. For audio, they often handle samples coming at 48kHz, and they are usually heavily multithreaded, yet they don’t have ring buffers for individual samples, their ring buffers are keeping batches of them.


> I chose to allow any queue size as opposed to allowing only sizes that are a power-of-two. This means that at least one queue item is unused in order to disambiguate between the empty queue and full queue state.

Don't you still need an unused queue item even with a power of two size? Isn't the point of having a power of two size that you can calculate the next item index in the buffer using binary rollover to go back to 0 as opposed to requiring a comparison and a branch.


Normally I use monotonically increasing indices to avoid the 1 element issue as there is no wraparound, then use masking to map them to actual buffer positions. With non-power of two sizes it becomes a more expensive operation, so you have to wrap on increment instead.


While this should be better, it now needs a check for overflow in the indices, unless they are 64-bit numbers, when you may hope for a computer reboot long before overflow.

Overflow in 32-bit indices can be surprisingly frequent in fast computers.


As a matter of fact, integer overflow does not matter in this case. The head and tail indices of a full buffer would still be equal, and those of an empty buffer would still differ by exactly its size: so all you need to do is use a wide enough integer for the buffer's size to fit inside it.


Well, yes, you use 64 bit indices. As you are padding your indices to cacheline size anyway, you have bits to spare.

You can also still use 32bit indices and let them wrap around naturally and make it work as long as your queue size is less than 2*31. Comparison just becomes a little bit trickier. That's how TCP works for example.


If it's a power of two you can use a longer than required index. An empty buffer will have all index bits equal. A full buffer will have the higher bits different but the lower bits equal. This requires extracting only the lower bits from the index before using it, but depending on the instruction set and bit position this is either very cheap or completely free.


You can distinguish full and empty using x-y=0 vs x-y=N if the length N is a power of two and smaller than the integer type holding indices x, y. No unused element needed then.


> You can distinguish full and empty using x-y=0 vs x-y=N if the length N is a power of two

Why wouldn't this work for, e.g., N=13?


The trick with powers of two is that you can just look at the MSB of `x XOR y` where x and y have one more bit than necessary to store the size.

So for instance if you have a buffer with 8 entries you have a read pointer and a write pointer, both wrapping at 16 (4bits). When you address the buffer you ignore the MSB, when you want to test for empty you do `read == write`, when you want to test for full you do `read XOR write == 0b1000`.

So you only have extremely cheap comparisons, bitwise XOR and counter increments to implement all operations.


If you want more than a spsc queue, I've written `lockfree`, a collection of SPSC and MPMC data structures along the same principles the author here used:https://github.com/DNedic/lockfree.

The library is written in standard C++11 (but additional API's for higher C++ versions have been added), uses no dynamic allocation and is configurable so it is both big metal and deeply embedded friendly.


A point that AFAICT is not articulated in the article is why the two cached fields should be in their own dedicated cache line (e.g. why readIdxCached_ can not share the cache line with writeIdx_).


They should be in the same cacheline as you suggest. The writer will always touch both writeIdx and readIdxCached so it makes sense to be together. Splitting them won't affect synthetic benchmarks, but it just wasting space in a larger application.


There might be an additional optimization in having the writer also cache it's write index on the cache line together with the read index cache. This way the writer would only do writes to the write index cache line. The hardware might be able to optimize this. I wonder how it interacts with UMWAIT on the latest cores.


Reading its own values is never a problem, the writeIndex cacheline will switch from E (or M) to S (or O), but it is never evicted, so reads from it never miss. For frequent writes the read will be fulfilled by the store buffer anyway.


I think it will be invalidated due to RFO when the reader reads the write index. Only when multiple readers reads the same cache line without any intervening write will the RFO heuristic be disabled.


False sharing perhaps? At least that is my shoot-from-the-hip response.

https://stackoverflow.com/questions/22766191/what-is-false-s...


I don’t get it. Both readIdx_ and writeIdxCached_ are “owned” by the consumer. (That being the point of a “cached” copy.) The producer might read readIdx_ occasionally, whenever we go through the whole buffer, but otherwise only the consumers access these. So why should we avoid putting them on the same cache line?


Yeap. The linked cpp ref doc has a nice example:

https://en.cppreference.com/w/cpp/thread/hardware_destructiv...


In the example you linked it's comparing two threads reading from the same or separate cache lines, no? If so, that's not really the point I was referring to (as the two variables I mentioned as example are accessed by a single thread, not by two threads).


false sharing is the reason for splitting readIdx and writeIdx. But there is no reason to split readIdx and writeIdxCached.


Another simple trick is duplicating the buffer pointer field and buffer size field for consumer and producer (instead of having a single vector field). Paradoxically this makes the queue object smaller as it make use of the wasted padding used for cache alignment.


> Another simple trick is duplicating the buffer pointer field and buffer size field for consumer and producer

Do you mean something like this?

    struct Inner {
        int\* data;
        size_t size;
        size_t cachedIdx;
    };

    struct ringbuffer3 {
        // No vector<int> [...] here, saves a cache line
        alignas(64) std::atomic<size_t> readIdx_{0};
        alignas(64) Inner writeInner;
        alignas(64) std::atomic<size_t> writeIdx_{0};
        alignas(64) Inner readInner;

      // Constructor ...
    };
Am I understanding correctly?

(edit: formatting)


Yes, except that you do not want to alignas writeInner and readInner. The size of your whole ringbuffer should be 128 bytes.


It might also be better for performance since the two cores can RFO the buffer pointer cache lines from each other.


Neither producer or consumer ever writes into that line though, so it should stay in S state forever.


There is a trick to get rid of the wrap-around test. Use modular arithmetic (best if ring size is a power of 2, then even when the integer wraps everything still works):

    next_write_idx = write_idx + 1;
    if (next_write_idx == read_idx) return full;
    ring[write_idx & (RING_SIZE - 1)] = data;
    write_idx = next_write_idx;
This eliminates one possible incorrectly predicted branch and allows you to use all slots of the ring. The original code always leaves one slot empty to avoid the ambiguity between empty and completely full (where in both cases, the indexes have the same value). This could matter for small rings.

The way to think of the indexes: they hold the number of items written to or read from the ring since reset.

I guess a similar optimization to the cached entries is to pop as many items as possible before returning:

    local_write_idx = write_idx;
    local_read_idx = read_idx;
    while (local_read_idx != local_write_idx)
       process(ring[local_read_idx++] & (RING_SIZE - 1)]);
    read_idx = local_read_idx;
    return;
This has other advantages: the ring read is a tight loop and you are repeatedly calling "process"- the working set (the cache and branch predictors) will be all set up for it after the first call.


Is there any sort of guide for a webdev to learn enough c++/c/assembly/hardware knowledge to understand this sort of blogpost deeply?

I am vaguely aware of some of the topics discussed in here like how when he mentions setting the write and read indices to the cache size he is referencing the L1 cache on the processor and how the L1 and L2 caches are different and every step up in memory on the processor is a order of magnitude difference in time to access. I know this about as well as the fact that I know that the stratosphere and troposphere are separate from the atmosphere, in that I know they exist but I don't understand all the implications.

I'd like to learn more of this(relative to my own software layer) deep magic but I don't know where to start.


You stated all the implications, there's not significantly more to cache coherence than what you discussed here.

Well, there's significantly more if you're trying to implement coherence as an architect, but for the software author that's pretty much it.

Cache lines have a size, if you have a line that's shared between two cores they keep needing perform an eviction each time one of the cores takes exclusive control to perform a write to that cache line, this is slow, the end.

If you want the definitive source for this stuff, the Intel Architecture SDM, Volume 3 Ch12 would be a place to start. Then the Intel Optimization manual, Ch 3.6/3.7, and the Ch 2.x.y chapters for the micro-architecture which will describe the cache layout


you can read the oft reposted What Every Programmer Should Know About Memory[0]. I also wrote a comment about this sort of thing recently which includes more links[1].

[0]: https://news.ycombinator.com/item?id=36420502

[1]: https://news.ycombinator.com/item?id=33760117


Great, now I need someone to recommend resources that would give me the foundation to understand any of your comment.


A course or book on computer architecture will help a lot. This one [0] by Udacity is quite good at explaining certain concepts. Onur Mutlu has his computer architecture lectures on Youtube as well. [1] He has newer ones on his channel.

[0] https://youtube.com/playlist?list=PLAwxTw4SYaPmqpjgrmf4-DGla... [1] https://youtube.com/playlist?list=PL5PHm2jkkXmi5CxxI7b3JCL1T...


It's just amortizing for large transfers. For frequent near-empty cases it still has the shared pointers problem. NVMe have a really clever way to avoid this, but it depends on the fact that there's [occasional] communication on the reverse stream (the SQ/CQ pair): The cache line sized entries in the submit queue has a "phase" bit in the last word. The reader can just read that to know when the entry has updated (the polarity of the bit toggles each time we loop around and you can just use the top+1 index bit if power-of-two sized).

The back pressure is handled with a credit scheme and the producer gets an updated copy of most recently-know consumer read counter with every completion message back.

Using this scheme you can achieve the optimal performance with just a single cache line of traffic for each cache-line sized message.

Unfortunately I haven't found a SPSC Rust crate that does this, but https://github.com/polyfractal/bounded-spsc-queue [abandoned] comes close.


This seems to suffer from race conditions. Consider what happens when two threads push at the same time.


The solution to your caching problem is more caching


The solution is to avoid synchronous blocking operation like migrating cache lines more often than required. An other solution is to offer a batch produce()/consume() API to process multiple elements per invocation. A double mapped ring buffer is also tempting because it avoids having to deal with modulo addressing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: