This was specifically designed for one producer (the QEMU processor), and many consumers (processing the trace of instructions and memory accesses). I can't remember what specific tunings I did to help with this specific model. Data structures like this can always be tuned slightly based on your workload, by changing structure shapes, shared cache lines, etc.
With about 3-4 consumers QEMU was not really blocking on the reporting of every single instruction executed, which is really cool. This requires a low noise system, just having a browser open can almost half these numbers since there's just more cache coherency bus traffic occuring. If having a browser open doesn't affect your benchmark, it's probably not bottlenecking on the uarch yet ;)
Ultimately, mempipe is not really a big thing I've talked about, but it's actually what makes cannoli so good and enables the design in the first place.
You can see a massive improvement for shared hyperthreads, and on-CPU-socket messages.
In this case, about ~350 cycles local core, ~700 cycles remote core, ~90 cycles same hyperthread. Divide these by your clock rate as long as you're Skylake+ for the speed in seconds. Eg. about 87.5 nanoseconds for a 4 GHz processor for local core IPC.
Anyways, since you express disappointment in ~1000 cycle cost. That's about right. The latency between cores is actually quite high and there's not much you can do about it, especially on a system like x86 which has extremely strong cache coherency by default. One thing that is really important to understand is that the ownership of cache lines dramatically affects the cost of memory.
For IPC, this effectively requires one thread writing to memory (thus, making that cache line modified to that core, and evicted from all other cores). Then, when the polling thread checks in on the line, it will have to demote that cache line from modified, to shared by flushing it out to memory (usually L2 or L3, but also writing out to memory). This causes some memory traffic, and constantly means that the cores are fighting over the same cache line. Since x86 is strongly ordered and caches are coherent, this traffic is extremely expensive. Think of a write as "tell all other cores that you modifed this memory, so they have to evict/invalidate their cache lines". And a read as "tell all cores to flush their cache lines if they're modified, then wait for them to tell me they're done, then I can read the memory". This effectively is a massively blocking operation. The simple act of reading the mailbox/ticket/whatever from another core to check if a message is ready will actually dramatically affect the speed the other core can write to it (as now that write is effectively full latency).
There are some tricks you can do to get extremely low latency between cores. One of them, is making sure you're on cores that are physically near each other (eg. on the same processor socket). This is only really relevant on servers, but it's a big thing. You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.
Another trick that you can do, is have both threads on the same core, thus, using hyperthreads. Hyperthreads share the same core and thus a lot of resources, and are able to actually skip some of the more expensive coherency traffic, as they share the same L1 cache (since L1 is per-core). The lowest latency you will be able to observe for IPC will be on the same core with hyperthreads, but that's often not really useful for _processing_ the data, since performance will not be great on two busy cores. But in theory, you can signal a hyperthread, the hyperthread can then go and raise some other signal, while the original hyperthread still continues doing some relevant work. As long as one of them is blocking/halted, the other won't really be affected by two things on the same thread.
Finally, the most reasonable trick, is making sure your tickets/buffers/mailboxes/whatever are _not_ sharing the same cache lines (unless they contain data which is passed at the same time). Once again, the CPU keeps things in sync at cache line levels. So having two pieces of data being hammered by two cores on the same cache line is asking for hundreds of cycles per trivial data access. This can be observed in an extreme case with many core systems, with multiple sockets, fighting over locks. I've done this on my 96C/192T system and I've been able to get single `lock inc [mem]` instructions to take over 15,000 cycles to complete. Which is unreal for a single instruction. But that's what happens when there's 200-500 cycles of overhead every single time that cache line is "stolen" back from other cores. So, effectively, keep in your head which state cache lines will be in. If they're going to be modified on one core, make sure they're not going to be read on another core while still being written. These transitions are expensive, you're only going to get your 3-4 cycle "random L1 data hit performance" if the cache line is being read, and it's in the exclusive, modified, or shared state, and if it's being written, it has to be exclusive or modified. Anything else and you're probably paying hundreds of cycles for the access, and thus, also probably hurting the other side.
Ultimately, what you're asking from the CPU is actually extremely complex. Think about how hard it would be for you to manage keeping a copy of a database in sync between hundreds of writers and reads (cores). The CPU is doing this automatically for you under the hood, on every single memory access. It is _not_ free. Thus, you really have to engineer around this problem, batch your operations, find a design that doesn't require as intense of IPC, etc. On more weakly ordered systems you can use some more tricks in page tables to get a bit more control over how cache coherency should be applied for various chunks of memory to get more explicit control.
Oh sliding in here late, but it’s also extremely important to pin your threads to specific cores. I kinda only think about computers in this mode so I didn’t bring it up, but the kernel will effectively randomly schedule your process to different physical cores. If you are doing intense IPC or otherwise relying on specific cache usage, getting assigned to a different core is a massive loss of existing state which takes time to recover from!
> You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.
This is a very cool comment in general, but I'm intrigued by this bit in particular. I'd love to see an implementation, if anyone knows of any.
I’ve also played around with this in my research kernel, but never thoroughly enough to write up. Something I should revisit, I think it’d be a really fun thing to discuss and work on. Doing this level of timing requires doing BIOS configuration to make the CPU as deterministic as possible (turn off dynamic throttling, make sure you’re thermally keeping the CPU stable, etc).
I always highly encourage people to write advanced benchmarks. It’s a great way to learn how computers work!
Sticking with x86, I'm pretty sure CPUID can tell you what the topology of all the cores you can query on are in. Not that I'd tell anyone not to infer it from timing, that just sounds like fun.
It can tell you specified topology, like cores, threads, NUMA nodes. But it can’t tell you the physical locations of cores. Processors are binned based on what cores are functional after fabrication, thus your 12 core processor is probably a 16-18 core processor, and the 4 disabled cores are “random” based on manufacturing defects. Knowing this is completely unimportant, but cool. Ultimately yes, cpuid will tell you and relevant topology, since anything beyond this requires likely trillions of iterations to even detect.
You mention the most reasonable trick is to just avoid hammering/read-write the same cache lines; I guess you didn't hit this as a need because your QEMU instruction pipe was fast enough, but would you batch up your events along cache lines, fill up a line, and signal it's free to the consumers instead?
Yeah. So I forget the original tuning I did for that project. But, I fill up a buffer which is on its own cache line (Chunk) and then signal that the chunk is ready for ownership on the other side, thus sending it. I’m not sure why the signaling atomics aren’t on their own cache lines, I imagine I tried both and this was faster? There’s also a chance I never tried it because I felt I didn’t need it? I’m not sure!
Edit: Yep, looking at this I think I see some room for improvement in design. I'm getting about 60-70 ns/iter right now on my Intel(R) Core(TM) i9-9900KS CPU @ 4.00GHz with turbo on, and a loaded up system. This code is 2 years old and I've gotten a lot better at designing stuff like this. But, it's still really good. It's something I'd like to revisit at some point. The main bottleneck I think I see is `cur_seq` should be on its own cache line, as that's the only heavily thrashed value between cores. Most of the others are primarily in the shared cache line state, until a writer hits a mailbox. This design searches linearly through mailboxes for work to do, perhaps it would be faster to save the previous index? I'm also not sure I always need to have readers strongly sequenced, such that they can process more than one message a time, but this could have been a requirement of the way I was using it? While in theory storing/caching some information is great, in practice it often means unpredictable and dependent loads to use them. Eg, storing the previous index now means the CPU has to fetch the previous index from memory, and then loop on it. The compiler also has to treat this as a "volatile" value, and thus will struggle to do things like loop unrolling and bounds checking removal. With a low N (~4-16, number of mailboxes), it's probably always faster to just `for _ in 0..16` instead of doing smarts here, as the processor can do all 16 of those effectively in parallel by the time the CPU has even loaded the "cached index". Once again, tuning is extremely specific to a workload and parameters.
For maximum performance, it's pretty much always required to try out some un-ergonomic API. In this case the sequencing is nice, but perhaps I could rewrite the code that uses mempipe to not care and handle sequencing someplace else? I forget. In theory I could have multiple banks of tickets, and switch between them on some level of pressure or timeout. Using the same signal line between the writer and the reader (eg, they both must write to client_owned) is probably a mistake. I think it'd be better to use two indexes, bumped on one side when a writer writes, and bumped on the other side when a reader is done. This would allow better "bursting" of data in my opinion? Who knows, even a data structure as simple as this effectively requires building it and trying to really determine how it performs. So many resources in use in the CPU and it's a tough balance in your head.
Write your own. Wrote one fairly recently in good-ole-sepples that uses UNIX domain sockets and pushes around half a gigabyte a second on weak ARM hardware. Domain sockets are great. Shared memory is even better. If you need a portable solution between processes, not threads, I recommend domain sockets and a simple binary protocol for high throughput.
The article mentions that the perf of unix domain sockets was not significantly better:
> I was surprised at how similarly most things performed. I did a cursory investigation into Linux-specific approaches like dbus and Unix Domain Sockets, but they seemed to be in the same ballpark as the non-shared memory approaches.
I wholeheartedly agree. To add a bit more carrot for people on the fence, unlike TCP/UDP sockets, you don't have to be concerned about endianness because the bits never leave the machine. For a lot of use-cases, this is a big win that doesn't get as much appreciation as I think it should.
> unlike TCP/UDP sockets, you don't have to be concerned about endianness because the bits never leave the machine
Why would you be concerned about endianness even if you're using TCP or UDP, if you control the protocol?
Little endian won; in 99.9% of cases there's no reason to use big endian unless you're implementing an existing protocol which explicitly mandates big endian. Doing something over the network doesn't magically make it better to use big endian, and the whole "network endian = big endian" is just silly.
This is reasonable, but a bit unfortunate specifically for TCP/UDP. The extremely strong convention is that all network protocols are "network byte format" which is Big-Endian.
Obviously no issue for local IPC, but when things become published network protocols, little-endian makes me sad.
> This is reasonable, but a bit unfortunate specifically for TCP/UDP. The extremely strong convention is that all network protocols are "network byte format" which is Big-Endian.
This is really only relevant for packet headers, tho. What goes into payload is entirely up to the developer. There is no specific reason to keep endianness the same between TCP/UDP and its payload. As long as it's clear which one to use.
RFC1700 is marked as obsolete. IIRC, network order is big-endian is because it allowed faster packet switching? But we're at the point where I can do dual-border routing with DPI orders of magnitude faster than switches from that era can do basic switching.
Some people choose little-endian for payload because they are:
a) unaware of endianness and will discover a funny thing when they suddenly start talking between BE and LE nodes. Rust forces you to choose when you write or read numbers.
b) Think since their CPU is LE then payload should be LE as well for performance. This is false because even on ancient hardware this would not take more than 3 cycles (in a context of network communication shaving off 3 cycles is a hilarious endeavor)
c) Aware of all of this and choose LE because they felt like it.
> Think since their CPU is LE then payload should be LE as well for performance. This is false because even on ancient hardware this would not take more than 3 cycles (in a context of network communication shaving off 3 cycles is a hilarious endeavor)
Having to do BE/LE conversions means that you can't just reinterpret cast your network buffers into whatever is the native application message type. Yes, there are ways around that (by wrapping every integer field with an endian converting accessor), but it is a significant amount of yak shaving.
Well, the native application type depends on what platform that application is running. That means LE in 99.9% of cases. Don't need to sell LE for payload to me, I'm already sold. It's people that think network byte order has anything to do with payload that are confused.
The use case for splice is moving data from one file descriptor to another without reading it, for example port forwarding/mirroring (or taking over sendfile, dumping a file to a socket or socket to file). It doesn't cover IPC where you have messages that are actually being sent from one process to another and presumably, you want to read and write those bytes.
That was the original implementation. I believe this has since been relaxed. I think the kernel will allocate a pipe internally in some cases. The man pages are known to be incomplete.
In fact I think sendfile is implemented with the splice machinery now.
At least try to be polite to the cache coherence system, please. Do a loop where you check (with a plain relaxed (when available) read) whether the compare and swap should work and, if not, do your platform’s pause operation (REP NOP on x86) and try again. Only do compare-and-swap if the optimistic read thinks it will work.
Is shared memory access naturally async like IO/io_uring? If not, asking for async is misplaced and any implementation would be synonymous with wrapping sync calls in `async fn` wrappers.
I think it's as naturally async as anything else here actually. If you're calling a function via IPC over shared memory, you're sending a message, signalling it's there to the other process, and waiting for a response message with the result.
The only thing needed to turn this async is to change the "get result over IPC" function from "block until result is available" to "check if result is available, return TRY_AGAIN if it's not"
In practice it's not desirable to go without an explicit wake feature, we need some way to know that now it's likely to succeed. The mechanism needn't be 100% reliable but "Always try again" is essentially exactly as useless as blocking.
I work on a widely deployed (~11,000 servers and growing) piece of software that does not use any OS sleep/wake primitive for our MPSC queues -- just polling. It works really well with the right overall design.
The core primitive for IPC isn't "request and wait for a response," it's "send a single message and return immediately." Sure, if you want an RPC framework over IPC, maybe you want async blocking, but that's sort of a higher level concern than basic IPC.
It seems that iceoryx2 intends to support request-response in the future, but doesn't today:
Even with fire and forget, your send queue might become full. So you either drop the message of you have to wait for it to empty. Similarly on the receiving end you have to block waiting for the queue to become non empty.
In fact that's exactly the reason that async exists for networking.
In iceoryx2, we use something that we call event concept, which can either use unix-domain sockets so that you have something to select/wake/epoll on or a variant where we use a semaphore stored in shared memory. See: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/iceory...
The unix-domain socket has the advantage that you can combine it with external non-iceoryx events, but you pay with a small performance hit. The semaphore event is usually faster.
As a user, you can configure your IPC service architecture and use the mechanism that suits you best (but I have to admit that we have not yet documented this in detail). For the zero_copy service variants, it is done here: https://github.com/eclipse-iceoryx/iceoryx2/blob/main/iceory....
I believe you could do something with e.g. Linux futexes to support this kind of thing. But in general, polling at the receiver is going to be lower latency / higher throughput.
With this sort of thing you'd typically wrap the underlying IPC channel in a system thread and redirect messages to/from it within the local process via an async channel.
The dedicated system thread for the channel means the local process will never block on the incoming IPC events and the redirected messages will be buffered/enqueded as determined by the async runtime.
If you want to use push notifications to wake up processes, iceoryx2 has the event messaging pattern where a listener waits for incoming events, and a notifier in another process can send specific events to all waiting listeners. See this example: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/exampl...
What... its shared memory? What is asynchronous about touching a few atomic vars and shared memory (shared pages) exactly? You could absolutely poll on this sort of thing in a non blocking manner, but there is not OS primitive here to trigger a Waker
build an Event Count on top of eventfd and you can get fast pathed blocking behavior on your shared memory queue that it is also async awaitable on top of your event loop of choice.
Pretty good write up - I reached the same conclusion doing something similar a few years back.
For the record, I got even better perfs with shared-memory by avoiding the mutex in `raw_sync::events::BusyEvent` - I used a blocking loop with atomics, and yielded the CPU (`yield_now` in rust) between the iterations of the loop.
Downside is that it becomes a bit less consistent. And it is a _lot_ more complex than a unix domain socket :)
After looking up the code for raw_sync, I came to the same conclusion. My first question was whether the shared memory case is safe on a weaker memory model like ARM or Power. Seeing the mutex, it may be too safe.
IPC having a lot of "overhead" is nonsense (other than, like, managing a mutex or semaphore). I've used many header-only IPC libraries (this one[1] worked great on Windows/MacOS). It's way easier to get a hang of instead of jerry-rigging TCP or UDP to do IPC (yuck).
Thanks for sharing this. I recently built something similar for a closed source project and am curious how I didn’t run across this library in my research.
Briefly: my implementation uses a templated RAII-style wrapper that persists objects in shared memory for as long as an instance of that object remains in scope. Once the last object goes out of scope, shared memory is freed. It works great so far. I’ll probably reference this library to compare implementations and see what I can learn. Thanks again.
If you need something more powerful, Boost.Interprocess is also a thing[1]. I used it like 10 years ago and seemed to work well. Looking at the docs, it even has a RAII-ish thing called `boost::interprocess::remove_shared_memory_on_destroy`.
Would be good if it had unix domain socket (USS) tested. Compared to normal tcp/udp socket, UDS does not go through network stack in the kernel, which is useless when sender and receiver are on the same machine.
It causes a panic if the write to stdout fails (e.g. if it's a pipe and the other side was closed, or if stdout is redirected to a file and the disk is full).
The rust compiler requires handling this potential error in some way, or you get a compiler error. unwrap() "handles" the IO error by crashing the program with a panic, which is the simplest way of avoiding the compiler error. It's fine for such a toy program, but would be inappropriate for production code.
This was specifically designed for one producer (the QEMU processor), and many consumers (processing the trace of instructions and memory accesses). I can't remember what specific tunings I did to help with this specific model. Data structures like this can always be tuned slightly based on your workload, by changing structure shapes, shared cache lines, etc.
With about 3-4 consumers QEMU was not really blocking on the reporting of every single instruction executed, which is really cool. This requires a low noise system, just having a browser open can almost half these numbers since there's just more cache coherency bus traffic occuring. If having a browser open doesn't affect your benchmark, it's probably not bottlenecking on the uarch yet ;)
https://github.com/MarginResearch/cannoli/blob/main/.assets/...
A little blog on Cannoli itself here: https://margin.re/2022/05/cannoli-the-fast-qemu-tracer/
Ultimately, mempipe is not really a big thing I've talked about, but it's actually what makes cannoli so good and enables the design in the first place.