Disruptor-rs: better latency and throughput than crossbeam

karmakaze · 2024-07-13T16:24:58

I played around with the original (Java) LMAX disruptor which was an interesting and different way to achieve latency/throughput. Didn't find a whitepaper--here's some references[0] which includes a Martin Fowler post[1].

[0] https://github.com/LMAX-Exchange/disruptor/wiki/Blogs-And-Ar...

[1] https://martinfowler.com/articles/lmax.html

temporarely · 2024-07-13T19:21:54

here you go:

Disruptor, Thompson, Farley, et al 2011

https://lmax-exchange.github.io/disruptor/files/Disruptor-1....

bluejekyll · 2024-07-13T15:34:43

This is really cool to see. Is anyone potentially working on an integration with this an Tokio to bring these performance benefits to the async ecosystem? Or maybe I should ask first, would it make sense to look at this as a foundational library for the multi-thread async frameworks in Rust?

pca006132 · 2024-07-13T16:43:06

Probably doesn't make sense. Busy wait is fast when you can dedicate a core to the task, but this means that you cannot have many tasks in parallel with a small set of physical cores. When you oversubscribe, performance will quickly degrade.

Tokio and other libraries such as pthread allows thread to wait for something and wake up that particular thread when the event occurs. This is what allows scheduler to schedule many tasks to a very small set of cores without running useless instructions checking for status.

For foundational library, I think you want things that are composable, and low latency stuff are not that composable IMO.

Not saying that they are bad, but low latency is something that requires global effort in your system, and using such library without being aware of these limitations will likely cause more harm than good.

nXqd · 2024-07-13T15:45:30

Tokio focuses on being high throughput as default, since they mostly use yield_now backoff strategy. It should work with most application.

For latency sensitive application, it tends to have different purpose which mainly trade off CPU and RAM usage for higher low latency ( first ) and throughput later.

nicholassm83 · 2024-07-13T16:07:37

I agree, the disruptor is more about low latency. And the cost is very high: a 100% utilized core. This is a great trade-off if you can make money by being faster such as in e-trading.

_3u10 · 2024-07-13T18:22:43

High throughput networking does the same thing, it polls the network adapter rather than waiting for interrupts.

The cost is not high, it's much less expensive to have a CPU operating more efficiently than not processing anything because its syncing caches / context switching to handle an interrupt.

These libraries are for busy systems, not systems waiting 30 minutes for the next request to come in.

Basically, in an under utilized system most of the time you poll there is nothing wasting CPU for the poll, in an high throughput system when you poll there is almost ALWAYS data ready to be read, so interrupts are less efficient when utilization is high.

nine_k · 2024-07-13T19:45:34

Running half the cores of an industrial Xeon or Zen under 100% load implies very serious cooling. I suspect that running them all at 100% load for hours is just infeasible without e.g. water cooling.

wmf · 2024-07-13T21:10:38

Nah, it will just clock down. Server CPUs are designed to support all cores at 100% utilization indefinitely.

Of course you can get different numbers if you invent a nonstandard definition of utilization.

nine_k · 2024-07-13T21:48:38

Of course server CPUs can run all cores at 100% indefinitely, as long as the cooling can handle it.

With 300W to 400W TDP (Xeon Sapphire 9200) and two CPUs per typical 2U case, cooling is a real challenge, hence my mention of water cooling.

wmf · 2024-07-13T22:09:08

I disagree. Air cooling 1 KW per U is a commodity now. It's nothing special. (Whether your data center can handle it is another topic.)

lordnacho · 2024-07-13T16:21:36

Suppose I have trading system built on Tokio. How would I go about using this instead? What parts need replacing?

Actually looking at the code a bit, it seems like you could replace the select statements with the various handlers, and hook up some threads to them. It would indeed cook your CPU but that's ok for certain use cases.

nicholassm83 · 2024-07-13T17:21:57

I would love to give you a good answer but I've been working on low latency trading systems for a decade so I have never used async/actors/fibers/etc. I would think it implies a rewrite as async is fundamentally baked into your code if you use Tokio.

lordnacho · 2024-07-13T18:01:30

Depends on what "fundamental" means. If we're talking about how stuff is scheduled, then yes of course you're right. Either we suspend stuff and take a hit on when to continue, or we hot-loop and latency is minimized at the cost of cooking a CPU.

But there's a bunch of stuff that isn't that part of the trading system, though. All the code that deals with the format of the incoming exchange might still be useful somehow. All the internal messages as well might just have the same format. The logic of putting events on some sort of queue for some other worker (task/thread) to do seems pretty similar to me. You are just handling the messages immediately rather than waking up a thread for it, and that seems to be the tradeoff.

_3u10 · 2024-07-13T18:32:22

These libs are more about hot paths / cache coherency and allowing single CPU processing (no cache coherency issues / lock contention) than anything else. That is where the performance comes from, referred to as "mechanical sympathy" in the original LMAX paper.

Originally computers were expensive, and lots of users wanted to share a system, so a lot of OS thought went into this, LMAX flips the script on this, computers are cheap, and you want the computer doing one thing as fast as possible, which isn't a good fit for modern OS's that have been designed around the exact opposite idea. This is also why bare metal is many times faster than VMs in practice, because you aren't sharing someone else's computer with a bunch of other programs polluting the cache.

lordnacho · 2024-07-13T18:37:53

Yeah, I agree. But the ideas of mechanical sympathy carry over into more than one kind of design. You can still be thinking about caches and branch prediction while writing things in async. It's just the awareness of it that allows you to make the tradeoffs you care about.

Quekid5 · 2024-07-14T00:00:22

Eh... not really. The main problem is that it becomes incredibly hard to reason about the exact sequencing of things (which matters a lot for mechanical sympathy) in async world.

kprotty · 2024-07-13T17:34:58

Tokio's focus is on low tail-latencies for networking applications (as mentioned). But it doesn't employs yield_now for waiting on a concurrent condition to occur, even as a backoff strategy, as that fundamentally kills tail-latency under the average OS scheduler.

andrepd · 2024-07-13T17:37:22

How would you? These are completely at odds: async is about suspending tasks that are waiting for something so that you can do other stuff in the meantime, and low-latency is about spinning a core at 100% to start working as fast as possible when the stuff you're waiting for arrives. You can't do both x)

_3u10 · 2024-07-13T18:19:06

No, not really, this is for synchronous processing, the events get overwritten so by the time you async handler fires you're processing an item that has mutated.

What you're looking for is io_uring on Linux or IOCP on Windows, I don't think osx has something similar, maybe kqueue.

alchemist1e9 · 2024-07-13T18:12:59

Is there anything specific to Rust that this library does which modern C++ can’t match in performance? I’d be very interested to understand if there is.

pornel · 2024-07-13T20:03:14

For Rust users there's a significant difference:

* it's a Cargo package, which is trivial to add to a project. Pure-Rust projects are easier to build cross-platform.

* It exports a safe Rust interface. It has configurable levels of thread safety, which are protected from misuse at compile time.

The point isn't that C++ can match performance, but that you don't have to use C++, and still get the performance, plus other niceties.

This is "is there anything specific to C++ that assembly can't match in performance?" one step removed.

alchemist1e9 · 2024-07-13T21:35:39

I had expected that’s true. You just never know if perhaps Rust compilers have some more advanced/modern tricks that can only be accessed easily by writing in Rust without writing assembly directly.

pornel · 2024-07-13T22:53:49

There is a trick in truly exclusive references (marked noalias in LLVM). C++ doesn't even have the lesser form of C restrict pointers. However, a truly performance focused C or C++ library would tweak the code to get the desired optimizations one way or another.

A more nebulous Rust perf thing is ability rely on the compiler to check lifetimes and immutability/exclusivity of pointers. This allows using fine-grained multithreading, even with 3rd party code, without the worry it's going to cause heisenbugs. It allows library APIs to work with temporary complex references that would be footguns otherwise (e.g. prefer string_view instead of string. Don't copy inputs defensively, because it's known they can't be mutated or freed even by a broken caller).

jacoblambda · 2024-07-14T00:24:02

> C++ doesn't even have the lesser form of C restrict pointers.

Standard C++ doesn't but `noalias` is available in basically every major compiler (including the more niche embedded toolchains).

3836293648 · 2024-07-14T21:13:16

And they're all extremely buggy, to the point where Rust has disabled it and reenabled it and disabled it and so on many times over as bugs are constantly discovered in LLVM because Rust is the only major user of it

slashdev · 2024-07-13T18:15:20

No, there shouldn’t be.

Rust is not magic and you can compile both with llvm (clang++).

If you specify that the pointers don’t alias, and don’t use any language sugar that adds overhead on either side, the performance will be very similar.

nicholassm83 · 2024-07-13T18:20:03

I agree.

The Rust implementation even needs to use a few unsafe blocks (to work with UnsafeCells internally) but is mostly safe code. Other than that you can achieve the same in C++. But I think the real benefit is that you can write the rest of your code in safe Rust.

bluejekyll · 2024-07-13T18:41:43

While you’re not explicitly saying this, C++ in Rust’s terms, is all unsafe. In a multi-threading context like this, that’s even more important.

nicholassm83 · 2024-07-13T18:54:38

I'm trying to be polite. :-) And there is a lot of great C++ code and developers out there - especially in the e-trading/HFT space.

zamalek · 2024-07-15T08:16:29

Nit: C/++ is safer than Rust unsafe. There are constraints (no aliasing) that must be upheld in unsafe.

alchemist1e9 · 2024-07-13T18:39:35

Unless the rest of your code is already in C++ and you’re interested in this new better disrupter implementation, that’s probably a common situation for people interested in this topic. Any recommendations for those in that situation? perhaps existing C++ implementations already match this idk.

LtdJorge · 2024-07-13T21:12:17

So nice, that I was just reading about the disruptor, since I had an idea of using ring buffers with atomic operations to back Rust channels with lower latency for intra-thread communication without locks, and now I see this. Gonna take a read!

BrokrnAlgorithm · 2024-07-13T19:25:22

Is there also a decent c++ implementation of the disruptor out there?

msaltz · 2024-07-14T01:19:48

Here’s one I’ve actually used/played with (though never measured performance of): https://github.com/lewissbaker/disruptorplus

And here’s one I saw linked on HN recently: https://github.com/0burak/imperial_hft/tree/main/distuptor