I feel like this article doesn't address the crux of why multithreading can't be faster. Kinda smells like LLM-style writing too - a lot of text that sounds like it's explaining something but skips too many details to be a great argument.
It's a (possibly machine-generated) translation of a Chinese article linked to at the bottom, which contains a number of diagrams that the translation lacks (even though it refers to a diagram at least once).
I mean, the only caveats of multithreading (apart from bugs) is whether you can obtain full linear scaling or if your data structures require synchronization to the point of becoming bottle-necks.
In case of a database, the majority of the work done by the process is in accessing big, shared data structures, which can lead to a lot of congestion. One has to go out of their way to create structures that do not require writers to have exclusive access that penalizes read (e.g., a write log that is occasionally merged), but then you still have to sort out write congestion. When perfected to allow maximum performance, the structure still ends up more complicated and will not allow fully linear scaling.
With a single-threaded model you lose parallelism, but you gain the ability to use the simple structures with reckless abandon, and performance is linear over the average request processing time.
Other types of applications do not necessarily have these limitations. E.g., you can make a multi-threaded webserver that requires no synchronization as it works off static data with requests and connections being entirely thread-local. The application could also be compute-bound, with the time spend accessing contentious shared structures being just a minor part of the execution time.
Yeah, this looks like summarization by LLM (probably from Chinese article). For example, there are definitely more than 5 datatypes in Redis: LLMs have no idea what can be summarized in which way.
The biggest per core "efficiency" improvement will come from RDMA, since IO tends to be the largest bottleneck in Redis (and now Valkey). However, if you are running with deep pipelines (sending multiple commands as a batch without waiting for responses from each commands) than the benefits of RDMA are pretty limited since the bottleneck is on command processing. Multithreading doesn't help much either.
One of the strategies that is being used in the multi-threading is using CPU memory prefetching to pull memory closing to the CPU so that we aren't stalling on fetching data from main memory while executing commands. We still want to try to apply these techniques without multi-threading to improve the efficiency of single or double core installations.
I'm guessing that RDMA only applies for performance benchmarks of clustered / sharded Redis setups right?
I would expect vertical scaling to have better performance for most use cases as you can get cheap servers with very large amounts of RAM and high number of CPU cores.
No, the thing about RDMA is that it significantly lowers latency and allows for copying of data without involving the cpu. If you use RoCE, you need advanced network switches that support PFC, ECN, DCBX that allows for lossless ethernet with the ability to pause frames to see any benefit. While lossy RoCE do work on newer generation of network interfaces, it comes with quirks and may not be any faster than TCP.
Of course multi-threaded is faster (more QPS) as long as there is no high contention for resources. Especially things like read-heavy workloads will see a big improvement. Single-threaded implementations have the advantage to be much simpler, robust and don't require synchronization.
This is basically the original source of the claim, https://valkey.io/blog/unlock-one-million-rps/. We are working on a followup that goes more into depth about how to setup the benchmark so that is easier to reproduce.
I don't know if it's the multithreading, but keydb is a multithreaded fork of redis and in our benchmark it's very fast; often faster than the advertised 5x.
reply