Hi Marc (talawahtech)! Thanks for the exhaustive article.
I took a short look at the benchmark setup (https://github.com/talawahtech/seastar/blob/http-performance...), and wonder if some simplifications there lead to overinflated performance numbers. The server here executes a single read() on the connection - and as soon as it receives any data it sends back headers. A real world HTTP server needs to read data until all header and body data is consumed before responding.
Now given the benchmark probably sends tiny requests, the server might get everything in a single buffer. However every time it does not, the server will send back two responses to the server - and at that time the client will already have a response for the follow-up request before actually sending it - which overinflates numbers. Might be interesting to re-test with a proper HTTP implementation (at least read until the last 4 bytes received are \r\n\r\n, and assume the benchmark client will never send a body).
Such a bug might also lead to a lot more write() calls than what would be actually necessary to serve the workload, or to stalling due to full send or receive buffers - all of those might also have an impact on performance.
Yea, it is definitely a fake HTTP server which I acknowledge in the article [1]. However based on the size of the requests, and my observation of the number of packets per second in/out being symmetrical at the network interface level, I didn't have a concern about doubled responses.
Skipping the parsing of the HTTP requests definitely gives a performance boost, but for this comparison both sides got the same boost, so I didn't mind being less strict. Seastar's HTTP parser was being finicky, so I chose the easy route and just removed it from the equation.
For reference though, in my previous post[2] libreactor was able to hit 1.2M req/s while fully parsing the HTTP requests using picohttpparser[3]. But that is still a very simple and highly optimized implementation. FYI, from what I recall, when I played with disabling HTTP parsing in libreactor, I got a performance boost of about 5%.
> Yea, it is definitely a fake HTTP server which I acknowledge in the article
It's not actually an HTTP server though... For these purposes, it's essentially no more useful than netcat dumping out a preconfigured text file. Titling it "HTTP Performance showdown" is doubly bad here since there's no real-world (or even moderately synthetic) HTTP requests happening; you just always get the same static set of data for every request, regardless of what that request is. Call it whatever you like but that isn't HTTP. A key part of performance equation on the web is the difference in response time involved in returning different kinds (sizes and types) of responses.
A more compelling argument could be made for the improved performance you can get bypassing the Kernel's networking, but this article isn't it. What this article demonstrates is that in this one very narrow case where you want to always return the same static data, there's vast speed improvements to be had. This doesn't tell you anything useful about the performance in basically 100% of the cases of real-world use of the web, and its premise falls down when you consider that kernel interrupt speeds are unlikely to be the bottleneck in most servers, even caches.
I'd really love to see this adapted to do actual webserver work and see what the difference is. A good candidate might be an in-memory static cache server of some kind. It would require URL parsing to feed out resources but would better emulate an environment that might benefit from this kind of change and certainly would be a real-world situation that many companies are familiar with. Like it or not, URL parsing is part of the performance equation when you're talking HTTP.
Correct, it is a fake HTTP server, serving a real HTTP workload. This post is about comparing two different networking stacks (kernel vs DPDK) to see how they handle a specific (and extreme) HTTP workload. From the perspective of the networking stack, the networking hardware, and the AWS networking fabric between instances, these are real HTTP requests and responses.
> I'd really love to see this adapted to do actual webserver work and see what the difference is.
Take a look at my previous article[1]. It is still an extreme/synthetic benchmark, but libreactor was able to hit 1.2M req/s while fully parsing the HTTP requests using picohttpparser[3].
From what I recall, when I played with disabling HTTP parsing in libreactor, the performance improvement was only about 5%.
ScyllaDB uses Seastar as an engine and the DynamoDB compatible API use HTTP parsing, so this use case is real. Of course the DB has much more to do than this benchmark with a static http reply but Scylla also uses many more core in the server, thus it is close to real life. We do use the kernel's tcp stack, due to all of its features and also since we don't have capacity for a deeper analysis.
Some K/V workloads are affected by the networking stack and we recently seen issues if we chose not the ideal interrupt mode (multiqueue vs single queue in small machines)
Few questions if you will, it's an interesting work and I figure you're on ScyllaDB team?
1. Is 5s experiment with 1s warmup really a representative workload? How about running for several minutes or tens of minutes? Do you observe the same result?
2. How about 256 connections on 16 vCPUs creating contention against each other and therefore skewing the experiment results? Aren't they competing for the same resources against each other?
3. Are the experiment results reproducible on different machines (at first use the same and then similar SW+HW configurations)?
4. How many times is experiment (benchmark) repeated and what about the statistical significance of the observed results? How do you make sure to understand that what you're observing, and hence drawing a conclusion out of it in the end, is really what you thought you were measuring?
Am ScyllaDB but Marc did completely independent work.
The client vcpus don't matter that much, the experiment compares
the server side, the client shouldn't suck.
When we test ScyllaDB or other DBs, we run benchmarks for hours and days. This is just a stateless, static http daemon, so short timing is reasonable.
The whole intent is to make it a learning experience, if you wish to reproduce, try it yourself. It's aligned with past measurements of ours and also with former Linux optimizations by Marc.
I'm myself doing a lot of algorithmic design but I also enjoy designing e2e performance testing frameworks in order to confirm theories I or others had on a paper. The thing is that I fell too many times into a trap without realizing that the results I was observing weren't what I thought I was measuring. So what I was hoping for is to spark a discussion around the thoughts and methodologies other people from the field use and hopefully learn something new.
The Data Plane Development Kit (DPDK) is an open source software project managed
by the Linux Foundation. It provides a set of data plane libraries and network
interface controller polling-mode drivers for offloading TCP packet processing
from the operating system kernel to processes running in user space. This offloading
achieves higher computing efficiency and higher packet throughput than is possible
using the interrupt-driven processing provided in the kernel.
Basically its what people who want to bypass the kernel network stack because they think its slow. They then spend the next few years writing their own stack till they realise they've just re-written what the kernel does and its slower and full of exploits.
Yeah, receiving packets is fast when you aren't doing anything with them.
Networking stacks are used in two ways: endpoints and forwarders.
Forwarders are usually not doing much with packets, just reading a few fields and choosing an output port. These are very common function in network cores (telcos, datacenters, ...).
DPDK is not well-suited for endpoint application programming, even though here you can still squeeze some additional performance.
But don't dismiss a framework that is currently so widely deployed just because you are not familiar with its most common use-case.
Sometimes you really don't do much more on packets than receive and store in (e.g) ring buffers until you can trigger some computation on aggregated data. Sometimes your application has critical latency and you need very light parsing. Not everyone uses tcp and all the options, some people are just down to jumbo udp packets, no fragmentation, and 800Gb/s rx...
It’s typically done when end to end latency is more important than full protocol compliance and hardening against myriad types of attackers. High frequency trading is typically where the applications where you see these sorts of implementations being really compelling.
For these types of applications there are proprietary implementations that you can buy from vendors that are more suited to latency sensitive applications.
The next level of optimization after kernel bypass is to build or buy FPGAs which implement the wire protocol + transport as an integrated circuit.
Fair enough. My
Knowledge of what’s on the bleeding edge and mass adoption in algo trading is fairly dated. I haven’t been adjacent to that industry in over 5 years. Even back then there was turnkey ASICS you could buy that would implement the network plus whatever protocols you used to talk to the exchange.
I think the neatest thing that sticks out to me is how logging in implementations of low latency trading applications was essentially pushed to the network. Basically they just have copper or fiber taps between each node in their system and it gets sent to another box that aggregates the traffic and processes all the packets to provide trace level logging through their entire system. Even these have solutions you can buy from a number of vendors.
That's not the reality. The kernel network stack does a lot of things that for some applications are not needed, so DPDK is used for those cases when they need very high performance. For example, packet capture, routing, packet generation... I don't think I've seen anyone rewriting what the kernel does precisely because when you use DPDK you don't want to do what the kernel does.
this is completely wrong. as the article points out, there are plenty of tcp stacks and other user space stacks available. dpdk is used in places where performance matters, and just because you haven't worked in that area doesn't mean it's useless. all telcos use it for the most part.
Normally the endpoint stack you need is more limited initially and you don't have to have all the baggage. The biggest advantage you get at the end is zero copy network and if what you do involves tons of reading and writing to the network you can definitely gain performance in both bandwidth and latency by better using your CPU.
You can do XDP as well and gain zero copy but for a small subset of NICs and you still need to implement you own network stack I don't see ac way to avoid it much. There are existing TCP stacks for DPDK that you can use as well.
eBPF pushes computation into the kernel to be colocated with kernel data/events. DPDK pushes all of the relevant data (the network packets) into user space to be colocated with user data/events.
You can mix both a bit to make the selection of where packets go with a eBPF program, and running higher level stacks in user space where user space is still dealing with raw packets rather than sockets.
DPDK has an XDP driver too so you can use that and also work with network cards that do not support XDP or the zero copy variant for more flexibility. Though DPDK can be a pain in and of itself, it is fairly opinionated and I don't like all of its opinions. It has its roots in switching and routing applications and using it for an endpoint can have some mismatches in opinion. It is open source so you can modify it which works out for me in the end.
I wouldn't say XDP is an alternative to DPDK. XDP severely limits what you can do as you insert code inside of the kernel. DPDK exposes the NIC to regular user code so you can do what you want. XDP is useful when you want to modify what the kernel does with packets before they go to their destination application, DPDK when you want to do whatever with the packets without sending them to any other application.
That's why the parent mentions AF_XDP. AF_XDP allows to get raw packets into userspace, bypassing the kernel stack. There users can run e.g. a custom TCP stack to process the packets. That's similar to what can be done with DPDK.
And both tools allow to export a "regular NIC" to the user. With AF_XDP one would probably just run the AF_XDP socket in a way which intercepts packets for a certain port, and leave all other traffic alone. That would then flow through the regular kernel stack, and the network interface would continue to show up as without XDP.
dpdk is more equivalent to xdp. ebpf is not really comparable since it can't do large programs within a bpf kernel. dpdk is much, much more than just a packet processing framework.
At the point you've gotten syscall overhead is definitely going to be a big thing (even without spectre mitigations enabled) -- I'd be very curious to see how far a similar io_uring benchmark would get you.
It supports IOPOLL (polling of the socket) and SQPOLL (kernel side polling of the request queue) so hopefully the fact that application driving it is in another thread wouldn't slow it too much... With multi-shot accept/recv you'd only need to tell it to keep accepting connections on the listener fd, but I'm not sure if you can chain recvs to the child fd automatically from kernel or not yet... We live in interesting times!
I would love to see an io_uring comparison as well; while it's a substantial amount of work to port an existing framework to io_uring, at the point where you're considering DPDK, io_uring seems relatively small by comparison.
I personally started a new framework and I went for io_uring for simplicity, that is already giving most of what I need -- asynchronous I/O with no context switching.
DPDK is huge and inflexible, it does a lot of things which I'd rather be in control of myself and I think it's easier to just do my own userspace vfio.
oh! That's not obvious at all from the man page (io_uring_enter.2)
If the io_uring instance was configured for polling, by specifying IORING_SETUP_IOPOLL in the
call to io_uring_setup(2), then min_complete has a slightly different meaning. Passing a
value of 0 instructs the kernel to return any events which are already complete, without
blocking. If min_complete is a non-zero value, the kernel will still return immediately if
any completion events are available. If no event completions are available, then the call
will poll either until one or more completions become available, or until the process has ex‐
ceeded its scheduler time slice.
... Well, TIL -- thanks! and the NAPI patch you pointed at looks interesting too.
Yes, I would be very interested in that as well. I work with a DPDK based app and have sometimes thought about how close to the same performance we could get by using as many kernel optimizations as possible (io_uring, xdp, I don't know what else). It would be an interesting option to give the app since it would greatly simplify the deployment sometimes, as dealing with the DPDK drivers and passthrough in a container and cloud world isn't always the easiest thing.
> I am genuinely interested in hearing the opinions of more security experts on this (turning off speculative execution mitigatins). If this is your area of expertise, feel free to leave a comment
Are these generally safe if you have a machine that does not have multi-user access and is in a security boundary?
For regulatory compliance, that’s still not acceptable because it opens the door to cross user data access or privilege escalation.
If someone exploited a process with a dedicated unprivileged user, had legit limited access, or got in a container on a physical, they might be able to leverage it for the forces of evil.
There’s really no such practical thing as single user Linux. If you’re running a network exposed app without dropping privileges, that’s a much bigger security risk than speculative execution.
Now, if you were skipping an OS and going full bare metal, then that could be different. But an audit for that would be a nightmare :).
There’s definitely sneakernet Linux boxes. I’ve worked at a bunch of places with random Linux boxes running weird crap totally off the actual network because nobody was particularly sure how to get those programs updated. Technical debt is a pita!
I suspect you are conflating security regulations for Unix users with regulations targeting users of the system.
Why would a regulatory framework care if a Linux box running one process was vulnerable to attacks that involve switching UIDs?
Converse, why would that same regulatory framework not care if users of that network service were able to impersonate each other / access each others’ data?
Most of the controls are about auditability and data access.
But the control frameworks are silly sometimes. Then add in that they’re enforced by 3rd party auditor consultants looking for any reason to drag it out.
And yeah, I tried to get this past them for a old singleton system to avoid having to buy a bigger non-standard server.
The worst is the fake regulation that is PCI. But I tried in a SOX audit to avoid buying new gear.
I hate auditors. Making sense doesn’t matter. Their interpretation of the control does. I do have a playbook of “how I meet X but not the way you want me to”, but lost that one. Probably spent more $ arguing than the HW cost.
You kind of have to consider the whole system. Who has access, including hypothetical attackers that get in through network facing vulnerabilities, and what privilege escalations they could do with the mitigations on or off.
I've ran production systems with mitigations off. All of the intended (login) users had authorized privilege escalation, so I wasn't worried about them. There was a single primary service daemon per machine, and if you broke into that, you wouldn't get anything really useful by breaking into root from there. And the systems where I made a particularly intentional decision were more or less syscall limited; enabling mitigations significantly reduced capacity, so mitigations were disabled. (This was inline with guidance from the dedicated security team where I was working).
Browser tabs from different origins are equivalent of the multi-user access here, at least in cases where the speculative execution vulnerability can be exploited from JS. Same for other workloads where untrusted parties have a sufficient degree of control on what code executes.
Generally.. depends on what you mean by generally. In the casual sense of the word, speculative execution attacks are not very common, so it can be said that most people are mostly safe from them independent of mitigations. Someone might also use "generally safe" to mean proven security against a whole class of attacks, in which case the answer would be no.
Anytime the discussion around turning off mitigations comes up, it's trumped by the "why don't we just leave them on and buy more computers" trump card.
An easier solution than mitigating is to just upgrade your AWS instance size.
As far as I am aware, the capability required to exploit Meltdown is compute and timers. For example, an attacker running a binary directly on the host, or a VM/interpreter (js, wasm, python, etc) executing the attacker's code that exposes a timer in some way.
If you just have a dumb "request/response" service you may not have to worry.
A database is an interesting case. Scylla uses CQL, which is a very limited language compared to something like javascript or SQL - there's no way to loop as far as I know, for example. I would probably recommend not exposing your database directly to an attacker anyways, that seems like a niche scenario.
If you're just providing, say, a gRPC API that takes some arguments, places them (safely) into a Scylla query, and gives you some result, I don't think any of those mitigations are necessary and you'll probably see a really nice win if you disable them.
This is my understanding as a security professional who is not an expert on those attacks, because I frankly don't have the time. I'll defer to someone who has done the work.
Separately, (quoting the article)
> Let's suppose, on the one hand, that you have a multi-user system that relies solely on Linux user permissions and namespaces to establish security boundaries. You should probably leave the mitigations enabled for that system.
Please don't ever rely on Linux user permissions/namespaces for a multi-tenant system. It is not even close to being sufficient, with or without those mitigations. It might be OK in situations where you also have strong auth (like ssh with fido2 mfa) but if your scenario is "run untrusted code" you can't trust the kernel to do isolationl
> On the other hand, suppose you are running an API server all by itself on a single purpose EC2 instance. Let's also assume that it doesn't run untrusted code, and that the instance uses Nitro Enclaves to protect extra sensitive information. If the instance is the security boundary and the Nitro Enclave provides defense in depth, then does that put mitigations=off back on the table?
Yeah that seems fine.
> Most people don't disable Spectre mitigations, so solutions that work with them enabled are important. I am not 100% sure that all of the mitigation overhead comes from syscalls,
This has been my experience and, based on how the mitigation works, I think that's going to be the case. The mitigations have been pretty brutal for syscalls - though I think the blame should fall on intel, not the mitigation that has had to be placed on top of their mistake.
Presumably io_uring is the solution, although that has its own security issues... like an entirely new syscall interface with its own bugs, lack of auditing, no ability to seccomp io_uring calls, no meaningful LSM hooks, etc. It'll be a while before I'm comfortable exposing io_uring to untrusted code.
This was a fascinating read and the kernel does quite nicely in comparison - 66% of DPDK performance is amazing. That said, the article completely nails the performance advantage: DPDK doesn't do a lot of stuff that the kernel does. That stuff takes time. If I recall correctly, DPDK abstractions themselves cost a bit of NIC performance, so it might be interesting to see a comparison including a raw NIC-specific kernel bypass framework (like the SolarFlare one).
> If I recall correctly, DPDK abstractions themselves cost a bit of NIC performance, so it might be interesting to see a comparison including a raw NIC-specific kernel bypass framework (like the SolarFlare one).
DPDK performs fairly well, even better for the most part. For some years I maintained a modification of the ixgbe kernel driver for Intel NICs that allowed us to perform high-performance traffic capture. We finally moved to DPDK once it was stable enough and we had the need to support more NICs, and in our comparisons we didn't see a performance hit.
Maybe manufacturer-made drivers can be better than DPDK, but if I had to guess that would be not because of the abstractions but because of the knowledge of the NIC architecture and parameters. I remember when we tried to do a PoC of a Mellanox driver modification and a lot of the work to get high performance was understanding the NIC options and tweaking them to get the most for our use case.
Is there a good comparison of these technologies? I've used dpdk for high rate streaming data and it roughly doubled my throughput over 10GE. I hear people using things like dma over Ethernet, and it sounds like there are several competing technologies. My use case is to get something from phy layer into gpu memory as fast as possible, latency is less important than throughput.
What you're looking for is RDMA. It was mostly restricted to Infiniband (IB) back in the days, but nowadays you probably want RoCEv2. You can look at iWARP too but I think nowadays RoCE won.
In any case, the standard software API for RDMA is ibverbs. All adapters supporting RDMA (be it IB, RoCE or iWARP) will expose it. You can get cloud instances with RDMA on AWS and Azure.
I am not 100% sure that all of the mitigation overhead
comes from syscalls, but it stands to reason that a lot
of it arises from security hardening in user-to-kernel
and kernel-to-user transitions.
Will io_uring be also affected by Spectre mitigations given it has eliminated most kernel/user switches?
And did anyone do a head-to-head comparison between io_uring and DPDK?
Good point. This is more of a tcp stack comparison between the kernel and userspace. Seastar has a sharded (per core) stack, which is very beneficial when the number of threads is high
You can set up one or many rings per core, but the idea I alluded to elsewhere in this comment section of spending 2 cores to do kernel busy polling and userspace busy polling for a single ring is less useful if your alternative makes good use of all cores.
What's amazing is that the seastar tcp stack hasn't been changed over the past 7 years, while the kernel received plenty of improvements (in order to close the gap vs kernel bypass mechanisms).
Still, for >> 99% of users, there is no need to bypass the kernel.
Why is this interesting to anyone?
Haven't we all moved to https by now?
Optimizing raw http seems to me like a huge waste of time by now.
I say that as someone who has spent years optimizing raw http performance.
None of that matters these days.
For me the interest comes from seeing the speed boost between regular kernel/optimized kernel/DPDK. What you put behind the RX layer doesn't really matter, but it's good to see numbers and things to do when your RX system isn't giving you enough throughput.
I feel this would be a good place to use a spark-based TCP stack. You're bypassing the kernel, have to run stuff as root or risky CAP_ rights, your stack should be as solid as possible.
Might also give people here some ideas on how to combine symbolic execution, proof, C and SPARK code and how to gain confidence in each part of a network stack.
I think there's even some ongoing work climbing up the stack up to HTTP but not sure of the plan (not involved).
I took a short look at the benchmark setup (https://github.com/talawahtech/seastar/blob/http-performance...), and wonder if some simplifications there lead to overinflated performance numbers. The server here executes a single read() on the connection - and as soon as it receives any data it sends back headers. A real world HTTP server needs to read data until all header and body data is consumed before responding.
Now given the benchmark probably sends tiny requests, the server might get everything in a single buffer. However every time it does not, the server will send back two responses to the server - and at that time the client will already have a response for the follow-up request before actually sending it - which overinflates numbers. Might be interesting to re-test with a proper HTTP implementation (at least read until the last 4 bytes received are \r\n\r\n, and assume the benchmark client will never send a body).
Such a bug might also lead to a lot more write() calls than what would be actually necessary to serve the workload, or to stalling due to full send or receive buffers - all of those might also have an impact on performance.