The timeout argument does not work as intended. The timeout is
checked only after the receipt of each datagram, so that if up to
vlen-1 datagrams are received before the timeout expires, but then no
further datagrams are received, the call will block forever.
I've done similar tests with UDP applications. It's possible to get 500K pps on a multi-core system with a test application that isn't too complex, or uses too many tricks. The problem is that the system spends 80% to 90% of its time in the kernel doing IO. So you have no time left to run your application.
Another alternative is pcap and PF_RING, as seen here: https://github.com/robertdavidgraham/robdns
That might be useful. Previous discussion on robdns: https://news.ycombinator.com/item?id=8802425
The point of all that work to context switch into processes to handle small amounts of network I/O is that very often THE CORRECT SOFTWARE ARCHITECTURE is for multiple address-space-separated processes to be doing small amounts of network I/O. That I/O "means something" to a larger data model being implemented by the software.
It's true that for some tasks that "look like routing" there's no point to having that kind of external data model. The packets are the data being operated on. So there's little value in process separation and you might as well DMA them all streamwise into a single process to do it. And that's great stuff, but AFAICT it's really not what the linked article is about.
Ultimately, all those packets are going to end up in conventional processes, because that's where conventional processing needs to happen. There are very good reasons why we like our page-protected address space separation in this world!
Apparently you can also use the SO_RCVTIMEO socket option, which is a way to specify a timeout for all receive operations on the socket.
Makes me wonder how often bypassing the kernel is used in production networked applications.
Netmap is commonly used in HFT as well as packet filtering applications. I believe Verisign is running some of the root DNS servers with netmap as well, getting millions of connections per second.
(It's slightly worse than that due to extra cache and TLB pressure, but I doubt that matters in this workload.)
> They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box.
24 * 50,000 = 1,200,000
> we had shown that it is technically possible to receive 1Mpps on a Linux machine
So the original proposition was correct.
> two cores busy with handling RX queues, and third running the application, it's possible to get ~650k pps
That's ~200k pps per core, so 4x the initial bet.
If you really want to squeeze out all the performance of your network card, what you should use is something like DPDK.
If you really want maximum throughput with RDMA, I think the best is to go InfiniBand.
InfiniBand was the way to go 5/10 years ago, when 10G Ethernet was not there.
Nowadays, most of companies that invested in IB years ago are stuck with a dead infrastructure. It costs a lot, there is very little knowledge about the techno, and most support for it is dropping (e.g. glusterfs).
Sad truth is most of these old IB infra are now used for IPoIB ...
Source: been working in HFT firms implementing IB RDMA, then GBEth RDMA, now proprietary NIC RDMA
A: "Use BSD"
Q: Why is there such a strong focus on trying to get Linux network performance when (I think) everyone agrees BSD is better at networking? What does Linux offer beyond the network that BSD doesn't when it comes to applications that demand the fastest networks?
ps. I think the markdown filter is broken, I can't make a literal asterisk with a backslash. Anyone know how HN lets you make an inline asterisk?
We're not consuming/using the built in network stacks anyway, we're using the OS as a content delivery system. Get us something we can get to the cores on, we're going to pin our applications directly to those cores keeping the kernel relegated to scheduling tasks on whichever NUMA core is further away from the PCIe bus running into the CPU. We'll have the CPU's we're using pegged in a constant spinlock anyway which would make the scheduler think really, really hard about running tasks there. We don't use realtime kernels as it's better for us to pay the price on the occasional spike outlier than to raise the baseline latency up.
Due to my own unfamiliarity, I don't know what BSD's equivalent to isolcpus is. I don't know how to taskset on a BSD. I don't know if the infinibad/ethernet controller's firmware/bypass software works. I don't know how BSD's scheduler works (not that we usually care, but there are times when one can't avoid work needing to be scheduled, things like rpc calls to shut down an app, ssh if you can spare the clock cycles for key verification, etc).
Would dtrace come in handy? Most definitely. Is that enough for us to abandon what we know works? Not yet.
However, we're talking about very specific needs: super high performance networking. If you have that specific of a need, wouldn't you want something unfamiliar if it solves the problem best?
I remember these claims being made in the late 90s, and perhaps they were true back then, but it's been 15 years, and I would be surprised if Linux hasn't caught up by virtue of its faster development pace, greater mindshare, and increased corporate/datacenter usage.
So, in all seriousness: what recent, well argued essays/papers can you refer me to so I can understand the claim that BSD networking is still better than Linux in 2015?
It was also linked to from HN: https://news.ycombinator.com/item?id=8167126
Both have some pretty good sources (and some not so good sources too)
Shipped in FreeBSD by default, developed for FreeBSD first -- and that's just the bleeding edge side of things.
Time for a epoll for kqueue swap, and make this performance debate go away for both, Linux and FreeBSD once and for all. No reason for this pissing contest.
(As in, the stuff that facilitates registered I/O is based on concepts that can be traced back to VMS, released in 1977. Namely, the Irp, plus, a kernel designed around waitable events, not runnable processes.)
epoll requires more syscalls to do the same stuff.
that's not responsible for the differences speed, but at this level syscalls are a meaningful expense.
At that point ease of management or package installation probably matters more to developers and it might simply come down to things like driver support and other stability/performance issues where Linux has gotten a LOT of highly-specialized attention from hardware vendors and the HPC world. Back when 10G hardware was just starting to enter the market, we bought cards which came with a Linux driver but it took awhile before that was ported to FreeBSD and longer still before the latency had been optimized as much.
And the chance of a Linux distro having a newer version of something important is unlikely unless a new distro was just released and picked the latest release of a particular software to standardize their package on for the lifetime of this new OS.
Either way -- FreeBSD will continue (or be capable of) getting updates to track upstream and your Linux distro will only be backporting security fixes.
In most cases what actually happens is that people use the main distribution repo for the 99% of packages which are stable and when you need something newer you add external apt/yum/etc. sources for those specific projects.
In the Ubuntu world, there's a huge ecosystem supporting this style where you upload source packages and they'll build and host the binary packages for you:
This approach gives you the speed, reliability and security benefits of binary packages with the currency of ports and, more importantly, allows you to opt-in only where you specifically know you need new features.
In many cases, the repos are maintained directly by the upstream project – see e.g. https://www.varnish-cache.org/installation/debian for what that process looks like.
In other cases, you have to decide whether you trust a particular developer. If not, you can choose to create your own version – which could be as simple as taking an source package, auditing it to whatever level you want, and signing it with your own trusted key.
Look, I ran servers using OpenBSD for years in the 90s and FreeBSD in the early 2000s. I respect the work which has gone into the ports system but the reason to use it is not security and advocacy based on limited understanding will not accomplish anything useful. If you want to praise ports, talk about how much easier it makes it to have the latest version of everything installed — and do your homework to be ready to explain how that's meaningfully better than e.g. a Debian user tracking the testing or unstable repositories.
Furthermore, third party repos not by upstream or trusted OS developers is a nightmare. I regularly spend time trying to find trustworthy 3rd party repos to get newer versions of developer tools into RHEL 5/6. Sometimes it's just random RPMs on an FTP or rpmfind.net style sites. Not trustworthy at all. And sometimes I can't even build the package myself because the tool refuses to build because the rest of the OS is too old.
Long term release Linux distros make life hell.
It sounds like you don't know how the Debian packaging system works regarding security. If you are installing from Debian repos (as opposed to third-party repos), then all binary packages go through the ftpmasters. The packages are checksummed, and the checksums are GPG-signed. Each package's maintainer or team handles building the binary from source. Of course, you can also download the source package yourself with a simple command, and then build it yourself. But if you don't trust the Debian maintainers to verify the integrity of the source packages they build, then you shouldn't be using Debian at all. This is no different than using a BSD. The ports tree could also be compromised.
Third-party repos are always a risk. That's one of the nice things about PPAs: their maintainers can use the same security mechanisms that the regular distro repos use, but with their own GPG keys. Again, if you don't trust the maintainer, I guess you should be building everything yourself. LFS gets old though, right?
And that's another good thing about Debian: almost anything you could want is already in Debian proper, so resorting to third-party repos or building manually is rarely necessary.
For long-term use, you can use testing or unstable or both, which are effectively rolling releases. There is also the backports repo for stable. And if you need to build a package yourself, between the Debian tools and checkinstall, it's not hard.
Debian and Ubuntu are the only distros I use, and for good reason. They solved most of these problems a long, long time ago. Compared to Windows or other Linux distros, it seems more like heaven than hell.
By the way, I'm no expert on BSDs--but do they even have any cryptographic signatures in the system at all, or is it just package checksums? Checksums by themselves don't prove anything; you need a way to sign the checksums to verify they haven't been altered. Relying on unsigned checksums is akin to security theater.
I don't think there is any particular problem with Linux. The core problem of "slow" performance is mostly due to limitations of BSD API (recvmmsg is a good example of a workaround), and due to the feature richness of Linux.
Also, doing the math: 2GHz / 350k pps = 5714 cycles per packet. 6k cycles to deliver the packet from NIC to one core, then pass it between cores and copy over to application. That ain't bad really.
Setting aside the fact that a typical fanboi comment is at the top of HN post, are you seriously contending that since one OS does a thing there's no need for other OS's to do the same thing?
E.g. on freebsd I recently ran into an ENOBUFS that I've never seen on linux. Although man pages said that they could happen.
Obviously it provides nothing extra, and certainly nothing better, when it comes to a server role. Other than support contracts. And support for 3rd-party apps. And hardware drivers/compatibility. And users that know how to operate it. And brand name recognition. And size of software development community, libraries, tools. And industry reliance on non-portable software (Docker).
Answer to your question: technical superiority is dwarfed by a more popular product.
Docker is the new shiny kid on the block, but the industry is far from reliant on it. It's also a bit misleading to call the software non-portable in a Linux v FreeBSD debate - it's not that the software is non-portable, it's that it uses a feature that is not available in FreeBSD. In a contest about bragging rights, that's significant.
If so, thanks for the aneurism. Windows is decades ahead, thanks to a fundamentally more performant kernel and driver architecture, the key parts of which have been present since NT. (And VMS, if you want to get picky.)
If you need something specific with better performance than that, you should probably look at moving more of the network stack to your application layer.
It all comes back to how the kernel can get data back to the caller: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-....
Windows has a distinct advantage thanks to design decisions made in VMS by Cutler et al; namely, I/O via I/O request packets, and much better memory management.
POSIX systems are at a disadvantage because of the readiness-oriented nature of I/O, leaving the caller responsible for "checking back" when something can be done, versus "here, do, this, tell me when it's done". (https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...)
; Preparing to listen the incoming connection (passive socket)
; int listen(int sockfd, int backlog);
; listen(sockfd, 0);
mov eax, 102 ; syscall 102 - socketcall
mov ebx, 4 ; socketcall type (sys_listen 4)
; listen arguments
push 0 ; backlog (connections queue size)
push edx ; socket fd
mov ecx, esp ; ptr to argument array
int 0x80 ; kernel interruption (this is the "receive packet" call along with MOVs above)
We have an app server that currently handles 40K concurrent users per node. I get ~63pps only:
TX eth0: 42780 pkts/s RX eth0: 64676 pkts/s
TX eth0: 41570 pkts/s RX eth0: 63401 pkts/s
TX eth0: 41867 pkts/s RX eth0: 63697 pkts/s
TX eth0: 41585 pkts/s RX eth0: 63187 pkts/s
TX eth0: 40408 pkts/s RX eth0: 61912 pkts/s
TX eth0: 41445 pkts/s RX eth0: 63299 pkts/s
TX eth0: 41119 pkts/s RX eth0: 63186 pkts/s
TX eth0: 41502 pkts/s RX eth0: 63153 pkts/s
TX eth0: 40465 pkts/s RX eth0: 62118 pkts/s
TX eth0: 42105 pkts/s RX eth0: 63986 pkts/s
But this is utilizing 7 of 8 cores on each node... with CPU util very low.
There are additional things at play here, too, including what the NIC driver's strategy for interrupt generation is and how interrupts are balanced across the available cores, whether there are cores dedicated to interrupt handling and otherwise isolated from the I/O scheduler, various sysctl settings, etc.
There's further gains here if you want to get really into it.
RouterOS is based on the Linux kernel see https://en.wikipedia.org/wiki/MikroTik
Good luck receiving 50k pps of UDP packets, doing some processing and _sending_ 50k pps responses back. It actually is hard.
And even if you were using the slow API, even with ping:
[7 year old desktop]$ sudo ping -q -f -c 100000 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
--- localhost.localdomain ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 954ms
rtt min/avg/max/mdev = 0.004/0.004/0.195/0.002 ms, ipg/ewma 0.009/0.005 ms