- epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)
- using REUSEPORT can worsen latency in a high-load case.
- blocking accept() on Linux is pretty buggy, you basically can't close() underlying socket
The LIFO behavior of epoll maybe could be worked around by having just one process get accept events at once and pass the baton to the next process when it accepts a client. I can't believe I have to think of such things. I have a daemon that uses epoll that currently scales quite well on one CPU for what it does, but which I'll probably have to make support multiple processes (or threads) at some point.
And then there's the fact that epoll is not fork-safe.
Is there anything you could suggest looking into to improve nginx processing of high volume large POST bodies (5KB, tens of thousands per second)?
I'm using kernel 4.13 with BBR congestion control on 20Gbps network and seem to hit this weird bottleneck when it does not matter how many nginx processes I have, it works similarly terrible on both 16-core and 64-core servers. (Of course irq/process affinity is in place which makes me think it's nginx issue.)
There's a setting called client_body_buffersize. (http://nginx.org/en/docs/http/ngx_http_core_module.html#clie...) It defaults to 8k on x86-64.
If you exceed that setting, it writes the body to a temporary file, which slows things down.
Maybe try bumping that up to 16k?
Also see the other related settings on that page.
Edit: Maybe also test out switching from urlencoded to multipart for the post? For some types of data, urlencoded can bloat the size quite a lot.
And thanks OP. I'm sure thousands of companies do these kinds of optimizations but CloudFlare is one of the only to bother sharing.
I believe the FIFO/LIFO refers to the queue of processes waiting to receive a request. Since each process calls accept()/epoll(), then loops around and call it again, FIFO would ensure the one that just processed a request would go back to the end of the line.
Articles like this make for good reading.
put that in the context of a multi-socket system, then core-caching and NUMA issues (even turbo clocking on an individual core) would seem to imply that LIFO is in fact exactly what you want, no?
You will end up with one worker handling these 10 connections/requests and spinning at 100% CPU while other workers idle. I'm not saying this bad load balancing is a problem affecting everyone, it very much depends on your load pattern.
I.e. it's possible for one worker to first serve ten tiny requests (e.g. index.html), then wait while the clients chew on it, then have all ten clients simultaneously request a large asset.
Also, IIRC this is a proxy. If you run out of CPU copying data between file descriptors before you run out of bandwidth, I'd be very surprised. I think sendfile(2) makes it especially cheap.
I am saying it is possible, not that it is better. Very different things.
As already pointed out, I imagine there could be some subtle benefit to delivering a majority of connections to a single process while it isn't overloaded, so even perfectly even load balancing can have some disadvantages.
majke, the benchmark you did with SO_REUSEPORT, was that against a NUMA system / recent kernel? I've completely lost track of it, but I think I remember reading that there has been some work on mapping the NIC Queues, Cores, and Workers together, allowing a flow to worker being able to be processed on the same core. I'm just curious if the benchmark shown was able to take advantage of that or not (and as an aside how much of a benefit mapping this together really has).
You can imagine a situation when a single worker gets most of the traffic and runs out of cpu (while other workers idle).
NT's IO Completion Ports implemented some kind of CPU-balanced work delivery mechanism from the beginning, presumably inspired by Cutler's previous work on VMS's $QIO. AIX afaik also had something similar. Linux not so much...but in the intervening 20 years I had assumed epoll() was fixing this.
The Linux community is known to be infected with a drug-resistant strain of NIHV -- the Not Invented Here Virus. This means that if there's a thing like kqueue, or NT I/O Completion Ports, or Solaris Event ports, and Linux could copy and/or improve one (or more) of those, then you can count on Linux to invent a new thing that isn't as good as any of the others.
Perhaps some pharmaceutical will someday work on a treatment, cure, and/or vaccine for that strain of NIHV. More likely, the community will someday evolve resistance to NIHV, or else make NIHV moot by driving out all the competition.
The good news is that you can resist NIHV yourself. All you have to do is not believe that you alone can solve mankind's problems in a vacuum.
"epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)"
Which is unrelated to keepalives and pipelining and best addressed in the kernel.
The round-robin patches seemed to be stalled precisely because everyone is bickering over solutions to problems that they've partly created for themselves. They've gone down the rabbit hole and disappeared.
If you want strongly fair scheduling and latency guarantees, just dequeue the sockets and assign them however you want. You introduce a small amount of latency, but you'd get similar latency by enforcing round-robin behavior in the kernel. LIFO is the effective behavior precisely because it's faster for the already running process to dequeue the next event than to park the running process and fire up a sleeping process.
Everybody talks about Windows IOCP, but Windows does precisely this same thing: a pool of threads with a simple scheduling scheme that uses similar asynchronous polling interfaces--just not interfaces exposed to user processes. I suppose it's a PITA to implement this in user space compared to the prepackaged solution Windows provides, but then again the Unix/Linux model is to make it easy to implement these solutions yourself, as opposed to providing one solution without exposing the underlying interfaces. Remember, scheduling, locking, and IPC primitives are much faster on Linux than on Windows. That's not coincidental. If you're not prepared to roll your own--if you want the vendor do you all the work for you--don't use Linux. The best features (and feature additions) in the Unix universe aren't oriented toward specific production problems, but toward interfaces that make it easier to roll your own solutions. Adding more flags to epoll long ago passed the point of diminishing returns.
>"This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections."
What is meant by a process "that only just went back to the event loop"? The worker process is the event loop no? Isn't the worker process never not running an event loop?
Or is this just meant to say when the worker process is not in the section of the event loop that's responsible for the enqueueing and dequeueing of events?
anyone have any insight on this behavior on other systems?
(e.g. FreeBSD kqueue)
LIFO is also the strategy for Completion Ports in Windows:
> Threads that block their execution on an I/O completion port are released in last-in-first-out (LIFO) order, and the next completion packet is pulled from the I/O completion port's FIFO queue for that thread. This means that, when a completion packet is released to a thread, the system releases the last (most recent) thread associated with that port, passing it the completion information for the oldest I/O completion.