
Why does one Nginx worker take all the load? - porker
https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
======
majke
Author here. There are three highlights in this blog post:

\- epoll() seem to have LIFO behavior which can result in uneven load
balancing across workers (in case of accept()ing from a shared socket)

\- using REUSEPORT can worsen latency in a high-load case.

\- blocking accept() on Linux is pretty buggy, you basically can't close()
underlying socket

~~~
FBISurveillance
Thanks for the post!

Is there anything you could suggest looking into to improve nginx processing
of high volume large POST bodies (5KB, tens of thousands per second)?

I'm using kernel 4.13 with BBR congestion control on 20Gbps network and seem
to hit this weird bottleneck when it does not matter how many nginx processes
I have, it works similarly terrible on both 16-core and 64-core servers. (Of
course irq/process affinity is in place which makes me think it's nginx
issue.)

~~~
spydum
Not an nginx user here, but I do know sometimes certain modules in Apache will
add significant latency on POST payloads by waiting for the entire POST
payload to complete before sending to the backend (in a reverse proxy setup).
The idea is that if the backend fails, it can retry on the next node. For
large payloads this sucks for latency.. No idea if nginx has this problem.

~~~
edwhitesell
This is a configurable option in nginx. You can have it wait until the entire
request is recevied before passing to the upstream (backend) or have it stream
as data arrives. I seem to recall there is still a manageable buffer when
streaming too, though it's been a few years since I looked at that in detail.

------
foxhill
with regard to the epoll-and-accept method - why is the LIFO queue bad? sure,
you're seeing more load on one core, but it must be finishing requests by the
time it starts working on the next one - i.e not causing requests to stall as
a result of the imbalance.

put that in the context of a multi-socket system, then core-caching and NUMA
issues (even turbo clocking on an individual core) would seem to imply that
LIFO is in fact exactly what you want, no?

~~~
majke
Imagine a HTTP server supporting HTTP keepalives. Say 10 new connections come
in. Say on average they will all go to a single worker. Then say all of the
connections request a heavy asset.

You will end up with one worker handling these 10 connections/requests and
spinning at 100% CPU while other workers idle. I'm not saying this bad load
balancing is a problem affecting everyone, it very much depends on your load
pattern.

~~~
otterley
If the first worker isn't actually idle, then the scheduler will assign the
next incoming request to an idle worker. Am I mistaken? If not, what's the
problem here?

~~~
Filligree
Load assignment, with this architecture, happens only when a connection is
first opened. HTTP keep-alive means there's a disconnect between when it's
opened, and when it becomes expensive.

I.e. it's possible for one worker to first serve ten tiny requests (e.g.
index.html), then wait while the clients chew on it, then have all ten clients
simultaneously request a large asset.

~~~
otterley
I am not sure it's possible to solve this problem generally. Doing so would
require the kernel be able to predict the future.

Also, IIRC this is a proxy. If you run out of CPU copying data between file
descriptors before you run out of bandwidth, I'd be very surprised. I think
sendfile(2) makes it especially cheap.

~~~
brianwawok
Sure there is. Pass around the connection socket as needed, or have a layer in
front of the processing layer to only hand out the work when it is good to go.

~~~
otterley
Can you point to a working example of this that actually solves the problem?
i.e., that is demonstrably more efficient for any given request than the FIFO
wakeup method?

~~~
brianwawok
No, you would need to test it for your workload.

I am saying it is possible, not that it is better. Very different things.

~~~
otterley
That's why I said "it's not possible to solve this problem generally." That
is, there's no general solution to the problem, one that is optimal for all
workloads.

------
dboreham
Somewhat amazed that this stuff _still_ isn't properly sorted out given that I
was working on applications with a keen interest in the problem in 1996!!

NT's IO Completion Ports implemented some kind of CPU-balanced work delivery
mechanism from the beginning, presumably inspired by Cutler's previous work on
VMS's $QIO. AIX afaik also had something similar. Linux not so much...but in
the intervening 20 years I had assumed epoll() was fixing this.

~~~
cryptonector
You assume things in Linux-land are "designed". It's more like natural
evolution. It's very organic. As you'd expect, there's a lot of decomposition
around. It kinda smells as you'd expect.

The Linux community is known to be infected with a drug-resistant strain of
NIHV -- the Not Invented Here Virus. This means that if there's a thing like
kqueue, or NT I/O Completion Ports, or Solaris Event ports, and Linux could
copy and/or improve one (or more) of those, then you can count on Linux to
invent a new thing that isn't as good as any of the others.

Perhaps some pharmaceutical will someday work on a treatment, cure, and/or
vaccine for that strain of NIHV. More likely, the community will someday
evolve resistance to NIHV, or else make NIHV moot by driving out all the
competition.

The good news is that you can resist NIHV yourself. All you have to do is not
believe that you alone can solve mankind's problems in a vacuum.

~~~
dboreham
I don't remember it quite like that in this specific case. Rather I heard that
there was push back against implementing a completion port model from
somewhere (Redhat legal??) fearing patent action by MS. IBM had a mutual
license agreement with MS I believe, which explains why AIX has something
similar. The patents in question may even be in the set litigated over in the
wake of Cutler moving to MS from DEC (which MS paid $$$ for and hence might be
inclined to defend aggressively).

~~~
cryptonector
Maybe, but they did everything wrong in epoll.

------
_Codemonkeyism
Why doesn't NGINX implement work stealing? Wouldn't that help?

~~~
tyingq
You can control of the queue, but not the entries. You can pass the whole
queue/socket around via sendmsg(), but not single entries in the queue. This
is hard to solve well in user space.

~~~
derefr
So, what would a non-userspace solution look like? An HTTP keepalive +
pipelining + HTTP2 implementation in the kernel that forwards the demuxed
messages as separate packets onto a specially-typed socket, such that a
prefork daemon can accept(2) individual HTTP request packets?

~~~
tyingq
There are a few problems outlined in the article. I was referring to this one:

 _" epoll() seem to have LIFO behavior which can result in uneven load
balancing across workers (in case of accept()ing from a shared socket)"_

Which is unrelated to keepalives and pipelining and best addressed in the
kernel.

------
bogomipz
I have a question about the following passage regarding the LIFO nature of
epoll-and-accept:

>"This behavior causes the busiest process, the one that only just went back
to event loop, to receive the majority of the new connections."

What is meant by a process "that only just went back to the event loop"? The
worker process is the event loop no? Isn't the worker process never not
running an event loop?

Or is this just meant to say when the worker process is not in the section of
the event loop that's responsible for the enqueueing and dequeueing of events?

~~~
dcow
At the end of the event loop you wait for new events. That part.

~~~
bogomipz
Thanks. For clarification, the event queue that nginx's worker process is
using for its input is the same accept() queue that the passive nginx listener
socket is populating. Is that correct?

------
zerop
For more interest, read Nginx architecture on "The architecture of open source
Applications" www.aosabook.org/en/nginx.html

------
cat199
soo..

anyone have any insight on this behavior on other systems? (e.g. FreeBSD
kqueue)

~~~
barrkel
IOCP on Windows would use a thread pool to dispatch work after async sockets
come due for some action. And after work for any given socket is completed,
it's possible for work to start on a different socket's completion, with the
"context switch" (or rather, continuation callback) happening in userland,
rather than requiring a context switch. This inversion of control acts a bit
like work stealing.

~~~
majke
Comment from reddit:
[https://www.reddit.com/r/programming/comments/78f1if/why_doe...](https://www.reddit.com/r/programming/comments/78f1if/why_does_one_nginx_worker_take_all_the_load/dotnesp/)

LIFO is also the strategy for Completion Ports in Windows:

> Threads that block their execution on an I/O completion port are released in
> last-in-first-out (LIFO) order, and the next completion packet is pulled
> from the I/O completion port's FIFO queue for that thread. This means that,
> when a completion packet is released to a thread, the system releases the
> last (most recent) thread associated with that port, passing it the
> completion information for the oldest I/O completion.

[https://msdn.microsoft.com/en-
us/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa365198\(v=vs.85\).aspx)

~~~
barrkel
Yes; but it's for the duration of work on that completion, it's not like
sockets are owned by the thread, so it doesn't have the same misbalancing
effect, where if the original thread is not available completions get starved.
The LIFO nature is a side effect of the user mode selection of next completion
to run; something round robin would be slower with kernel transitions.

