Hacker News new | past | comments | ask | show | jobs | submit login
Why does one Nginx worker take all the load? (cloudflare.com)
164 points by porker on Oct 24, 2017 | hide | past | web | favorite | 49 comments

Author here. There are three highlights in this blog post:

- epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)

- using REUSEPORT can worsen latency in a high-load case.

- blocking accept() on Linux is pretty buggy, you basically can't close() underlying socket

Wow. This is a terrible state of affairs!

The LIFO behavior of epoll maybe could be worked around by having just one process get accept events at once and pass the baton to the next process when it accepts a client. I can't believe I have to think of such things. I have a daemon that uses epoll that currently scales quite well on one CPU for what it does, but which I'll probably have to make support multiple processes (or threads) at some point.

And then there's the fact that epoll is not fork-safe.

The other possibility is to have a thread in each process just to block in accept4(2), add it to a queue, and send an event to the other thread in the same process.

Thanks for the post!

Is there anything you could suggest looking into to improve nginx processing of high volume large POST bodies (5KB, tens of thousands per second)?

I'm using kernel 4.13 with BBR congestion control on 20Gbps network and seem to hit this weird bottleneck when it does not matter how many nginx processes I have, it works similarly terrible on both 16-core and 64-core servers. (Of course irq/process affinity is in place which makes me think it's nginx issue.)

You mention 5KB...is there some chance that it exceeds 8KB?

There's a setting called client_body_buffersize. (http://nginx.org/en/docs/http/ngx_http_core_module.html#clie...) It defaults to 8k on x86-64.

If you exceed that setting, it writes the body to a temporary file, which slows things down.

Maybe try bumping that up to 16k?

Also see the other related settings on that page.

Edit: Maybe also test out switching from urlencoded to multipart for the post? For some types of data, urlencoded can bloat the size quite a lot.

Not an nginx user here, but I do know sometimes certain modules in Apache will add significant latency on POST payloads by waiting for the entire POST payload to complete before sending to the backend (in a reverse proxy setup). The idea is that if the backend fails, it can retry on the next node. For large payloads this sucks for latency.. No idea if nginx has this problem.

This is a configurable option in nginx. You can have it wait until the entire request is recevied before passing to the upstream (backend) or have it stream as data arrives. I seem to recall there is still a manageable buffer when streaming too, though it's been a few years since I looked at that in detail.

At that speed it might not be Nginx. Check out Cloudflare's other blog posts too.


And thanks OP. I'm sure thousands of companies do these kinds of optimizations but CloudFlare is one of the only to bother sharing.

You kind of did a throwaway comment in the blog post about FIFO helping, but it wasn't clear to me why it would result in any different behaviour from LIFO. Wouldn't that mean it just hits the first host all the time instead of the last?

(work at cloudflare but not on the same team as the blog author)

I believe the FIFO/LIFO refers to the queue of processes waiting to receive a request. Since each process calls accept()/epoll(), then loops around and call it again, FIFO would ensure the one that just processed a request would go back to the end of the line.

It hits the first worker, and that first worker leaves the FIFO queue, and thus is no longer the first worker anymore. When it calls epoll again, it'll end up as the last one instead.

Ahh of course.

Thanks for the nicely written article, especially the pictures.

Articles like this make for good reading.

with regard to the epoll-and-accept method - why is the LIFO queue bad? sure, you're seeing more load on one core, but it must be finishing requests by the time it starts working on the next one - i.e not causing requests to stall as a result of the imbalance.

put that in the context of a multi-socket system, then core-caching and NUMA issues (even turbo clocking on an individual core) would seem to imply that LIFO is in fact exactly what you want, no?

Imagine a HTTP server supporting HTTP keepalives. Say 10 new connections come in. Say on average they will all go to a single worker. Then say all of the connections request a heavy asset.

You will end up with one worker handling these 10 connections/requests and spinning at 100% CPU while other workers idle. I'm not saying this bad load balancing is a problem affecting everyone, it very much depends on your load pattern.

If the first worker isn't actually idle, then the scheduler will assign the next incoming request to an idle worker. Am I mistaken? If not, what's the problem here?

Load assignment, with this architecture, happens only when a connection is first opened. HTTP keep-alive means there's a disconnect between when it's opened, and when it becomes expensive.

I.e. it's possible for one worker to first serve ten tiny requests (e.g. index.html), then wait while the clients chew on it, then have all ten clients simultaneously request a large asset.

I am not sure it's possible to solve this problem generally. Doing so would require the kernel be able to predict the future.

Also, IIRC this is a proxy. If you run out of CPU copying data between file descriptors before you run out of bandwidth, I'd be very surprised. I think sendfile(2) makes it especially cheap.

Sure there is. Pass around the connection socket as needed, or have a layer in front of the processing layer to only hand out the work when it is good to go.

Can you point to a working example of this that actually solves the problem? i.e., that is demonstrably more efficient for any given request than the FIFO wakeup method?

No, you would need to test it for your workload.

I am saying it is possible, not that it is better. Very different things.

That's why I said "it's not possible to solve this problem generally." That is, there's no general solution to the problem, one that is optimal for all workloads.

I find this very interesting.

As already pointed out, I imagine there could be some subtle benefit to delivering a majority of connections to a single process while it isn't overloaded, so even perfectly even load balancing can have some disadvantages.

majke, the benchmark you did with SO_REUSEPORT, was that against a NUMA system / recent kernel? I've completely lost track of it, but I think I remember reading that there has been some work on mapping the NIC Queues, Cores, and Workers together, allowing a flow to worker being able to be processed on the same core. I'm just curious if the benchmark shown was able to take advantage of that or not (and as an aside how much of a benefit mapping this together really has).

http isn't particularly my thing, so if i'm wrong, please correct me! however, i was of the impression that a keep-alive session wouldn't return to the accept stage - the socket is still open, surely?

Correct. The point is the worker may be relatively idle when it accepts() plenty of connections and the load only happens after that. When the requests on the connections start to flow.

You can imagine a situation when a single worker gets most of the traffic and runs out of cpu (while other workers idle).

Once solution would be that the workers pass each other the socket descriptor (using sendmsg()) once they notice that there are idle brethren and they have a lot of work in the queue.

Could that cause a race condition with multiple processes using epoll on the same file descriptors?

So I think the magic piece of information missing from the article is that workers can have multiple open per-client sockets "on the go"; so if one worker gets all the client sockets, then you're not getting parallelism.

Somewhat amazed that this stuff _still_ isn't properly sorted out given that I was working on applications with a keen interest in the problem in 1996!!

NT's IO Completion Ports implemented some kind of CPU-balanced work delivery mechanism from the beginning, presumably inspired by Cutler's previous work on VMS's $QIO. AIX afaik also had something similar. Linux not so much...but in the intervening 20 years I had assumed epoll() was fixing this.

You assume things in Linux-land are "designed". It's more like natural evolution. It's very organic. As you'd expect, there's a lot of decomposition around. It kinda smells as you'd expect.

The Linux community is known to be infected with a drug-resistant strain of NIHV -- the Not Invented Here Virus. This means that if there's a thing like kqueue, or NT I/O Completion Ports, or Solaris Event ports, and Linux could copy and/or improve one (or more) of those, then you can count on Linux to invent a new thing that isn't as good as any of the others.

Perhaps some pharmaceutical will someday work on a treatment, cure, and/or vaccine for that strain of NIHV. More likely, the community will someday evolve resistance to NIHV, or else make NIHV moot by driving out all the competition.

The good news is that you can resist NIHV yourself. All you have to do is not believe that you alone can solve mankind's problems in a vacuum.

I don't remember it quite like that in this specific case. Rather I heard that there was push back against implementing a completion port model from somewhere (Redhat legal??) fearing patent action by MS. IBM had a mutual license agreement with MS I believe, which explains why AIX has something similar. The patents in question may even be in the set litigated over in the wake of Cutler moving to MS from DEC (which MS paid $$$ for and hence might be inclined to defend aggressively).

Maybe, but they did everything wrong in epoll.

Why doesn't NGINX implement work stealing? Wouldn't that help?

You can control of the queue, but not the entries. You can pass the whole queue/socket around via sendmsg(), but not single entries in the queue. This is hard to solve well in user space.

So, what would a non-userspace solution look like? An HTTP keepalive + pipelining + HTTP2 implementation in the kernel that forwards the demuxed messages as separate packets onto a specially-typed socket, such that a prefork daemon can accept(2) individual HTTP request packets?

There are a few problems outlined in the article. I was referring to this one:

"epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)"

Which is unrelated to keepalives and pipelining and best addressed in the kernel.

You can accept a socket and pass it to worker processes/threads using whatever scheme desired.

There are a few problems mentioned in the article, but the main one is before that. It's how to spread the load of the accept() on a single shared socket across more than one worker in a balanced way.

Right, they use EPOLLEXCLUSIVE then complain when no other waiters are woken up. Thundering herd is a problem when you're using hundreds of threads. If you have a smaller number of waiters it's irrelevant. Moreover, under load only a few waiters, if any, will be sleeping on the queue when a new connection comes in, so there won't be any herd at all.

The round-robin patches seemed to be stalled precisely because everyone is bickering over solutions to problems that they've partly created for themselves. They've gone down the rabbit hole and disappeared.

If you want strongly fair scheduling and latency guarantees, just dequeue the sockets and assign them however you want. You introduce a small amount of latency, but you'd get similar latency by enforcing round-robin behavior in the kernel. LIFO is the effective behavior precisely because it's faster for the already running process to dequeue the next event than to park the running process and fire up a sleeping process.

Everybody talks about Windows IOCP, but Windows does precisely this same thing: a pool of threads with a simple scheduling scheme that uses similar asynchronous polling interfaces--just not interfaces exposed to user processes. I suppose it's a PITA to implement this in user space compared to the prepackaged solution Windows provides, but then again the Unix/Linux model is to make it easy to implement these solutions yourself, as opposed to providing one solution without exposing the underlying interfaces. Remember, scheduling, locking, and IPC primitives are much faster on Linux than on Windows. That's not coincidental. If you're not prepared to roll your own--if you want the vendor do you all the work for you--don't use Linux. The best features (and feature additions) in the Unix universe aren't oriented toward specific production problems, but toward interfaces that make it easier to roll your own solutions. Adding more flags to epoll long ago passed the point of diminishing returns.

I have a question about the following passage regarding the LIFO nature of epoll-and-accept:

>"This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections."

What is meant by a process "that only just went back to the event loop"? The worker process is the event loop no? Isn't the worker process never not running an event loop?

Or is this just meant to say when the worker process is not in the section of the event loop that's responsible for the enqueueing and dequeueing of events?

At the end of the event loop you wait for new events. That part.

Thanks. For clarification, the event queue that nginx's worker process is using for its input is the same accept() queue that the passive nginx listener socket is populating. Is that correct?

For more interest, read Nginx architecture on "The architecture of open source Applications" www.aosabook.org/en/nginx.html


anyone have any insight on this behavior on other systems? (e.g. FreeBSD kqueue)

It looks like nginx ability to accept several connections in a row can improve throughput, but increases latency. In DragonflyBSD now it is possible to tune "batch size": http://lists.dragonflybsd.org/pipermail/commits/2017-October...

IOCP on Windows would use a thread pool to dispatch work after async sockets come due for some action. And after work for any given socket is completed, it's possible for work to start on a different socket's completion, with the "context switch" (or rather, continuation callback) happening in userland, rather than requiring a context switch. This inversion of control acts a bit like work stealing.

Comment from reddit: https://www.reddit.com/r/programming/comments/78f1if/why_doe...

LIFO is also the strategy for Completion Ports in Windows:

> Threads that block their execution on an I/O completion port are released in last-in-first-out (LIFO) order, and the next completion packet is pulled from the I/O completion port's FIFO queue for that thread. This means that, when a completion packet is released to a thread, the system releases the last (most recent) thread associated with that port, passing it the completion information for the oldest I/O completion.


Yes; but it's for the duration of work on that completion, it's not like sockets are owned by the thread, so it doesn't have the same misbalancing effect, where if the original thread is not available completions get starved. The LIFO nature is a side effect of the user mode selection of next completion to run; something round robin would be slower with kernel transitions.

On FreeBSD with kqueue and without accept mutex workers load is also uneven (it became more even when you have more load). With accept mutex load is more even but using mutex adds some overhead.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact