Hacker News new | comments | show | ask | jobs | submit login

Author here. There are three highlights in this blog post:

- epoll() seem to have LIFO behavior which can result in uneven load balancing across workers (in case of accept()ing from a shared socket)

- using REUSEPORT can worsen latency in a high-load case.

- blocking accept() on Linux is pretty buggy, you basically can't close() underlying socket

Wow. This is a terrible state of affairs!

The LIFO behavior of epoll maybe could be worked around by having just one process get accept events at once and pass the baton to the next process when it accepts a client. I can't believe I have to think of such things. I have a daemon that uses epoll that currently scales quite well on one CPU for what it does, but which I'll probably have to make support multiple processes (or threads) at some point.

And then there's the fact that epoll is not fork-safe.

The other possibility is to have a thread in each process just to block in accept4(2), add it to a queue, and send an event to the other thread in the same process.

Thanks for the post!

Is there anything you could suggest looking into to improve nginx processing of high volume large POST bodies (5KB, tens of thousands per second)?

I'm using kernel 4.13 with BBR congestion control on 20Gbps network and seem to hit this weird bottleneck when it does not matter how many nginx processes I have, it works similarly terrible on both 16-core and 64-core servers. (Of course irq/process affinity is in place which makes me think it's nginx issue.)

You mention 5KB...is there some chance that it exceeds 8KB?

There's a setting called client_body_buffersize. (http://nginx.org/en/docs/http/ngx_http_core_module.html#clie...) It defaults to 8k on x86-64.

If you exceed that setting, it writes the body to a temporary file, which slows things down.

Maybe try bumping that up to 16k?

Also see the other related settings on that page.

Edit: Maybe also test out switching from urlencoded to multipart for the post? For some types of data, urlencoded can bloat the size quite a lot.

Not an nginx user here, but I do know sometimes certain modules in Apache will add significant latency on POST payloads by waiting for the entire POST payload to complete before sending to the backend (in a reverse proxy setup). The idea is that if the backend fails, it can retry on the next node. For large payloads this sucks for latency.. No idea if nginx has this problem.

This is a configurable option in nginx. You can have it wait until the entire request is recevied before passing to the upstream (backend) or have it stream as data arrives. I seem to recall there is still a manageable buffer when streaming too, though it's been a few years since I looked at that in detail.

At that speed it might not be Nginx. Check out Cloudflare's other blog posts too.


And thanks OP. I'm sure thousands of companies do these kinds of optimizations but CloudFlare is one of the only to bother sharing.

You kind of did a throwaway comment in the blog post about FIFO helping, but it wasn't clear to me why it would result in any different behaviour from LIFO. Wouldn't that mean it just hits the first host all the time instead of the last?

(work at cloudflare but not on the same team as the blog author)

I believe the FIFO/LIFO refers to the queue of processes waiting to receive a request. Since each process calls accept()/epoll(), then loops around and call it again, FIFO would ensure the one that just processed a request would go back to the end of the line.

It hits the first worker, and that first worker leaves the FIFO queue, and thus is no longer the first worker anymore. When it calls epoll again, it'll end up as the last one instead.

Ahh of course.

Thanks for the nicely written article, especially the pictures.

Articles like this make for good reading.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact