
How we scaled Nginx - sahin-boydas
https://blog.cloudflare.com/how-we-scaled-nginx-and-saved-the-world-54-years-every-day/?ref
======
toast0
> Once in a while, a request gets slowed down enough to matter. My colleague
> Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second.
> Moreover, some of our SSDs have more performance outliers than others.

If you're running Intel DC 3x10 SSDs, check for a firmware update that
improves 'maximum latency' in some cases, the update was released some years
ago, but people might not have noticed it.

~~~
dev_dull
Is there a place for benchmarks where we can discover the “best” SSDs? I find
their performance varies wildly.

~~~
windows_tips
[https://www.tomshardware.com/reviews/best-
ssds,3891.html](https://www.tomshardware.com/reviews/best-ssds,3891.html)

~~~
olavgg
These are consumer level SSD's that are very well known by professionals to be
slow. They are good for quick bursts of traffic but throttles with heavy use
over time. They also have horrible sync write performance
[https://forums.servethehome.com/index.php?threads/did-
some-w...](https://forums.servethehome.com/index.php?threads/did-some-write-
benchmarks-of-a-few-ssds.15231/)

They belong in a desktop/gaming machine not in a production web server.

Here is an old review with enterprise drives and the Samsung 840 pro consumer
drive which received excellent reviews at the time it was released, notice how
it is the worst performer over time.
[https://www.storagereview.com/micron_m500dc_enterprise_ssd_r...](https://www.storagereview.com/micron_m500dc_enterprise_ssd_review)

The exception is Optane 900p which is blazing fast! A gamechanger!

------
enitihas
Non blocking disk I/O is one thing where NT is really ahead of all _nix OSes.
Unlike say network IO where we have all sorts of platforms(go, node) which
allow you to scale by doing async IO, there aren 't much options for disk I/O,
primarily because of lack of _nix options.

~~~
khc
Author of the post (and the engineer who did the work) here.

There are ways to do non-blocking disk I/O in *nix (aio/io_submit in linux)
but all of which requires you to have an open file descriptor first. Does NT
allow you to open a file in an async fashion?

~~~
drewg123
Netflix kernel engineer here.. We use FreeBSD's async sendfile() and not aio,
so it would be a bit harder for us to fix open latency, since we're not using
aio.

I had not thought about open latency being an issue, that's fascinating.
Looking at one of our busy 100G servers w/NVME storage, I see an openat
syscall latency of no more than 8ms sampling over a few minute period, with
most everything being below 65us. However, the workload is probably different
than yours (more longer lived connections, fewer but larger files, fewer opens
in general). Eg, we probably don't have nearly the "long tail" issue you do..

~~~
khc
right I suspect you have way fewer files than we do and everything is in the
dentry cache. Pretty sure that most of your files are bigger than 60KB too :-)
(which is our p90)

~~~
pg314
Have you looked into using something like SQLite instead of the filesystem?
[1]

[1]
[https://www.sqlite.org/fasterthanfs.html](https://www.sqlite.org/fasterthanfs.html)

~~~
Kalium
SQLite makes a _ton_ of sense for systems that don't need to worry about
concurrent writes. It's possible that a CDN's cache system might need to
concern itself with concurrent writes.

------
erikb
I love how happy they are to work around blocking open().

This is a very common way of thinking. But in fact there are only two ways to
handle I/O. And no matter what you do, you always end up with one of them:

Path 1, blocking I/O: When you have blocking I/O your process continues to the
point where the I/O starts, then sends the corresponding request to the kernel
and waits until it gets a respond, potentially forever. This is very low
resource usage, but sometimes-hanging-forever is quite a huge price. So
usually people put this I/O stuff in a thread/fork and use the parent to have
a timeout waiting.

Path 2, non-blocking I/O: In this version when the process hits the I/O it
will almost-immediately fail when the desired resource (e.g. file, port,
whatever) is not available. So usually you are writing a loop and constantly
poll for the resource to become ready. This obviously has a rather high cost
on resources, because your code gets more complicated (loops, exceptions, etc)
and whatever you are I/Oing to has more activity (e.g. if you poll a webserver
you constantly create load on that webserver for each client process). But
also an advantage is that you can't hang forever, because you usually break
the loop after x seconds or y amounts of retries.

You might feel it sucks (at least I do) but there are not more options. Decide
for one version that you can live with more easily, tune the variables you can
fiddle with, like timeouts/retries, and then move on to other problems.

------
noncoml
Relevant stackoverflow question, which makes for an interesting read:

[https://stackoverflow.com/questions/22780822/linux-kernel-
ai...](https://stackoverflow.com/questions/22780822/linux-kernel-aio-open-
system-call)

------
LinuxBender
Does Cloudflare submit all of it's improvements to Nginx upstream? Do
nginx.org accept / merge the improvements?

~~~
khc
Author here.

I've talked to a nginx product manager and he's told me changes that are
specific to one customer are unlikely to be accepted.

Also in our implementation we took some shortcuts so it may not be suitable
for upstream as is anyway

~~~
LinuxBender
Understood. Perhaps you could make your improvements modular in some cases so
that people can toggle them on, either as nginx modules, or compile flags in
nginx core?

~~~
khc
In this case it's not possible to implement as nginx modules, but we are
looking into releasing the patch as is.

~~~
devwastaken
It would be really great for some public projects if your internal
modifications and updates to nginx were a public repo. That repo could be
compiled and packaged for use of open source projects that benefit from those
modifications. I say this because I've seen multiple patches from cloudflare
around, but it's very difficult for 1 person to go through all of that, know
what version of nginx it's for, and modify the patch for newer versions of
NginX like security updates. If you modify nginx internally I don't doubt
there's lots of various changes and improvements over time that don't get
organized or published publicly.

I think it'd be great if more companies released their own 'opinionated'
versions that update with their infrastructure. Like if I wanted to host a
OpenStreetMaps tiling server hypothetically using some features cloudflare has
in their nginx builds. Makes it easy for white-hats to test, too.

IIRC I've been intersted in HPACK for small http responses where the headers
are larger than the body, but if I wanted to use the HPACK patch I have to re-
impliment it every time an update comes out that modifies the file.

------
kev009
[https://www.slideshare.net/facepalmtarbz2/new-sendfile-in-
en...](https://www.slideshare.net/facepalmtarbz2/new-sendfile-in-english)

------
iopuy
Does nginx still force you to recompile the program to get access to the web
application firewall? I remember this being a sticking point years ago when
evaluating the product.

~~~
merlincorey
No, NginX supports dynamic modules now.

------
brian-armstrong
If they had written it in Rust this never would have happened

~~~
deathanatos
My understanding is that most of Rust's standard library translates to the
same blocking calls on Linux, so it would be plagued with the same issues as C
would be. (And the solution reached in the article would similarly work in
Rust.)

There are certainly async I/O libraries for Rust (e.g., Tokio), but those are
going to be limited by the primitives the OS gives them. (AFAIK, Tokio's core
libraries don't directly do async disk I/O; there is tokio-fs, and it does it
by shunting the work to a threadpool.) The fact that disk I/O is so uniquely
special on Linux effects any language, as it is an aspect of the kernel
itself.

I _love_ Rust, and while there are a ton of compelling reasons to use it, I
don't think it's fair to say it would have prevented this from happening, in
this particular case.

