
How We Designed for Performance and Scale - fcambus
http://nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/
======
rdtsc
Btw there is a new feature in the kernel to help avoid using the accept shared
memory mutex -- EPOLLEXCLUSIVE and EPOLLROUNDROBIN

This should round robin accept in the kernel, and not wake up all the epoll
listeners.

[https://lwn.net/Articles/632590/](https://lwn.net/Articles/632590/)

~~~
15155
Isn't this exactly what EPOLLET (edge triggering) does?

~~~
_wmd
Edge vs level triggered refers to how a change in an fd's state is reflected
in userspace: level triggering will cause repeated wakeups while the condition
remains true, whereas edge triggering will cause exactly one, until the
monitored state flips from true to false and back again.

The new flags relate to what happens when multiple threads are waiting on the
same set of file descriptors (i.e. sleeping on the same epoll FD passed as the
first parameter to epoll_wait(), or having the same client FD in multiple
epoll sets -- sorry I'm not sure which way around it is).

Previously the kernel had no support for waking exactly one thread to handle
one event, so if there was a single set shared among a bunch of sleeping
tasks, all tasks would be scheduled, causing (presumably) synchronization
contention on the kernel's internal structures. The new mode ensures only a
single task is woken up for an event on a single FD, even when multiple tasks
are waiting on it.

------
nailer
This is a lovely article, but:

> The fundamental basis of any Unix application is the thread or process.
> (From the Linux OS perspective, threads and processes are mostly identical;
> the major difference is the degree to which they share memory.)

It's better to be specific in performance discussions, rather than use
'thread' and 'process' interchangeably.

As well as the article mentioned about memory sharing, threads (which are
called Lightweight Processes, or LWPs, in Linux 'ps') are granular.

    
    
        ps -eLf
    

NWLP in the command above is 'number of lightweight processes', ie number of
threads.

Processes are not granular: they're one or many threads. IIRC it can be
beneficial to assign threads of the same process to the same physical core or
same die for cache affinity. There's all kind of performance stuff where
'threads' and 'processes' do not mean the same thing. Being specific is rad.

~~~
alexchamberlain
In Linux, they are both just entries in the process table. They are created by
the `clone` syscall, and the "normal" way of creating them share different
amounts of resources by _default_.

You're right to say that treating them differently can be beneficial in some
situations, but it really depends.

~~~
nailer
Yep. I guess the most basic case is: "is this single process multithreaded, as
I have a multicore machine and there's only one PID"

------
cshimmin
> You can reload configuration multiple times per second (and many NGINX users
> do exactly that)

I thought this was an interesting remark. Can anyone clue me in to what these
"many users" might be doing, that requires them to reload configuration so
frequently?

~~~
threeseed
Service Discovery. Dynamically generating routing rules so microservices can
have pretty URLs. For example with us when you deploy a new version of a
microservice it starts on a random port, registers with Consul on startup and
then dynamically regenerates and reloads Nginx.

------
nodesocket
"NGINX’s binary upgrade process achieves the holy grail of high-availability;
you can upgrade the software on the fly, without any dropped connections,
downtime or interruption in service."

Is this really true? I remember seeing an article[1] recently on using an
iptables hack to prevent dropping connections when reloading haproxy. Does
nginx actually provide zero-downtime configuration reloads?

[1] [https://medium.com/@Drew_Stokes/actual-zero-downtime-with-
ha...](https://medium.com/@Drew_Stokes/actual-zero-downtime-with-
haproxy-18318578fde6)

~~~
gull
The holy grail of high-availability isn't upgrading stateless software. It's
upgrading stateful ones.

Like upgrading when a data structure changes between versions. The HTTP
protocol nginx serves is stateless and by comparison far simpler. Same goes
for Erlang. It offers nothing more than simple function replacement, and
that's not enough to handle data structure changes either.

~~~
toast0
In Erlang, if you need a new data structure for your state, you can check if
your state is old, and upgrade it and continue on. In a gen_server, you might
have something like

    
    
      handle_call(Request, From, State) when is_record(State, state) -> handle_call(Request, From, upgrade_state(State);
      handle_call(Request, From, State) when is_record(State, state2) -> ...
    

(you'll want to do something similar on handle_cast and handle_info if you use
those). You have to do a little work, but I don't see how you avoid that?

~~~
gull
That's not guaranteed to be safe.

Having new code check if your state is old and upgrading isn't enough. You
also need to check old code doesn't process new state. That becomes harder
under concurrency.

[http://en.wikipedia.org/wiki/Dynamic_software_updating#Updat...](http://en.wikipedia.org/wiki/Dynamic_software_updating#Update_safety)

Erlang doesn't offer this check.

 _" Old code may still be evaluated because of processes lingering in the old
code."_

[http://www.erlang.org/doc/reference_manual/code_loading.html...](http://www.erlang.org/doc/reference_manual/code_loading.html#id86424)

~~~
toast0
> Having new code check if your state is old and upgrading isn't enough. You
> also need to check old code doesn't process new state. That becomes harder
> under concurrency.

If we're talking about a gen_server, the state is per process, and once the
process has switched to the new code, it won't go back, so there's no problem
with old code and new state. In non gen_server code, you do need to be careful
about when you hit a boundary that gets you into new code; you'd typically
want it to be your process's main loop, since that usually tail recurses and
doesn't leave a stack in the old code. It is difficult to reason about a
situation where you call into new code, and that returns to old code; it's
much better to avoid it.

The concurrent case is OK too, each process manages its own state, and
upgrades it when it switches to new code. Are you thinking about changes to
messages that are being passed and/or global state? In that case, like with
any distributed system, you need to load in stages: first load code that can
handle old and new messages, then trigger sending new messages (code load or
config setting), then load code that only handles new messages.

~~~
gull
You touched on most points, primarily avoiding the situation of reasoning
about concurrent old and new code. I didn't know a gen_server manages code
loading like this, thank you for that.

An upgrade doesn't involve only the in-memory state per process though. It
also involves state outside the process, like state on disk. Even if each
process upgrades it's own state (I'm assuming the gen_server isn't limited to
in-memory state; I don't know), an old process accessing from disk a data
structure that differs from the one used by the new process isn't safe. You
can't just upgrade old processes in stages.

An upgrade can also involve multiple processes. It's hard to upgrade all of
them at once. As you mentioned, in the hardest case of all, a distributed
system, loading in stages may be the only option, provided the system was
explicitly designed such that old and new processes can coexist without safety
issues.

[http://pmg.csail.mit.edu/upgrades/](http://pmg.csail.mit.edu/upgrades/)

------
amelius
One thread per CPU, and non-blocking I/O, that's sounds like the usual way to
approach the problem. I'm surprised it uses state machines to handle the non-
blocking I/O, because modern software engineering provides much more pleasant
approaches such as using coroutines.

~~~
nly
Coroutines have the distinct disadvantage of needing a stack, much like
threads. So-called 'stackless' coroutines aren't really so different to
computed gotos in a state machine

~~~
amelius
For parsing html, you wouldn't need that large a stack. Further, you could
make the stack grow dynamically at a very small cost. Also, I bet that even if
stackless coroutines are close to FSMs, they are much easier to program.

~~~
exacube
You need green/user-space threads if you want to use co-routines effectively,
because if a coroutine blocks, it would block off the entire OS thread from
doing anything else.

------
McElroy
I did as they said at the bottom and gave them my e-mail and other personal
details so I could download the eBook that they were giving free preview
copies of - "Building Microservices". Unfortunately, they sent link to PDF
only so it's not usable to me. Just a heads up to others so you save yourself
the time of discovering that. (I'll just wait for when the book is finished
and then I'll buy it so I get ePub. I like O'Reilly and have bought many books
there before.)

~~~
andorov
If you have a kindle you can send a pdf to your kindle email with the subject
'convert' -

[http://www.amazon.com/gp/sendtokindle/email](http://www.amazon.com/gp/sendtokindle/email)

------
simi_
I had a feeling I've read about nginx before:
[http://aosabook.org/en/nginx.html](http://aosabook.org/en/nginx.html)

The whole book is worth a read, although I found some sections painfully
boring (perhaps my limited attention span is to blame).

~~~
the_why_of_y
There's a 3rd volume "The Performance of Open Source Applications" now, and it
has a chapter on another high performance HTTP server, Warp:

[http://www.aosabook.org/en/posa/warp.html](http://www.aosabook.org/en/posa/warp.html)

Interesting what kind of performance one can get out of GHC nowadays. Article
says the authors of Warp had to implement a new parallel IO manager for GHC to
get there, but that was merged into GHC 7.8.

------
saurabhtandon
Interesting overview. I wish they had some data comparison which could explain
the significance and efficiency of this approach vs other/old approaches.

------
dschiptsov
They forgot to mention pool-allocated buffers, zero-copy strings, and very
clean, layered codebase - every syscall was counted.

The original nginx is a rare example of what is the best in software
engineering - deep understanding of principles and almost Asperger's attention
to details (which is obviously good). Its success is justified.

