
Fast Servers - mr_tyzic
http://geocar.sdf1.org/fast-servers.html
======
someone13
Okay, this is a really cool post, but I have a small bit of criticism - the
code samples are _really hard_ to read. I'd recommend, at minimum, adding a
bit more whitespace so you don't end up with lines like this:

    
    
        if(e[i].events&(EPOLLRDHUP|EPOLLHUP))close(e[i].data.fd);
    

Despite that minor criticism - pretty cool stuff!

~~~
kev009
I thought it was misformatting but wow look at his code
[https://github.com/geocar/dash/blob/master/d.c](https://github.com/geocar/dash/blob/master/d.c)

~~~
geocar
I've been known to write things for other people, in other ways at times[1],
[2], but I'll usually start out writing this way[3], then expand it out
_simply_ to avoid that recoil that you just experienced.

I actually find working this way very helpful to finding bugs and noticing
redundancies, and if you're curious about techniques that improve the
programmer, this method is worth some study.

[1]:
[https://github.com/geocar/mmap/blob/master/mmap.cpp](https://github.com/geocar/mmap/blob/master/mmap.cpp)

[2]: [https://github.com/geocar/cdp-tools/blob/master/cdp-
listen.c](https://github.com/geocar/cdp-tools/blob/master/cdp-listen.c)

[3]:
[https://github.com/geocar/con/blob/master/con.c](https://github.com/geocar/con/blob/master/con.c)

~~~
swinglock
But what is that method? To my untrained eye it just looks obfuscated.

~~~
geocar
I am still working on a good explanation. Here's what I've come up with so
far:

I have noticed every page I scroll causes a comprehension loss of around 90%,
so in reading something that is 10 pagefuls long, I might only be able to
reproduce a tiny part of the program.

I find not scrolling, and just moving my eyes, I rapidly absorb the program,
and I find most bugs just by reading the code. This practice is absolutely
impossible for me if I have to scroll very far and made difficult by scrolling
at all.

A lot of J/K/Q/KDB programmers do this because it is similar to APL, but I
don't have an APL background. I simply do not know how to do this except to
write horizontally.

I learned a lot about this method by studying Arthur's code (example[1]). He
frequently recommends Iverson's paper[2]. I just try to make every character
count because it's still taking up space.

Maybe keeping that in mind will help you read each word carefully -- maybe
even out loud, if it helps.

[1]:
[http://www.nsl.com/papers/origins.htm](http://www.nsl.com/papers/origins.htm)

[2]:
[http://www.jsoftware.com/papers/tot.htm](http://www.jsoftware.com/papers/tot.htm)

~~~
kev009
To me it looks machine generated and it would be hard for me to mentally keep
track and tick off "okay I know what the lines above do" but if it works for
you and the people you work with are fine with it then good for you and I
don't mean that sarcastically.

The way I work and similar to most people I know but perhaps with less screen
real estate.. I can fit a hundred vertical lines of one text buffer
comfortably on my monitors (plural), and I can have about a dozen buffers up
at the same time. If needed, that can be the same document scrolled to
different areas. I also use a tool like cscope or for some languages an IDE to
get additional context (type info, available operations), and that's the only
way for me to quickly comprehend truly large bodies of code (kernels, etc). I
don't think your strategy would work well for the projects I work on.

~~~
geocar
We will see.

One of the fastest databases in the world written this way (kdb), and I was
able to make an operating system kernel this way (kparc) and now a dynamic web
server, and so on.

Maybe it would help you to keep an open mind to know it took some time to
learn how to read and write code this way?

Perhaps it would also help to know it has actually improved my ability to work
on code that looks more like what you are used to.

One of the hardest things to do is to look at something so alien, and suppress
the recoil and discomfort and really be critical:

Here is something you cannot read; something others can do that you cannot.

Don't you want to figure out how to do something that others can do?

Maybe you will find other limitations: its alien nature is certainly a
limitation, but maybe you're right and there are others.

But in the meantime, it produces small fast and correct code quickly, and
that's pretty amazing. Maybe there are other benefits as well.

We will see.

~~~
kev009
I guess it seems like I'm bashing you by expressing shock but I don't mean
that. Seriously, more power to you for expressing yourself in less space than
I can or desire to.

~~~
geocar
I don't think you're bashing me.

I'm telling you the shock is normal, and that this is something you can learn
to do.

However the desire surprises me: If someone does something I cannot do, and it
is beneficial, I want to learn it.

Here I point to smaller, faster, and more-correct programs, written in less
time. Why would any programmer not want to be able to do _that_?

I never understood Paul's Blub essay either, though.

------
Matthias247
Can't completly understand the main message of what is proposed here.

Using multiple threads were one thread accepts connections and others process
the connections is already quite standard (e.g. look on how Netty for Java
works with a boss thread and worker threads).

However the pattern won't work with blocking IO like it's suggested in the
referred page if your worker thread should handle multiple connections. Even
if poll tells you the connection is readable there might not be enough data
for a complete request - so you need a state machine for reading again. Or you
block until you have read a complete request and thereby block other
connections that should be served by the same thread(pool). And if you block
on writing responses then one slow connection will block the processing of the
others.

What also should be considered is that by far not all network protocols follow
a pure request -> response model. If the server may send responses
asynchronously (out of order) or if there is multicast/broadcast support in
the protocol the requirements on software architecture look different.

------
markpapadakis
There is nothing fundamentally new described in the blog post, although
pinning threads to core is almost never the default operation mode (though
many popular servers expose an option for turning this one).

As someone else stated in another post, SO_REUSEPORT and accept4() is best,
all things considered, way to accept connections across multiple threads.
Again, most modern servers support this by default, if supported by the
underlying operating system (e.g nginx).

By the way, Aerospike accepts new connections in a single thread and then
directly adds the socket FD to one of the i/o threads (round-robbing selection
scheme) directly using e.g epoll_ctl(fd, EPOLL_CTL_ADD, ..).

See [http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-
ma...](http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-
context.html) for costs of context switching, and performance improvement when
pinning threads to cores (it's quite impressive). Also, note that according to
the author, on average, a context switch is 2.5-3x more expensive when using
virtualization.

You may also want to read [https://medium.com/software-development-2/high-
performance-s...](https://medium.com/software-development-2/high-performance-
services-using-coroutines-ac8e9f54d727#.ho0s7q28b) \-- it's been a long time
since I wrote this, but it describes how one can deal with asynchronous I/O
and other operations that may block a thread.

~~~
geocar
I'm glad Aerospike uses this technique.

Yes, it is not new. I've been using this method in one form or another (e.g.
SIGIO/F_SETOWN) for more than a decade, however nginx (relatively modern?)
_doesn 't_ use this technique and could really benefit from it on some
workloads. I wrote about it after I casually mentioned it a roomful of quant
programmers and found out they didn't know about it; I do not think most
servers use this technique.

I've added a note about accept4(). When I have some time I intend to repeat my
benchmarks with it to see how it does. Context switching is much cheaper than
memory so I do not expect much.

You should publish benchmarks that are easy for other people to reproduce. I
tried running Aerospike on my laptop and was unable to get above 25,000
writes/sec. I'm certain I'm doing something wrong, because the same hardware
easily gets above 60,000 writes/sec on KDB per CPU.

~~~
markpapadakis
I am not using AeroSpike nor am I affiliated with them. I just studied their
code base (I study codebases as a hobby ). There is a _lot_ of room for
improvement there IMO but that's true for the majority of codebases I went
through :)

------
halayli
This is not an ideal design, surprised it has this many upvotes.

Transferring the accepted connections this way involves an extra system call
to epoll/kqueue to add the event into the particular thread and the accepting
thread can become a bottleneck under high load.

A better design would be to share the listening socket across threads and have
each thread accept at its own pace, at least this avoids the additional
kqueue/epoll system call needed to add the new fd into the thread's poller,
but it does cause lock contention in the OS which is still less expensive than
a system call. What's even better is if you're on a newer linux version, or
bsd, consider using SO_REUSEPORT which allow each thread to bind/accept on the
same port and avoids the lock contention issue.

Also you should consider using accept4() to set the non-blocking flag during
accept instead of the additional system call to set it to non-blocking.

~~~
geocar
One of the benefits of this design overlooked is that you can have several
dedicated accepting-threads. The benefit is magnified when you have something
like HTTP where the client speaks first.

Another benefit of this design overlooked is that individual cores may not
_ever_ need to read memory -- the entire task can run in L1 or L2. If a single
worker becomes too complicated this benefit is lost, and memory is much _much_
slower than cache.

SO_REUSEPORT is not panacea, and I do not agree that you should have threads
accept _and_ handle requests, but I do think SO_REUSEPORT is useful for making
upgrades easier. I have (historically) simply blocked outgoing ICMP port-
unreachable messages for ports that I know are supposed to be running since
that's more portable than SO_REUSEPORT.

accept4() is a good idea. I've added a note about that. Thank you.

------
wittrock
Backpressure between threads and utilization could be hard here. Balance
between the speed of the accept, request, and worker threads is something I'm
curious about. In theory, you could create pools of each if you find that one
set bottlenecks the other. Also, workload isolation is important--I'm curious
how the author deals with (or avoids) transferring ownership of large amounts
of memory and cache between cores without incurring significant transfer
overhead.

~~~
geocar
Try not to transfer memory.

On small examples, I arrange memory access so that once the memory has been
migrated, a lack of thrashing will be good enough. If I need a hot start, I
can use another trick:

On Linux, you can create separate processes that share fds but not memory. On
modern UNIX you could pass the epoll/kqueue fds around using sendmsg. This
allows you to get everything ready before hitting bind() and listen().

------
RyanZAG
This is pretty old, right? These days 'Fast Servers' refers to bare-metal
network interface without a kernel which can achieve far higher throughput
than passing it through a kernel/epoll.

~~~
pstrateman
Direct memory mapping from NIC to userspace with a userspace TCP/IP stack can
significantly accelerate ultra low latency processing.

The effect basically disappears when you're looking at webpages that take
10-1000ms to generate.

~~~
RyanZAG
Agreed, as soon as per-request processing becomes a large portion of the work
it will dwarf these types of optimizations.

The article is talking specifically about " _simple one-page performant
servers that easily get into the 100k requests /second territory_", which can
see very large throughput increases with direct NIC access.

~~~
geocar
dash[1] is a _dynamic_ webserver that uses the logic described in the article
(content logic is in Q/KDB).

I have considered writing a userspace TCP HTTP server like the one you
suggest, but it would be a demonstration of how to write a state machine and
not how to better use the facilities that the operating system provides for
writing a server.

[1]: [https://github.com/geocar/dash](https://github.com/geocar/dash)

------
nbevans
This pattern is called "I/O completion ports" or more generally "overlapped
I/O" in Windows and has been there since the very first version of Windows NT.

~~~
bediger4000
I believe it was present in Windows NT from the start because it was present
in DEC's VMS operating system, which Dave Cutler was also a key developer. It
may date to before VMS for all I know. It's not at all new.

------
vog
Very good pattern description, which demonstrates nicely that if basic/system
assumptions are changing, new pattern need to embrace.

I'd love to see more patterns like that.

~~~
eva1984
I think the pattern is the already the norm for a while, a.k.a async
programming.

------
chillaxtian
good job of explaining their paradigm, but i did not see any explanation as to
why it is better than traditional epoll.

what advantage do you gain by separating accepting connections from handling
request / response?

~~~
huhtenberg
Consider the case of 1000 clients trying to establish an SSL connection all at
once.

------
amelius
> Fast Servers

Fast in what sense? Throughput, or latency?

> One thread per core

Oh, I guess the answer to the previous question is "throughput".

~~~
RyanZAG
Could you explain why one thread per core translated to throughput and not
latency targeting?

Each core can only handle a fixed number of instructions per second regardless
of number of threads assigned to that core. If a core receives 100 requests,
it could handle them all sequentially with 1 thread, or utilize 2-100 threads
to handle them.

When using 1 thread, the first request will be completed within the minimum
amount of time the core is able to do the work. The second request will
complete in the same amount of time after the first.

When using 2 threads, the core will carry out the work an in interleaved
fashion between the 2 threads. Both requests will complete at the same time as
the 2nd request completed in our 1 thread example above.

I'll try make it clearer - assume a request takes 5 ms:

    
    
      1 Thread
        Request 1: start processing at 0ms, complete at 5ms
        Request 2: start processing at 5ms, complete at 10ms
    
      2 Threads
        Request 1: start processing at 0ms, complete at 10ms
        Request 2: start processing at 0ms, complete at 10ms
    

As can be seen, the average latency is better with only 1 thread - avg 7.5ms
latency when 2 requests arrive at the same time. 2 threads gives us avg 10ms
latency.

Throughput should theoretically remain the same if you disregard context
switching time and cache invalidation. In the presence of both, the single
thread method should outperform in terms of throughput as well.

~~~
amelius
Suggestion: Try with 1 job of X ms, and N jobs of 1ms, and figure out for
different N where the break-even point of X lies. That might give some more
insight.

------
notacoward
"One worker per core" is too simplistic. Yes, as I wrote in a pretty well-
known article a dozen years ago, it's a good starting point. Yes, it can avoid
some context switching. On the other hand, when you have to account for
threads doing background stuff it can be too many, leaving you with context
thrashing between oversubscribed cores. When you account for threads blocking
for reasons beyond your control (e.g. in libraries or due to page faults) it
might be too _few_. The right number of worker threads is usually somewhere
around the number of cores, but depends on your exact situation and can even
vary over time.

Those who do not know the lessons of history...

------
derefr
This is all about achieving a dataflow architecture by exploiting CPU cache
hierarchy, right?

We're basically talking about going _from_ a design where 10k little "tasklet"
state machines are each scheduled onto "scheduler" threads; where during its
turn of the loop, each tasklet might run whatever possibly-"cold" logic it
likes...

...and turning it _into_ a collection of cores that each have a thread for
evaluating a particular state-machine state pinned, where each "tasklet" that
gets scheduled onto a given core will always be doing the same logic, so that
logic can stay "hot" in that core's cache—in other words, so that each core
can function closer to a SIMD model.

Effectively, this is the same thing done in game programming when you lift
everything that's about to do a certain calculation (i.e. is in a certain FSM
state) up into a VRAM matrix and run a GPGPU kernel over it†. This kernel is
your separate "core" processing everything in the same state.

Either way, it adds up to _dataflow architecture_ : the method of eliminating
the overhead of context-switching and scheduling on general-purpose CPUs, by
having specific components (like "I/O coprocessors" on mainframes, or
(de)muxers in backbone switches) for each step of a pipeline, where that
component can "stay hot" by doing exactly and only what it does in a
synchronous manner.

The difference here is that, instead of throwing your own microcontrollers or
ASICs at the problem, you're getting 80% of the same benefit from just using a
regular CPU core but making it avoid executing any non-local jumps: which is
to say, not just eliminating OS-level scheduling, but eliminating any sort of
top-level per-event-loop switch that might jump to an arbitrary point in your
program.

This is way more of a win for CPU programming than just what you'd expect by
subtracting the nominal time an OS context-switch takes. Rewriting your logic
to run as a collection of these "CPU kernels"—effectively, restricting your
code the same way GPGPU kernels are restricted, and then just throwing it onto
a CPU core—keeps any of the kernel's cache-lines from being evicted, and
builds up (and never throws away) an excellent stream of branch-prediction
metadata for the CPU to use.

The _interesting_ thing, to me, is that a compiler, or an interpreter JIT,
could (theoretically) do this "kernel hoisting" for you. As long there was a
facility in your language to make it clear to the compiler that a particular
function _is_ an FSM state transition-function, then you can code regular
event-loop/actor-modelled code, and the compiler can transform it into a
collection of pinned-core kernels like this as an _optimization_.

The compiler can even take a hybrid approach, where you have some cores doing
classical scheduling for all the "miscellaneous, IO-heavy" tasklets, and the
rest of the cores being special schedulers that will only be passed a tasklet
when it's ready to run in that scheduler's preferred state. With an advanced
scheduler system (e.g. the Erlang VM's), the profiling JIT could even notice
when your runtime workload has changed to now have 10k of the same transition-
function running all the time, and generate and start up a CPU-kernel-thread
for it (temporarily displacing one of its misc-work classical scheduler
threads), descheduling it again if the workload shifts so that it's no longer
a win.

Personally, I've been considering this approach with GPGPU kernels as the
"special" schedulers, rather than CPU cores, but they're effectively
equivalent in architecture, and perhaps in performance as well: while the GPU
is faster because it gets to run your specified kernel in true SIMD parallel,
your (non-NUMA) CPU cores get to pass your tasklets' state around "for free",
which often balances out—ramming data into and out of VRAM is expensive, and
the fact that you're buffering tasklets to run on the GPGPU as a group
potentially introduces a high-latency sync-point for your tasklets. Soft-
realtime guarantees might be more important than throughput.

\---

† A fun tangent for a third model: if your GPGPU kernel outputs a separate
dataset for each new FSM state each of the data members was found to
transition to, and you have _other_ GPGPU kernels for each of the other state-
transition functions of your FSM waiting to take those datasets and run them,
then effectively you can make your whole FSM live entirely on the GPU as a
collection of kernel "cores" passing tasklets back and forth, the same way
we're talking about CPU cores above.

While this architecture probably wins over both the entirely-CPU and CPU-
passing-to-GPGPU models for pure-computational workloads (which is, after all,
what GPGPUs are supposed to be for), I imagine it would fall over pretty fast
if you wanted to do much IO.

Does anyone know if GPUs marketed specifically as GPGPUs, like the Tesla
cards, have a means for low-latency access to regular virtual memory from
within the GPGPU kernel? If they did, staying entirely on the GPGPU would
definitely be the dominant strategy. At that point, it might even make sense
to have, for example, a GPGPU Erlang VM, or entirely-in-GPGPU console
emulators (imagine MAME's approach to emulated chips, but with each chip as a
GPGPU kernel.)

If you can get that, then effectively what you've got at that point is less a
GPU, and more a meta-FPGA co-processor with "elastic allocation" of gate-
arrays to your VHDL files. System architecture would likely change _a lot_ if
we ever got _that_ in regular desktop PCs.

~~~
pheo
Woah.

I think that many "servers" today aren't going to benefit though from the
level of optimization that you're talking about. Network bandwidth is at least
an order of magnitude slower than memory and PCI throughput. For all but the
most heavyweight computational tasks (ie. gaming, finance, scientific
computing, etc.) this kind of multiplexing optimization is very helpful. If
you're needing to get that kind of performance, you would userspace map off
the NIC.

With how fast modern processors (CPUs and GPUs) are, whats much more important
nowadays is not stalling on the flow from the NIC. The solution of the
original author seems to me like a very simple fix for the old naive paradigm
of state machine switching off from a single thread.

I think you might be (impressively) too close to the metal, and guilty of
premature optimization. For example, a simple database application would
benefit much more from keeping the slow I/O bottlenecks flowing than from the
compute multiplexing.

~~~
derefr
Ah, true; this whole thing makes very little sense for the general case. Most
workloads aren't like this.

The real goal here, in my mind, was to allow people in those particular high-
performance domains (gaming especially, though the other ones would probably
benefit too) to code at a higher abstraction level.

In my ideal world, a physics engine, for example, would be implemented in code
as actors sending messages—with one actor for each physical (e.g. collide-
able) object. Not only is that really expensive; the naive approach wouldn't
even work! (You need "tick" events spammed to every physics-actor once per
frame, which is just absolutely ridiculous.) But then the compiler
transparently takes that and _spits out_ something that isn't actor-modelled
at all, but rather is a modern-day physics engine doing GPGPU SIMD to arrays
of quaternions.

Or, in short: I want to code games in Elixir, without an impedance mismatch in
the places where one of my actors' components turns out to need groupwise
synchronous shared-memory evaluation rather than piecewise asynchronous
reduction-scheduled evaluation. (That sounds like something that should be
followed up with "and a pony!", but—given that the Erlang BEAM VM already has
the infrastructure of "dirty" schedulers for CPU-bound tasks, and an LLVM IR
step in its JIT which could be shunted over to a PTX-ISA target—this actually
shouldn't even be that hard an extension to make.)

------
jbarzyc
check out [http://dpdk.org](http://dpdk.org). I just delivered a session on
DPDK and SR-IOV. There's a whole set of libraries, classifications, and
frameworks to tune/tweak linux systems on x86 in user space.

------
deathanatos
It's hard to say from the article, but is the first example single threaded?
If so, then yes, adding more threads is of course going to speed it up.

However, they don't each need their own queue; epoll supports being polled by
>1 thread. In such a setup _any_ thread available can handle any request; in
the author's setup, you're going to need to make sure any particular thread
doesn't get too bogged down if the requests are not equal. (That pick function
is important.) I'd be more curious how those two compared. (The author's is
certainly slightly easier to write, I think.)

~~~
geocar
If you can get into cache and stay there, you can enjoy 100x or even 1000x
speedup.

Having your CPU do a lot of different things is a good way to get out of cache
:)

------
JoshTriplett
I'd love to see benchmark numbers comparing this approach to others.

~~~
geocar
My dash webserver uses this trick, and I've published[1] benchmarks comparing
it to KDB's built-in webserver and NodeJS (which uses libev)

[1]:
[https://github.com/geocar/dash#performance](https://github.com/geocar/dash#performance)

------
listic
Why aren't servers using the proposed pattern already?

~~~
acidx
My web server, uses the same pattern. There's an article explaining it in
details here:
[http://tia.mat.br/posts/2014/10/06/life_of_a_http_request.ht...](http://tia.mat.br/posts/2014/10/06/life_of_a_http_request.html)

------
cbsmith
Has everyone forgotten the c10k site, which covers all this?

~~~
geocar
Are you sure this technique is documented on C10k[1]?

I think most people use a single epollfd/kqueue instead of one per thread.

[1]: [http://www.kegel.com/c10k.html](http://www.kegel.com/c10k.html)

~~~
cbsmith
The listed strategies are:

1\. Serve many clients with each thread, and use nonblocking I/O and level-
triggered readiness notification. 2\. Serve many clients with each thread, and
use nonblocking I/O and readiness change notification 3\. Serve many clients
with each thread, and use asynchronous I/O and completion notification 4\.
Serve one client with each server thread 5\. Build the server code into the
kernel 6\. Bring the TCP stack into userspace

You may notice a lot of them are "with each thread".

~~~
geocar
The trick isn't "use a bunch of threads" but to have each cpu do a separate
task (N1 threads for accept, N2 threads for read, N3 threads for write, N4
threads for close, etc) and using a separate keventfd/epollfd for each thread.

The tasks can still use blocking calls (or nonblocking, as they like), which
makes it different than strategy 1 or 2. It shards by task which is different
than 3, or 4. Completion-notification is not explicit. It is clearly different
than 5 or 6.

I think you should read it carefully. Popular webservers like nginx and
varnish do not implement this method, so I do not believe it is common, and
there are a few neat features of this method:

1\. Tasks will usually fit into cache. If the kernel hotpaths to IP they
usually won't fall out of cache either. This can produce 100x or even 1000x
speed improvements for complex tasks.

2\. It may be easy to convert an existing program to this design than to
replace everything with nonblocking IO, or to move into kernel space, or to
attach to a userland TCP stack.

These things are important if you want to write faster servers.

~~~
cbsmith
What if you have to map accept, read, write & close (so four types of tasks)
to a six cores? ;-)

Yes, nginx & varnish have different approaches. Their approaches tend to work
better for their type of application.

One thread or process per core, with its own event loop, is indeed a very
standard pattern. The process-per-core style approaches tend to shard per fd
(e.g., nginx), while the thread-per-core style tend to either shard by IO
request or don't shard at all (each thread has an independent poll of all
outstanding IOs). Particularly if you use a continuations style paradigm, it
makes it really easy to convert over existing code to the design.

In practice, you will typically have better cache locality with the fd than
your position in the state machine (and the model you are describing isn't
quite where you are in the state machine, but you can adjust it to fit). If
your state machine is simple enough, your approach might get better
_instruction_ cache hit rate, but you are basically guaranteeing that _data_
has to move between the processor caches in very rapid succession. In
practice, not only are you forcing the fd's to be passed between processor
caches, but there tends to be data locality with that fd. While it is easy to
have the instruction cache replicated across all processors, the data mutates
quite rapidly. Again, this stuff is covered in the discussions on the c10k
site.

------
jbarzyc
check out dpdk.org

