
84% of a single-threaded 1KB write in Redis is spent in the kernel - antics
http://blog.nullspace.io/kernel-latency.html
======
jandrewrogers
This is why high-performance commercial databases do most or all of their I/O
management and scheduling in userspace. It is not a new idea; it is much more
efficient. However, that means you need to reimplement most of what the kernel
does in an optimal way.

This is relatively common for closed source software but you almost never see
these types of userspace I/O designs in open source, which means OSS designs
are often leaving integer factors worth of efficiency and performance on the
table. For some use cases, companies are very successful selling into this
efficiency and performance gap with closed source.

Part of the lack of open source is that these designs are not portable, due to
OS dependencies and sometimes hardware dependencies if a design is hardcore. I
think this is a partial copout; a Linux-only database engine would address the
vast majority of real-world deployments. A bigger reason is that the design
and implementation of these kinds of userspace kernels is a very high skill
and low level art that, frankly, is way outside the expertise of most open
source software contributors. For databases in particular, more often than not
even the basic design elements of the software are naively done (e.g. MongoDB)
and that is lower hanging fruit.

~~~
gaius
Indeed. Oracle basically uses the OS just to bootstrap itself and hand over
the initial chunk of memory, then it manages everything internally. It even
talks to NFS itself rather than going through VFS. Opens a socket from
userland code to the NetApp and reads and writes the raw protocol.

------
antirez
Totally... this is why pipelining makes Redis 10x faster, less syscalls.

Basically to make Redis much faster we need to work to three different related
things:

1) Less kernel friction.

2) Threaded I/O, this is the part worth threading, with a global lock to
execute queries so you don't get crazy with concurrency and complex data
structures. Memcached did it right.

3) Pipelining: better client support for pipelining so that it's easy to tell
the client in what order I need my replies, and unrelated replies can be glued
together easily.

~~~
jdub
Point (2) makes me think of LMDB. Have you looked into it much? I wonder if it
would be an interesting storage substrate for a threaded Redis.

~~~
antirez
Point "2" better applies when the storage substrate is memory and operations
are O(1) or logarithmic, because in that case, the time to serve the query is
comparable small compared to the time needed to process the reply and send
back the response. With an on-disk storage, I would go for a classic on-disk
database setup where different queries are served by different threads.

~~~
hyc_symas
LMDB beats all other "classic" on-disk databases for read performance. It also
happens to beat all other in-memory systems for read performance too, since
its reads require no locks.

[http://symas.com/mdb/memcache/](http://symas.com/mdb/memcache/)
[http://symas.com/mdb/inmem/](http://symas.com/mdb/inmem/)
[http://symas.com/mdb/ondisk/](http://symas.com/mdb/ondisk/)

------
yuriks
The concept of redis has always baffled me. A hash table is a very fast data
structure. As soon as you put that in a dedicated server, the cost of the
actual lookup is instantly eclipsed by the need to parse a text protocol and
do network I/O to communicate with the client.

So I'd be willing to say that the problem here isn't that the kernel stack is
slow per-se, but that workload is too small as to make the overhead look
ridiculous, when it'd be very much acceptable if your server did more actual
work.

~~~
falcolas
The value of Redis is never going to be found with a single server. It's going
to be found when you use Redis to synchronize the state of multiple servers.

~~~
moe
Redis can actually be very useful on a single server, too, as a fast, robust,
convenient, potentially shared datastore.

Performance is usually a secondary concern in these use-cases.

~~~
bherms
Yep, for data stores (caches, etc), Redis is awesome and super fast, though
not as time-tested as memcached.

And all of the Redis set/list operations are super valuable. You just need to
be careful once you start relying on Redis at scale for things you take for
granted when you start playing around with it... For example: zunionstores on
lots of large sorted sets. "O(N)+O(M log(M)) with N being the sum of the sizes
of the input sorted sets, and M being the number of elements in the resulting
sorted set." When you start off using it, it's awesome and super fast, but
before you know it, the blocking, single-threaded architecture will crash and
burn if your data scales up. Luckily we have clustering now :)

------
mattgodbolt
This is why techniques like kernel bypass are used in high throughout or low
latency systems like finance. Things like
[http://www.openonload.org/](http://www.openonload.org/)

This does tie you in to a specific network vendor but I can see the argument
for moving more commodity networking hardware in this direction too.

~~~
falcolas
Personally, I see a lot of value in Kernel modules which let you bypass the
kernel entirely for simple IO. It requires more work on the applications end,
and more libraries to support the disparate hardware, but it would help with
many such activities.

------
caf
This article from last month is relevant:

 _Improving Linux networking performance_

[https://lwn.net/Articles/629155/](https://lwn.net/Articles/629155/)

HN discussion:
[https://news.ycombinator.com/item?id=8931431](https://news.ycombinator.com/item?id=8931431)

------
JoeAltmaier
InfiniBand was conceived around this issue. Additional overhead using kernel
I/o includes the user/kernel space switch, copying between user/kernel
buffers, and waiting/polling for interrupts. That stuff doesn't get any
smaller as networks get faster, so today we're at the point that they dominate
the actual hardware I/o time for many network devices.

There's been periodic interest in 'virtual hardware' where the hardware
presents multiple interfaces to different users. This way the driver can run
in user mode, since there's no need to control/share the hardware registers in
the kernel.

------
wallflower
Rian Hunter, Dropbox's third engineer, talks about the latency incurred by
OpenSSL when they were designing and implementing their extremely high-
performance Dropbox notification servers in the talk below. Also, the pitch he
gives at the end for joining Dropbox is one of the most genuine and heartfelt
I've ever seen.

[http://www.youtube.com/watch?v=FBRIeoEr8GU](http://www.youtube.com/watch?v=FBRIeoEr8GU)

------
nteon
To be clear, 80% of kernel-time in that 1KB write is spent in fsync(), _not_
in the network stack. Network overhead is roughly similar between read & write
requests. What Arrakis seems to be able to do is avoid the overhead of write &
sync, presumably because it doesn't go through the VFS + filesystem + block IO
code paths.

~~~
hurin
>To be clear, 80% of kernel-time in that 1KB write is spent in fsync(), _not_
in the network stack

Are you sure?

 _Of the total 3.36 μs (see Table 1) spent processing each packet in Linux,
nearly 70% is spent in the network stack_

~~~
nteon
Table 1 is looking specifically at getting a chunk of data off the wire and
into the users code. Look at Table 2 for a comparison of redis read/write.
Average time for a write is 163 μs - of which 137 μs is spent in fsync(2)

~~~
hurin
Right but you can't compare table 2 data to the _network stack_ , table 2 data
is only timing the redis operations. Which as stated take up significantly
less time than the time spent in the _network stack_.

------
noir-york
Where exactly in the networking stack is the time being spent?

If it is in the IP/TCP layers then moving that to user-space does not, by
itself, necessarily reduce latency, it merely shifts it elsewhere. If the
latency is due to kernel user land memory copies then that is a different
matter.

~~~
antics
The point of moving it to user-space is that you can source a lot of the jobs
of the TCP/IP stack to hardware directly. In some cases this dramatically
speeds up your I/O.

Consider, for example that the dominant costs are things like demultiplexing
and security checks. If you choose to implement multiplexing with virtual
network cards then you get true 0-copy multiplexing, which is much faster than
the software equivalent. And many of the security checks can be eliminated by
using some combination of packet filters and logical disks. (The security BTW
seems to be one big difference from RDMA, which might be an alternative, but
I'm not really an expert.)

Some things _can 't_ be sourced to the hardware, like naming and access
control. But that's fine.

(NB, I'm not arguing _for_ this paper's position necessarily, I just thought
it was interesting, and the motivation was good enough to start me thinking
about how I might get around the kernel.)

~~~
rsanders
The life of a data frame has become pretty complicated. Along the way it may
pass through one or more layers of virtualization, one or more layers of
network-specific mangling like iptables doing filtering and NAT, and a combo
of the two in OS-based virtualized networks connecting VMs and containers both
intra- and inter-host.

Pushing more of the stack into hardware is probably a good idea for single-
tenant datacenters that can deploy a lot of e.g. Redis appliances, but those
of us just renting capacity in the cloud are going to suffer from Amdahl's Law
if you can only accelerate the part of the system adjacent to real hardware
NICs.

------
wmf
While we're looking at OSDI proceedings, check out the IX paper:
[https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/belay) It shows how the kernel can provide high
throughput and low latency without giving up sharing and protection.

------
wyager
When testing [https://github.com/wyager/Neks](https://github.com/wyager/Neks)
, pipelining made the server something like 30x faster for the exact same
reason. Syscalls are tremendously expensive compared to the work of processing
a request.

------
bwross
Solutions that pull the TCP stack out of the kernel perform so much better
because they're bypassing all the internal bureaucracy that the kernel
otherwise performs to make it as easy as possible for userspace applications
to use the network without stepping on other applications' toes.

The kernel socket API is designed so that programs have to do as little
thinking as possible to get their own personal slice of the shared and noisy
network. It provides an easy abstraction, and that requires the kernel do a
lot of messy stuff for you:

\- When you're using TCP sockets, the kernel makes copies of everything your
application writes and holds it in a buffer until its receipt is acknowledged,
just in case it needs to resend it when the other side doesn't acknowledge it.
If the socket's buffer fills up, your application blocks on I/O until some
space is freed.

\- It holds ports open in a lingering state long after they're closed just in
case it needs to re-transmit the last bytes. This can be disabled, but it's on
by default.

\- It takes care of all the congestion control for you, but it's tuned for the
general case, and as a result there are a lot of edge cases which perform very
badly for the problem they're trying to solve. Redis is probably one such edge
case.

Of course, all of this is fine and desirable for general applications, but it
ends up being problematic if you're trying to solve a problem where
performance is the chief concern.

It's tempting to say the problem is that kernel has to do way too much to
provide that easy abstraction, but really the problem is that the kernel
provides no way around it. You pretty much have the option of using their
cushy stream abstraction at the cost of performance, or you use a userspace
TCP stack on raw sockets, which requires running as root and disabling TCP in
the kernel (otherwise the kernel stomps all over your TCP negotiations[1]).

There are some other transport layer protocols (SCTP, DCCP, etc.), as well as
application layer protocols built on UDP, that remove some of the abstractions
TCP provides and as a result require less in-kernel bureaucracy, but those
solutions don't seem to be very popular or well-supported.

It would be nice if the kernel would provide some lower level system calls
that could be selectively used to move parts of TCP into the application
(e.g., retaining copies of data in case of re-transmission). Alas, I don't
think there's much push for that, because a) it's hard, and b) the current
situation is fine for 99% of network applications.

[1] [http://jvns.ca/blog/2014/08/12/what-happens-if-you-write-
a-t...](http://jvns.ca/blog/2014/08/12/what-happens-if-you-write-a-tcp-stack-
in-python/)

------
riffraff
maybe relevant:
[http://info.iet.unipi.it/~luigi/netmap/](http://info.iet.unipi.it/~luigi/netmap/)
a "framework for high speed packet I/O implemented as a kernel module for
FreeBSD and Linux"

------
BrianEatWorld
I am not very familiar with OSes and things at the kernel level, can anyone
answer the question in the comments of the post?

"How did you measure the time spent in each section (HW, kernel, app)? how did
you get such granularity?"

~~~
wmf
From the Arrakis paper: "To analyze the sources of overhead, we record
timestamps at various stages of kernel and user-space processing." You can
probably implement this with something like perf dynamic tracing:
[http://www.brendangregg.com/perf.html#DynamicTracing](http://www.brendangregg.com/perf.html#DynamicTracing)

------
easytiger
I wonder if anyone has tried it with Solarflare?

~~~
neomantra
Solarflare has a whitepaper on accelerating memcached:
[http://10gbe.blogspot.com/2014/12/memcached-3x-faster-
than-i...](http://10gbe.blogspot.com/2014/12/memcached-3x-faster-than-
intel.html)

I just whipped this up in 5 minutes and didn't do much of the tuning there
(e.g. no isolcpus or interrupt changes), but here's a single-client 1024-byte
SET redis-benchmark running against localhost with and without TCP Loopback
Acceleration... redis 2.8.4 on a dual E5-2630 @ 2.30GHz, card is SFN5122F but
this is all loopback. I'm not claiming anything and just doing it because
somebody pondered...

    
    
       * plain jane
      /usr/bin/redis-server
      redis-benchmark -t set -q -n 1000000 -d 1024 -c 1
      SET: 21258.96 requests per second
      
       * unaccelerated server, unaccelerated client
       numactl --physcpubind 1,3,5 --preferred 1 /usr/bin/redis-server
       numactl --physcpubind=7,9 --preferred 1  redis-benchmark -t set -q -n 100000 -d 1024 -c 1
      SET: 14293.88 requests per second
      
       * TCP loopback accelerated server, unaccelerated client
      EF_NAME=hn EF_TCP_SERVER_LOOPBACK=2 EF_TCP_CLIENT_LOOPBACK=2 onload -p latency numactl --physcpubind 1,3,5   --preferred 1 /usr/bin/redis-server
      numactl --physcpubind=7,9 --preferred 1  redis-benchmark -t set -q -n   100000 -d 1024 -c 1  
      SET: 25967.28 reque  sts per second
    
       * TCP loopback accelerated server, accelerated client
      EF_NAME=hn EF_TCP_SERVER_LOOPBACK=2 EF_TCP_CLIENT_LOOPBACK=2 onload -p latency numactl --physcpubind 1,3,5   --preferred 1 /usr/bin/redis-server
      EF_NAME=hn onload -p latency numactl --physcpubind=7 --preferred 1  redis-benchmark -t set -q -n 1000000 -  d 1024 -c 1  
      oo:redis-benchmark[13454]: Sharing OpenOnload 201405-u1 Copyright 2006-2012 Solarflare Communications,   2002-2005 Level 5 Networks [4,hn]
      SET: 96098.41 requests per second
    

Edit: formatting fixes

~~~
easytiger
Not bad considering its "for free". Thanks a lot for the post

