
Evio – Fast event-loop networking for Go - Acconut
https://github.com/tidwall/evio
======
tidwall
This project is not intended to be a general purpose replacement for the
standard Go net package or goroutines. It's for building specialized services
such as key value stores, L7 proxies, static websites, etc.

You would not want to use this framework if you need to handle long-running
requests (milliseconds or more). For example, a web api that needs to connect
to a mongo database, authenticate, and respond; just use the Go net/http
package instead.

There are many popular event loop based applications in the wild such as
Nginx, Haproxy, Redis, and Memcached. All of these are single-threaded and
very fast and written in C.

The reason I wrote this framework is so I can build certain network services
that perform like the C apps above, but I also want to continue to work in Go.

~~~
Acconut
> It's for building specialized services such as key value stores, L7 proxies,
> static websites, etc.

First of all, thank you for publishing this project. It's very interesting in
my opinion since I never thought about the benefits of an event loop. Would
you mind explaining briefly why an event loop is a better suit for these
applications? Is it due to performance and efficiency?

~~~
jerf
I'd suggest that's not the right way to look at it. To a first approximation,
"everything" is using an event loop nowadays, in that everything is using the
same fundamental primitives to handle and dispatch events. In particular, this
includes the Go runtime; run "strace" on a Go network program and you'll see
these same calls pop up in the strace.

What this does instead is give a Go program _direct_ access to the event loop.
The benefit is that it bypasses all of the stuff that Go wraps around the
internal event loop call that allows it to implement the way it offers a
thread-like interface for you, and integrates with the channel and concurrency
primitives, and maintains your position in the call stack between events, etc.
The penalty is... the exact same thing, that you lose all the nice stuff that
the Go runtime offers to you to implement the thread-like interface, etc., and
are back to a lower-level interface that offers less services.

The performance of the Go runtime is "pretty good", especially by scripting
language standards, but if you have sufficiently high performance
requirements, you will not want to pay the overhead. The pathological case for
all of these nice high-level abstractions is a server that handles a ton of
network traffic of some sort and needs to do a little something to every
request, maybe just a couple dozen _cycle 's_ worth of something, at which
point paying what could be a few hundred cycles for all this runtime nice
stuff that you're not using becomes a significant performance drain. Most
people are not doing things where they can service a network request in a few
dozen cycles, and the longer it takes to service a single request the more
sense it makes to have a nice runtime layer providing you useful services, as
it drops in the percentage of CPU time consumed by your program. For the most
part, if you are so much as hitting a database over a network connection, even
a local one, in your request, you've already greatly exceeded the amount of
time you're paying to the runtime, for instance.

It does seem to me that a lot of people are a bit bedazzled by the top-level
stuff that various languages offer, and forget that under the hood, everyone's
using the event-based interfaces. What differs between Node and Twisted and
all of the dozens or hundreds of other viable wrappers over these calls is the
services automatically provided, not whether or not they are "event loops". Go
is an event loop at the kernel level. Node is an event loop at the kernel
level. Erlang is an event loop at the kernel level. They aren't all the same,
but "event-based" vs. "not event-based" is not the distinction; it's a
question of what they lay on top of the underlying event loop, not whether
they use it. Even pure OS threads are, ultimately, event loops under the hood,
just in the kernel rather than the user space.

~~~
pcwalton
> It does seem to me that a lot of people are a bit bedazzled by the top-level
> stuff that various languages offer, and forget that under the hood,
> everyone's using the event-based interfaces.

Yup. It's all very similar under the hood.

The most important difference between I/O models is whether the paradigm
involves _explicit_ vs. _implicit_ management of the event loop. Callback
models like Node, async/await style models like those of C#, and low-level
primitives like IOCP, epoll, and kqueue fall into the former category.
Go/Erlang, plain old threads, and even Unix processes fall into the latter
category. There are advantages and disadvantages of each model.

Within each of these broad categories, the distinctions are, IMHO, much less
interesting, and they're often made out to be more significant than they
actually are. In particular, the distinction between runtimes like Go and
regular OS pthreads is often made out to be more important than it really is,
when the difference ultimately boils down to the CPU privilege level that
thread management runs at.

~~~
tychver
Patrick, on the 2.6+ Linux kernels, is there a significant difference between
threads and processes? It seems like both threads and processes are created
via clone and the only difference is memory access?

I often hear "context switching between threads is cheaper" but pthreads still
have their own PID and everything, so is this really the case?

Is there really much advantage to pthreads over the way PostgreSQL does things
with efficient CoW sharing between processes for the binary?

~~~
cormacrelf
The significance of the distinction depends entirely on the use case.

Yes, they’re both created with clone, but with different levels of sharing. A
pthread will share the virtual address space of its parent, which makes shared
memory simple to implement; use the same pointer and you’re done. CoW is not
“sharing” really, because you can’t communicate over it, it just saves some
creation overhead.

With CoW, technically nothing gets copied initially, but as soon as the new
process starts executing, it’s going to start copying the stack frame and any
other regions it’s using. With a pthread you can be certain it will just copy
the stack.

Context switches are usually cheaper when you don’t need to throw out the old
virtual address space (and invalidate the Translation Lookaside Buffer).
Pthreads share virtual address space, so there is no need to flush the TLB.

In a use case like Postgres, you don’t necessarily need to optimise for
context switches. If you have a lot of concurrent connections, each of which
has one process, then you’ll only hit limits with context switching overhead
if very few of those connections are fighting over any locks or spending much
time in IO at all. This is atypical, so usually those other factors hit you
first.

~~~
anarazel
> The significance of the distinction depends entirely on the use case.

Indeed.

> Context switches are usually cheaper when you don’t need to throw out the
> old virtual address space (and invalidate the Translation Lookaside Buffer).
> Pthreads share virtual address space, so there is no need to flush the TLB.

I believe the cost of that has been reduced somewhat due to tagged TLBs on
modern hardware.

> In a use case like Postgres, you don’t necessarily need to optimise for
> context switches. If you have a lot of concurrent connections, each of which
> has one process, then you’ll only hit limits with context switching overhead
> if very few of those connections are fighting over any locks or spending
> much time in IO at all. This is atypical, so usually those other factors hit
> you first.

Yea. There's a number of limitations in postgres due to the process model, but
they're imo not TLB / context switch related. The biggest issue is that
dynamically sharing memory between processes is harder, because there's no
guarantee that it's possible for all post-fork memory allocations can portably
be put at the same virtual addresses. Which then makes it more complicated to
have shared datastructures, because you need to use relative pointers and
such. That's not a problem for the main buffer pool etc, which is allocated
when postgres is started, but it is problematic e.g. for memory shared between
multiple processes working on the same query (say the memory for a shared
hashtable in a hashjoin).

~~~
cormacrelf
> you need to use relative pointers and such

I don't think this qualifies as a performance overhead, though, beyond the odd
isub.

~~~
anarazel
> > you need to use relative pointers and such

> I don't think this qualifies as a performance overhead, though, beyond the
> odd isub.

It ends up as one. The reason is less the additional instruction(s), but that
you actually need to ferry arround additional data. In common scenarios you'll
end up with a number of mappings shared between processes, so you can't just
assume a single base address per-process. Instead you've to associate the
specific mapping with relative pointers, and that does add to overhead. Both
programming wise and runtime efficiency wise.

------
crawshaw
One of my favorite things about Go is that it cuts through the "threads vs.
events" debate by offering thread-style programming with event-style scaling
using what you might call green threads (compiler assisted cooperative
multitasking that has the programming semantics of preemptive multitasking).

That is, I can write simple blocking code, and my server still scales.

Using event loop programming in Go would take away one of my favorite things
about the language, so I won't be using this. However I do appreciate the
work, as it makes an excellent bug report against the Go runtime. It gives us
a standard to hold the standard library's net package to.

~~~
pcwalton
It doesn't really "cut through" the debate any more than any other
implementation of threads does. The only difference between Go and plain old
one-thread-per-connection is that regular threads run in the kernel, while Go
threads run in userspace. That's not a _semantic_ difference, only an
implementation detail (a large detail, to be clear, but still an
implementation detail).

There were historical implementations of pthreads, such as NGPT, that used
precisely the same model as Go, and they were abandoned because the advantages
over 1:1 were not sufficient to justify the complexity.

~~~
kjksf
What you call a "Go thread" has a precise name (goroutine) and running in
userspace is hardly the only difference between a goroutine and a kernel
thread.

Creating and destroying kernel threads is significantly more expensive.

A kernel thread has a fixed stack and if you go beyond, you crash. Which means
that you have to create kernel threads with worst-case-scenario stack sizes
(and pray that you got it right).

Goroutine has an expandable stack and starts with very small stack (which is
partly why it's faster; setting up kernel page mappings to create a contiguous
space for a large stack is not free).

Finally, goroutine scheduling is different than kernel thread scheduling: a
blocked goroutine consumes no CPU cycles.

In a 4 core CPU there is no point in running more than 4 busy kernel threads
but kernel scheduler has to give each thread a chance to run. The more threads
you have, the more time kernel spends and pointless work of ping-ponging
between threads. That hurts throughput, especially when we're talking about
high-load servers (serving thousands or even millions of concurrent
connections).

Go runtime only creates as many threads as CPUs and avoids this waste.

That's why high-perf servers (like nginx) don't just use kernel thread per
connection and go through considerable complexity of writing event driven
code.

Go gives you straightforward programming model of thread-per-connection with
scalability and performance much closer to event-driven model.

You work on Rust and are well informed about this topic so I'm sure you know
all of that.

Which is why it amazes me the lengths to which you go to denigrate Go in that
respect and minimize what is a great and unique programming model among
mainstream languages.

~~~
pcwalton
> What you call a "Go thread" has a precise name (goroutine)

I call goroutines threads because they are user-level threads.

As an analogy, NVIDIA calls local threadgroups "warps", but that doesn't make
them not local threadgroups.

> Creating and destroying kernel threads is significantly more expensive.

Because kernel threads usually have larger stacks. But they don't always have
large stacks: that is configurable. Other than the stack size, the primary
difference is simply that kernel threads are created in kernel space and user
threads are created in userspace.

> A kernel thread has a fixed stack and if you go beyond, you crash. Which
> means that you have to create kernel threads with worst-case-scenario stack
> sizes (and pray that you got it right).

You can do stack switching in 1:1 too. After all, if you couldn't, then Go
couldn't do stack switching at all, since goroutines are built on top of
kernel threads.

Go's small stacks are really a property of the moving GC, not a property of
the threading model.

> In a 4 core CPU there is no point in running more than 4 busy kernel threads
> but kernel scheduler has to give each thread a chance to run.

> Go runtime only creates as many threads as CPUs and avoids this waste.

Not if they're blocked doing I/O!

If they're not blocked doing I/O, then Go tries to do preemption just as the
kernel does. (I say "tries to" because Go currently cannot preempt outside
function boundaries; this is a significant downside of M:N threading compared
to 1:1 kernel threading.)

> That's why high-perf servers (like nginx) don't just use kernel thread per
> connection and go through considerable complexity of writing event driven
> code.

High-performance servers like nginx use an event loop because it's the only
way to get the absolute fastest performance, with no overhead of stacks at
all. The fact that the project described in the article gets better
performance than Go's threads is proof of that fact, in fact.

It would be possible, and interesting, to do Go-like 1:1 threading with small
stacks.

> Go gives you straightforward programming model of thread-per-connection with
> scalability and performance much closer to event-driven model.

Sure. But that's mostly because of the GC, not because of the M:N threading
model.

> Which is why it amazes me the lengths to which you go to denigrate Go in
> that respect and minimize what is a great and unique programming model among
> mainstream languages.

It's not unique. As I said, NGPT used to do M:N for pthreads. Solaris used to
do M:N for pthreads. The JVM used to do M:N.

~~~
dullgiulio
Nope, the JVM used to do M:1, it's very different from M:N.

------
olivierva
The Go network stack already makes use of epoll and kqueue:
[https://golang.org/src/runtime/netpoll_epoll.go](https://golang.org/src/runtime/netpoll_epoll.go)
So I'm not quiet sure why this would be faster since almost all I/O in Go is
event driven, including the networking stack.

~~~
willvarfar
The benchmarks at the bottom of the readme show quite an improvement (with a
single thread it seems).

I would speculate the performance win is because there is no stack switching
and less channels.

I've done lots of event loops in the past (eg hellepoll in c++) and think that
the cost of that is on the programmer - keeping track of things, callbacks,
state machines and things and avoiding using the stack for state etc is all
hard work and easy to mess up.

I am reminded of this post I saw on HN a while ago
[https://www.mappingthejourney.com/single-
post/2017/08/31/epi...](https://www.mappingthejourney.com/single-
post/2017/08/31/episode-8-interview-with-ryan-dahl-creator-of-nodejs/) Ryan
Dahl, creator of node.js, would just use Go today ;)

~~~
nly
> I've done lots of event loops in the past (eg hellepoll in c++) and think
> that the cost of that is on the programmer - keeping track of things,
> callbacks, state machines and things and avoiding using the stack for state
> etc is all hard work and easy to mess up.

This is improving, even in C++. This is what the core loop of a line-based
echo server could look like in C++17 (and something very similar compiles
today on my machine)

    
    
        void echo_loop (tcp::socket socket) {
            io::streambuf buffer;
            std::string line;
            std::error_code ec;
            do {
                ec = co_await async_read_line (socket, buffer, line);
                if (ec)
                    break;
                ec = co_await async_write (socket, line);
            } while (!ec);
        }

~~~
tidwall
Wow. That looks really simple.

~~~
nly
Unfortunately it's just exposition, but here[0] is a version that works with
Clang 5 + Boost

Echo specific code starts on line 167. Everything above will hopefully be
provided by the standard library once both the Networking TS and Coroutine TS
merge in to C++20.

One nice thing about lines 1 - 165 though, is that it demonstrates how easy it
is to extend the native coroutine capabilities in C++ to support arbitrary
async libraries, even if the author of those libraries didn't know anything
about coroutines. All this happens without breaking the ability to call these
coroutines from C. You can even use async C libraries that only provide a
void* argument to your callback.

[0]
[https://gist.github.com/anonymous/d9a258136431a352516122d1c9...](https://gist.github.com/anonymous/d9a258136431a352516122d1c9c2ca58)

------
fooyc
I love Go because I never had to write asynchronous, callback driven programs
in this language. I hope it won't become the norm in Go, too.

~~~
tidwall
It won't become the norm. I promise you, cross my heart and hope to die, stick
a needle in my eye.

------
indescions_2017
I'd be interested in the level seven reverse proxy application. As well as
unix domain socket message queues. There are probably many other places in the
networking pipeline evio could provide a boost.

It's a testament to what is possible through the "syscall" and "golang/x/sys"
facilities. As well as your confidence in playing with Linux internals ;)

~~~
tidwall
Thanks! The L7 proxy should be pretty sweet. :)

------
bsaul
Not sure i understand what the use case is. As soon as you start doing
something on the event loop , you need some kind of way to perform the
operation in another "thread" ( or goroutine or whatever). And then you start
to need some kind of concurrency mechanism, and pay the price.

Stripping those mechanism to pretend the event handling is faster only works
if you never intend to have some real computation performed. That's never true
in practice... Or am i missing something ?

~~~
dboreham
Not the OP but typically you resort to these tactics when you want to shave
the last ms of the server's response time, and/or get that last 1000
requests/s/core performance. You have a "fast path" that is simple and event
driven and hand off operation processing to regular threads for the (less
frequent) more complex operations.

So you're not missing anything.

~~~
nvarsj
Aren’t you losing a lot of the benefit by using a GC’d language in the first
place?

------
brian-armstrong
At this point, why not just use C++? I feel like people are trying to stretch
Go way past what it's good for. It's not going to replace C++ where C++ is
effective, and it shouldn't :)

~~~
pcwalton
Because memory safety is much easier to achieve in Go than in C++, even
"modern C++".

------
adrianratnapala
Worst of both worlds, only faster?

------
cdoxsey
This is single-threaded? What are you going to do with the other 31 or 63
cores?

The single-threaded nature of applications liked Redis an Haproxy is a
singificant impediment to their vertical scalability. CPUs aren't getting
faster, we're just going to get more cores, so anything that assumes there's
only a single core seems like a dead end.

Haproxy literally just added multithreading support in 1.8.

~~~
meritt
The CPU is rarely the bottleneck and for both Redis/HAProxy the vertical
scalability solution has been to launch multiple processes or forks with
different core affinities. There are downsides of course (no IPC) but I still
argue that CPU is not the bottleneck for 99% of usage scenarios.

HAProxy added threading support in 1.8 as you pointed out and Redis has
started the same (for a certain subset of processing) in 4.0 as well. They're
getting there but concurrency is tough.

To suggest that his product is a "dead end" due to not supporting threading
seems a bit premature, as Redis and HAProxy are extremely well-regarded in
their niche and they made it there without threading, and we've been at
maximal clock speed for nearly a decade.

~~~
cdoxsey
> There are downsides of course (no IPC) but I still argue that CPU is not the
> bottleneck for 99% of usage scenarios.

I suppose my experience might be unusual, but I frequently log in to
c3.8xlarge redis machines that have a single core pegged at 100% and the rest
doing nothing. Yes multiple processes help, but that requires updating clients
and makes it harder to share memory.

> To suggest that his product is a "dead end" due to not supporting threading
> seems a bit premature, as Redis and HAProxy are extremely well-regarded in
> their niche and they made it there without threading.

Well yeah, CPUs hitting their GHZ limit and the dramatic increase in the
number of cores per machine is a relatively recent phenomena.

I just think its weird to start a brand new project making those same
assumptions, especially when the underlying programming language was
explicitly designed with concurrency in mind.

It'd be like building a new networking library in Rust which ditches memory
safety.

