
Uvloop: Fast Python networking - c17r
http://magic.io/blog/uvloop-blazing-fast-python-networking/
======
haberman
> However, the performance bottleneck in aiohttp turned out to be its HTTP
> parser, which is so slow, that it matters very little how fast the
> underlying I/O library is.

This is exactly the same observation that motivated the Mongrel web server for
Ruby, 10 years ago this year.

"Mongrel is a small library that provides a very fast HTTP 1.1 server for Ruby
web applications. [...] What makes Mongrel so fast is the careful use of an
Ragel [C] extension to provide fast, accurate HTTP 1.1 protocol parsing. This
makes the server scream without too many portability issues." \--
[https://github.com/mongrel/mongrel](https://github.com/mongrel/mongrel)

And its successor Thin:

"Thin is a Ruby web server that glues together 3 of the best Ruby libraries in
web history: (1) the Mongrel parser, the root of Mongrel speed and security"
\-- [http://code.macournoyer.com/thin/](http://code.macournoyer.com/thin/)

In case anyone is still wondering, parsing in Ruby/Python/Lua is pretty slow
compared to C/C++. That's why I personally have been really interested for a
long time in writing parsers in C that can be used from higher level
languages. That way you can get the best of both worlds.

~~~
coderdude
It's the best of both worlds but doesn't shield you from the worst of one of
the worlds. Untrusted input is still reaching code that has direct access to
system memory. Hopefully not, anyway. But probably. Still, it's the way to go
if performance is key.

~~~
haberman
Very true. Thankfully fuzzing tools are getting better all the time. LLVM's
libfuzzer is great.

~~~
ericfrederich
I'd recommend to watch this:
[https://www.youtube.com/watch?v=y0hyqzR6hIY](https://www.youtube.com/watch?v=y0hyqzR6hIY)

He goes into detail about how these types of libraries are particularly
difficult or impossible to fuzz. He uses OpenSSL as an example but I would
imagine an http library being similar.

~~~
haberman
> He goes into detail about how these types of libraries are particularly
> difficult or impossible to fuzz.

I just watched the video on 2x and I don't think that's a fair summary. He
seems positive on fuzzing in general and mentions that fuzzing found two
extremely tricky bugs in libsndfile and flac.

He does point out that there are some cases like OpenSSL that are particularly
difficult to fuzz completely because they are encrypted and heavily stateful,
creating transient keys on the fly and such. I don't think HTTP has this
problem, for the most part.

Cool to see a video of Erik de Castro Lopo though -- I've worked with that guy
since the early 2000s when I was working on Audacity (which uses his excellend
libsndfile internally -- or at least did at the time).

------
SEJeff
Also note that David Beazley (google him if you're not aware) has a competitor
to asyncio, which arguably is also a direct competitor to this called curio:

[http://curio.readthedocs.io/en/latest/](http://curio.readthedocs.io/en/latest/)

The caveat is that it uses the new async/await coroutine bits that just landed
in Python 3.5, so it only works with Python 3.5+. He also gave a talk on
concurrency in python recently at last year's PyCon:

[https://www.youtube.com/watch?v=MCs5OvhV9S4](https://www.youtube.com/watch?v=MCs5OvhV9S4)

~~~
cderwin
Unless I'm mistaken, the decorator @asyncio.coroutine() is equivalent to async
def and yield from is functionally a drop-in for await, so you should be able
to use it with at least 3.4, maybe 3.3. Not that that's much better though.

~~~
masklinn
That makes it functionally equivalent to uvloop (which from my understanding
is a drop-in replacement for the built-in asyncio eventloop).

------
halayli
This isn't a fair comparison. the "HTTP server" presented isn't doing any
checks / validation that a typical web server does.

Compare

[https://github.com/MagicStack/vmbench/blob/master/servers/as...](https://github.com/MagicStack/vmbench/blob/master/servers/asyncio_http_server.py)

to

[https://github.com/nodejs/node/blob/master/lib/_http_outgoin...](https://github.com/nodejs/node/blob/master/lib/_http_outgoing.js)

[https://github.com/nodejs/node/blob/master/lib/_http_server....](https://github.com/nodejs/node/blob/master/lib/_http_server.js)

[https://github.com/golang/go/blob/master/src/net/http/server...](https://github.com/golang/go/blob/master/src/net/http/server.go)

The benchmark is almost equivalent to testing raw eventloop performance
against a complete http server.

To make this test fair, you need to write the same code in
asyncio_http_server.py in other languages.

~~~
RedCrowbar
> This isn't a fair comparison. the "HTTP server" presented isn't doing any
> checks / validation that a typical web server does.

How so? It uses a binding to http-parser, just like nodejs.

~~~
halayli
Parsing is just a small part of an http server. You can check the links I
posted to see how much work is involved in validating http headers (after
parsing), and creating http responses.

------
windlep
This is quite interesting, but I don't find req/sec very interesting at all.
These should be about matters of concurrency, that is, how much is being done
at once, not how many req/sec overall are done (which could almost be
explained away purely by the gains in lower latency).

These benchmarks seem to only use 10 clients concurrently, max. That's
ridiculously low.

A few questions I'd like to see answered.

How many clients can you connect to each server, and have each one ping once
per 5 mins, before the CPU gets overloaded?

How much memory is used per connection?

How does PyPy+Twisted fair in this?

~~~
1st1
> These benchmarks seem to only use 10 clients concurrently, max. That's
> ridiculously low.

I've just update the post with more details on HTTP benchmarks and attached
the correct full-results file [1]. The concurrency level for HTTP benchmarks
is 300, not 10.

To answer other questions, I'll have to run some benchmarks tomorrow :)

[1] [http://magic.io/blog/uvloop-blazing-fast-python-
networking/h...](http://magic.io/blog/uvloop-blazing-fast-python-
networking/http-bench.html)

~~~
windlep
Cool! For the client test I'm referring to, these would be long-lived clients
that stay around and just TCP ping for an echo server, rather than HTTP calls
that connect/disconnect.

In my experience, PyPy+twisted is around 5-25x faster CPython/twisted, and
smoked asyncio as well. Would be great to see how uvloop compares there, and
of course, someday when PyPy supports Python 3.5, there's no reason it
couldn't use uvloop via cffi I'd hope.

~~~
1st1
> [..] Would be great to see how uvloop compares there [..]

Yep, I'm curious to see what will happen there. Do you have any suggestions on
what tool to use to generate the load?

> [..] someday when PyPy supports Python 3.5, there's no reason it couldn't
> use uvloop via cffi I'd hope.

We'll figure that out! ;)

~~~
windlep
I wrote a tool to evaluate memory per connection one layer up (at websocket
level), using autobahn. It works in asyncio, so it should work fine with
uvloop.

[https://github.com/bbangert/ssl-ram-
testing/](https://github.com/bbangert/ssl-ram-testing/)

------
1st1
I'm the dev behind uvloop. AMA.

~~~
ruffrey
> at least 2x faster than nodejs, gevent, as well as any other Python
> asynchronous framework

I did not see any benchmarks in the repo to support this. How was this
statistic determined?

~~~
danappelxx
Building on to this, how does it compare to raw libuv in c?Personally, I'm not
surprised that python (especially cython) is faster than node in this case,
but I still need to see how much less overhead there is to node.

~~~
1st1
> Building on to this, how does it compare to raw libuv in c?

Building something in C is very hard. uvloop wraps all libuv primitives in
Python objects which know how to manage the memory safely (i.e. not to "free"
something before libuv is done with it). So development time wise, uvloop is
much better.

As for the performance, I guess you'd be able to squeeze another 5-15% if you
write the echo server in C.

> I'm not surprised that python (especially cython) is faster than node in
> this case

Cython is a statically typed compiled language, it can be anywhere from 2x to
100x faster than CPython.

~~~
danappelxx
>So development time wise, uvloop is much better.

Yeah, I definitely get that. I'm just trying to see the smaller picture here.

>Cython is a statically typed compiled language, it can be anywhere from 2x to
100x faster than Python.

Ah, my bad for not knowing the difference between Cython and CPython. It seems
to me, then, that this isn't really a fair comparison to node, is it?
Naturally a statically typed language is going to be faster than a dynamic
one. Good on you for including a comparison with Go, though.

~~~
Sean1708
> It seems to me, then, that this isn't really a fair comparison to node, is
> it?

Isn't node written in C?

~~~
danappelxx
Yes, but JavaScript is still a dynamic, garbage collected language.

~~~
1st1
Most of nodejs internals are in C++ on top of libuv. Only a thin layer of JS
interfaces wrap that.

Python is also a dynamic, GCed language. uvloop is built with Cython, which
uses the Python object model (and all of its overhead!), and CPython C-API
extensively (so it's slower than a pure C program using libuv).

------
cjbprime
Wow. What's the intuition for why python on top of libuv is 2x faster than
node on top of libuv?

~~~
1st1
Two things I'd check first:

1\. The benchmarks make servers generate a huge number of objects, so maybe,
the GC is under too much pressure.

2\. Another possibility is that the v8 JIT can't optimize some JS code in
nodejs, or does a poor job.

That said, only careful profiling can answer your question :)

~~~
geekuillaume
I would be interested in seeing the performance difference in NodeJS TCP echo
benchmark by using piping instead of reading/writing manually:

    
    
      socket.pipe(socket);

------
nitely
Asyncio is pretty fast, but as soon as you write any Python logic, request per
second will drop significantly.

I'm surprise uvloop is faster than node.js since libuv was developed for the
latter. Kudos to the author.

~~~
1st1
> Asyncio is pretty fast, but as soon as you write any Python logic, request
> per second will drop significantly.

Sure, but it really depends on how complex your Python code is.

> I'm surprise uvloop is faster than node.js since libuv was developed for the
> latter. Kudos to the author.

Thanks!

------
messel
Looks good. Wonder why these benchmarks never include an actual database. I'm
using firebase with nodejs and no matter how many requests per second my
server can respond with, it's ultimately constrained by data requests and
memory (ie how many connections can I wait on).

When I see numbers like 50k requests per second it's meaningless, unless I
have no database or some kind of in memory cache only db.

~~~
sametmax
It's not meaningless for new gen apps which trade a lot of db request for
message passing. In micro-services app, when you update something, you
propagate the change, and you have x client update for one db requests instead
of having x + 1 db requests. In that context, broacasting quickly to a lot of
clients is important.

------
patrickg_zill
Are you using this code in production?

~~~
1st1
I'd say it's not yet ready for production. The test coverage is fairly decent,
though, so I hope we'll make a stable release soon.

That said, uvloop should be fully compatible with asyncio. All APIs should be
ready, so you can start testing it in your projects.

------
ascotan
I think claims about being faster than X require showing the code used in the
benchmarking.

~~~
1st1
Everything is in the blog post:
[https://github.com/MagicStack/vmbench/tree/master/servers](https://github.com/MagicStack/vmbench/tree/master/servers)

------
MrBra
Time to take a rest for JS? :)

------
pritambarhate
It will be great to know how this compares to async frameworks in Java, Netty
and Vert.x for example, in the benchmarks.

~~~
1st1
Feel free to make a PR to
[https://github.com/MagicStack/vmbench](https://github.com/MagicStack/vmbench)!

------
noplay
Does it mean that uvloop could allow the use of the same event loop
implementation on the three major OS?

~~~
1st1
asyncio already allows that.

uvloop, as of now, only runs on *nix; but I hope we'll have Windows supported
soon too.

------
smegel
It is interesting that uvloop-streams is almost identical to gevent in
performance. Gevent is based on libev, the project libuv was forked from.

What exactly is the -streams addition that makes uvloop-streams perform so
much worse than plain uvloop?

~~~
1st1
Streams implementation is a pretty big chunk of Python code, that manages flow
control, buffering, and integration with coroutines. And you don't always need
all that stuff when you're writing a protocol, since you can implement them
more efficiently as part of protocol parser.

~~~
smegel
Is it possible to add a benchmark for gevent using raw sockets not
StreamServer (assuming that adds similar overhead).

~~~
1st1
AFAIK, StreamServer should actually affect benchmark in a positive way. In
gevent case, StreamServer isn't about high-level abstractions, it's about
making sure that client sockets always have a READ flag in the IO multiplexor
(epoll, kqueue etc)

------
distracteddev90
The benchmarks for node.js are terribly misleading. The node.js
implementations only ever spawn a single process, and thus node is only
running on a single core and uses only a single thread.

Specifically, the http server example(1), doesn't even bother using the
standard library provided Cluster module(2). Cluster is specifically designed
for distributing server workloads across multiple cores.

All node.js services/applications I've worked on in the past 3 years (that are
concerned with scale) utilize a multi-process node architecture.

The current benchmark can only claim that a single python process that spawns
multiple threads is 2x faster than a single node.js process that spawns only
one thread.

This fact may be interesting to some, but is irrelevant to real world
performance.

[1]:
[https://github.com/MagicStack/vmbench/blob/master/servers/no...](https://github.com/MagicStack/vmbench/blob/master/servers/nodejs_http_server.js)

[2]:
[https://nodejs.org/api/cluster.html](https://nodejs.org/api/cluster.html)

~~~
1st1
There is nothing misleading about the benchmarks. It is explicitly said that
ALL frameworks were benchmarked in single-process and single-thread modes.

Yes, in production you should run your nodejs app in cluster, your Python apps
in a multiprocess configuration, and you should never use GOMAXPROCS=1 for
your go apps in production!

Running all benchmarks in multiprocess configuration wouldn't add anything new
to the results.

~~~
distracteddev90
The main premise in my comment is that the benchmarks do not resemble real
world performance, and are therefore misleading.

The comment above
([https://news.ycombinator.com/item?id=11626762](https://news.ycombinator.com/item?id=11626762))
further expands on why these kinds of benchmarks, although interesting, have
no real value.

Each implementation does something wildly different and responds to different
inputs with completely different outputs.

To put it metaphorically, if you put a car engine in two completely different
chassis and then race them on a track, you aren't gaining any real insight
into relative performance of the engine in the two vehicles.

Also, just to be clear, my qualms are with the benchmarks alone, I think the
library is great! Thanks for all the hard work :)

~~~
1st1
I guess I look at the benchmarks in a bit different light.

These benchmarks are primarily comparing event loops and their performance.
TCP benchmark is very fair, HTTP - maybe not so much. The point is to show
that you can write super fast servers in Python too, just have a fast protocol
parser.

As for the HTTP benchmarks, I plan to add more stuff to httptools and
implement a complete HTTP protocol in it. Will rerun the benchmakrs, but I
don't expect more than 20% performance drop.

