> However, the performance bottleneck in aiohttp turned out to be its HTTP parser, which is so slow, that it matters very little how fast the underlying I/O library is.
This is exactly the same observation that motivated the Mongrel web server for Ruby, 10 years ago this year.
"Mongrel is a small library that provides a very fast HTTP 1.1 server for Ruby web applications. [...] What makes Mongrel so fast is the careful use of an Ragel [C] extension to provide fast, accurate HTTP 1.1 protocol parsing. This makes the server scream without too many portability issues." -- https://github.com/mongrel/mongrel
And its successor Thin:
"Thin is a Ruby web server that glues together 3 of the best Ruby libraries in web history: (1) the Mongrel parser, the root of Mongrel speed and security" -- http://code.macournoyer.com/thin/
In case anyone is still wondering, parsing in Ruby/Python/Lua is pretty slow compared to C/C++. That's why I personally have been really interested for a long time in writing parsers in C that can be used from higher level languages. That way you can get the best of both worlds.
It's the best of both worlds but doesn't shield you from the worst of one of the worlds. Untrusted input is still reaching code that has direct access to system memory. Hopefully not, anyway. But probably. Still, it's the way to go if performance is key.
These days you would probably want to write the parser part in Rust, with a small amount of unsafe code to implement a C-compatible API that could then be called from Python, or wherever. I did this for some regexp-based log parsing code written in Python, and saw a considerable (2-3x) performance win. The main outstanding issue is that Rust isn't as easy to distribute as C to random end users (e.g. it likely requires the user to have rustc installed for |pip install| to work, which is unlikely and not always possible through standard package managers).
Absolutely it's a positive that this is an active area of research. One day these types of problems are going to be something future programmers joke about. Or don't even know existed.
He goes into detail about how these types of libraries are particularly difficult or impossible to fuzz. He uses OpenSSL as an example but I would imagine an http library being similar.
> He goes into detail about how these types of libraries are particularly difficult or impossible to fuzz.
I just watched the video on 2x and I don't think that's a fair summary. He seems positive on fuzzing in general and mentions that fuzzing found two extremely tricky bugs in libsndfile and flac.
He does point out that there are some cases like OpenSSL that are particularly difficult to fuzz completely because they are encrypted and heavily stateful, creating transient keys on the fly and such. I don't think HTTP has this problem, for the most part.
Cool to see a video of Erik de Castro Lopo though -- I've worked with that guy since the early 2000s when I was working on Audacity (which uses his excellend libsndfile internally -- or at least did at the time).
On a related note, we go through great lengths in the Kestrel HTTP server [0] (which also uses libuv) to have fast HTTP parsing. As an example, we attempt to read the method and the HTTP version as longs and compare them to pre-computed longs in order to have fast comparisons and reuse strings containing standard methods and versions (reducing memory allocation is the main driver of our optimizations) [1], so we don't have to allocate those strings on every request. We do a similar thing for headers [2]. We also manage a lot of our own memory, despite using a garbage collected language [3].
How much of a parsing stack can you build atop of Ragel? I gave Ragel a good look over recently when checking out parsers for Ruby because Ragel has a Ruby binding. I chose A Ruby PEG† library (Parslet) because I decided that there was too much of a gap between the low level finite automata machinery that Ragel provides and the parsing generation that I was looking for. Was I wrong to decide that, I wonder.
† On a related note, I'm finding it difficult finding (fast) GLR or GLL parser generators for Ruby
I think you're right: the gap between Ragel and what you want in higher-level languages is too wide to make a stack out of it IMO.
I'm working on what I consider a parsing stack. The key observation, in my opinion, is that you need a way of representing structured data in the high-level language that is both rich and efficient. I think Protocol Buffers can serve that role.
Once you have the structured data representation, you can write parsers to read/write it. But you want the parsers to be independent of how you represent the structured data in any one language.
I've spent the last little while browsing through the upb and gazelle repos. I'm struggling to get a high-level overview of what you want to do. I'm curious though because what you're saying here on HN intrigues me.
Also note that David Beazley (google him if you're not aware) has a competitor to asyncio, which arguably is also a direct competitor to this called curio:
The caveat is that it uses the new async/await coroutine bits that just landed in Python 3.5, so it only works with Python 3.5+. He also gave a talk on concurrency in python recently at last year's PyCon:
Unless I'm mistaken, the decorator @asyncio.coroutine() is equivalent to async def and yield from is functionally a drop-in for await, so you should be able to use it with at least 3.4, maybe 3.3. Not that that's much better though.
Keep in mind that the HTTP server in the benchmarks actually uses the httptools parser, which is a full-blown HTTP parser. A lot of heavy-lifting is done by the parser.
I plan to add a complete HTTP protocol implementation to httptools, but honestly, I don't expect it to be more than 20% slower.
Parsing is just a small part of an http server. You can check the links I posted to see how much work is involved in validating http headers (after parsing), and creating http responses.
This is quite interesting, but I don't find req/sec very interesting at all. These should be about matters of concurrency, that is, how much is being done at once, not how many req/sec overall are done (which could almost be explained away purely by the gains in lower latency).
These benchmarks seem to only use 10 clients concurrently, max. That's ridiculously low.
A few questions I'd like to see answered.
How many clients can you connect to each server, and have each one ping once per 5 mins, before the CPU gets overloaded?
> These benchmarks seem to only use 10 clients concurrently, max. That's ridiculously low.
I've just update the post with more details on HTTP benchmarks and attached the correct full-results file [1]. The concurrency level for HTTP benchmarks is 300, not 10.
To answer other questions, I'll have to run some benchmarks tomorrow :)
Cool! For the client test I'm referring to, these would be long-lived clients that stay around and just TCP ping for an echo server, rather than HTTP calls that connect/disconnect.
In my experience, PyPy+twisted is around 5-25x faster CPython/twisted, and smoked asyncio as well. Would be great to see how uvloop compares there, and of course, someday when PyPy supports Python 3.5, there's no reason it couldn't use uvloop via cffi I'd hope.
I wrote a tool to evaluate memory per connection one layer up (at websocket level), using autobahn. It works in asyncio, so it should work fine with uvloop.
No questions, just a thank you for including relevant information about the tests content, concurrency, adding percentile boxes to graphs, etc. (and the tested environment itself!) It's refreshing to see a benchmark taken seriously rather than "we tested some stuff, here are 3 numbers, victory!".
1) What makes uvloop which is based on libuv 2x faster than node.js which is also based on libuv?
2) Can uvloop be used with frameworks like flask or django?
3) gevent uses monkey patching to turn blocking libraries such as DB drivers non-blocking, does uvloop do anything similar? If not how does it work with blocking libraries?
1) I don't know :( I've answered a similar question in this thread with a couple of guesses.
2) No, they have a different architecture. Although I heard that there is a project to integrate asyncio into django to get websockets and http/2.
3) asyncio/uvloop require you to use explicit async/await. So, unfortunately, the existing networking code that isn't built for asyncio can't be reused. On the bright side, there are so many asyncio DB drivers and other modules now!
Having a web framework (a next generation Flask if you will) built ground up with concurrency is the missing key. I have used Flask for many years, but this is the right time to introduce a new framework - because of the internal restructuring and slowdown of Flask's maintainers (and I say this with the utmost respect).
If you are keen, this has the potential to be the killer application for Python 3.
One problem with this is that the entire ecosystem has to get on board with async. Maybe a new framework would make it compelling enough, who knows.
This was/is the big issue with Tornado, IMO (and Tornado has been around for ages in framework time). Tornado is only async if the entire call stack all the way down to the http socket is async, using callbacks instead of returning values. This means that any 3rd party client library you use has to be completely written asynchronously, and none are in python. So you end up with a lot tedious work re-implementing http client libraries for Twilio or Stripe or whatever you're using.
I'm curious to see where asyncio goes in python, but I'm a bit skeptical after seeing how much of a pain it was to use Tornado on a large web app. In the meantime I'll be using Gevent + Flask, which isn't perfect since it adds some magic & complexity but has the huge upside of letting you keep using all the libraries you're used to.
There is already http://www.tornadoweb.org/en/stable/ which is python3 compatible and shipped in production software across lots of companies (last I heard hipmunk and quora uses it)
Tornado is excellent... But Flask is better than excellent. The mental map of Flask is incredible. Tornado is a little hard to grok. Now one may argue that Tornado is hard by choice...to not mask the complexity.
But then we have node..the most hip of frameworks out there. Node's true innovation was not performance, but to simplify the mental model of async. Obviously there's all the callback hell and all..but still.
I'm pretty sure both Node and Tornado have the same mental model of async. Both have callbacks and both have async/await functionality that makes it look more like blocking code.
The main benefit of Node as I see it is that the entire Node community uses the same IO Loop whereas Python's community is fragmented between normal sync code and multiple different IO Loops (asyncio will probably help with this).
You totally should. I think people are hungry for a next gen async web framework... And if it leverages Python 3, then so much the better. Flask cannot go here even if it wants because its core philosophy is to be WSGI compliant. You don't necessarily have to adhere to that.
Just one point, please make sure you have designed DB access as a core part of your framework (e.g. [1]). Too many frameworks discount database interaction until it's too late.
Oh and please please choose your name so that it doesn't conflict on Google search. http://www.spinframework.org
> Just one point, please make sure you have designed DB access as a core part of your framework (e.g. [1]). Too many frameworks discount database interaction until it's too late.
I strongly advise against this. One of the reasons Flask is so attractive is the fact that it does not enforce any database on you.
Thanks to its decoupled design you can use it purely as a routing library, which is great! Letting the framework decide something important as the database is a bad idea. [1]
that is ok - but there is a representative library that works. Loosely decoupled but definitively working is beautiful... and this is why I use Flask in my startup.
what frequently happens is a web framework without a thought for any kind of DB interaction (or as you put it... a routing library). In things like an async web framework, that could leave users hanging. For example, psycopg vs psycopg2 vs psycogreen vs psycopg2-cffi . Tell me which one to use and benchmark it.
I agree, its a fine line. And we can keep going back and forth whether a framework should "recommend" or not recommend. But in case like this - I think there will NOT be a lot of libraries that will be compliant with the async usecase. I would hope that this framework will recommend... but not ship "batteries included".
If such a framework was built around asyncio, do we really need yet another database layer? 1 of the best things about Flask is it doesn't reinvent the wheel.
I'd like to chat with you about this, share ideas for API, and architectures. Even if we don't work to each others, we can benefit from sharing ideas about what should a next gen framework look like in Python.
We are actually working on a project like this with Tygs (https://github.com/Tygs/tygs, which will use crossbar.io), and right now it's taking ages just to get the app life cycle right.
Indeed, we want to make it very easy to use, especially a clean/discoverable API, simple debugging, clear errors, etc. Which is the stuff async frameworks are not good at, and an incredible added value.
It's a lot more work we initially thought. When you are doing async, you now think, not with a sequence of actions ordered in time, but with events. So for your framework to be usable, you must provide hooks to:
- do something at init
- register a component
- do something when a component is registered
- do something once it's ready
- do something when there is an error
- do something when it shuts down
With minimum boiler plate, and maximum clear error handling when the user try to do it at the wrong time or in the wrong way.
But, we learned while coding the project that, with asyncio, exceptions in coroutine don't break the event look (except KeyboardInterrupt which is a weird hybrid) while exceptions from the loop do break it.
Plus you have to make a nice setup, which auto starts the event loop with good default so that people don't have to think about it for simple apps. But it must be overridable, and handle the case where your framework is embeded inside an already started event loop, and make it easy to do so.
It's one of the strong point of gevent: you don't have to think about it. With asycio/twisted, you have the benefit a explicit process switch and parallelism, but you need to pay the price with verbosity and complexity. We try to create a balance, and it's turns out to be harder than expected.
Then you have to make clear error reporting, especially iron the implementation mixing Task, Futures, coroutine functions and coroutines. Provide helpers so that common scheduling is done easily...
And you haven't even talked about HTTP yet. This is just proper asyncio management. This is why nobody made a killer framework yet : it's a looooooot of work, it's hard, and it's very easy to get it wrong. Doing like <sync framework> but async doesn't cut it.
You've basically described Twisted. It's unfortunate that when twisted was initially developed python didn't have some of the syntactic niceness (coroutines, async) that would have made it a bit easier/cleaner to use.
- It's verbose compared to flask.
- It reinvent the wheel, while flask uses great components such as werkzeug.
- If you want to make a component, it's complex.
- It doesn't come battery included for the Web like Django, just the bare minimum.
- It misses the opportunity to provide task queues, RPC or PUB/SUB which are key components to any modern stacks are made easily possible by having persistent connections.
- It ignores async/await potential of unifying threads/process/asyncio and don't allow easy multi-cpu.
Don't get me wrong, I think tornado is a great piece of software, but it's not match against innovative projects we see in Go or NodeJS such as Meteor.
We will have to agree to disagree on many of your points as they are mostly personal opinions, but the last one is flat wrong. Tornado has for a long time had great support for fully leveraging all CPU cores. Happy to show sample code if you like.
I'm the main person in Django working on 2), and this is interesting for sure, though our current code is based on Twisted since we need python 2 compatability (there's some asyncio code, but not a complete webserver yet)
I'm working on a Python CLI that uses asyncio/aiohttp to make and process requests to a 3rd party API. Anyways, I ran into the 10,000 socket problem today and ended up using a semaphore, that actually boosted the overall performance. Why is that? Is it just because the CPU is overwhelmed otherwise?
It depends. Maybe you aren't closing the sockets properly (shutdown + close). Or maybe, because of how TCP works, the sockets are stuck in timeouts and don't really close for a long period of time. If its something like that, then your old connections aren't really closing, and new ones can't be created.
Or maybe it's a simple problem of aiohttp performance -- as shown in the blog post, its HTTP parser is a bit slow.
In general, I'd recommend to use a fewer number of sockets and implement some pipelining of API requests.
Are you making all connections within a single session?
Not long ago I saw example here that someone was creating a new session for every single connection. This is not very optimal way of using it. If you use it within same session, aiohttp will make use of keep-alive, which in turn will reuse existing connections and reduce overhead. You also won't need to use a semaphore, since you can define limit in TCPConnector.
Why you had performance issues? As other said, you were making thousands of connections, each socket need to be in TIME_WAIT state for 2 minutes after closing (limitation of TCP, SCTP does not have this problem). So if you use all connections within short time, you'll essentially run out of them. Some people use tcp_tw_reuse/recycle, and that solves this issue, but that makes your connections no longer follow RFC and you might encounter strange issues later on. The advice above should resolve your problem without any hacks.
Not sure what you mean, the benchmark section lists ~40k packets/s for nodejs and asyncio, ~100k for uvloop (for 1KiB packages, similar difference for 10 and 100 KiB) - and ~20k req/s for nodejs, and ~37k for uvloop w/httptools. Interestingly, uvloop pulls ahead for 100KiB request size for the http case.
Building on to this, how does it compare to raw libuv in c?Personally, I'm not surprised that python (especially cython) is faster than node in this case, but I still need to see how much less overhead there is to node.
> Building on to this, how does it compare to raw libuv in c?
Building something in C is very hard. uvloop wraps all libuv primitives in Python objects which know how to manage the memory safely (i.e. not to "free" something before libuv is done with it). So development time wise, uvloop is much better.
As for the performance, I guess you'd be able to squeeze another 5-15% if you write the echo server in C.
> I'm not surprised that python (especially cython) is faster than node in this case
Cython is a statically typed compiled language, it can be anywhere from 2x to 100x faster than CPython.
Yeah, I definitely get that. I'm just trying to see the smaller picture here.
>Cython is a statically typed compiled language, it can be anywhere from 2x to 100x faster than Python.
Ah, my bad for not knowing the difference between Cython and CPython. It seems to me, then, that this isn't really a fair comparison to node, is it? Naturally a statically typed language is going to be faster than a dynamic one. Good on you for including a comparison with Go, though.
Most of nodejs internals are in C++ on top of libuv. Only a thin layer of JS interfaces wrap that.
Python is also a dynamic, GCed language. uvloop is built with Cython, which uses the Python object model (and all of its overhead!), and CPython C-API extensively (so it's slower than a pure C program using libuv).
Wherever you use asyncio, it should be safe to just start using uvloop (once we graduate it from beta).
uvloop shouldn't behave any differently, I've paid special attention to make sure it works exactly the same way as asyncio (down to when its objects are garbage collected).
Not really practical. uvloop is designed to work in tandem with asyncio, which is a Python 3-only module. asyncio, in turn, requires 'yield from' support, something that Python 2 doesn't have.
The problem with Trollius was that packages that asyncio packages needed explicitly to add support for it to work (because Python 2 does not have yield from) Several packages (I remember aiohttp was one of them) did not want to do that.
Python is actually designed in such way that you can install multiple major versions and they can coexist perfectly fine together.
For example you can install python 2.6, 2.7, 3.3, 3.4, & 3.5 all on one host without any conflicts. The limitation is that many distributions prefer to not maintain different versions of supposedly the same language.
If you use RedHat or CentOS you can just use https://ius.io/ and get access to the other python versions. This is one of few repos that makes sure the packages don't conflict with system ones.
The HTTP benchmarks were actually run under concurrency level of 300. I've just updated the post with extra details. See also the full report of HTTP benchmark [1]
C Python uses ref counting, whereas Node/V8 uses GC. Ref counting schemes generally use much less memory and GC scans can be expensive when there's a lot of new data, as is the case for a web server.
Looks good. Wonder why these benchmarks never include an actual database. I'm using firebase with nodejs and no matter how many requests per second my server can respond with, it's ultimately constrained by data requests and memory (ie how many connections can I wait on).
When I see numbers like 50k requests per second it's meaningless, unless I have no database or some kind of in memory cache only db.
It's not meaningless for new gen apps which trade a lot of db request for message passing. In micro-services app, when you update something, you propagate the change, and you have x client update for one db requests instead of having x + 1 db requests. In that context, broacasting quickly to a lot of clients is important.
Streams implementation is a pretty big chunk of Python code, that manages flow control, buffering, and integration with coroutines. And you don't always need all that stuff when you're writing a protocol, since you can implement them more efficiently as part of protocol parser.
AFAIK, StreamServer should actually affect benchmark in a positive way. In gevent case, StreamServer isn't about high-level abstractions, it's about making sure that client sockets always have a READ flag in the IO multiplexor (epoll, kqueue etc)
Unless you're working with legacy project, in which case you wouldn't be replacing the lib for such stuff anyway, I don't see any reason why you would want to use 2.7.x over 3.x unless you desperately need something like gevent.
The benchmarks for node.js are terribly misleading. The node.js implementations only ever spawn a single process, and thus node is only running on a single core and uses only a single thread.
Specifically, the http server example(1), doesn't even bother using the standard library provided Cluster module(2). Cluster is specifically designed for distributing server workloads across multiple cores.
All node.js services/applications I've worked on in the past 3 years (that are concerned with scale) utilize a multi-process node architecture.
The current benchmark can only claim that a single python process that spawns multiple threads is 2x faster than a single node.js process that spawns only one thread.
This fact may be interesting to some, but is irrelevant to real world performance.
There is nothing misleading about the benchmarks. It is explicitly said that ALL frameworks were benchmarked in single-process and single-thread modes.
Yes, in production you should run your nodejs app in cluster, your Python apps in a multiprocess configuration, and you should never use GOMAXPROCS=1 for your go apps in production!
Running all benchmarks in multiprocess configuration wouldn't add anything new to the results.
Each implementation does something wildly different and responds to different inputs with completely different outputs.
To put it metaphorically, if you put a car engine in two completely different chassis and then race them on a track, you aren't gaining any real insight into relative performance of the engine in the two vehicles.
Also, just to be clear, my qualms are with the benchmarks alone, I think the library is great! Thanks for all the hard work :)
I guess I look at the benchmarks in a bit different light.
These benchmarks are primarily comparing event loops and their performance. TCP benchmark is very fair, HTTP - maybe not so much. The point is to show that you can write super fast servers in Python too, just have a fast protocol parser.
As for the HTTP benchmarks, I plan to add more stuff to httptools and implement a complete HTTP protocol in it. Will rerun the benchmakrs, but I don't expect more than 20% performance drop.
Since this is a benchmark of eventloop based frameworks, it makes sense to only spawn a single eventloop and test against that. I looked through the code for the python servers and they are all configured for a single event loop, making this a comparison on equal footing.
Yes it's true you normally run multiple node processes in production, but you likewise normally run multiple asyncio/tornado/twisted processes in production as well. I don't see it as a big deal, or misleading to compare them in this sense.
No, even Go is explicitly configured to only use one scheduler:
> We use Python 3.5, and all servers are single-threaded. Additionally, we use GOMAXPROCS=1 for Go code, nodejs does not use cluster, and all Python servers are single-process.
This is exactly the same observation that motivated the Mongrel web server for Ruby, 10 years ago this year.
"Mongrel is a small library that provides a very fast HTTP 1.1 server for Ruby web applications. [...] What makes Mongrel so fast is the careful use of an Ragel [C] extension to provide fast, accurate HTTP 1.1 protocol parsing. This makes the server scream without too many portability issues." -- https://github.com/mongrel/mongrel
And its successor Thin:
"Thin is a Ruby web server that glues together 3 of the best Ruby libraries in web history: (1) the Mongrel parser, the root of Mongrel speed and security" -- http://code.macournoyer.com/thin/
In case anyone is still wondering, parsing in Ruby/Python/Lua is pretty slow compared to C/C++. That's why I personally have been really interested for a long time in writing parsers in C that can be used from higher level languages. That way you can get the best of both worlds.