This is exactly the same observation that motivated the Mongrel web server for Ruby, 10 years ago this year.
"Mongrel is a small library that provides a very fast HTTP 1.1 server for Ruby web applications. [...] What makes Mongrel so fast is the careful use of an Ragel [C] extension to provide fast, accurate HTTP 1.1 protocol parsing. This makes the server scream without too many portability issues." -- https://github.com/mongrel/mongrel
And its successor Thin:
"Thin is a Ruby web server that glues together 3 of the best Ruby libraries in web history: (1) the Mongrel parser, the root of Mongrel speed and security" -- http://code.macournoyer.com/thin/
In case anyone is still wondering, parsing in Ruby/Python/Lua is pretty slow compared to C/C++. That's why I personally have been really interested for a long time in writing parsers in C that can be used from higher level languages. That way you can get the best of both worlds.
He goes into detail about how these types of libraries are particularly difficult or impossible to fuzz. He uses OpenSSL as an example but I would imagine an http library being similar.
I just watched the video on 2x and I don't think that's a fair summary. He seems positive on fuzzing in general and mentions that fuzzing found two extremely tricky bugs in libsndfile and flac.
He does point out that there are some cases like OpenSSL that are particularly difficult to fuzz completely because they are encrypted and heavily stateful, creating transient keys on the fly and such. I don't think HTTP has this problem, for the most part.
Cool to see a video of Erik de Castro Lopo though -- I've worked with that guy since the early 2000s when I was working on Audacity (which uses his excellend libsndfile internally -- or at least did at the time).
This uses Dahl's original http_parser.c FSM. With a little work you can write a WSGI handler around it. Highly recommend.
† On a related note, I'm finding it difficult finding (fast) GLR or GLL parser generators for Ruby
I'm working on what I consider a parsing stack. The key observation, in my opinion, is that you need a way of representing structured data in the high-level language that is both rich and efficient. I think Protocol Buffers can serve that role.
Once you have the structured data representation, you can write parsers to read/write it. But you want the parsers to be independent of how you represent the structured data in any one language.
The caveat is that it uses the new async/await coroutine bits that just landed in Python 3.5, so it only works with Python 3.5+. He also gave a talk on concurrency in python recently at last year's PyCon:
The benchmark is almost equivalent to testing raw eventloop performance against a complete http server.
To make this test fair, you need to write the same code in asyncio_http_server.py in other languages.
I plan to add a complete HTTP protocol implementation to httptools, but honestly, I don't expect it to be more than 20% slower.
How so? It uses a binding to http-parser, just like nodejs.
These benchmarks seem to only use 10 clients concurrently, max. That's ridiculously low.
A few questions I'd like to see answered.
How many clients can you connect to each server, and have each one ping once per 5 mins, before the CPU gets overloaded?
How much memory is used per connection?
How does PyPy+Twisted fair in this?
I've just update the post with more details on HTTP benchmarks and attached the correct full-results file . The concurrency level for HTTP benchmarks is 300, not 10.
To answer other questions, I'll have to run some benchmarks tomorrow :)
In my experience, PyPy+twisted is around 5-25x faster CPython/twisted, and smoked asyncio as well. Would be great to see how uvloop compares there, and of course, someday when PyPy supports Python 3.5, there's no reason it couldn't use uvloop via cffi I'd hope.
Yep, I'm curious to see what will happen there. Do you have any suggestions on what tool to use to generate the load?
> [..] someday when PyPy supports Python 3.5, there's no reason it couldn't use uvloop via cffi I'd hope.
We'll figure that out! ;)
2) Can uvloop be used with frameworks like flask or django?
3) gevent uses monkey patching to turn blocking libraries such as DB drivers non-blocking, does uvloop do anything similar? If not how does it work with blocking libraries?
2) No, they have a different architecture. Although I heard that there is a project to integrate asyncio into django to get websockets and http/2.
3) asyncio/uvloop require you to use explicit async/await. So, unfortunately, the existing networking code that isn't built for asyncio can't be reused. On the bright side, there are so many asyncio DB drivers and other modules now!
Having a web framework (a next generation Flask if you will) built ground up with concurrency is the missing key. I have used Flask for many years, but this is the right time to introduce a new framework - because of the internal restructuring and slowdown of Flask's maintainers (and I say this with the utmost respect).
If you are keen, this has the potential to be the killer application for Python 3.
This was/is the big issue with Tornado, IMO (and Tornado has been around for ages in framework time). Tornado is only async if the entire call stack all the way down to the http socket is async, using callbacks instead of returning values. This means that any 3rd party client library you use has to be completely written asynchronously, and none are in python. So you end up with a lot tedious work re-implementing http client libraries for Twilio or Stripe or whatever you're using.
I'm curious to see where asyncio goes in python, but I'm a bit skeptical after seeing how much of a pain it was to use Tornado on a large web app. In the meantime I'll be using Gevent + Flask, which isn't perfect since it adds some magic & complexity but has the huge upside of letting you keep using all the libraries you're used to.
The main benefit of Node as I see it is that the entire Node community uses the same IO Loop whereas Python's community is fragmented between normal sync code and multiple different IO Loops (asyncio will probably help with this).
Just one point, please make sure you have designed DB access as a core part of your framework (e.g. ). Too many frameworks discount database interaction until it's too late.
Oh and please please choose your name so that it doesn't conflict on Google search. http://www.spinframework.org
 http://initd.org/psycopg/docs/advanced.html#async-support vs https://github.com/chtd/psycopg2cffi
I strongly advise against this. One of the reasons Flask is so attractive is the fact that it does not enforce any database on you.
Thanks to its decoupled design you can use it purely as a routing library, which is great! Letting the framework decide something important as the database is a bad idea. 
what frequently happens is a web framework without a thought for any kind of DB interaction (or as you put it... a routing library). In things like an async web framework, that could leave users hanging. For example, psycopg vs psycopg2 vs psycogreen vs psycopg2-cffi . Tell me which one to use and benchmark it.
I agree, its a fine line. And we can keep going back and forth whether a framework should "recommend" or not recommend. But in case like this - I think there will NOT be a lot of libraries that will be compliant with the async usecase. I would hope that this framework will recommend... but not ship "batteries included".
Most database libraries don't play very well with non blocking code. In fact nodejs dB libraries were specifically designed for this.
Building async frameworks is not trivial - http://initd.org/psycopg/docs/advanced.html#async-support
Indeed, we want to make it very easy to use, especially a clean/discoverable API, simple debugging, clear errors, etc. Which is the stuff async frameworks are not good at, and an incredible added value.
It's a lot more work we initially thought. When you are doing async, you now think, not with a sequence of actions ordered in time, but with events. So for your framework to be usable, you must provide hooks to:
- do something at init
- register a component
- do something when a component is registered
- do something once it's ready
- do something when there is an error
- do something when it shuts down
But, we learned while coding the project that, with asyncio, exceptions in coroutine don't break the event look (except KeyboardInterrupt which is a weird hybrid) while exceptions from the loop do break it.
Plus you have to make a nice setup, which auto starts the event loop with good default so that people don't have to think about it for simple apps. But it must be overridable, and handle the case where your framework is embeded inside an already started event loop, and make it easy to do so.
It's one of the strong point of gevent: you don't have to think about it. With asycio/twisted, you have the benefit a explicit process switch and parallelism, but you need to pay the price with verbosity and complexity. We try to create a balance, and it's turns out to be harder than expected.
Then you have to make clear error reporting, especially iron the implementation mixing Task, Futures, coroutine functions and coroutines. Provide helpers so that common scheduling is done easily...
And you haven't even talked about HTTP yet. This is just proper asyncio management. This is why nobody made a killer framework yet : it's a looooooot of work, it's hard, and it's very easy to get it wrong. Doing like <sync framework> but async doesn't cut it.
- It's verbose compared to flask.
- It reinvent the wheel, while flask uses great components such as werkzeug.
- If you want to make a component, it's complex.
- It doesn't come battery included for the Web like Django, just the bare minimum.
- It misses the opportunity to provide task queues, RPC or PUB/SUB which are key components to any modern stacks are made easily possible by having persistent connections.
- It ignores async/await potential of unifying threads/process/asyncio and don't allow easy multi-cpu.
Or maybe it's a simple problem of aiohttp performance -- as shown in the blog post, its HTTP parser is a bit slow.
In general, I'd recommend to use a fewer number of sockets and implement some pipelining of API requests.
Not long ago I saw example here that someone was creating a new session for every single connection. This is not very optimal way of using it. If you use it within same session, aiohttp will make use of keep-alive, which in turn will reuse existing connections and reduce overhead. You also won't need to use a semaphore, since you can define limit in TCPConnector.
Why you had performance issues? As other said, you were making thousands of connections, each socket need to be in TIME_WAIT state for 2 minutes after closing (limitation of TCP, SCTP does not have this problem). So if you use all connections within short time, you'll essentially run out of them. Some people use tcp_tw_reuse/recycle, and that solves this issue, but that makes your connections no longer follow RFC and you might encounter strange issues later on. The advice above should resolve your problem without any hacks.
I did not see any benchmarks in the repo to support this. How was this statistic determined?
Building something in C is very hard. uvloop wraps all libuv primitives in Python objects which know how to manage the memory safely (i.e. not to "free" something before libuv is done with it). So development time wise, uvloop is much better.
As for the performance, I guess you'd be able to squeeze another 5-15% if you write the echo server in C.
> I'm not surprised that python (especially cython) is faster than node in this case
Cython is a statically typed compiled language, it can be anywhere from 2x to 100x faster than CPython.
Yeah, I definitely get that. I'm just trying to see the smaller picture here.
>Cython is a statically typed compiled language, it can be anywhere from 2x to 100x faster than Python.
Ah, my bad for not knowing the difference between Cython and CPython. It seems to me, then, that this isn't really a fair comparison to node, is it? Naturally a statically typed language is going to be faster than a dynamic one. Good on you for including a comparison with Go, though.
Isn't node written in C?
Python is also a dynamic, GCed language. uvloop is built with Cython, which uses the Python object model (and all of its overhead!), and CPython C-API extensively (so it's slower than a pure C program using libuv).
uvloop shouldn't behave any differently, I've paid special attention to make sure it works exactly the same way as asyncio (down to when its objects are garbage collected).
Work on Trollius was stopped a few weeks ago, there wasn't enough interest/use. Call for interested maintainers, http://trollius.readthedocs.io/deprecated.html#deprecated
Not very, it's an alternative event loop for asyncio which was introduced in 3.4 and builds upon other Python 3 features (e.g. `yield from`)
For example you can install python 2.6, 2.7, 3.3, 3.4, & 3.5 all on one host without any conflicts. The limitation is that many distributions prefer to not maintain different versions of supposedly the same language.
If you use RedHat or CentOS you can just use https://ius.io/ and get access to the other python versions. This is one of few repos that makes sure the packages don't conflict with system ones.
How does it compare to PyPy?
Once PyPy3 is available, will this work with it?
We'll find a way!
1. The benchmarks make servers generate a huge number of objects, so maybe, the GC is under too much pressure.
2. Another possibility is that the v8 JIT can't optimize some JS code in nodejs, or does a poor job.
That said, only careful profiling can answer your question :)
How so? Clustering for node just makes it run in several OS processes. You can (and should) do the same for Python code, it's easy.
I'm surprise uvloop is faster than node.js since libuv was developed for the latter. Kudos to the author.
Sure, but it really depends on how complex your Python code is.
> I'm surprise uvloop is faster than node.js since libuv was developed for the latter. Kudos to the author.
When I see numbers like 50k requests per second it's meaningless, unless I have no database or some kind of in memory cache only db.
That said, uvloop should be fully compatible with asyncio. All APIs should be ready, so you can start testing it in your projects.
uvloop, as of now, only runs on *nix; but I hope we'll have Windows supported soon too.
What exactly is the -streams addition that makes uvloop-streams perform so much worse than plain uvloop?
Specifically, the http server example(1), doesn't even bother using the standard library provided Cluster module(2). Cluster is specifically designed for distributing server workloads across multiple cores.
All node.js services/applications I've worked on in the past 3 years (that are concerned with scale) utilize a multi-process node architecture.
The current benchmark can only claim that a single python process that spawns multiple threads is 2x faster than a single node.js process that spawns only one thread.
This fact may be interesting to some, but is irrelevant to real world performance.
Yes, in production you should run your nodejs app in cluster, your Python apps in a multiprocess configuration, and you should never use GOMAXPROCS=1 for your go apps in production!
Running all benchmarks in multiprocess configuration wouldn't add anything new to the results.
The comment above (https://news.ycombinator.com/item?id=11626762) further expands on why these kinds of benchmarks, although interesting, have no real value.
Each implementation does something wildly different and responds to different inputs with completely different outputs.
To put it metaphorically, if you put a car engine in two completely different chassis and then race them on a track, you aren't gaining any real insight into relative performance of the engine in the two vehicles.
Also, just to be clear, my qualms are with the benchmarks alone, I think the library is great! Thanks for all the hard work :)
These benchmarks are primarily comparing event loops and their performance. TCP benchmark is very fair, HTTP - maybe not so much. The point is to show that you can write super fast servers in Python too, just have a fast protocol parser.
As for the HTTP benchmarks, I plan to add more stuff to httptools and implement a complete HTTP protocol in it. Will rerun the benchmakrs, but I don't expect more than 20% performance drop.
Yes it's true you normally run multiple node processes in production, but you likewise normally run multiple asyncio/tornado/twisted processes in production as well. I don't see it as a big deal, or misleading to compare them in this sense.
It doesn't matter anyway, with one thread per core it would be pretty straightforward to scale in beefier machines.
> We use Python 3.5, and all servers are single-threaded. Additionally, we use GOMAXPROCS=1 for Go code, nodejs does not use cluster, and all Python servers are single-process.