
Making 1M requests with Python-aiohttp - dante9999
http://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
======
terom
Re the EADDRNOTAVAIL from socket.connect(),

If you're connecting to 127.0.0.1:8080, then each connection from 127.0.0.1 is
going to be assigned an ephemeral TCP source port. There are only a finite
number of such ports available, on the order of ~30-50k, which limits the
number of connections from a single address to a specific endpoint.

If you're doing 100k TCP connections with 1k concurrent conections, it's
feasible that you'll run into those limits, with TCP connections hanging
around in some TIME_WAIT state after close().

Not that this would be a documented errno for connect(), but it's the
interpretation that makes sense..

[http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-
not...](http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not.html)
[http://lxr.free-
electrons.com/source/net/ipv4/inet_hashtable...](http://lxr.free-
electrons.com/source/net/ipv4/inet_hashtables.c?v=4.4#L572)

~~~
ahuang
Generally its the upper 32k ports that are ephemeral, and if your churn more
than that per minute in connections, you'll run into that TIME_WAIT issue.

Hacky way to get around that is to enable tcp_tw_reuse which will let you
reuse ports, but it can be risky if you get a SYN from the previous connection
that happens to lineup with segment number of the current connection (which
will close your connection). Shouldn't happen often, and if you can tolerate a
small amount of failure is an easy way to get around this limit.

[0] [http://blog.davidvassallo.me/2010/07/13/time_wait-and-
port-r...](http://blog.davidvassallo.me/2010/07/13/time_wait-and-port-reuse/)

~~~
e12e
For benchmarking loopback connections, addressing really shouldn't be an
issue, as you have an entire /8-subnet to split between your client(s) and
server(s) (127.0.0.0/8). You would need some logic to set up eg 10.000
listening servers, and 1000.000 clients to get it working, and at some point
you'd probably run into memory or other limits.

I'm a little surprised some simple googling didn't turn up any examples of
this - I'm sure someone have tried it out in order to do some benchmarking of
high-performance network servers/services?

Apparently ipv6 changes this to a single (loopback) address, but then again,
with ipv6 you can use entire subnets per network card.

------
tooker
I have a library for doing coordinated async IO in python that addresses some
of the scheduling and resource contention issues hinted out in the later part
of this post. It's called cellulario in reference to containing async IO
mechanics inside a cell wall..

    
    
        https://github.com/mayfield/cellulario
    

And an example of using it to manage a multi-tiered scheme where a first layer
of IO requests seeds another layer and then you finally reduce all the
responses..

    
    
        https://github.com/mayfield/ecmcli/blob/master/ecmcli/api.py#L456

~~~
simonw
This looks really promising. I've often wanted to be able to do exactly this:
run a bunch of async code in the middle of an otherwise synchronous block
(classic example: writing a Django view which fires off a bunch of parallel
HTTP API requests and continues once all of them have either returned or timed
out).

~~~
tooker
That's almost exactly the use case I began with. It unapologetically requires
python 3.5+ but if you're already there I'd be happy to see and support some
of your use cases. Hit me up on github if you want to try it and need some
guidance (the docs are nonexistent).

------
sandGorgon
I really keep wishing that there would be benchmark comparisons of
asyncio/aiohttp with gevent/python2 . Performance would be a killer reason to
migrate immediately to Py3.

What I suspect though is that asyncio is not all that better than gevent. Can
someone correct me on this?

~~~
riyadparvez
Is there anything inherent to Python3 that is slower than Python2? Or is it
just some of the performant packages still have not been ported to Python3?

~~~
sandGorgon
i keep looking for a reason to switch to python 3 and cant find one. Plus if I
want to use the cool stuff in Pypy.. then I better not !

overall - very less reason to consider Py3 at all. Performance would have been
one - if there were a comparison between gevent and asyncio.

~~~
mrweasel
>i keep looking for a reason to switch to python 3 and cant find one.

Unicode? Not having to deal with encoding all over the place has been well
worth switch to Python 3. If performance is a a huge issue, I honestly don't
know why you would stay on Python (regardless of version)

I wouldn't want to switch back to Python 2.7 is I can avoid it. There's
honestly no reason not to go with 3.4 or 3.5 at this point, unless you happen
to have a large Python 2 code base.

~~~
sago
This is a superficially trivial bit of syntactic sugar, but an example of the
way small tweaks can provide big impact. This:

    
    
        > do_something(*some_args, *some_more_args)
    

is rocking my world right at the moment. That's a massive time saving feature
I've been waiting for and worth the price of a 3.5 upgrade.

~~~
kenneth_reitz
Oh cool, I didn't know about this!

------
velox_io
The 1 million in the title is misleading (1M per hour is nothing to write home
about, only 278/sec). There are frameworks that are able hit 1M per minute
plus (16,666/sec).

~~~
jorge_leria
1M per minute it is something. Could you name those frameworks?

~~~
jc4p
Elixir is the name I see thrown around the most when it comes to stuff like
this: [http://www.phoenixframework.org/blog/the-road-
to-2-million-w...](http://www.phoenixframework.org/blog/the-road-to-2-million-
websocket-connections)

------
ben_jones
Does anyone enjoy doing async work in python? I've done a few hobby projects
and honestly I was yearning for javascript + async lib after awhile. As great
as python is maybe we should _yield_ async programming to the languages
designed for it?

~~~
darpa_escapee
This is a genuine question: in what ways is Python's async implementation
lacking? Could it have been baked in a better way?

In what ways do languages that were supposedly designed for async programming
different than Python?

Python is definitely lacking an elegant interface for async programming.

~~~
bdarnell
I think that Python 3.5 now has a very elegant interface for async
programming. I prefer Tornado to the standard library's asyncio, but the new
keywords are nice for both packages (disclaimer: I'm the maintainer of
Tornado).

The downsides have nothing to do with the design of the language. The problem
is that introducing a new concurrency model late in a language's life splits
the ecosystem. Most existing packages are synchronous, so if you want to build
asynchronous systems you must avoid packages like requests, django, or
sqlalchemy and find (or develop) asynchronous equivalents for the
functionality you need.

Javascript has an advantage here not because the design of the language is
especially well-suited for asynchronous programming, but because it never went
through a synchronous/multi-threading phase. Every javascript package is
designed for asynchronous use.

~~~
nbadg
Yeah, my biggest complaint personally is that combining multithreading and
async is a massive pain in the ass. Now, realistically, you aren't usually
going to want to do that, except if you have multiple event loops, or are
bridging between external synchronous code and internal async code. Otherwise,
I really enjoy async python -- of course, I'm also the kind of person who has
written my own event loops using synchronous code before, so maybe I'm just
crazy like that.

------
imaginenore
1,000,000 requests in 52 minutes is just 320 req/sec.

Am I missing something? What's so amazing about this?

I just deployed some production feed that serves at 1955 requests/second on a
cheap VPS in freaking PHP, one of the slowest languages out there.

~~~
aaossa
Why you say is not amazing? Honestly curious here :)

~~~
imaginenore
Because it's trivial.

I would be interested in anything doing 10,000+ req/sec on a cheap VPS. 320 is
nothing.

People achieve 2 million requests/second with C++ on EC2:

[https://medium.com/swlh/starting-a-tech-startup-
with-c-6b5d5...](https://medium.com/swlh/starting-a-tech-startup-
with-c-6b5d5856e6de#.38xs3etwg)

~~~
Cyph0n
Absolutely excellent article. I always keep C++ at the back of my mind in case
I need it some day, so I think the list of libraries they used will be useful
for me in the future. Thanks.

~~~
gst
If you want to write fast C++ Web services I recommend a look at Seastar:
[http://www.seastar-project.org/](http://www.seastar-project.org/)

------
philippb
I'm the CTO at KeepSafe. We open sourced aiohttp.

We wrote aiohttp for our production system. We build everything on aiohttp. In
our production systems we constantly run more request then in the benchmark
with business logic on each request.

The main reason we like aiohttp a lot if that you we can write asynchronous
code that reads like synchronous and does not have callbacks.

------
takeda
IMO you should place all requests within a single ClientSession().

This will provide two benefits:

1\. You won't need to use a semaphore. To limit connections you will need to
create a TCPConnection() object with limit set to the limit you used in the
semaphore and pass it to the ClientSession() and aiohttp will not make more
connections than the limit set (default behavior is to have unlimited number
of connections).

2\. With single ClientSession(), aiohttp will make use of keep-alive (i.e. it
will reuse same connections for next requests, but it will keep at most the
limit of connections you set in TCPConnection() object).

This should improve performance further, and (given sane limit) it'll also
solve issue with "Cannot assign requested address" error.

BTW: Even without limit set aiohttp will try to reduce number of connections
open so it might still fix the connection error issue as long as individual
requests don't take long. It's still good idea to set limit, just to be nice
to the remote server.

------
nbadg
First off, awesome to see more benchmarks (even if it's just personal
experimentation) for synchronous vs asyncio performance. I think the real
argument for asyncio _right now_ is that it makes it very easy for you to
write extremely efficient code, even for hobbyist projects. Even though your
experiment is only handling 320 req/s, that you were able to do that so
quickly and with very, very little optimization is, I think, a testament to
the potential for asyncio.

Some pointers:

The event loop is still a single thread and therefore subject to the GIL. That
means that at any given time, only one coroutine is running in the loop. This
is important for several reasons, but probably the most relevant are that

1\. within any given coroutine, execution flow will always be consistent
between yield/await statements.

2\. synchronous calls within coroutines will _block the entire event loop_.

3\. most of asyncio was not written with thread safety in mind

That second one is really important. When you're doing file access, eg where
you're doing "with open('frank.html', 'rb')", that's something you may want to
consider moving into a run_in_executor call. That _will_ block the coroutine,
but it will return control to the event loop, allowing other connections to
proceed.

Also, more likely than not, the too many open files error is a result of you
opening frank.html, not of sockets. I haven't run your code with asyncio in
debug mode[1] to verify that, but that would be my intuition. You would
probably handle more requests if you changed that -- I would do the file
access in a run_in_executor with a max executor workers of 1000. If you want
to surpass that, use a process pool instead of a threadpool, and you should be
ready to go, though it's worth mentioning that disk IO is hardly ever cpu-
bound, so I wouldn't expect you to get much performance boost otherwise.

Also, the placement of your semaphore acquisition doesn't make any sense to
me. I would create a dedicated coroutine like this:

    
    
        async def bounded_fetch(sem):
            async with sem:
                return (await fetch(url.format(i)))
    

and modify the parent function like this:

    
    
        for i in range(r):
            task = asyncio.ensure_future(bounded_fetch(sem))
            tasks.append(task)
    

That being said, it also doesn't make any sense to me to have the semaphore in
the client code, since the error is in the server code.

[1] [https://docs.python.org/3/library/asyncio-dev.html#debug-
mod...](https://docs.python.org/3/library/asyncio-dev.html#debug-mode-of-
asyncio)

~~~
dante9999
Thanks for feedback.

> You would probably handle more requests if you changed that -- I would do
> the file access in a run_in_executor with a max executor workers of 1000.

This is really good point. I'm going to check this and edit post adding this
information there.

> Also, the placement of your semaphore acquisition doesn't make any sense to
> me. I would create a dedicated coroutine like this:

looking into my semaphore code next day after writing it I do wonder if I'm
using it correctly. I assumed it works correctly because it fixed my "too many
open files" exception, so it seems to mean that I'm no longer exceeding 1024
open files limits. Can you clarify why you think my use of semaphore does not
make sense and why your suggestion is better? What is the benefit of dedicated
coroutine?

> That being said, it also doesn't make any sense to me to have the semaphore
> in the client code, since the error is in the server code.

I admit that I focused more on my client than server. One thing that worries
me about my test server is that it does not print any exceptions. Either it
does not fail at all, which seems unlikely, or it fails silently, which is
more likely and is bad. So I need to check my server code to see what exactly
happens there.

> it also doesn't make any sense to me to have the semaphore in the client
> code, since the error is in the server code.

main reason for semaphore in client code is that it should stop client from
making over 1k connections at a time. My logic here is that if client wont
make 1k connections at a time - server wont receive 1k connections at a time
and thus there will be no problem of too many open files on server (it won't
have to send more than 1k responses). However I see that this logic may not be
totally correct, other comment points out that it's possible for sockets to
"hang around" after closing:
[https://news.ycombinator.com/item?id=11557672](https://news.ycombinator.com/item?id=11557672)
so I need to review that and edit post.

> [https://docs.python.org/3/library/asyncio-dev.html#debug-
> mod...](https://docs.python.org/3/library/asyncio-dev.html#debug-mode-of-
> asyncio)

this looks really great, will look into this thanks.

~~~
nbadg
No problem, it's especially hard to find external feedback for side projects
and experiments so I try to give it when I can.

> I assumed it works correctly because it fixed my "too many open files"
> exception

It works, so at the end of the day that's what matters. The client vs server
question, from my perspective, ultimately comes down to a question of test
realism; in a real-world deployment you couldn't limit connections with
client-side code because there are multiple clients. That's what I mean by "it
doesn't make sense given that the error is server-side".

> Can you clarify why you think my use of semaphore does not make sense and
> why your suggestion is better? What is the benefit of dedicated coroutine?

I'm saying that mostly, but not exclusively, from a division of concerns
standpoint. You're acquiring the semaphore in a completely different context
than you're releasing it. On the one hand, that's partly a programming style
issue. On the other hand, it can also have some really important consequences:
for example, it's actually the event loop itself that is releasing the
semaphore for you when the task is done. Because of the way the event loop
works, it's hard to say exactly when the semaphore will be released. You want
to hold it for the absolute minimum time possible, since it's holding up
execution of other connections in the loop. Putting it into a dedicated
coroutine makes it clearer what's going on, makes it such that the acquirer
and releaser of the semaphore are the same, and means you are definitely
holding the semaphore for the minimum amount of time possible (since, again,
execution flow will not leave any particular coroutine until you yield/await
another). In general I would say that releasing the semaphore in a callback is
significantly more fragile, and mildly to moderately less performant, than
creating a dedicated coroutine to hold the semaphore and handle the request.

Does that all make sense?

> Either it does not fail at all, which seems unlikely, or it fails silently,
> which is more likely and is bad.

That's a fair statement, I think. As an aside, the print statement is slow, so
keep that in mind. It might actually be faster to have a single memory-mapped
file for the whole thing, and then just append the error and traceback to the
file. The built-in traceback library can be very useful for that. That's also
a bit more realistic, since obviously IRL you wouldn't be using a print
statement to keep track of errors. On a similar note, because file access is
so slow, you'd be best off figuring out some way to remove the part where the
server accesses the disk once per connection entirely. On a real-world system
you'd possibly use some kind of memory caching system to do that, especially
if you're just reading files and not writing them. That allows you to use a
little more memory (potentially as little as enough to have a single copy of
the file in memory) to drastically improve performance.

~~~
dante9999
> Does that all make sense?

yeah it does make sense.

> in a real-world deployment you couldn't limit connections with client-side
> code

yeah that's a very good point. But in a real world scenario handling this
would not be that easy. Limiting number of available connections on the server
side is not a trivial task to implement. Setting your server to avoid failures
and simply return either 503 service unavailable to some clients or 429 (too
many conns) to others would probably require quite a lot of coding. It's also
not very clear to me how this would be implemented, how do people implement
things like this? Just putting some check for number of open files before line
that opens file and setting response code to 500 and 429 before opening file?
This would only stop server from opening to many "html" files, but would not
stop server from getting flooded with connections. Is my aiohttp app even the
right place to add checks like this? Wouldn't it be better to use haproxy or
nginx or some other load balancing service in front of aiohttp app and let it
handle too much traffic?

Other thing that comes to my mind (need to check this later) is that perhaps
some partial "handling" of cases like this could/should be implemented in
aiohttp library. I'm not sure how it behaves now, but maybe it should simply
fail to open file, return 500 to the client, and print noisy traceback about
open files to my logs? I didnt see this behavior when doing my tests, so
either it didn't occur, is not implemented in aiohttp, or it occurrred and I
somehow missed that. From my experience with Twisted I know that this is how
Twisted resources behave, if you have some unhandled exceptions twisted just
returns 500 to client and show traceback in logs.

~~~
nbadg
Keep in mind that 5XX error codes are for server errors and 4XX codes are for
client errors. Returning 429 would imply "too many connections (from your
computer)", not total for the service. Choosing to return a 503 for over-taxed
servers is, as far as I can tell, done maybe half the time. Depending on the
kind of service you're running, you might want to enforce a server timeout
that says "after a certain number of milliseconds of local response time,
return a 5XX error code and abandon the connection". That would be a
particular component in an overall strategy for handling high load, which
would heavily bias towards serving the easiest responses first. That may or
may not be a good idea: what if the "expensive" requests are from paying
customers accessing account pages, and the "inexpensive" ones are from a
sudden spike in traffic to your homepage due to some good press somewhere? Of
course eventually, you'd want to separate these two kinds of traffic entirely,
such that customers are only affected by outages that they create. You can
then focus on expanding your capacity to handle customers directly, instead of
trying to lump that in with the much more unpredictable behavior of general
web traffic.

> Just putting some check for number of open files before line that opens file
> and setting response code to 500 and 429 before opening file?

So actually this is one of the big benefits of putting the semaphore limiting
file access within its own dedicated coroutine (except on the server side
instead of the client). It allows you to handle the connection without having
to deal with immediate responses. What that means in practice is that your
server will be slower to respond under high load, but until it hits the
client's (browser) timeout limit, you'll still be able to respond. It actually
doesn't require any extra code to do that. Note that this isn't the only way
to achieve this result, but it's probably the most direct, and simplest,
especially given the approach you've taken with the code thus far.

A load balancer sits on top of that, ideally monitoring metrics like server
CPU usage, memory load, or (most directly) request response time, and then
shifts around requests between servers accordingly, to minimize the delay
incurred in the aforementioned "wait for semaphore (or other synchronization
primitive)" part.

At the end of the day, until you start hitting the limit of concurrent
connections that others have mentioned, you don't really actually need to
worry very much about how many connections you have open at once. You just
want to focus on handling every connection you have as quickly as possible.

------
henryw
Looks pretty interesting to do async on python. I once did something similar
in node (async by default) with a few lines of code. I think I scraped 12 or
20 million real URLs in 8 hours on a $5 cloud VM. It was limited by network
bandwidth.

------
azinman2
"Everyone knows that asynchronous code performs better when applied to network
operations"

Ummm that seems a bit far reaching.

~~~
15155
It depends on what "network operations" you are trying to do.

For high-concurrency purposes, asynchronous programming is far more scalable
(see: epoll/kqueue + state machines).

For high-throughput, low-concurrency operations, it doesn't matter as much.

~~~
azinman2
I happen to know of a very major tech company who scale is insane yet their
core c++ code is based on highly tuned blocking threads. It's not a given that
async is the only way to scale.

