
Async Python is not faster - haybanusa
http://calpaterson.com/async-python-is-not-faster.html
======
phodge
How is this result surprising? The point of coroutines isn't to make your code
execute faster, it's to prevent your process sitting idle while it waits for
I/O.

When you're dealing with external REST APIs that take multiple seconds to
respond, then the async version is substantially "faster" because your process
can get some other useful work done while it's waiting. Obviously the async
framework introduces some overhead, but that bit of overhead is probably a lot
less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external
service.

~~~
calpaterson
I think it is surprising to a lot of people who do take it as read that async
will be faster.

As I describe in the first line of my article I don't think that people who
think async is faster have unreasonable expectations. It seems very intuitive
to assume that greater concurrency would mean greater performance - at least
one some measure.

> When you're dealing with external REST APIs that take multiple seconds to
> respond, then the async version is substantially "faster" because your
> process can get some other useful work done while it's waiting.

I'm afraid I also don't think you have this right conceptually. An async
implementation that does multiple ("embarrassingly parallel") tasks in the
same process - whether that is DB IO waiting or microservice IO waiting - is
not necessarily a performance improvement over a sync version that just starts
more workers and has the OS kernel scheduler organise things. In fact in
practice an async version is normally lower throughput, higher latency and
more fragile. This is really what I'm getting at when I say async is not
faster.

Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an
external service. Making alternative use of the otherwise idle CPU is the
purpose (and IMO the proper domain of) operating systems.

~~~
john-radio
> Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an
> external service. Making alternative use of the otherwise idle CPU is the
> purpose (and IMO the proper domain of) operating systems.

Sure, the operating system can find other things to do with the CPU cycles
when a program is IO-locked, but that doesn't help _the program_ that you're
in the situation of currently trying to run.

> An async implementation that does multiple ("embarrassingly parallel") tasks
> in the same process - whether that is DB IO waiting or microservice IO
> waiting - is not necessarily a performance improvement over a sync version
> that just starts more workers and has the OS kernel scheduler organise
> things. In fact in practice an async version is normally lower throughput,
> higher latency and more fragile. This is really what I'm getting at when I
> say async is not faster.

You're right. "Arbitrary programs will run faster" is not the promise of
Python async.

Python async does help a program work faster in the situation that phodge just
described (waiting for web requests, or waiting for a slow hardware device),
since the program can do other things while waiting for the locked IO (unlike
a Python program that does not use async and could only proceed linearly
through its instructions). That's the problem that Python asyncio purports to
solve. It is still subject to the Global Interpreter Lock, meaning it's still
bound to one thread. (Python's multiprocessing library is needed to overcome
the GIL and break out a program into multiple threads, at the cost that cross-
thread communication now becomes expensive).

~~~
quietbritishjim
> unlike a Python program that does not use async and could only proceed
> linearly through its instructions

This isn't how it works. While Python is blocked in I/O calls, it releases the
GIL so other threads can proceed. (If the GIL were never released then I'm
sure they wouldn't have put threading in the Python standard library.)

> Python's multiprocessing library is needed to overcome the GIL

This is technically true, in that if you _are_ running up against the GIL then
the only way to overcome it is to use multiprocessing. But blocking IO isn't
one of those situations, so you can just use threads.

The comparison here is not async vs just doing one thing. It's async vs
threads. I believe that's what the performance comparison in the article is
about, and if threads were as broken as you say then obviously they wouldn't
have performed better than asyncio.

\--------

As an aside, many C-based extensions also release the GIL when performing CPU-
bound computations e.g. numpy and scipy. So GIL doesn't even prevent you from
using multithreading in CPU-heavy applications, so long as they are relatively
large operations (e.g. a few calls to multiply huge matrices together would
parallelise well, but many calls to multiply tiny matrices together would
heavily contend the GIL).

~~~
gshulegaard
> > Python's multiprocessing library is needed to overcome the GIL

> No it's not, just use threads.

I just wanted to expand on this a little to describe some of the downsides to
threads in Python.

Multi-threaded logic can be (and often is) _slower_ than single-threaded logic
because threading introduces overhead of lock contention and context
switching. David Beazley did a talk illustrating this in 2010:

[https://www.youtube.com/watch?v=Obt-
vMVdM8s](https://www.youtube.com/watch?v=Obt-vMVdM8s)

He also did a great talk about coroutines in 2015 where he explores threading
and coroutines a bit more:

[https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s](https://www.youtube.com/watch?v=MCs5OvhV9S4&t=525s)

In workloads that are often "blocked" like network calls our I/O bound work
loads, threads can provide similar benefits to coroutines but with overhead.
Coroutines seek to provide the same benefit without as much overhead (no lock
contention, fewer context switches by the kernel).

It's probably not the right guidelines for everyone but I generally use these
when thinking about concurrency (and pseudo-concurrency) in Python:

\- Coroutines where I can.

\- Multi-processing where I need real concurrency.

\- Never threads.

~~~
quietbritishjim
Ah ha! Now we have finally reached the beginning of the conversation :-)

The point is, many people think (including you judging by your comment, and
certainly including me up until now but now I'm just confused) that in Python
asyncio is better than using multiple threads with blocking IO. The point of
the article is to dispel that belief. There seems to be some debate about
whether the article is really representative, and I'm very curious about that.
But then the parent comment to mine took us on an unproductive detour that
based on the misconception that Python threads don't work at all. Now your
comment has brought up that original belief again, but you haven't referenced
the article at all.

~~~
gshulegaard
I didn't reference the article because I provided more detailed references
which explore the difference between threads and coroutines in Python to a
much greater depth.

The point of my comment is to say that neither threads or coroutines will make
Python _faster_ in and of themselves. Quite the opposite in fact: threading
adds overhead so unless the benefit is greater than the overhead (e.g. lock
contention and context switching) your code will actually be net slower.

I can't recommend the videos I shared enough, David Beazley is a great
presenter. One of the few people who can do talks centered around live coding
that keep me engaged throughout.

> The point is, many people think (including you judging by your comment, and
> certainly including me up until now but now I'm just confused) that in
> Python asyncio is better than using multiple threads with blocking IO. The
> point of the article is to dispel that belief.

The disconnect here is that this article isn't claiming that asyncio is not
faster than threads. In fact the article only claims that asyncio is not a
silver bullet guaranteed to increase the performance of any Python logic. The
misconception it is trying to clear up, in it's own words is:

> Sadly async is not go-faster-stripes for the Python interpreter.

What I, and many others are questioning is:

A) Is this actually as widespread a belief as the article claims it to be?
None of the results are surprising to me (or apparently some others).

B) Is the article accurate in it's analysis and conclusion?

As an example, take this paragraph:

> Why is this? In async Python, the multi-threading is co-operative, which
> simply means that threads are not interrupted by a central governor (such as
> the kernel) but instead have to voluntarily yield their execution time to
> others. In asyncio, the execution is yielded upon three language keywords:
> await, async for and async with.

This is a really confusing paragraph because it seems to mix terminology. A
short list of problems in this quote alone:

\- Async Python != multi-threading.

\- Multi-threading is not co-operatively scheduled, they are indeed
interrupted by the kernel (context switches between threads in Python do
actually happen).

\- Asyncio is co-operatively scheduled and pieces of logic have to yield to
allow other logic to proceed. This is a key difference between Asyncio
(coroutines) and multi-threading (threads).

\- Asynchronous Python can be implemented using coroutines, multi-threading,
or multi-processing; it's a common noun but the quote uses it as a proper noun
leaving us guessing what the author intended to refer to.

Additionally, there are concepts and interactions which are missing from the
article such as the GIL's scheduling behavior. In the second video I shared,
David Beazley actually shows how the GIL gives compute intensive tasks higher
priority which is the opposite of typical scheduling priorities (e.g. kernel
scheduling) which leads to adverse latency behavior.

So looking at the article as a whole, I don't think the underlying intent of
the article is wrong, but the reasoning and analysis presented is at best
misguided. Asyncio is not a performance silver bullet, it's not even real
concurrency. Multi-processing and use of C extensions is the bigger bang for
the buck when it comes to performance. But none of this is surprising and is
expected if you really think about the underlying interactions.

To rephrase what you think I thought:

> The point is, many people think (including you judging by your comment, and
> certainly including me up until now but now I'm just confused) that in
> Python asyncio is better than using multiple threads with blocking IO.

Is actually more like:

> Asyncio is more efficient than multi-threading in Python. It is also
> comparatively more variable than multi-processing, particularly when dealing
> with workloads that saturate a single event loop. Neither multi-threading or
> Asyncio is actually concurrent in Python, for that you have to use multi-
> processing to escape the GIL (or some C extension which you trust to safely
> execute outside of GIL control).

\---

Regarding your aside example, it's true some C extensions can escape the GIL,
but often times it's with caveats and careful consideration of where/when you
can escape the GIL successfully. Take for example this scipy cookbook
regarding parallelization:

[https://scipy-
cookbook.readthedocs.io/items/ParallelProgramm...](https://scipy-
cookbook.readthedocs.io/items/ParallelProgramming.html)

It's not often the case that using a C extension will give you truly
concurrent multi-threading without significant and careful code refactoring.

~~~
camgunz
For single processes you’re right, but this article (and a lot of the activity
around asyncio in Python) is about backend webdev, where you’re already
running multiple app servers. In this context, asyncio is almost always
slower.

------
orf
His async code creates a pool with only 10 max connections[1] (the default).
Whereas his sync pool[2], with a flask app that has 16 workers, has
significantly more database connections.

I expect upping this number would have a positive effect on asyncio numbers
because the only thing[3] this[4] is[5] measuring[6] is how many database
connections you have, and is about as far from a realistic workload as you can
get.

Change your app to make 3 parallel requests to httpbin, collect the responses
and insert them into the database. That's an actually realistic asyncio
workload rather than a single DB query on a very contested pool. I'd be very
interested to see how sync frameworks fare with that.

1\. [https://github.com/calpaterson/python-web-
perf/blob/master/a...](https://github.com/calpaterson/python-web-
perf/blob/master/async_db.py#L9)

2\. [https://github.com/calpaterson/python-web-
perf/blob/master/s...](https://github.com/calpaterson/python-web-
perf/blob/master/sync_db.py#L11)

3\. [https://github.com/calpaterson/python-web-
perf/blob/master/a...](https://github.com/calpaterson/python-web-
perf/blob/master/app_aio.py#L8)

4\. [https://github.com/calpaterson/python-web-
perf/blob/master/a...](https://github.com/calpaterson/python-web-
perf/blob/master/app_flask.py#L11)

5\. [https://github.com/calpaterson/python-web-
perf/blob/master/a...](https://github.com/calpaterson/python-web-
perf/blob/master/app_sanic.py#L11)

6\. [https://github.com/calpaterson/python-web-
perf/blob/master/a...](https://github.com/calpaterson/python-web-
perf/blob/master/app_starlette.py#L9)

~~~
calpaterson
Hi - as mentioned in the article all connections went through pgbouncer
(limited to 20) and I was careful to ensure that all configurations saturated
the CPU so I'm pretty confident they were not waiting on connections to open.
Opening a connection from pgbouncer over a unix socket is very fast indeed -
my guess is perhaps a couple of orders of magnitude faster than without it. 20
connections divided by 4 CPUs is a lot, and pretty much all CPU time was still
spent in Python.

Sidenote here: one thing I found but didn't mention (the reason I put in the
pooling, both in Python and pgbouncer) is that otherwise, under load, the
async implementions would flood postgres with open connections and everything
would just break down.

I think making a database query and responding with JSON is a very realistic
workload. I've coded that up many times. Changing it to make requests to other
things (mimicking a microservice architecture) is also interesting and if you
did that I'd be interested to read your write up.

~~~
supermatt
Aren't you still capping the throughput by the query rate of your connection
pool though? By limiting that, you are limiting the application as a whole -
i.e. your benchmark is bound by the speed of your database, and has (almost)
nothing to do with the performance of a specific python implementation.

~~~
arghwhat
Only if there are spare resources left to saturate the connection pools, which
didn't seem to be the case.

If the system as a whole is well saturated, and the python processes dominate
the system load with a DB load proportional to the requests served, then I
don't think we would hit any external bottlenecks.

The benchmarks performed are not that great (e.g., virtualized, same machine
for all components, etc.), but I don't think the errors are enough to throw
off the result. Note, of course, that such results are not universal, and
_some_ loads might perform better async.

------
zzzeek
I am SUPER happy someone else is finally looking at this. It is long past time
that the reflexive use of asycnio or systems like gevent/eventlet for no other
reason than "hand-wavy SPEED" come to an end. That web applications that
literally serve just one user at at time are built in Tornado for "speed". (my
example for this is the otherwise excellent SnakeViz:
[https://jiffyclub.github.io/snakeviz/](https://jiffyclub.github.io/snakeviz/)
which IMO should have just used wsgiref).

As the blog post apparently cites as well (woo!), I've written about the myth
of "async == speed" some years ago here and my conclusions were identical.

[https://techspot.zzzeek.org/2015/02/15/asynchronous-
python-a...](https://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-
databases/)

~~~
calpaterson
Hi - yes loved your blogpost! Also very tired of the "async magic performance
fairy dust" :)

It's a difficult myth to dispel and I think the situation in terms of public
mindshare is much worse now than it was in 2015. Some very silly claims from
the async crowd now have basically widespread credence. I think one of the
root causes is that people are sometimes very woolly about how multi-
processing works. One of the others is that I think it's easy to make the
conceptual mistake of 1 sync workers = 1 async worker and do a comparison that
way

One of my worries is that right now it feels like everything in Python is
being rewritten in asyncio and the balkanisation of the community could well
be more problematic than 2 vs 3.

~~~
throwaway894345
For me it's worth the effort to deal with async if it means not having to deal
with uwsgi or other frontends. But in general I think Python has too many
problems (packaging, performance, distribution, etc) that it doesn't make
sense IMO to invest in new Python projects.

~~~
1337shadow
uWSGI is a lot of joy for me, really, I've never been happier with my
deployments since I have discovered uWSGI back in 2008 or something, and
nowadays it supports plenty of languages so there's just nothing I don't
deploy on uWSGI anymore.

Python packaging is something that I have fully automated (maintaining over 50
packages here) and that I'm pretty happy with.

I fail to see the problem with Python packaging, maybe because I have an
aggressive continuous integration practice ? (always integrate upstream
changes, contribute to dependencies that I need, and when I'm not doing TDD
it's only because I have not yet proof that the code I'm writing is not
actually going to be useful) That's not something everybody wants to do (I
don't understand their reasoning though).

People would rather freeze their dependencies and then cry because upgrading
is a lot of work, instead of upgrading at the rhythm of upstream releases. If
other packages managers or other languages have packaging features that
encourages what I consider to be non-continuous integration then good for
them, but that's not how a hacker like me wants to work, being able to "ignore
upstream releases" is not a good feature, it made me a sad developer really,
"ignoring non-latest releases" have made me a really happy developer.

Most performance issues are not imputable to the language. If they are, it's
probably not affecting all your features, you can still rewrite the feature
that Python is not well performing for into a compiled language. I need most
of my code to be easy to manipulate, and very little of it to actually
outperform Python.

I've recently re-assessed if I should keep going with Python for another 10
years, tried a bunch of languages, frameworks, at the end of the month I still
wanted a language that easy to manipulate with basic text tools, that's
sufficiently easy so that I can onboard junior collegues on my tools, that
provides sufficiently advanced OOP because I find it efficient to structure
and reuse code.

Python does what it claims, it solves a basic human-computer problem, let's
face it: it's here to stay and shine, and its wide ecosystem seems like a
solid proof. Wether it makes sense to invest in a project or not should not
depend in the language anyway.

~~~
throwaway894345
> uWSGI is a lot of joy for me, there's nothing I don't deploy on uWSGI, even
> PHP code.

Oh man, we moved away from uwsgi to async a couple of years ago and that's
been one of the best decisions we've made. Async is no walk in the park, but
not having to deal with uwsgi configuration, etc has been well worth it.

> Python packaging is something that I have fully automated (maintaining over
> 50 packages here) and that I'm pretty happy with.

Yeah, I don't doubt this. Many people have found a happy path that works for
them, but I've found that those tend to be people who don't have significant
constraints (e.g., they don't need fast builds, or they don't care about
reproducibility, or they don't have to deal with a large number of regular
contributors, or etc).

> Most performance issues are not imputable to the language.

This isn't true in a meaningful sense. For the most part, if you're doing
anything more complicated than a CRUD app, you will run into performance
problems with Python almost immediately upon leaving the prototype phase, and
your main options for improving performance are horizontal scaling
(multiprocess/multihost parallelism) or rewriting the hot path in a faster
language. As previously discussed, these options only work for certain use
cases where the ratio of de/serialization to real work is low, so you often
find yourself without options. Further, horizontal scaling is expensive
(compute is expensive) and rewriting in a different language is differently
expensive (you now have to integrate a separate build system and employ
developers who are not only well-versed in the new language, but also in
implementing Python extensions specifically).

On the other hand, if you chose a language like Go, you would be in the same
ballpark of maintainability, onboarding, etc (many would argue Go is easier to
write and maintain due to simplicity and static typing) but you would be in a
much better place with respect to packaging and performance. You likely
wouldn't need to optimize anything since naive Go tends to be 10-100X faster
than naive Python, and if you needed to optimize, you can do so in-language
without paying any sort of de/serialization overhead (parallelism, memory
management, etc), allowing you to eek out another magnitude of performance.
There are other options besides Go that also give performance gains, but they
often involve trading off
simplicity/packaging/deployment/tooling/ecosystem/etc.

> If they are, it's probably not affecting all your features, you can still
> rewrite the feature that Python is not well performing for into a compiled
> language.

This is true, but "rewriting features" is usually prohibitively expensive, and
it's often non-trivial to figure out up-front which features will have
performance problems in the future such that you could otherwise avoid a
rewrite.

> Python does what it claims that's a basic human problem, let's face it: it's
> here to stay and shine.

Yes, Python is here to stay, but that's more attributable to network effects
and misinformation than merit in my experience.

~~~
1337shadow
Well we can't use uWSGI for ASGI but still good for us for anything else, I
literally have 0 uWSGI configuration file, just a uWSGI command in a container
command.

> Many people have found a happy path that works for them, but I've found that
> those tend to be people who don't have significant constraints (e.g., they
> don't need fast builds, or they don't care about reproducibility, or they
> don't have to deal with a large number of regular contributors, or etc).

I'm really curious about this statement, building a python codebase for me
means building a container image, if the system packages or python
dependencies don't change then it's really going to take less than a minute.
What does your build look like ?

Can you define "a large number of regular contributors".

What do you mean "they don't need reproductibility" ? I suppose they just
build a container image in a minute and then go over and deploy on some host.
If a dependency breaks the code, it's still reproductible, but broken, then it
means it has to be fixed, rather than ignored, a temporary version pin is fine
though.

> This is true, but "rewriting features" is usually prohibitively expensive,
> and it's often non-trivial to figure out up-front which features will have
> performance problems in the future such that you could otherwise avoid a
> rewrite.

If Go is so much easier to write then I fail to see how it can be a problem to
use Go to rewrite a feature for which performance is mission critical, and for
which you have final specifications in the python implementation you're
replacing. But why write it in Go instead of Rust, Julia, Nim, or even
something else ?

You're going to choose the most appropriate language for what exactly you have
to code. If you're trying to outperform an interpreted language and/or don't
care about being stuck with a rudimentary pseudo-object oriented feature set
then choose such a compiled language. Otherwise, Python is a pretty decent
choice.

> Yes, Python is here to stay, but that's more attributable to network effects
> and misinformation than merit in my experience.

If Go was easier to write and read, why would they implement a Python subset
in Go for configuration files, instead of just having configuration files in
Go ? go.starlark.net Oh right, because it's not as easy to read and write than
Python, and because you'd need to recompile. So apparently, even Google who
basically invented also seem to need it to support some Python dialect.

10-100X performance is most probably something you'll never need when starting
a project, unless performance is mission critical from the start. Static types
and compile is an advantage for you, but for me dynamic typing and
interpretation means freedom (again, I'm going to TDD on one hand and fix
runtime exceptions as soon as I see them in applicative monitoring anyway).

I don't believe comparing Python and Go is really relevant, comparing PHP and
Ruby and Python for example would seem more appropriate, when you say "people
shouldn't need Python because they have Go" I fail to see the difference with
just saying "people shouldn't need interpreted languages because there are
compiled languages".

Humans need a basic programing language that is easy to write and read,
without caring about having to compile it for their target architecture,
Python claims to do that, and does it decently. If you're looking for more, or
something else, then nobody said that you should be using Python.

I might be wrong, but when I'm talking about Humans, I'm referring to, what I
have seen during the last 20 years as 99% of the projects out there in the
wild, not the 1% of projects that have extremely specific mission critical
performance requirements, thousands of daily contributors, and the like. Those
are also pretty cool, and they need pretty cool technology, but it's really
not the same requirements. For me saying everybody needs Go would look a bit
like saying everybody needs k8s or AWS. Languages are many and solve different
purpose. The one that Python serves is staying, not by misinformation, but
because of Human nature.

~~~
throwaway894345
> What does your build look like ?

Running tests, building a PEX file, putting the PEX file into a container
image. We have probably about a dozen container images and counting at this
point. The tests take a long time (because Python is 2+ orders of magnitude
slower than other languages), and our CI bill is killing us (we're looking
into other CI providers as well).

> Can you define "a large number of regular contributors".

More than 20 (although our eng org is 30-50). Multiple teams. You don't want
to hold everyone's hand and show them all the tips and tricks you've found for
working around the quirks of Python packaging or give them an education on
wheels, bdists, sdists, virtualenvs, pipenvs, pyenvs, poetries, eggs, etc.
They were promised Python was going to be easy and they wouldn't have to learn
a bunch of things, after all.

> What do you mean "they don't need reproductibility" ? I suppose they just
> build a container image in a minute and then go over and deploy on some
> host.

Container images aren't reproducible in practice. Moreover, they have to also
be reproducible for local development, and we use macs and Docker for mac is
prohibitively slow. Need something else to make sure developers aren't dealing
with dependency hell.

> If Go is so much easier to write then I fail to see how it can be a problem
> to use Go to rewrite a feature for which performance is mission critical,
> and for which you have final specifications in the python implementation
> you're replacing.

Both can be true: Go is easier to write than Python and it's still
prohibitively expensive to rewrite a whole feature in Go. If the feature is
small, well-designed, and easily isolated from the rest of the system, then
rewriting is cheap enough, but these cases are rare and "opportunity cost" is
a real thing--time spent rewriting is time not spent building new features.

> But why write it in Go instead of Rust, Julia, Nim, or even something else ?

Because Rust slows development velocity by an order of magnitude and Julia and
Nim aren't mature general-purpose application development languages.

> You're going to choose the most appropriate language for what exactly you
> have to code. If you're trying to outperform an interpreted language and/or
> don't care about being stuck with a rudimentary pseudo-object oriented
> feature set then choose such a compiled language. Otherwise, Python is a
> pretty decent choice.

Yes, you have to choose the most appropriate language, but I contend that
Python is a pretty rubbish choice for reasons that people often fail to
consider up front. E.g., "My app will never need to be fast, and if it's fast
I can just rewrite the slow parts in C!".

> If Go was easier to write and read, why would they implement a Python subset
> in Go for configuration files, instead of just having configuration files in
> Go ? go.starlark.net Oh right, because it's not as easy to read and write
> than Python, and because you'd need to recompile. So apparently, even Google
> who basically invented also seem to need it to support some Python dialect.
> Starlark is pretty cool though, and I use it a lot; I just wish it were
> statically typed.

Apples and oranges. Starlark is an embedded scripting language, not an app dev
language. Different design goals. It also probably pre-dates Go, or at least
derives from something which pre-dates Go.

> 10-100X performance is most probably something you'll never need when
> starting a project, unless performance is mission critical from the start.

You would be surprised. As soon as you're doing something moderately complex
with a small-but-not-tiny data set you can easily find yourself in the tens of
seconds. And 100X is the difference between a subsecond request and an HTTP
timeout. It matters a lot.

> Static types and compile is an advantage for you, but for me dynamic typing
> and interpretation means freedom (again, I'm going to TDD on one hand and
> fix runtime exceptions as soon as I see them in applicative monitoring
> anyway).

We do TDD for our application development too and we still see hundreds of
typing errors in production every week. I think your idea of "static typing"
is jaded by Java or C++ or something; you can have fast, flexible iteration
cycles with Go or many of the newer classes of statically typed languages, as
previously mentioned. "Type inference" (in moderation) is your friend. Anyway,
Go programs can often compile in the time it takes a Python program to finish
importing its dependencies. A Go test can complete in a fraction of the time
it takes for pytest to _start_ testing (no idea why it takes so long for it to
find all of the tests).

> I don't believe comparing Python and Go is really relevant, comparing PHP
> and Ruby and Python for example would seem more appropriate, when you say
> "people shouldn't need Python because they have Go" I fail to see the
> difference with just saying "people shouldn't need interpreted languages
> because there are compiled languages".

"compiled" and "interpreted" aren't use cases. "General app dev" is a use
case. Python and Go compete in the same classes of tools: web apps, CLI
applications, devops automation, lambda functions, etc. PHP and Ruby are also
in many of these spaces as well. I don't especially care if Python is the
fastest interpreted language (it's not by a long shot), I care if it's fast
enough for my application (it's not by a long shot).

> Humans need a basic programing language that is easy to write and read,
> without caring about having to compile it for their target architecture,
> Python claims to do that, and does it decently. If you're looking for more,
> or something else, then nobody said that you should be using Python.

Lots of people recommend Python for use cases for which it's not well suited,
and since so many Python dependencies are C, you absolutely have to worry
about recompiling for your target architecture, and it's much, much harder
than with Go (to recompile a Go project for another architecture, just set the
OS and the architecture via the `GOOS` and `GOARCH` env vars and rerun `go
build`--you'll have a deployable binary before your Python Docker image
finishes building).

> I might be wrong, but when I'm talking about Humans, I'm referring to, what
> I have seen during the last 20 years as 99% of the projects out there in the
> wild, not the 1% of projects that have extremely specific mission critical
> performance requirements

Right, Python is alright for CRUD apps or any other kind of app where the
heavy lifting can easily be off-loaded to another language. There's still the
build issues and everything else to worry about, but at least performance
isn't the problem. But I think you'll be surprised to find out that lots of
apps don't fit that bill.

> For me saying everybody needs Go would look a bit like saying everybody
> needs k8s or AWS.

I'm not saying everyone needs Go, I'm saying that Go is a better Python than
Python. There are a handful of exceptions--there's not currently a solid Go-
alternative for django, and I wouldn't be surprised if the data science
ecosystem was less mature. But for general purpose development, I think Go
beats Python at its own game. And I've been playing that game for a decade
now. This conversation has been pretty competitive, but I really encourage you
to give Go a try--I think you'll come around eventually, and you can learn it
so fast that you can be writing interesting programs with it in just a few
hours. Check out the tour: [https://tour.golang.org](https://tour.golang.org).

~~~
1337shadow
I understand that if you're building a PEX file then all dependencies must be
reinstalled into it every time, however you might still be able to leverage
container layer caching to save the download time.

CI bills are aweful, I always deploy my own CI server, a gitlab-runner where I
also spawn a Traefik instance to practice eXtreme DevOps.

More than 20 daily contributors that's nice, but I must admit that I have
contributed to some major python projects that don't have a packaging problem,
such as Ansible or Django. So, I'm not sure if the number of contributors is
really a factor in packaging success. That said, sdist and well are things
that happen in CI for me, it's just adding this to my .gitlab-ci.yml:

    
    
        pypi:
            stage: deploy
            script: pypi-release
    

And adding TWINE_{USERNAME,PASSWORD} to CI. The other trick is to use the
excellent setupmeta or something like that (OpenStack also has a solution) so
that setup.py discovers the version based on the git tag or publishes a dev
version.

That's how I automate the packaging of all my Python packages (I have
something similar for my NPM packages). As for virtualenvs, it's true that
they are great but I don't use them, I use pip install --user, which has the
drawback that you need all your software to run with the latest releases of
dependencies, otherwise you have to contribute the fixes, but I'm a more happy
developer this way, and my colleagues aren't blocked by a breaking upstream
release very often, they will just pin a version if they need to keep working
while somebody takes care of changing our code and contribute to dependencies
to make everything work with latest versions.

I don't think that other languages are immune to version compatibility issues,
I don't think that problem is language dependent, either you pin your versions
and forget about upstream releases, either you aggressively integrate upstream
releases continuously in your code and your dependencies.

> My app will never need to be fast

I maintain a governmental service that was in production in less than 3
months, then 21 months of continuous development, serving 60m citizen with a
few thousand administrators, as sole techie, on a single server, for the third
year. Needless to say, my country has never seen such a fast and useful
project. I have not optimized anything. Of course you can imagine it's not my
first project in this case. For me, Python's speed most often not a problem is
not a lie, I proved it.

The project does have a slightly complex database, the administration
interface does implement really tight permission granularity (each department
has its own admin team with users of different roles), it did have to iterate
quickly, but you know the story with Django : changing the DB schema is easy,
migrations are generated by Django, you can write data migrations easily,
tests will tell you what you broke, you write new tests (I also use snapshot
testing so a lot of my tests actually write themselves), and upgrading a
package is just as easy as fixing anything that broke when running the tests.

You seem to think that Python is outdated because it's old, and that's also
what I thought when I went over all alternative for my 10 next years of app
devs. I was ready to trash all my Python really. But that's how I figured that
the human-computer problem Python solves will just _always_ be relevant. I'll
assume that you understand the point I made on that and that we simply
disagree here.

Or maybe we don't really disagree, I'll agree with you that a compiled
language is better for mission-critical components, but any of these will
almost always need a CRUD and that's where Python shines.

But I've not always been making CRUDs with Python, I have 2 years of
experience as an OpenStack developer, and I must admit that Python fit the
bill pretty well here too. Maybe my cloud company was not big enough to have
problems, or we just avoided the common mistakes. I know people like Rackspace
had hard times maintaining forks of the services, I was the sole maintainer of
4 network services rewrites which were basically 1 package using OpenStack as
a framework (like I would use Django), to simply listen on RabbitMQ and do
stuff on SDN and SSH. Then again, I think not so much people actually practice
CI/CD correctly, so that's definitely going to be a problem for them at some
point.

> there's not currently a solid Go-alternative for django

That's one of the things that put me of, I tried all Go web frameworks, and
they are pretty cool, but will they ever reach the productivity levels of
Django, Rails or Symfony ?

Meanwhile, I'm just waiting for the day someone puts me in charge of something
where performance is sufficiently performance-critical that I need to rewrite
it in a compiled language, if I could have the chance to do some ASM
optimizations that would also be a lot of fun. Another option is that I have
something to contribute to a Go project, but so far, Go developers seem doing
really fine without me for sure :)

While I choose it for general purpose development ? I guess I'm stuck with "I
love OOP" just like "the little functional programing Python offers".

I really enjoyed this conversation too, would like to share it on my blog if
you don't mind, thank you for your time, have a great weekend.

------
ahupp
This is true as far as it goes, but is not testing the (very common) areas
where async shines.

Imagine you're loading a profile page on some social networking site. You
fetch the user's basic info, and then the information for N photos, and then
from each photo the top 2 comments, and for each comment the profile pic of
the commentor. You can't just fetch all this in one shot because there's data
dependencies. So you start fetching with blocking IO, but that makes your wait
time for this request proportional to the number of fetches, which might be
large.

So instead, you ideally want your wait to be proportional to the depth of your
dependency tree. But composing all these fetches that way is _hard_ without
the right abstraction. You can cobble it together with callbacks but it gets
hairy fast.

So (outside of extreme scenarios) it's not really about whether async is
abstractly faster than sync. It's about how real developers would solve the
same problem with/without async.

(Source: I worked on product infrastructure in this area for many years at FB)

~~~
reggieband
I felt baffled by this thread until I read this response. async/await for me
has always been about managing this kind of dependency nightmare. I guess if
all you have to do is spawn 100 jobs that run individually and report back to
some kind of task manager then the performance gains of threads probably beats
async/coroutine based approaches on a pure speed benchmark. But when I have
significant chains of dependent work then the very idea of using bare threads
and callbacks to manage that is annoying.

At least in Typescript nowadays, the ability to just mark a function `async`
and throw an `await` in front of its invocation drastically lowers the barrier
to moving something from blocking to non-blocking. In the same cases if I had
to recommend the same change with thread pools and callbacks (and the manual
book-keeping around all that) most developers just wouldn't bother.

~~~
sicromoft
> just mark a function `async` and throw an `await` ... to [move] something
> from blocking to non-blocking.

That's not how it works. `async` and `await` are merely syntactic sugar around
callbacks. Everything in javascript is already nonblocking[1], whether or not
you use async/await.

[1] There are a few rare exceptions in node js (functions suffixed with
"Sync"), but in the same vein, they are blocking whether or not you use
async/await.

~~~
peferron
The argument was about the developer experience, not how things work behind
the scenes. It's super simple for a developer to write this, for example:

    
    
        const a = an async operation
        const b = another async operation
        // Resolve a and b concurrently
        const [x, y] = await Promise.all([a, b])
        // Do something with x and y
    

You can naturally achieve that with callbacks but there's more boilerplate
involved. I'm not familiar with Python so I don't know how it would look like
without async.

Edit: I just re-read your comment and the one you were responding to, and do
agree that async/await don't "move" things from blocking to non-blocking. It
just helps using already non-blocking resources more easily. It will not help
you if you're trying to make a large numerical computation asynchronous, for
example. In this regard it's very different from Golang's `go`, which will run
the computation in a separate goroutine, which itself will run concurrently
(with Go's scheduler deciding when to yield), and in parallel if the
environment allows it.

~~~
earthboundkid
As someone who works in both Python and JavaScript regularly, JS’s async is
just leagues easier and better. It’s night and day. Even something as simple
as new Promise or Promise.all is way more confusing in Python. It’s very
different.

------
alexhutcheson
A lot of the debate and discussion here seems to come from the fact that the
example program demonstrates concurrency _across_ requests (each concurrent
request is being handled by a different worker), but no concurrency _within_
each request: The code to serve each request is essentially one straight line
of execution, which pauses while it waits for a DB query to return.

A more interesting example would be a request that requires multiple blocking
operations (database queries, syscalls, etc.). You could do something like:

    
    
        # Non-concurrent approach
        def handle_request(request):
          a = get_row_1()
          b = get_row_2()
          c = get_row_3()
          return render_json(a, b, c)
       
    
        # asyncio approach
        async def handle_request(request):
          a, b, c = await asyncio.gather(
            get_row_1(),
            get_row_2(),
            get_row_3())
          return render_json(a, b, c)
    
        # Naive threading approach
        def handle_request(request):
           a_q = queue.SimpleQueue()
           t1 = threading.Thread(target=get_row_1(a_q))
           t1.start()
           b_q = queue.SimpleQueue()
           t2 = threading.Thread(target=get_row_2(b_q))
           t2.start()
           c_q = queue.SimpleQueue()
           t3 = threading.Thread(target=get_row_3(c_q))
           t3.start()
    
           t1.join()
           t2.join()
           t3.join()
    
           return render_json(a_q.get(), b_q.get(), c_q.get())
    
    
        # concurrent.futures with a ThreadPoolExecutor 
        def handle_request(request, thread_pool):
          a = thread_pool.submit(get_row_1())
          b = thread_pool.submit(get_row_2())
          c = thread_pool.submit(get_row_3())
          return render_json(a.result(), b.result(), c.result())
    

These examples demonstrate what people find appealing about asyncio, and would
also tell you more about how choice of concurrency strategy affects response
time for each request.

~~~
knite
This a great point, surprised you received no follow-up comments!

------
berbc
Is speed really a good reason for using async? If I remember correctly,
asynchronous I/O was introduced to deal with many concurrent clients.

Therefore, I would have liked to see how much memory all those workers use,
and how many concurrent connections they can handle.

~~~
jillesvangurp
I think speed is the wrong word here. A better word is throughput.

The underlying issue with python is that it does not support threading well
(due to the global interpreter lock) and mostly handles concurrency by forking
processes instead. The traditional way of improving throughput is having more
processes, which is expensive (e.g. you need more memory). This is a common
pattern with other languages like ruby, php, etc.

Other languages use green threads / co-routines to implement async behavior
and enable a single thread to handle multiple connections. On paper this
should work in python as well except it has a few bottlenecks that the article
outlines that result in throughput being somewhat worse than multi process &
synchronous versions.

~~~
throwaway894345
> which is expensive (e.g. you need more memory)

Memory is cheap; the cost is in constant de/serialization. Same with "just
rewrite the hotspots in C!"-style advice; de/serialization can easily eat
anything you saved by multiprocessing/rewriting. Python is a deceivingly hard
language, and a lot of this is a direct result of the "all of CPython is the
public C-extension interface!" design decision (significant limitations on
optimizations => heavy dependency on C-extensions for anything remotely
performance sensitive => package management has to deal extensively with the
nightmare that is C packaging => no meaningful cross-platform artifacts or
cross compilation => etc).

~~~
Grimm1
Memory is not cheap when dealing the real world cost of deploying a production
system. The pre fork worker model used in many sync cases is very resource
intensive and depending on the number of workers you're probably paying a lot
more for the box it's running on, ofc this is different if you're running on
your own metal but I have other issues with that.

~~~
throwaway894345
> Memory is not cheap when dealing the real world cost of deploying a
> production system.

What? What makes you say that? What did you think I was talking about if not a
production system? To be clear, we're talking about the overhead of single-
digit additional python interpreters unless I'm misunderstanding something...

~~~
Grimm1
Observed costs from companies running the pre fork worker model vs alternative
deployment methods and just in the benchmark they're running double digit
interpreters which I've seen as more common and expensive.

~~~
throwaway894345
Double-digit interpreters per host? Where is the expense? Interpreters have a
relatively small memory overhead (<10mb). If you're running 100 interpreters
per host (you shouldn't be), that's an extra $50/host/year. But you should be
running <10/host, so an extra $5/host/year. Not ideal, but not "expensive",
and if you care about costs your biggest mistake was using Python in the first
place.

~~~
Grimm1
I don't know where you're seeing the < 10mb from the situation I saw they were
easily consuming 30mb per interpreter. Even my cursory search around now shows
them at roughly 15-20mb so assuming the 30mb Gunicorn was just misconfigured
that's still an extra $100 per host using your estimate and what I'm looking
at Googling around and across a situation where there are multiple public apis
that's adding up pretty quickly.

Another google search shows me Gunicorn, for instance, using high memory on
fork isn't exactly uncommon either.

Edit: I reworded some stuff up there and tried to make my point more clear.

~~~
throwaway894345
The interpreter overhead on macos is 7.7mb. I can't speak to gunicorn
configuration but it's far from the only game in town.

~~~
Grimm1
Totally fair point, my experience with fork type deploys has only been
Gunicorn so I'll take this as a challenge to try some others out.

------
rlpb
I find it interesting that all the talk here is about performance, and nobody
has mentioned any benefits of Async Python when performance isn't an issue.

I use trio/asyncio to more easily write correct complex concurrent code when
performance doesn't matter. See "The Problem with Threads"[1].

For this use case, Async Python probably still isn't faster, but that doesn't
matter. Let's not throw out the baby with the bathwater :)

[1]
[https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-...](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)

~~~
rukittenme
Whats the point of writing concurrent code if its not faster?

~~~
Jtsummers
Contrasting with jdlshore, concurrency can make programs much _easier_ to
reason about, when done well. This is a benefit of both Go and Erlang, though
they use different approaches.

Concurrency can help you separate out logic that is often commingled in non-
concurrent code, but doesn't need to be. As a real-world example, I used to do
safety critical systems for aircraft. The linear, non-concurrent version,
included a main loop that basically executed a couple dozen functions. Each
function may or may not have dependencies on the other functions, so
information was passed between them over multiple passes through this main
loop (as their order was fixed) using shared memory.

A similar project had about a dozen processes, each running concurrently.
There was no speed improvement, but the connection between each activity was
handled via channels (equivalent in theory to Go's channels, less like
Erlang's mailboxes as the channels could be shared). We knew it was correct
because each process was a simple state machine, separated cleanly from all
other state machines.

The second system's code was _much_ simpler, there was no juggling (in our
code) of the state of the system, compared to managing the non-concurrent
logic. If a channel had data to be acted on, the process continued, otherwise
it waited. Very simple. And it turns out that many systems can be modeled in a
similar fashion (IME). Of course, we had a very straightforward communication
mechanism (again, essentially the same as Go channels except it was a library
written in, as I recall, Ada by whoever made the host OS).

~~~
rukittenme
Signals are not dependent on concurrency. And you don't need multiple
processes to implement a state machine.

I mean think about it. Whats the difference between sending message A and then
message B versus sending messages A and B into a queue and letting some async
process pop from it? Less complexity and guaranteed message delivery come for
free in single-threaded code.

Am I wrong? What am I missing?

~~~
jdlshore
I don't think you're wrong, but in Jtsummers' specific case, I think multi-
processing probably would be simpler. You don't have to implement the event
loop, there's no risk of tromping on other processes' data, and if a process
gets into an invalid state, you can just die without impacting others.

You'd need a good watchdog and error handling, but presumably some of that
came for "free" in their environment.

Although if you take out the "free" OS support, watchdog, etc., I agree that
there's likely a place between "shared memory spaghetti" and "multi-
processing" that's simpler than both.

~~~
Jtsummers
Exactly this. I had started my own reply and refreshed and saw yours, thanks.

The other benefit of the concurrent design (versus the single-threaded
version) was that it was actually _much_ simpler. This was critical for our
field because that system is still flying, now 12 years later, and will
probably be flying for another 30-50 years. The single-threaded system was
unnecessarily complex. Much of the complexity came from having to include code
to handle all the state juggling between the separate tasks, since each had
some dependency on each other (not a fully connected graph, but not entirely
disconnected either). The concurrent design made it trivial to write something
very close to the most naive version possible, where waiting was something
that only happened when external input was needed. So the coordination between
each task just fell out naturally.

You still have to care about locking the system up, but in our case because
each process was sufficiently reduce to its essentials, this was easy to
evaluate and reason about.

------
ChrisMarshallNY
I use async for UI work, but don't have much of an opinion for servers.

I suspect that the best async is that supported by the server OS, and the more
efficiently a language/compiler/linker integrates with that, the better.
JIT/interpreted languages introduce new dimensions that I have not
experienced.

I do have some prior art in optimizing libraries, though. In particular, image
processing libraries in C++. My opinion is that optimization is sort of a
"black art," and async is anything but a "silver bullet." In my experience,
"common sense" is often trumped by facts on the ground, and profilers are more
important than careful design.

I have found that it's actually possible to have _worse_ performance with
threads, if you write in a blocking fashion, as you have the same timeline as
sync, but with thread management overhead.

There are also hardware issues that come into play, like L1/2/3 caches,
resource contention, look-ahead/execution pipelines and VM paging. These can
have _massive_ impact on performance, and are often only exposed by running
the app in-context with a profiler. Sometimes, threading can exacerbate these
issues, and wipe out any efficiency gains.

In my experience, well-behaved threaded software needs to be written, profiled
and tuned, in that order. An experienced engineer can usually take care of the
"low-hanging fruit," in design, but I have found that profiling tends to
consistently yield surprises.

T.A.N.S.T.A.A.F.L.

~~~
unilynx
> profilers are more important than careful design.

> I have found that it's actually possible to have worse performance with
> threads, if you write in a blocking fashion

But isn't excessive blocking/synchronization not something the should already
be tackled in your design instead of trying to rework it after the fact ?

I would expect profiling to mostly leads to micro-optimisations, eg combining
or splitting the time a lock is taken, but when you're still designing you can
look at avoiding as much need for synchronization as possible. eg: sharing
data copy-on-write (not requiring locks as long as you have a reference)
instead of having to lock the data when accessing it.

As another commenter says

> with asyncio we deploy a thread per worker (loop), and a worker per core. We
> also move cpu bound functions to a thread pool

you can't easily go from eg. thread-per-connection to a worker pool. that
should have been caught during design

~~~
ChrisMarshallNY
> But isn't excessive blocking/synchronization not something the should
> already be tackled in your design instead of trying to rework it after the
> fact ?

Yes and no. Again, _I have not profiled or optimized servers or interpreted
/JIT languages,_ so I bet there's a new ruleset.

Blocking can come from unexpected places. For example, if we use dependencies,
then we don't have much control over the resources accessed by the dependency.

Sometimes, these dependencies are the OS or standard library. We would
sometimes have to choose alternate system calls, as the ones we initially
chose caused issues which were not exposed until the profile was run.

In my experience, the killer for us was often cache-breaking. Things like the
length of the data in a variable could determine whether or not it was bounced
from a register or low-level cache, and the impact could be astounding. This
could lead to remedies like applying a visitor to break up a [supposedly]
inconsequential temp buffer into cache-friendly bites.

Also, we sometimes had to recombine work that we had sent to threads, because
that caused cache hits.

Unit testing could be useless. For example, the test images that we often used
were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed
onto a well-lit table, with a few targets.

Then, we would run an image from a pro shooter, with a Western prairie
skyline, and the lengths of some of the convolution target blocks would be
different. This could sometimes cause a cache-hit, with a demotion of a
buffer. This taught us to use a large pool of test images, which was sometimes
quite difficult. In some cases, we actually had to use synthesized images.

Since we were working on image processing software, we were already doing this
in other work, but we learned to do it in the optimization work, too.

When my team was working on C++ optimization, we had a team from Intel come in
and profile our apps.

It was pretty humbling.

------
woofie11
Cooperative multitasking came out slower than preemptive in the nineties, so
this is unsurprising in the generic case.

I think my question is whether async Python is slower in the case it was
designed for -- many, long-running open sockets.

Async was traditionally used server-side for things like chat servers, where I
might have millions of sockets simultaneously open.

~~~
ris
> Cooperative multitasking came out slower than preemptive in the nineties

This wasn't really the reason for the shift away from cooperative
multitasking, it was really because cooperative multitasking isn't as robust
or well behaved unless you have a lot of control over what tasks you have
trying to run together.

In theory cooperative multitasking should have better _throughput_ (latency is
another story) because each task can yield at a point where its state is much
simpler to snapshot rather than having to do things like record exact register
values and handle various situations.

~~~
woofie11
... I never meant to imply that performance was the reason for the switch.

We've had a track record of technologies which:

1) Automated things (reliving programmers from thinking about stuff)

2) Were expected to make stuff slower

3) In reality, sped stuff up, at least in the typical case, once algorithms
got smart

That's true for interpreted/dynamic languages, automated memory
management/garbage collection, managed runtimes of different sorts, high-level
descriptive languages like SQL, etc.

Sometimes, it took a lot of time to figure out how to do this. Interpreters
started out an order-of-magnitude or more slower than compilers. It took until
we had bytecode+JIT that performance roughly lined up. Then, we started doing
profiling / optimization based on data about what the program was actually
doing, and potentially aligning compilation to the individual users' hardware,
things suddenly got a smidgeon faster than static compilers.

There is something really odd to me about the whole async thing with Python.
Writing async code in Python is super-manual, and I'm constantly making
decisions which ought to be abstracted away for me, and where changing the
decisions later is super-expensive. I'd like to write.

~~~
mpweiher
> It took until we had bytecode+JIT that performance roughly lined up.

It really didn't. Yes, in highly specialized benchmark situations, JITs
sometimes manage to outperform AOT compilers, but not in the general case,
where they usually lag significantly. I wrote a somewhat lengthy piece about
this, _Jitterdämmerung_ :

[https://blog.metaobject.com/2015/10/jitterdammerung.html](https://blog.metaobject.com/2015/10/jitterdammerung.html)

Discussed at the time:

[https://news.ycombinator.com/item?id=10344601](https://news.ycombinator.com/item?id=10344601)

~~~
woofie11
Well, if you wanna go that route, in the general case, code will be structured
differently. On one side, you have duck typing, closures, automated memory
management, and the ability to dynamically modify code.

On the other side, you don't.

That linguistic flexibility often leads to big-O level improvements in
performance which aren't well-captured in microscopic benchmarks.

If the question is whether GC will beat malloc/free when translating C code
into a JIT language, then yes, it will. If the question is whether malloc/free
will beat code written assuming memory will get garbage collect, it becomes
more complex.

~~~
mpweiher
Objective-C has duck typing (if you want), closures, automated memory
management and the ability to dynamically modify code.

And is AOT compiled.

GC can only "beat" malloc/free if it has several times the memory available,
and usually also only if the malloc/free code is hopelessly naive.

And you've got the micro-benchmark / real-world thing backward: it is JITs
that sometimes do really well on microbenchmarks but invariably perform
markedly worse in the real world. I talk about this at length in my article
(see above).

------
compressedgas
> Function colouring is a big problem in Python

Not when you know how to call sync functions from async functions and vice
versa.

An sync function can call an async function via:

    
    
      loop = asyncio.new_event_loop()
      result = loop.run_until_complete(asyncio.ensure_future(red(x)))
    

A async function can call a sync function via:

    
    
      loop = asyncio.get_event_loop()
      result = await loop.run_in_executor(None, blue, x)
    

Where red and blue are defined as:

    
    
      async def red(x):
            pass
    
      def blue(x):
          pass
    

Note that the documentation is wrong about recommending create_task over
ensure_future. That recommendation results in more restrictive code as
create_task only accepts a coroutine and not a task.

This works for regular functions I don't know how it works for generators.

~~~
WoodenChair
You perfectly illustrated why this is a problem. Calling functions from one
side to the other involves ceremony. Ceremony adds cognitive overhead and
decreases readability.

~~~
PaulHoule
I've written Python functions that "call" another function either async or not
depending on how the function inspects.

For instance, imagine a "maybe_await" method that just calls sync if is
synchronous or otherwise awaits.

~~~
mattbillenstein
It's very heavy to do this is it not? Like you inspect the function on each
call to figure out if it's async or not?

~~~
PaulHoule
For things that happen at UI speed it isn't bad. If I wanted to do something a
million times a second I'd worry about it.

Some Javascript frameworks, such as Vue, often do something similar in that
you can pass either a sync or async callback and it does the right thing for
either. In that case you could potentially inspect the function once and call
it many times.

------
lovasoa
Async python is faster when you use it for running parallel tasks. In this
benchmark, you are running a single database request per query, so there is no
advantage to being asynchronous: a pool of processes will scale just as well
(but it will use more memory). The point of async is that it lets you easily
make a Postgres query, AND an HTTP query, AND a redis query in parallel.

~~~
emptysea
Couldn’t threads handle that use case?

~~~
lovasoa
Yes they can. But threads are a pain to work with in python, as compared to
async.

------
hombre_fatal
One big difference between one thread per request vs single-threaded async
code is that synchronization and accessing shared resources is trivial when
all of your code is running on a single thread.

An entire category of data races like `x += 1` become impossible without you
even thinking about it. And that's often worth it for something like a game
server where everything is beating on the same data structures.

I don't use Python, so I guess it's less of an issue in Python since you're
spawning multiple processes rather than multiple threads so you're already
having to share data via something out of process like Redis and using its own
synchronization guarantees.

But for example the naive Go code I tend to read in the wild always has data
races here and there since people tend to never go 100% into a channel / mutex
abstraction (and mutexes are hard). And that's not a snipe at Go but just a
reminder of how easy it is to take things for granted when you've been writing
single-threaded async code for a while.

~~~
wwright
FWIW, Rust gives you the same simplicity (no data races at runtime) with
threads as well.

(Not necessarily on topic, but if you’re really excited about dodging data
races, I figured it would give you something fun to look at!)

~~~
ric2b
Not in the same way though, it catches the possibility of data races and
forces you to rewrite until all the memory accesses are safe. That's more
complex to program, you might need to redesign some of your data structures,
for example.

------
kkirsche
This reminds me of Rob Pike’s talk from Golang about how concurrency is not
parallelism. I think the python community may be hitting this issue where
async is meant to model concurrent behavior not always or necessarily
facilitate parallel activity

~~~
chooseaname
I think a good chunk of Python developers expected (expect?) async to be a
"get out of GIL free card". It's not.

------
mikkelam
Techempower [1] has a really great collection of benchmarks using highly
controlled test setups that I like to look at to compare web frameworks. Not
affiliated with them, but it's relevant to the post.

[1]
[https://www.techempower.com/benchmarks/#section=data-r19&hw=...](https://www.techempower.com/benchmarks/#section=data-r19&hw=ph&test=fortune&l=zijzen-1r)

------
Grimm1
Async is useful for high IO where you may have a lot of down time between the
requests. Are you pulling many requests from different servers with different
response times, communicating with a db or pulling out large response bodies.
Async is probably going to do better since each one of those synchronously
represents potentially large idling periods where other requests could have
gotten work done.

As to the article the comparisons are good but fails to mention resource
constraints, like Gunicorn, forking 16 instances is going to be a lot heavier
on memory so for a little more RPS you're probably spending a decent chunk of
change more to run your work and I don't think that's worth it considering the
Async model in python is pretty easy to grok these days and under this
benchmark share a similar performance profile.

Now that said If I had to guess these numbers are fine for the average API but
if you're doing something like high throughput web crawling or need to serve
something on the order of 10's of thousands to hundred thousands RPS async
will win out on speed and resource use and ultimately cost.

Plus at one point they were like "we could only get an 18% speed up with
Vibora" haven't used them my self. But 18% performance increase at really any
level of load is fantastic. Hand waving that off tells me the work loads for
what is "realistic" don't take in to account real high RPS workloads like you
might see at major tech companies.

~~~
meritt
> forking 16 instances is going to be a lot heavier on memory

It really depends on how the application is designed. Fork operates through
mmap and copy-on-write. It's extremely lightweight by default.

A well-designed fork-based application will already have everything necessary
to run a given process into memory, not munge any of the existing shared
memory, and only allocate and free memory associated with new
events/connections/etc.

When programmed that way, individual forks are incredibly light on resources.
All the workers are sharing the exact same core application code and logic in
memory.

~~~
Grimm1
"All the workers are sharing the exact same core application code and logic in
memory."

Oh interesting, are you saying an intelligent forking implementation is able
to share static portions of memory with multiple children?

I was perhaps under the naive assumption forking was pretty much just a full
memory copy of the parent.

~~~
meritt
Yep, but that's simply how the linux kernel works [1]. As a programmer you
need to essentially load up all your modules/libraries/data/etc that you will
need into memory _before_ the fork, and treat the forked() processes as read-
only as much as possible from a resource perspective. If you modify anything
from the parent, you get your own page as soon as that happens. [2][3]

[1]
[https://www.informit.com/articles/article.aspx?p=368650](https://www.informit.com/articles/article.aspx?p=368650)

[2] [https://en.wikipedia.org/wiki/Copy-on-
write](https://en.wikipedia.org/wiki/Copy-on-write)

[3]
[https://en.wikipedia.org/wiki/Fork_(system_call)](https://en.wikipedia.org/wiki/Fork_\(system_call\))

------
ohyes
This is just a fundamental misunderstanding of what concurrency is. I do not
see why it requires a benchmark.

Concurrency is many things at once. That's it.

Async frameworks end up with better concurrency properties because you're not
paying the memory and context switching overhead of an entire 'thread' for
each thing that you are trying to do at the same time. Instead you are paying
the (normally cheaper) overhead of what is essentially a co-routine call.

The disadvantage being that you have to manage these context switches
yourself, and that they tend to happen more frequently (to maintain the
illusion that we are doing many things all at the same time on a single cpu).
There is no way that an async framework would ever have better straight line
performance than a synchronous one, simply because of all of these extra
context switches, and that's fine because that's not what it is for.

Imagine I want to have 10,000 requests held open at the same time. Your flask
server with 16 workers is going to have a tough time as you don't have enough
workers to service that many threads, requests won't get serviced and things
will start to time out. Because an async framework multiplexes those workers
so that can each individually handle multiple requests at once. Multiplexing
in this way costs you something performance wise.

If you were to crank up the concurrency beyond 100 at once (the default in the
posted scripts), you would start getting different results.

------
rajandatta
Excellent article. Well done. Great to see that you examined throughput,
latency and other measures. It may not answer all questions that arise in
real-life situations and work loads but we need more numerical experiments to
really understand how this works.

------
birdyrooster
No one said it was faster. We said it scaled better. That’s because blocking
all execution on IO is bad for time sensitive tasks like web requests.

If you want to actually go faster the asyncio interfaces used by
aiomultiprocess module get you there by maintaining the event loop across
multiple processes. You can save time and memory by sharding your data set and
aggregating the return data.

------
brodouevencode
Great article, but don't just abandon async entirely. There are still use
cases. For me: I use it to pull data from several external APIs all at once.
That data is then married up to produce another data object with some special
sauce computation. All of these network calls run in parallel, therefore the
network overhead (DNS lookups, SSL handshakes, etc.) all operate at the same
time instead of running one after the other if it were in synchronous mode.
IIRC the benchmarks for this went from running at 3 minutes+ to just over 20
seconds.

So there's still utility, so YMMV.

------
tannhaeuser
It's about time someone put this into perspective with figures before more and
more people rush to implement business apps in async style (= 80's cooperative
multiprocessing). There are exceptions of course; for example Node.js was
originally envisioned for eg. game servers where async's purported robustness
in the presence of a massive number of open sockets supposedly helps. But I
think for the vast majority of workloads going async has a terrible impact to
your codebase (either with callback hell or by deprecating most of the host
language's flow control primitives like try/catch in favour of hard-to-debug
ad-hoc constructs such as Promises). Another price to pay is groking Node.js'
streams (streams2/streams3) and domain APIs and unhelpful exception handling
story with subtle changes even as late as in v13. As I hear, Python's async
APIs aren't uncontroversial either.

Now the next thing I'd be interested to get debunked is multithreading vs
multiple processes with shared memory (SysV shmem). I'm not very sure, but I'd
not been surprised to hear that the predominance of multithreaded runtimes
(JVM, most C++ appservers) is purely a cargo-cult effect. As far as I
remember, threads were introduced for small and isolated problems in GUI
programs, like code completion in IDEs; they were never intended for replacing
O/S processes and their isolation guarantees.

~~~
waheoo
Kevlin henney has a lot to say about concurrent processing ithink it was one
of thesr talks:

[https://youtu.be/2yXtZ8x7TXw](https://youtu.be/2yXtZ8x7TXw)

[https://youtu.be/ZsHMHukIlJY](https://youtu.be/ZsHMHukIlJY)

Threading is faster, but really only if youre willing to give up your locks
and design for it properly.

------
philote
In a post about async web frameworks in Python, I'm really surprised that
Tornado was not included.

~~~
yla92
Yeah. I was expecting to see Tornado among the async web frameworks as well.
We use it at work for almost all the backend related work and are happy with
it.

------
bit_logic
Overall, I think the whole reactive programming style such as node and async
python are mistakes at this point. They come at too high a cost for code
complexity and maintainability. Synchronous style was always superior with
only one flaw, using OS threads.

But now there are solutions both existing and upcoming such as Go and Java
Project Loom that fix that one flaw. I don't see much appeal in the reactive
style at this point.

------
kwhitefoot
> Sadly async is not go-faster-stripes for the Python interpreter.

Surely this is back to front. It is go faster stripes because it doesn't make
it go faster.

------
jordic
It would be pretty nice to see the benchmark with what people is using on the
async world (asyncpg + uvloop). Just taking a look on them found it's using
aiopog (who is using this?) without uvloop.

~~~
calpaterson
Hi - many of the configurations do use uvloop.

For what it's worth, I think people are using aiopg because it works with
SQLAlchemy whereas asyncpg does not.

I kept the database driver the same because I'm testing sync vs async and not
database drivers. I would be interested in testing asyncpg, particularly a
performance claim is a big part of that library's documentation but another
time.

~~~
GamblersFallacy
Hi, a few suggestions. Your benchmarks github repo requirements.txt shows
uvloop is not been installed. In addition, the bash script calling uvicorn
doesn't have uvloop set for the loop parameter. For example, serve-uvicorn-
starlette.sh should be:

uvicorn --port 8001 --workers $PWPWORKERS app_starlette:app --loop uvloop

The uvicorn docs should point out what a big difference uvloop makes.

~~~
fernandotakai
uvicorn selects uvloop automatically if you have it installed (i just test on
my machine, without passing --loop).

------
gotzmann
Thats interesting. In PHP world the situation radically different: modern
frameworks based on libevent are really speed up web apps up to 10x.

I've thoroughly benchmarked my own framework[1] for REST APIs and now it
outperforms many of Go / Node.js platforms on Techempower[2]

[1] [https://github.com/gotzmann/comet](https://github.com/gotzmann/comet)

[2]
[https://www.techempower.com/benchmarks/#section=test&runid=e...](https://www.techempower.com/benchmarks/#section=test&runid=e12e0b2d-fc4a-4894-b619-cda198516483)

------
dotdi
EDIT: Read the article again, cleared up why the worker count differs.

Anyway, I'm running quite a few small Python services on cheap VPSs, i.e.,
shit performance, and using async was beneficial for me, with performance
being ~30% better. They are bread-and-butter apps that read from Postgres, do
some HTTP requests, process the results, and potentially write stuff back to
the DB. Same performance gain for other services that have HTTP servers.

In my mind, hardware can be used more efficiently with async, since while one
routine is waiting for an async result, other routines can run meanwhile.

~~~
martius
What matters is that the server app uses as much CPU as it can when it needs
it.

With a sync server, a worker is inactive as long as it is waiting on IO, so
you need enough workers to maximize the chance that all workers are busy,
else, some clients are waiting even if you've got the CPU to deal with them.

With an async server, a single worker handles many clients simultaneously, in
theory, a single worker per core is sufficient to eat all the CPU available.

~~~
jordic
with asyncio we deploy a thread per worker (loop), and a worker per core. We
also move cpu bound functions to a thread pool

------
kortex
How difficult would it be to write a python runtime, let's call it MPython
(meta), that forks into separate Python interpreters, 100% orthogonal, except
for channels/shm. Could even use separate entry points, but all the same PID.
Multiprocessing (iirc) forks from the first python process (this breaks some
things depending on when you fork, eg gRPC with torch.dataloader with
multiprocessing will crash).

Does that get you anything, or am I misunderstanding how multiprocessing/fork
works?

------
devy
IMO, Async Python frameworks have been proliferated in the era of AI research
projects gaining popularity. Some of these AI researchers who know Python
already while working on AI frameworks like pytorch would utilize some of
these async python frameworks with cookiecutter [1][2][3][3], which helps many
others to quickly spin up a backend api for an AI application, in the names of
"fast", both in terms of setup and in terms of responding to requests.

[1]:
[https://github.com/tiangolo/fastapi](https://github.com/tiangolo/fastapi)

[2]: [https://github.com/tiangolo/full-stack-fastapi-
postgresql](https://github.com/tiangolo/full-stack-fastapi-postgresql)

[3]: [https://github.com/tiangolo/uvicorn-gunicorn-fastapi-
docker](https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker)

------
lend000
I use regular threads in Python3 even for I/O and network requests, just
because they are what I am familiar with and I haven't yet found a compelling
reason to port projects over that are working fine with the standard thread
model (granted, I don't produce that many threads at a time). I haven't
noticed any performance hits with the GIL or my shared resource locks, but I'm
pretty careful about my concurrent programming.

Perhaps someone here can explain what the asyncio paradigm does for you beyond
"ease of use" when it doesn't get you past the single processor / GIL issue.
In what environments are the "os threads" created by the Python engine
actually that expensive? I suppose if you are just starting out it may be
easier to grok, but then it won't transfer as well to other programming
environments besides perhaps NodeJS.

------
daemonk
This seems like a resource allocation issue then? Async python is just
starting a bunch of jobs with no regard to how each job claims CPUs. Whereas
sync python is using native OS threads which I guess does a much better job of
allocating CPUs?

For async python, when you make 1000 requests, does it immediately register
1000 jobs across your CPUs via workers for processing? Does that just mean
each job takes a tiny piece (1/1000) of the resource pie resulting in slower
performance for all jobs?

Whereas in sync python you are saying you can only perform X number of jobs at
a time where X is the number of allocated workers. So resource allocation is
roughly divided into X parts.

You also have a db connection pool layer after the server code. Isn't that
ultimately your bottleneck? I wonder if your async server is saturating the
CPUs making the connection pool slow.

------
peterthehacker
I'm trying to figure out how to run these benchmarks on my own machine and
experiment with some tweaks to the implementation, but it's unclear how to run
these benchmarks from start to finish. I don't see any instructions for
running the benchmarks in the github repository.

@calpaterson can you provide guidance?

I'd like to try an alternative query pattern. The current pattern implemented
in the benchmarks is select 1 row in 1 query. I'd like to try an
implementation with 2 queries - select count() from table and select * from
table limit 10, which is a very common pattern for a REST list view. I would
hypothesize that the async apps would perform better in this case, but I'm
curious what this benchmark with show.

~~~
calpaterson
Hi - you will need to pip install the requirements into a virtualenv. Then set
$PWPWORKERS (eg to 1 to start with) and run serve-gunicorn-flask.sh. That will
get you a gunicorn instance up and running. From them on you'll need to set up
nginx, pgbouncer and postgres. I used unix sockets between all of these but
using TCP/IP is fine. The data generation script is checked in, as is the
schema.

Before you start you should know that Tudor M (see a PR on the project)
experimented with changing the query patterns (to three queries, but not a
count(*)). It doesn't change matters and the basic reason for that is that
nothing has changed - simply having more blocking or non-blocking IO is
irrelevant to throughput - except that the more yields you have the more
problematic your response times are going to be under load.

------
Spiritus
Funny how in the TechEmpower Web Framework Benchmarks[1], the async frameworks
are basically destroying their sync counterparts.

[1]
[https://www.techempower.com/benchmarks/](https://www.techempower.com/benchmarks/)

~~~
hedora
None of the performant async frameworks in those benchmarks are written in
python (unless I missed one).

The takeaway of this article is that python’s async io implementations perform
poorly.

That’s surprising, since async I/O is usually used for performance reasons,
and in most other languages, async I/O can be much faster.

~~~
Spiritus
The idea is to filter so that it only shows Python. Not to compare with other
languages.

------
martius
I'm not sure this is a realistic benchmark. A couple of remarks:

16 workers is not that much considering that modern servers can have a lot of
cores available, and I expect that the more workers you need the more likely
you'll hit other bottlenecks:

* the more workers you need, the more memory you consume (workers are processes, not threads),

* I don't know how OS scheduler behave these days, but the general-purpose OS scheduler may consume some CPU time you'd rather give to your app.

I understand the point on latency variation though: preemptive multitasking
will slice the CPU time "fairly" between workers, while in a cooperative
multitasking situation, it would be the job of the programmer to yield after
some time.

~~~
calpaterson
Hi - I am confident that 16 workers was the right number for that application
deployed on that machine. The machine is described in the article. If you took
this app and put it on a machine with 8 cores clearly it would make sense to
try 32 workers - but in practice I think few Python apps are so IO bound as
this one. Most of the time, just over 2 * cpu count is about the right number.

I suspect that scheduler overhead is not a realistic consideration for a
Python program. My understanding is that switching executing process takes
microseconds at worst, which would be too small to notice from the point of
view of a Python programmer.

On "it would be the job of the programmer to yield after some time" \- I'm
always personally suspicious of any technique that rests on programmer
diligence. My experience suggests not to require (or even expect!) programmer
diligence, even from my own (I assure you, god like) programming abilities.
Secondly, yielding more often probably would not help (and in fact I half-
suspect part of the problem is the frequent yielding at every async/await
keyword!).

~~~
martius
Hi, thanks for your response :)

Edit: I've been downvoted so I'll add a precision: Usually, it is believed
that async shines against other models once you reach a certain scale
([https://en.wikipedia.org/wiki/C10k_problem](https://en.wikipedia.org/wiki/C10k_problem)).
This benchmark shows than async app frameworks are slower than the sync ones
when running at a given scale, and since the article doesn't give many details
on the incomming traffic, I can only assume that it's low, since it saturates
4 cores.

I believe that your conclusion that "Async python is not faster" is an over
generalization of your use case.

I'm not saying that the configuration in your benchmark is not correct, I am
saying that this benchmark may not yield the same results if you try to scale
it on bigger hardware.

I believe that scheduler overhead can't be ruled out (not for python nor any
other program) on a server since we've sometimes observed that the scheduler
could be the bottleneck under some circumstances. For instance, some Linux
schedulers used to show poor perfs when using nested cgroups with resources
quota enabled.

Also, I'd like to state my first point again: you need to see how the number
of workers will influence the memory usage on your system. Especially with
python, if you've got a lot of workers, you can expect some memory
fragmentation that can impact the perf of your system.

------
jordic
The benchmark will be better with something like this
[https://news.ycombinator.com/item?id=12227507](https://news.ycombinator.com/item?id=12227507)
On the asyncio side :)

Old stories that come again and again

------
xvilka
A good alternative to Python is OCaml. It can be both interpreted with a
bytecode, but also compiled natively. With the Multicore OCaml coming[1],
along with Domains and algebraic effects, it can be a viable alternative to
many cases where Python currently is. Moreover, it offers a strict but
flexible typing.

[1] [https://discuss.ocaml.org/t/multicore-ocaml-
may-2020-update/...](https://discuss.ocaml.org/t/multicore-ocaml-
may-2020-update/5898)

------
papito
It's not? Try this - create 10 Postrges queries and run them in sequence with
standard Python.

Now yield all those calls asynchronously as an array. What is this even about?

~~~
brianwawok
Now run them in 10 pythons using multi-proc. Now run then in 10 python
threads.

~~~
papito
Except that multi-threaded applications are inherently more complex and are
extremely challenging to debug.

~~~
brianwawok
Multi-threading is only hard if you share state. Keep state separate and life
is good.

------
nDmitry
Just finished writing a more fare benchmark a few days ago. It's utilizing all
cores, have DB pools of the same capacity for all tested languages, uses
asyncpg in the async Python version, etc.

[https://github.com/nDmitry/web-benchmarks](https://github.com/nDmitry/web-
benchmarks)

Long story short - asyncio is twice as fast... (results are at the bottom of
the readme).

------
mkchoi212
"the more performance sensitive Python code you can replace the better you
will do. This is Python performance tactic with a long history"

------
jupp0r
The general issue if serving one vs multiple clients per thread has been
discussed extensively in the last two decades, see
[http://www.kegel.com/c10k.html](http://www.kegel.com/c10k.html)

I’m not familiar with python but it seems like there is a glaring performance
bug iff using one thread per connection is faster than using async io.

~~~
megaman821
Threads are quite fast, it shouldn't be that shocking that they are as fast or
faster than async. What would be shocking is if threads had less memory usage.

So for the c10k problem on a machine with 2 GB of RAM, async will win because
threads will exhaust the memory of the machine. Give that same machine 200 GB
of RAM and threads may end up being faster.

~~~
jupp0r
That’s a very simple model of how memory and cpu interact. If you actually end
up using hundreds of Gigabytes of memory, there will be implications on cache
hit rates, TLB misses, page table sizes and many other things that make me
wary about guessing performance in such a case.

There is also the not much discussed issue of having shared resources between
all of these threads and the impact of such a threading model on the
engineering part if writing such a program. I personally haven’t seen the
thread per connection model in a successful large scale server.

------
JoeAltmaier
Async is about latency. About not stalling the UI thread on I/O etc. Not about
speed in the usual sense.

------
aliceryhl
I would be really interested in seeing some numbers on how long the async code
spends before each yield.

------
PythonicAlpha
Very interesting work and results.

It also would be interesting to see the memory footprints of the different
solutions.

------
willcipriano
In my experience, if the workload supports it multiprocess works well in
Python. The overhead for a new interpreter is cheap and this sidesteps the GIL
issue. If you require communication between your 'threads', ZeroMQ(ØMQ) makes
this simple and fast.

------
antoncohen
I think this blog post on Python, Gunicorn, and Gevent is relevant:

[https://rachelbythebay.com/w/2020/03/07/costly/](https://rachelbythebay.com/w/2020/03/07/costly/)

~~~
wetmore
This blog post is mentioned in TFA

~~~
antoncohen
Yes, sorry I should have been more explicit. The article does mention the
Rachel by the Bay blog post, but in a way that makes it sound like a different
issue. I think they are more related.

I think the Rachel by the Bay blog post does a good job of explaining what
event loops are doing under the hood, and how that can lead to bad tail
latency for web requests.

------
jlg23
This debate reminds me on "Node.js Is Bad Ass Rock Star Tech", from 2012:
[https://www.youtube.com/watch?v=bzkRVzciAZg](https://www.youtube.com/watch?v=bzkRVzciAZg)

------
rubyn00bie
tldr;

Increasing throughput doesn't mean faster, it means more efficient use of your
resources.

Asynchronous and parallelizing workloads only increases throughput not speed.
You get more done faster, you don't get each thing produced faster..er.

\---

Async is only faster if you're not CPU constrained. I don't think anyone is
surprised.

The following is really simplified; but, hopefully this makes things more
clear for folks...

 _Assuming ONE cpu, with ONE thread_

Synchronous call:

[A: Start]---------------->[B: Finish]

Asynchronous call:

[A: Start]-------->[B: Pause]...(sleep)...[C: Resume]----->[D: Finish]

There is no way to make the async call faster than the synchronous call,
period. By simply having the operation pause/wait/resume (context switch) it
has introduced overhead that is not present in the synchronous operation.

So WTF async?

Async is only useful when the context switching overhead is less than the time
the I/O operation takes. That's it... So when you have I/O bound tasks, that
take more time than it does to switch contexts (and carefully manage how many
context you have), you can have increased _throughput_.

------
awinter-py
yeah, having operated a largeish nodejs system that had to be realtime: the
latencies were all over the place when it got loaded down

MxN (kernel threads to green threads) seems to be the established wisdom for
go now, with other langs catching up, but unlike python, go has no GIL so is
'less likely to get stuck' (I say, without a footnote)

Wider focus on observability makes me optimistic that we'll do better as an
industry at packing software onto hardware, and understanding which workloads
benefit from what.

------
danthemanvsqz
The benchmark doesn't reflect how I would use asyncio. Instead of simply
hitting a DB I'd like to see adding an API call to the middle of the request.

------
Thristle
Any reason why Django wasn't tested? It supports both the sync standard and
async stanadard and is AFAIK the most popular web framework (way more then
flask)

~~~
Ralfp
Django is not async yet. You can run it over ASGI but parts of it (eg ORM)
need extra compat layer (sync_to_async wrapper) for async to work.

~~~
Thristle
Still, should have been tested as a sync framework alongside flask

------
fxtentacle
Not surprised. The bottleneck in Python tends to be the Global Interpreter
Lock. That's also why multithreading only rarely helps and why people attempt
multi-process execution instead.

But Python is an excellent language for quick prototyping and for controlling
other things, like coordinating GPUs who do the actual compute work. So I
don't quite get why we need to make Python usable for Webservers, when we
already have other languages optimized for that purpose, e.g. Google's Go.

~~~
ris
The GIL is not an issue with async python. Async python is single-threaded.

~~~
ghostwriter
You can start multiple threads within the same process that don't use an event
loop, but run preemptively nonetheless, which will introduce GIL to the thread
with an event loop. For instance, you can have a ThreadPoolExecutor running in
the same process as the thread with an instantiated event loop.

~~~
ris
I'm sure you can. And if my aunt had balls, she'd be my uncle.

~~~
ghostwriter
then she's definitely your uncle and you had better double-check her pronoun,
because ThreadPoolExecutor exists for a reason, and it's widely used in pair
with run_in_executor() [1], and the pool itself can be shared with other non-
executor-related tasks scheduled to run preemptively.

[1] [https://docs.python.org/3/library/asyncio-
eventloop.html#asy...](https://docs.python.org/3/library/asyncio-
eventloop.html#asyncio.loop.run_in_executor)

------
ronreiter
Nginx on an async server does not make sense.

~~~
VWWHFSfQ
nginx offers many benefits when fronting an application server. for instance,
tls termination/offload, request buffering, connection pooling. those are some

------
ericls
What happens if there's a ASGI speaking web server implemented in C/C++/Rust?

------
hypewatch
Why do you use less than half the workers for the async libraries?

uwsggi+flask - 16 workers unicorn+starlette - 5 workers

The highest throughout examples in your benchmark all have 16 workers. I also
don’t see any hardware data... Does your machine have 16 cores?

~~~
calpaterson
Hi - this is explained in detail in the article.

~~~
hypewatch
I went straight to the code and didn’t read past the github link - but read
through it now. That process described isn’t in the code..

One other thing I noticed is that this uses aiopg for the async db queries
instead of asyncpg, which is more widely adopted and IMO much better.

I was hoping to re-run these benchmarks myself with asyncpg.

Looks like actually running the benchmarks would take a bunch of manual work.
In fact I don’t see instructions for running these benchmarks to replicate
your results.

------
nojvek
I don’t understand the benchmark. Different libs have different worker counts.
Is the actual throughput test doing any async work? Like calling a db or a
REST call before it serves a response?

How is this an apples to apples benchmark ?

------
thayne
Did the author use an async library to access the database? If not, the
benefis of async are diminished by the fact that every request is still
synchronously waiting for I/O for the database.

------
ryanthedev
Not a real test.

No response time simulation...

Learn what async code does for you.

~~~
ectospheno
Do you have a link handy for a real test?

------
makz
Last time I checked async in Python didn’t even work properly, has this
changed lately?

------
jlokier
I'm not surprised by rhe reuslt. Another commenter said:

> async I/O is faster because it avoids context switches and amortizes kernel
> crossings

I think this is widely believed, but it's not particularly true for async I/O
(of the coroutine kind meant by async/await in Python, NodeJS and other
languages, rather than POSIX AIO).

With non-blocking-based async I/O, there are often more system calls for the
same amount of I/O, compared with threaded I/O, and rarely fewer calls. It
depends on the pattern of I/O how much more.

Consider: with async I/O, non-blocking read() on a socket will return -EAGAIN
sometimes, then you need a second read() to get the data later, and a bit more
overhead for epoll or similar. Even for files and recent syscalls like
preadv2(...RWF_NOWAIT), there are at least two system calls if the file is not
already in cache.

Whereas, threaded I/O usually does one system call for the same results. So
one blocking read() on a socket to get the same data as the example above, one
blocking preadv() to get the same file data.

Every system call is two user<->kernel transitions (entry, exit). The number
of these transitions is one of the things we're talking about reducing with
async/await style userspace scheduling.

Threaded I/O puts all context switches in kernel space, but these add zero
user<->kernel transitions, because all the context switches happen _inside_ an
existing I/O system call.

Another way of looking at it, is async replaces every kernelspace context
switche with a kernel entry/exit transition pair _instead_ , plus a userspace
context switch.

So the question becomes: Does the speed of userspace context switches _plus
kernel entry /exit costs for extra I/O system calls_ compare favourably
against kernel context switches _which add no extra kernel entry /exit costs_.

If the kernel scheduler is fast inside the kernel, and kernel entry/exit is
slow, this favours threaded I/O. If the kernel scheduler is slow even inside
the kernel (which it certainly used to be in Linux!), and kernel entry/exit
for I/O system calls is fast, it favours async.

This is despite userspace scheduling and context switching usually being
extremely fast if done sensibly.

Everything above applies to async I/O versus threaded I/O and counting
user<->kernel transitions, assuming them to be a significant cost factor.

The argument doesn't apply to async that is _not_ being used for I/O. Non-I/O
async/await is fairly common in some applictions, so that tilts the balance to
userspace scheduling, but nothing precludes using a mix of scheduling methods.
In fact doing blocking I/O in threads, "off to the side" of an async userspace
scheduler is a common pattern.

It also doesn't apply when I/O is done without system calls. For example
memory-mapped I/O to a device. Or if the program has threads communicating
directly without entering the kernel. io_uring is based on this principle, and
so are other mechanisms used for communicating among parallel tasks purely in
userspace using shared memory, lock-free structures (urcu etc) and
ringbuffers.

------
crimsonalucard1
No man. Nodejs will beat the flask benchmark. For this specific test there is
no downside to async.

What’s going on here is python specific.

~~~
kerkeslager
1\. I don't agree with your conclusion that it's Python specific. You don't
have evidence for that--you made that up. And no I'm not interested in
whatever benchmark you're going to want to post, because it's not a test of
this situation--it cannot possibly be, because when you introduce JS, you're
also going to be introducing literally hundreds of other factors which could
affect the performance. The assertion you are making is not one you can
possibly know.

2\. For this specific test, there is a downside to async, as shown by the
test. Even _if_ what's going on here were Python-specific (which is still
something you made up), downsides to async which only occur in a Python
environment are still downsides to async. The title of this post is "Async
Python is not faster"\--that conclusion is incorrect for many reasons, but
none of those reasons include the words "NodeJS", "JS", or anything else that
is not in the Python ecosystem.

3\. What is going on is probably specific to the tools being used, which is
why I said "those downsides certainly don't apply to every project". In fact,
they probably don't apply to the idiomatic ways of implementing this in
Tornado, for example. But note how I said "probably" because I don't know for
sure, and I'm not comfortable with making things up and stating them as facts.

~~~
crimsonalucard1
>And no I'm not interested in whatever benchmark you're going to want to post,

That's rude.

Let's put it this way. NodeJS and nginx leveled the playing field. It
destroyed the lamp stack and made async the standard way of handling high
loads of IO. From that alone it should indicate to you that there is something
very wrong with how you're thinking about things.

You know the theory of asyncio? Let me restate it for you: If coroutines are
basically the SAME thing as routines but with the extra ability to allow tasks
to be done in parallel with IO then what does that mean?

It means that 5 async workers in theory should be more performant than 5 sync
workers FOR highly concurrent IO tasks.

The logic is inescapable.

So what does it mean, if you run tests and see that 5 async workers are NOT
more performant than 5 sync workers ON PYTHON exclusively? The theory of
asyncio makes perfect logical sense right? So what is logically the problem
here?

The problem IS PYTHON. That's a theorem derived logically. No need for
evidence or data driven techniques.

There's this idea that data drives the world and you need evidence to back
everything up. How many data points do you need to prove 1 + 1 = 2? Put that
in your calculator 200 times and you got 200 data points. Boom data driven
buzzword. That's what you're asking from me btw. A benchmark, a datapoint to
prove what is already logical. Then you hilariously decided to dismiss it
before i even presented it.

Look, I say what I say not from evidence, but from logic. I can derive certain
issues about the system from logic. You just follow the logic I gave you above
and tell me where it went wrong and why do I need some dumb data point to
prove 1+1=2 to you?

There is NOTHING made up above. It is pure logic derived from the assumption
of what AsyncIO is doing.

>But note how I said "probably" because I don't know for sure, and I'm not
comfortable with making things up and stating them as facts.

But you seem perfectly comfortable in being rude and accusing me of making
stuff up. I'm not comfortable in going around the internet and trashing other
peoples theories with accusations that they are making shit up. If you
disagree say it, I respect that. I don't respect the part where you're saying
I'm making stuff up.

~~~
kerkeslager
> > And no I'm not interested in whatever benchmark you're going to want to
> post,

> That's rude.

It wasn't rude, it was predictive, and I predicted correctly. You literally
ignored the second half of the sentence where I already explained why your
incorrect conclusion is incorrect.

Your logic makes perfect sense, in a world where I/O bound processes, JIT
versus interpretation differences, garbage collection versus reference
counting differences, etc., don't exist. But those things do exist in the
_real_ world, so if your logic doesn't include them, you're quite likely to be
wrong. In general, an interpreted concurrent system is far too complex to make
performance predictions about based only on logic, because your logic can't
possibly include all the relevant variables.

> No need for evidence or data driven techniques.

Well, there's where you're wrong. It turns out that if you actually collect
evidence through experimentation, you'll discover results that are not
predicted by your logic.

> Then you hilariously decided to dismiss it before i even presented it.

Well, you presented basically what a predicted, so... I wasn't wrong.

> But you seem perfectly comfortable in being rude and accusing me of making
> stuff up. I'm not comfortable in going around the internet and trashing
> other peoples theories with accusations that they are making shit up. If you
> disagree say it, I respect that. I don't respect the part where you're
> saying I'm making stuff up.

You are, in fact, making stuff up. If you are offended by accurate description
of your behavior, behave better.

~~~
crimsonalucard1
>Your logic makes perfect sense, in a world where I/O bound processes, JIT
versus interpretation differences, garbage collection versus reference
counting differences, etc., don't exist. But those things do exist in the real
world, so if your logic doesn't include them, you're quite likely to be wrong.
In general, an interpreted concurrent system is far too complex to make
performance predictions about based only on logic, because your logic can't
possibly include all the relevant variables.

Hey Genius. Look at the test that the benchmark ran. The benchmark is IO
bound. This is SPECIFIC test about IO bound processes. The benchmark is not
referring to real world applications it is SPECIFICALLY referring to IO.

The literal test is thousands of requests and for each handler for those
requests ALL it does is query a database. If you look at a single request
almost 99% of time is spent on IO.

Due to the above, Anything that has to do with the python interpreter, JIT,
garbage collection and reference counting becomes NEGLIGIBLE in the context of
the TEST in the ARTICLE ABOVE. I suspect you didn't even read it completely.

Does that concept make sense to you? You can use relativity rather then
newtonian physics to calculate the trajectory of a projectile BUT it is
involves UNNECESSARY overhead coming from Relativity because the accuracy
gained from the extra calculations ARE NEGLIGIBLE.

>You are, in fact, making stuff up. If you are offended by accurate
description of your behavior, behave better.

Now that things have been explained completely clearly to you, do you now see
how you are the one who is completely wrong or are you incapable of ever
admitting your wrong and apologizing to me like you should? I mean literally
that statement above is embarrassing once you get how wrong you are.

~~~
kerkeslager
> Anything that has to do with the python interpreter, JIT, garbage collection
> and reference counting becomes NEGLIGIBLE

Well, it's odd that you say that, when previously you were claiming that the
result was caused by Python. Is it caused by Python, or is Python negligible?

> You can use relativity rather then newtonian physics to calculate the
> trajectory of a projectile BUT it is involves UNNECESSARY overhead coming
> from Relativity because the accuracy gained from the extra calculations ARE
> NEGLIGIBLE.

Man, you sure are willing to make broad statements that you cannot possibly
know.

You're really sure that there's _no_ context where the accuracy gained by
relativity would be useful?

[http://www.astronomy.ohio-
state.edu/~pogge/Ast162/Unit5/gps....](http://www.astronomy.ohio-
state.edu/~pogge/Ast162/Unit5/gps.html)

Again, this is you making stuff up. The worst part here is that a basic
application of logic, which you claim to have such a firm grasp on that you
don't need evidence, would indicate that you _cannot possibly_ know the things
you are claiming to know. You really think you know all the possible cases
where someone might want to calculate the trajectory of a projectile?
_Really?_

~~~
crimsonalucard1
>Well, it's odd that you say that, when previously you were claiming that it
was caused by Python. Is it caused by Python, or is Python negligible?

It's not odd. Think harder. I'm saying under the benchmark and according to
the logic of what SHOULD be going on under AsyncIO it SHOULD be negligible. So
such performance issues between python and node SHOULDN'T matter, and that's
why you CAN compare NodeJS and Python.

But actual testing does show an unexpected discrepancy that says that the
problem is python specific because logically there's No other explanation.

>You're really sure that there's no context where the accuracy gained by
relativity would be useful?

Dude I already know about that. I just didn't bring it up, because I'm giving
you an example to help you understand what "NEGLIGIBLE" means. You're just
being a pedantic smart ass.

There are many many cases where relativity is not needed because it's
negligible. In fact the overwhelming majority of engineering problems don't
need to touch relativity. You are aware of this I am aware of this. No need to
be a pedantic smart ass. The other thing that gets me is that it's not even
anything new, many people know about how satellites travel fast enough for
relativity to matter.

>Again, this is you making stuff up

Dude nothing was made up.

I never said "there's no context where the accuracy gained by relativity would
be useful". "No context" is something YOU made up.

In fact "made up" is too weak of a word. A better word is an utter lie.

That's right. You're a liar. I'm not being rude. Just making an observation.
Embarrassed yet?

~~~
dang
I've banned this account for repeatedly doing flamewars. Would you please stop
creating accounts to break HN's rules with? You're welcome here if, and only
if, you sincerely want to use this site in the intended spirit.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and
give us reason to believe that you'll follow the rules in the future.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
heipei
Not versed enough in Python and asyncio to replicate or understand his
benchmark, however from my simplistic view async (or any concurrent framework
really) should almost always be faster with any modern kind of application.

Let me give an example: If you run a web-service, chances are that you're
gonna make some network call as part of processing a request, be it database,
network, search index, etc. With async, these calls free up your application
to work on another requests until the call returns. I'd say that many modern
web-services are just a stitching-together of external calls (get-user-info-
from-db, retrieve-user-items-from-db), so the only real work left for the web
application to handle is to wait and parse/encode some JSON to client. If you
can't max out your bandwidth in terms of JSON performance with one core you
just start as many async processes as you have cores. The big advantage over
sync processes then is that your async process can also handle insanely-long-
running background calls while still crunching through the other requests.
Someone please explain to me if I'm missing something here.

~~~
twic
This isn't correct at all. If you have threads, then when one of them does
something blocking, it is suspended, which frees up the processor for other
threads to do work. It's not like the computer is sitting there idle waiting
for IO to happen.

If you're writing synchronous code and not using threads, then yes, your
analysis is right, but that would be a daft thing to do!

The difference between sync and async is mostly whether the state associated
with a task (like an incoming HTTP request being serviced) is kept on a
dedicated native thread stack (as it is with sync), or in some sort of
coroutine structure (as with async). Thread stacks may be somewhat more
efficient, but you have to allocate a whole stack upfront for each thread, so
if you want to have lots of threads, you need to dedicate a lot of memory to
that, and that goes badly. For applications with small numbers of tasks in
flight at once, we shouldn't expect a lot of difference between sync and async
code. But for tasks with huge numbers of tasks (chat servers are the classic
example, but high-traffic webservers with lots of blocking calls in the
backend are another), async code should keep chugging on where sync code just
falls over.

tl;dr async is about the number of tasks you can handle at once, not the speed
with which you handle each task.

~~~
heipei
Yeah no, I get that, but what I'm saying is that with sync I can handle as
many requests concurrently as I have processes / OS threads running, which is
usually the number of cores in my system. With async, I can have thousands of
requests in flight, all of them just waiting for the response of the backend,
and all I have to do is to start a single OS thread.

~~~
arethuza
In a purely sync world you can have tens of threads per processes without much
difficulty - so definitely much heavier resource wise than async but not
_that_ bad.

~~~
barrkel
10s?

There are over 2000 threads running on my Linux laptop, with a couple of
database servers, an IDE and two browsers open. Firefox has 323 threads,
Chrome 420, and Slack 69.

Async code with continuations and no second execution of continuations is
isomorphic to threaded code. The chief differences are that in threaded code,
the pointer to the continuation function and its closure are on the stack
instead of on the heap, and context switches are done by the OS (pro:
fairness, con: context switching overhead). Stack allocation and deallocation
is generally faster than heap allocation, but because it's contiguous you need
to pay for the high water mark. Even then, that's not expensive unless you
have loads of recursion or locally bound state.

~~~
arethuza
Yeah - I knew it was large numbers but didn't have them handy so used 10s :-)

