
Some thoughts on asynchronous Python API design in a post-async/await world - piotrjurkiewicz
https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/
======
quotemstr
The idea espoused in this blog post, that

> if you have N logical threads concurrently executing a routine with Y yield
> points, then there are N __Y possible execution orders that you have to hold
> in your head

is actively harmful to software maintainability. Concurrency problems don't
disappear when you make your yield points explicit.

Look: in traditional multi-threaded programs, we protect shared data using
locks. If you avoid explicit locks and instead rely on complete knowledge of
all yield points (i.e., all possible execution orders) to ensure that data
races do not happen, then you've just created a ticking time-bomb: as soon as
you add a new yield point, you invalidate your safety assumptions.

Traditional lock-based preemptive multi-threaded code isn't susceptible to
this problem: it already embeds maximally pessimistic assumptions about
execution order, so adding a new preemption point cannot hurt anything.

Of course, you can use mutexes with explicit yield points too, but nobody
does: the perception is that cooperative multitasking (or promises or
whatever) frees you from having to worry about all that hard, nasty multi-
threaded stuff you hated in your CS classes. But you haven't really escaped.
Those dining philosophers are still there, and now they're _angry_.

The article claims that yield-based programming is easier because the fewer
the total number of yield points, the less mental state a programmer needs to
maintain. I don't think this argument is correct: in lock-based programming,
we need to keep _zero_ preemption points in mind, because we assume every
instruction is a yield point. Instead of thinking about NY program
interleavings, we think about how many locks we hold. I bet we have fewer
locks than you have yields.

To put it another way, the composition properties of locks are much saner than
the composition properties of safety-through-controlling-yield.

I believe that we got multithreaded programming basically right a long time
ago, and that improvement now rests on approaches like reducing mutable shared
state, automated thread-safety analysis, and software transactional memory.
Encouraging developers to sprinkle "async" and "await" everywhere is a step
backward in performance, readability, and robustness.

~~~
vomjom
It's not clear what you're suggesting as an alternative. My understanding is
that you're suggesting thread-per-request, which has many known flaws. There
are three approaches to serving requests:

1\. Thread-per-request. This is a simple model. You have a fixed-size thread
pool of size N, and once you hit that limit, you can't serve anymore requests.
Thread-per-request has several sources of overhead, which is why people
recommend against it: thread limits, per-thread stack memory usage, and
context switching.

2\. Coroutine style handling with cooperative scheduling at synchronization
points (locks, I/O). This is how Go handles requests.

3\. Asynchronous request handling. You still have a fixed-size thread pool
handling requests, but you no longer limit the number of simultaneous requests
with the size of that thread pool. There are several different styles of async
request handling: callbacks, async/await, and futures.

#2 and #3 are more common these days because they don't suffer from the many
drawbacks of the thread-per-request model, although both suffer from some
understandability issues.

~~~
quotemstr
Those options aren't as distinct as you might imagine. Would calling it fiber-
per-request make you happy?

(By the way: most of the time, a plain-old-boring thread-per-request is just
fine, because most of the time, you're not writing high-scale software. If you
have at most two dozen concurrent tasks, you're wasting your time worrying
about the overhead of plain old pthread_t.)

I'm using a much more expansive definition of "thread" than you are. Sure, in
the right situation, maybe M:N threading, or full green threads, or whatever
is the right implementation strategy. There's no reason that green threading
has to involve the use of explicit "async" and "await" keywords, and it's
these keywords that I consider silly.

~~~
vomjom
(I agree that thread-per-request works just fine in the majority of cases, but
it's still worthwhile to write about the cases where it doesn't work.)

Responding to your original post: you argue that async/await intends to solve
the problem of data races. That's not why people use it, nor does it tackle
that problem at all (you still need locks around shared data).

It only tries to solve the issue of highly-concurrent servers, where requests
are bound by some resource that a request-handling threads have to wait for
the result of (typically I/O).

Coroutines/fibers are not an alternative to async servers, because they need
primitives that are either baked into the language or the OS itself to work
well.

~~~
cderwin
Please correct me if I'm wrong, but doesn't asyncio in the form of async/await
(or any other ways to explicitly denote context switches) solve the problem of
data races in that per-thread data structures can be operated on atomically by
different coroutines? My understanding is that unless data structures are
shared with another thread, you don't usually need locks for shared data.

~~~
omribahumi
I think that the biggest argument against it is code changes. Think about a
code change that adds an additional yield point without proper locking.

Has any language tackled this with lazy locking? i.e. lock only on yield.
Maybe this could even be done in compile time

------
codethief
I find it surprising that noone here comments on the actual topic of the blog
post: Namely that the internal implementation of asyncio is opaque at best and
this unfortunately propagates upwards to the public API. Personally, I have
taken a look at its source code a few times as well (to understand what my
code was doing because the docs were lacking details) and I remember that the
callback hell paired with those additional user-space buffers the author
mentions really made it a major PITA to reason about. Now, why should anyone
worry about asyncio's internals? Heck, if everything was working, I wouldn't
mind, either. However, as pointed out in the blog, there are quite a few edge
cases where it isn't. Plus, the documentation traditionally doesn't do a
particularly good job at explaining things. Or is it just the fact that the
API is quite confusing sometimes that has caused me to take a look at the
source code more often than I care to admit? (Compare
[https://news.ycombinator.com/item?id=12829759](https://news.ycombinator.com/item?id=12829759))
Whatever it is, the fact is that asyncio's internals _do_ matter
unfortunately.

…which is why I was happy to hear that not all hope is lost and that someone
created an alternative. Now, I haven't taken a look at curio yet, so maybe I'm
a bit quick to judge, but I already found it very refreshing that spending not
even a minute to read the documentation already left me with a good idea of
how it works and how I can use it. Kudos to the author(s), I will definitely
give it a try!

------
justinsaccount
I feel like I'm too dumb to understand any of this. And I've been writing
python for 12 years.

Just give me greenlets or whatever and let me run synchronous code
concurrently.

    
    
      async def proxy(dest_host, dest_port, main_task, source_sock, addr):
        await main_task.cancel()
        dest_sock = await curio.open_connection(dest_host, dest_port)
        async with dest_sock:
          await copy_all(source_sock, dest_sock)
    

Are you kidding me? Simplified that is

    
    
      async def func():
        await f()
        dest_sock = await f()
        async with dest_sock:
          await f()
    

Every other token is async or await. No thank you.

~~~
jeswin
Are you saying using greenlets are any simpler than this? IMO that mechanism
looks way more complex compared to this. And will probably be less efficient.

The point is this: threads are still expensive in bulk (the CPU has to shuffle
a lot of data every time you switch). So all kernels have mechanisms to
support parallel IO operations. An async library will use the best available
kernel mechanism for IO; epoll on Linux, kqueue on BSDs, maybe IO Completion
Ports on Windows (not sure). Turns out, doing that requires some help from the
language itself or the code turns into a pyramidal mess. Async keyword
addresses the readability aspect of code.

So:

a) It's more complex than synchronous code

b) But it solves the performance problem without too much cognitive overhead
(once you get used to it).

~~~
quotemstr
> threads are still expensive in bulk

They don't _have_ to be. First of all, even ordinary threads are more
efficient than you might think. On a really awful low-end Android 4.1 device,
I can pthread_create and pthread_join over 5,000 threads per second. On a real
computer, my X1 Carbon Gen4, I can create and join over 110,000 threads per
second. (And keep in mind that each create-join pair also forces two full
context switches.)

For most applications, performance of regular threads is perfectly adequate.
In these environments, the maintainability and debuggability advantages of
using plain old boring threads makes it really hard to justify using something
exotic.

But suppose you do have big performance requirements: you can still use
normal-looking threaded code. There's a difference between how we represent
threads in source code and how we implement them. It's possible to provide
green, userspace-switched threads without requiring "await" and "async"
keywords everywhere. GNU Pth did it a long time ago, and there are lots of
other fibers implementations.

> the CPU has to shuffle a lot of data every time you switch

Any green-threaded system (with or without explicit preemption points) _also_
does context switches! Such a system maintains in user space a queue of things
to work on: as the system switches from one of these work items to another,
it's switching contexts! You have the same kind of register reloading and
cache coldness problems that switching thread contexts has. There's no
particular reason that you can do it much better than the kernel can do it,
especially since switching threads in the same address space is pretty
efficient.

~~~
int_19h
The problem with all green thread implementations that I know of is that
they're language and/or framework-specific. So the moment you start using
them, you get the same set of problems as using setjmp/longjmp in C across the
boundaries of foreign code - it either just blows up spectacularly, or at the
very least violates invariants because the interleaving code is not aware that
someone's pulling the rug from under it.

This can only be solved by standardizing a fiber API _and_ (per platform) ABI,
and by forcing all libraries in the ecosystem to be aware of fibers if their
behavior differs with threads in any way (e.g. if TLS and FLS are distinct).

Callbacks (and hence promises), on the other hand, work with what we already
have, and are trivially passed across component boundaries as a simple
function pointer + context pointer, or some suitable equivalent expressible in
C FFI. For example, I can take an asynchronous WinRT API (which returns a
future-like COM object), and wrap it in a Python library that returns
awaitable futures; with neither WinRT being aware of the specifics of Python
async, nor with Python aware of how WinRT callbacks are implemented under the
hood. On the other hand, if WinRT used Win32 fibers for asynchrony, Python
would have to be aware of them as well.

~~~
anonymoushn
I expect you can also use a callback that switches greenlets, or one passes
the values it got to lua's coroutine.resume.

~~~
int_19h
How can it switch green threads without breaking any foreign code currently on
the stack? Consider what happens when said code holds an OS mutex, for
example.

The only way I see this working is if your green threads roll their own stack
on the heap, and switch that, without touching the OS stack. But then how is
the result fundamentally different from promise chains? Their callbacks and
captured state essentially form that very same green stack.

~~~
quotemstr
To start a fiber, you allocate some memory, set RSP to the end of that memory,
set your other registers to some arbitrary initial state, and jump to your
fiber routine. To switch fibers, you set RSP to some other block of memory,
restore your registers, and set PC to whatever it was when you last switched
away from that fiber. There's nothing magical, and it works with almost all
existing code. If you hold a mutex and switch to a different fiber, the mutex
stays held. How could it be otherwise?

~~~
int_19h
I was thinking of a situation where thread-aware but not fiber-aware code uses
mutex to synchronize with itself, which breaks with fibers because they reuse
the same thread, and the mutex is bound to that thread (so if another fiber
tries to acquire that mutex, it's told that it already has it, and proceeds to
stomp over shared data with impunity).

But upon further consideration, I realize that in this narrow scenario - where
fibers are used in conjunction with callback-based APIs - this shouldn't
apply, because you can't synchronize concurrent callback chains with plain
mutexes, either.

Having said all that, are there any actual implementations that seamlessly
marry fibers with callbacks? I don't recall seeing any real world code that
pulled that off. Which seems to imply that there are other problems here.

Of note is that CLR tried to support fibers, and found it to be something that
was actually fairly expensive. By extension, this also applies to any code
running on top of that VM:

"If you call into managed code on a thread that was converted to a fiber, and
then later switch fibers without involvement w/ the CLR, things will break
badly. Our stack walks and exception propagation will rely on the wrong
fiber’s stack, the GC will fail to find roots for stacks that aren’t live on
threads, among many, many other things."
([http://joeduffyblog.com/2006/11/09/fibers-and-the-
clr/](http://joeduffyblog.com/2006/11/09/fibers-and-the-clr/))

GC is a sticking point here, it seems - clearly it needs to be fiber-aware to
properly handle roots in switched-out fibers.

------
Animats
The main use case for all this async stuff is handling a huge number of
simultaneous stateful network connections. At least, that was what Twisted was
used for. Are there other use cases for this sort of thing that justify all
the complexity that comes with it?

~~~
int_19h
It lets you easily write responsive UI apps without worrying about things like
threads - you treat your app as a single conceptual thread, and use async IO
operations on it by awaiting them. Since in practice every operation callback
is a new item posted onto the event loop, this doesn't block said loop at any
point, and UI remains responsive. So the developer can think in simple terms
like "if this button is clicked, [await] download this file, then update this
label and [await] send this email", instead of background worker threads with
condition variables etc.

In particular, WinRT heavily promotes this approach for UWP apps.

~~~
TimJYoung
My problem with this justification is that these problems have been solved for
a long time with simple message passing. In Win32, you just post a message to
a window handle from the background thread to notify the UI of status updates,
etc. Yes, you _do_ need to worry about shared state/locks if that "message"
includes more than a simple integer. But, these are also solved problems and
rarely require more exotic lock-less queues, stacks, etc. for the majority of
applications that use these types of architectures for UI background
processing because the performance implications are inconsequential. Using a
shared stack that uses a simple critical section will work fine for managing
the messages, especially since Windows now has critical sections that can use
spin locks to help minimize context switches.

~~~
int_19h
It's a solved problem in a sense that yes, you can do it that way. But it's
also more conceptually complicated, and much easier to get it wrong, which is
evident by the fact that so many desktop apps on Windows _still_ lock up
occasionally. Not so with UWP apps.

------
tschellenbach
Are there any languages that have really nailed this? I've used gevent,
eventlet, (both python), promises, callbacks (node) and none of them come
close to being as productive as synchronous code.

I'd like to try out Akka and Elixer in the future.

~~~
quotemstr
C++? Java? Python? The traditional thread model isn't bad merely because it's
traditional. I much prefer it to promise hell and to async-everything. About
the only thing that beats it is CSP, which you can _also_ represent
sequentially without funky new keywords and which you can implement as a
library for C++, Java, or Python.

I never understood why people tout Go's goroutine feature so much. You can
have it in literally any systems language.

~~~
reality_czech
The whole point of Golang is that every library and every project that uses Go
will support coroutines and channels. Sure you can write a toy project in a
language like C that has these concepts, but your toy library will effectively
be usable with all of the other libraries that have ever been written for C.
Any library that calls a blocking function will break your coroutine
abstraction.

It's like saying that indoor plumbing is no big deal-- it's just liquid moving
through a pipe. Well yes. Yes, it is. But if you don't have plumbing in your
neighborhood, or a sewage treatment plant in your city, you can't fake it by
fooling around in your garage. And frankly, it's not going to smell like a
rose.

~~~
int_19h
> The whole point of Golang is that every library and every project that uses
> Go will support coroutines and channels.

Of course, this also means that Go is making it hard for its libraries to be
used by other languages. So it's probably a bad candidate to write something
like a cross-plat UI toolkit, if you hope for its wide use.

In contrast, threads and callbacks are both well-supported in existing
languages; so if you write a library in C using either, pretty much any
language will be able to consume it.

~~~
reality_czech
That's a fair point. Go was not designed to be used to write libraries-- so
much so that the language didn't even have support for dynamically loaded
libraries for a very long time. (I'm not sure if they ever implemented their
DLL proposal that was out there for a long time... I'm too lazy to check now.)
The idea was you would write AWS-style microservices rather than using
libraries.

In general, "turducken" designs are awkward and difficult to debug. Ask
someone what a joy debugging or writing JNI or CPython code is some time.
People often prefer "pure" libraries even when the performance is a little
worse. C is the king of libraries awkwardly jammed into existing programming
languages, but it's a dubious crown to have. Rust is trying to break into this
space, but I'm not sure whether it's really a space worth being in.

------
systems
i think because it is really actually new i think perl 6 is only being
considered as production worthy since 2016

also as of now, most people who used it complain it is slow

give it 2 more years before you worry, and for now continue with python or
whatever you like to use

no one is in a rush to make perl 6 popular ... it is not a commercial project
... so don't bet your career on perl 6 ... yet

