
Show HN: Parallelise pipelines of Python async iterables - michalc
https://github.com/michalc/asyncio-buffered-pipeline
======
Galanwe
Am I the only one struggling with async/await in _any_ language?

I did my fair share of tornado/twisted/zeromq ioloop programming, and even
though I completely agree that it gets messy real quick, I always "got it".

With async/await though, I always struggle and need a really long time to get
a clear picture of what is happening.

Maybe my mental model of how it works is not optimal so I would love for
people to describe how they "read" async/await code.

I typically see await very much like I would read a "yield", except that
instead of yielding a value, it just yields a promise to the ioloop that will
"next()" once it resolves.

Still, it seems every language is adopting it, so I guess à majority of people
find it simpler to reason with, I just don't :(

Note: the same applies for async/await in JS. I can easily get my head around
the "then" style, but struggle with async/await. A bit less than in python
though.

~~~
formerly_proven
Try to grasp coroutines outside the async/await framework. Alternatively, try
to understand corutines within the framework as state machines, like so:

    
    
        async def foo():
            conn = await DB.connect()
            results = await conn.query("SELECT bar, baz;")
            restresult = await REST.get("somewhere.else.prod.example.net/v1/BAR", id=results.id)
            return restresult.xyz
    

->
    
    
        class foo():
            future = None
            conn = None
            results = None
            restresult = None
    
            def call(self):
                self.future = Future()
                future = DB.connect()
                future.add_done_callback(self._db_conn)
                return self.future
    
           def _db_conn(self, fut):
                try:
                     self.conn = fut.get_result()
                except:
                     self.future.set_exception()
                     return
                future = self.conn.query("SELECT bar, baz;")
                future.add_done_callback(self._results)
            
          def _results(self, fut):
                try:
                     self.results = fut.get_result()
                except:
                     self.future.set_exception()
                     return
                future = REST.get("somewhere.else.prod.example.net/v1/BAR", id=self.results.id)
                future.add_done_callback(self._restresult)
      
          def _restresult(self, fut):
                try:
                     self.restresult = fut.get_result()
                except:
                     self.future.set_exception()
                     return
                self.future.set_result(self.restresult.xyz)

~~~
jnwatson
I think this bottom-up approach is the best way to understand. Using libevent
or libev for a while is a great way to understand the power of async.

Async re-merges functions that have been shattered due to having to block.

~~~
formerly_proven
There are some projects which highlight what the language is doing versus what
a library is implementing on top of that (e.g. asynker). I found that a great
help in figuring out "what" async "is".

------
jamesmishra
I had a lot of trouble writing and maintaining an asyncio Python codebase.
It's still hard to find library support, and there are "function coloring"[1]
issues that tend to make asyncio usage spread.

So I started using multiprocessing and multithreading available via
concurrent.futures. I built a wrapper package called `ori`[2] that lets you
build chains of threadpools and processpools without having to enter the
Python asyncio kingdom.

[1]: [https://journal.stuffwithstuff.com/2015/02/01/what-color-
is-...](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-
function/)

[2]:
[https://ori.technology.neocrym.com/en/latest/ori.poolchain/#...](https://ori.technology.neocrym.com/en/latest/ori.poolchain/#module-
ori.poolchain)

------
tda
So what this does is introduce a (configurable) buffer between chained async
generators. So this does not introduce any kind of multi processor
parallelism, just makes an async generator easily be able to wait on multiple
iterations in parallel. So basically the need for queues is abstracted away,
looks quite nice if your use case requires something like this

~~~
jacobwilliamroy
Yeah the title is an oxymoron.

"Parallelized async"

~~~
michalc
Hi, author here,

I was slightly torn on the usage of the word "parallelise", in that yes, CPU-
wise, no tasks will progress in parallel.

However, the point of the library is to allow certain parts of a pipeline to
progress at the same time in the real-world. In the example, they are calls to
asyncio.sleep, but in real cases, they could be HTTP requests. As in, the
bytes of the request/response will be going across the wire at the same time,
so the total wall-clock time of (certain) pipelines will go down.

So, I claim "parallelise" _is_ appropriate.

~~~
jacobwilliamroy
I am familiar with all of the cheats, delaying evaluation, kicking the can
onto the network and pretending that everything _isn 't_ just being fed into a
mutex on the other side.

And for writing to disk, async is basically the only way to "parallelize" such
operations

I just think its inappropriate because, if I was looking for a parallelization
library like multiprocessing, and I found this, I might waste 15 minutes or so
before I figure out that this isn't parallel, it's just async.

Async is okay, but it does nothing to improve performance, like what
parallelization does. Async provides good UX, parallelization provides good
performance. Conflating the two just muddys the water for the people who like
to do high-performance-computing.

------
hackingthenews
Concurrency/Parallelism is the number one reason I move a project from Python
to any other language. It's one the few areas where Python can't provide a
gracious frontend or an acceptable runtime, but it is very impressive how much
the community is willing to do to mitigate the challenges.

~~~
mystickphoenix
Process/Thread pools help to mitigate the brain damage in Python
significantly, but I hear you. Python's concurrency story is the main reason I
started looking at other languages for a better concurrency story and it's why
I picked up Clojure as a second language.

------
joshribakoff
How does this compare to something like Rx?
[http://reactivex.io/languages.html](http://reactivex.io/languages.html) is it
not applicable because generators are pull based an observables are push
based? Could the generator be wrapped in an observable (as I commonly do with
promises which are eager) to compose them that way?

------
war1025
We have something very similar to this that we wrote at my work.

Ours is based off Twisted, which works pretty similarly to the new async stuff
in python3.

The thing ours does, that I can't tell if this one supports or not, is you
pass it a list of functions you want it to call (in python3 you would just
pass it coroutines I suppose, but in Twisted an async function gets scheduled
automatically, so you need to pass it in an uncalled state), and then you
specify the "width" of the pipeline, and it runs through your list of calls,
ensuring that "width" number of calls are in flight at a time.

It's a super useful little utility to have around.

------
sandGorgon
Another way is to use Trio -
[https://trio.readthedocs.io/en/stable/](https://trio.readthedocs.io/en/stable/)

------
petters
I think something like this should be in the standard library.

~~~
bryogenic
It is! You just need to use asyncio correctly and put coroutines you want run
in parallel in tasks. It's the second example on the asyncio documentation for
coroutines and tasks: [https://docs.python.org/3/library/asyncio-
task.html#asyncio....](https://docs.python.org/3/library/asyncio-
task.html#asyncio.create_task)

~~~
petters
Hmm, so you're saying that this library is not really needed? I'll look into
this.

~~~
michalc
Hi, author here,

While yes, running a coroutine in a new task is in the standard library, that
is only a component of what this library does, which is:

\- given an async iterable,

\- iterating over it in a new task, into a buffer,

\- and then returning an iterable that yields values from this buffer as it's
filled.

I don't _think_ this logic is in the standard library, at least not at
[https://docs.python.org/3/library/asyncio-
task.html#asyncio....](https://docs.python.org/3/library/asyncio-
task.html#asyncio.create_task) from what I can see.

~~~
scaramanga
Yeah, this actually looks pretty cool, it's a nice way to abstract a common
pattern which is otherwise quite tedious, namely, that of spawning a bunch of
tasks and have them feed each other work via asyncio queues.

Have you thought about making it work as a decorator?

~~~
michalc
Indirectly: the initial version was a single function, so could have been used
in a decorator easily enough.

The issue is what happens on exceptions to avoid tasks hanging around forever.
Exceptions propagate “forward” in the pipeline fairly automatically. However,
getting the ones “behind” the exception to notice: that was tricky. The best I
came up with was the bit of state wrapped up in buffered_pipeline that notices
the exceptions and cancels the tasks.

------
stuaxo
Fantastic, I've been wanting something like this since before async.

What is the mechanism used to parallelize, threads, processes or something
else - it's worth mentioning in the readme.

