
Memory efficiency of parallel IO operations in Python - underyx
https://code.kiwi.com/memory-efficiency-of-parallel-io-operations-in-python-6e7d6c51905d
======
konschubert
Honest question.

Can somebody explain to me why async IO is so important and why it is better
than using the operating system scheduler?

If process A is blocked because of IO, then then thing that needs to be done
will need to wait for the IO anyways.

Of course, in a server context, process A cannot handle new server requests
while it is blocked. But luckily we can run more than one process, so process
B will be free to pick it up. I will need to run a few more worker processes
than there are CPU cores, but is there a problem with that?

EDIT:

I'm thinking now the problem is maybe that running more workers than there are
cores will mean that the server accepts more concurrent connections than it
can handle? If I use async code and run exacly as many workers as I have
cores, the workers will never blocked. But then, I have the scenario where
multiple async callbacks resolve in short sequence, but cannot be picked up by
a worker because all workers are busy.

So, in both scenarios (no async but more workers than cores VS async with as
many workers as cores) it can happen that the server puts too much on its
plate and accepts more than it can handle.

I have a feeling that this is a fundamental problem that manifests itself
differently in both paradigms, but exists notheless?

~~~
zzzeek
I have an unpopular opinion. Non blocking IO is useful when we want to scale
to lots and lots of parallel IO channels. If we don't have lots, only dozens,
there is no strong need for non blocking IO. You don't actually need anything
like coroutones or event driven frameworks to use non blocking IO, just loop
through the sockets and tend to the ones that are ready, though event driven
frameworks can make it much nicer (though still significantly more awkward
than writing code in a synchronous paradigm).

Within the world of "we want non blocking IO", the world has gone crazy.
JavaScript has created a whole generation of programmers who think inside out
callback / yielded code is how everything should be done, and that threads are
"too hard". They hardly understand the original purpose of non blocking IO and
conflate the fact that they can conceptualize event driven code better than
they can threads with "well that's why it's so much faster" (which usually it
isn't). I can't retire fast enough from this industry.

~~~
quentinp
I think many people agree that speed is not a reason to choose non-blocking IO
over threads.

However, many people do believe that async/await and event loops make
reasoning about non-blocking IO much easier. Has your opinion changed since
[http://techspot.zzzeek.org/2015/02/15/asynchronous-python-
an...](http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-
databases/)?

~~~
zzzeek
> However, many people do believe that async/await and event loops make
> reasoning about non-blocking IO much easier.

they do, until their program is mysteriously having MySQL server drop their
connections randomly because something CPU-bound snuck in and the server dumps
non-authenticated connections after ten seconds. Three days of acquiring and
poring over HAProxy debug logs from the production system finally reveals the
issue that never really should have happened in the first place because the
server is only handling about 30 requests per second, and of course the fix is
to switch that part of the program to threads.

asyncio certainly makes it easier to reason about non-blocking IO but it also
means you have to construct your own preemptive multitasking system by hand,
given only points of IO where context can actually switch. We're coding in
high level scripting languages. Low level details like memory allocation,
garbage collection, and multitasking should be taken care of for us.

~~~
quentinp
Threads are a leaky abstraction, so it makes sense to explore other solutions.
To cite another low level detail from you list: I don't think Rust is less
usable because there is no garbage collection.

But keep in mind asyncio has many issues of its own, which is why I'm happy
that alternatives like
[http://trio.readthedocs.io/](http://trio.readthedocs.io/) are possible in
Python.

------
wenc
There is also a drop-in replacement for asyncio called uvloop [0]. It claims
to be faster than asyncio, gevent, node.js, etc. and comparable to golang.

[0] [https://magic.io/blog/uvloop-blazing-fast-python-
networking/](https://magic.io/blog/uvloop-blazing-fast-python-networking/)

~~~
jwandborg
More precisely, uvloop is an "asyncio.AbstractEventLoop" implementation using
libuv.

It can be used instead of the default asyncio.SelectorEventLoop[0] on Lunix(?
not Windows at least:
[https://github.com/MagicStack/uvloop/issues/14](https://github.com/MagicStack/uvloop/issues/14)).

You will still, for better or worse, use the asyncio standard library when
writing code running on uvloop.

[0]: [https://docs.python.org/3/library/asyncio-
eventloops.html#av...](https://docs.python.org/3/library/asyncio-
eventloops.html#available-event-loops)

------
scott_s
> In my case, there was a ~40% speed increase compared to sequential
> processing. Once a code runs in parallel, the difference in speed
> performance between the parallel methods is very low.

But they didn't actually present the total processing time for all of the
methods - I assume all of the parallel methods were about 17 seconds?
(Compared to the sequential baseline of 29 seconds.) And how were the threaded
frameworks configured? How many threads were they told to use (or just the
default?), how many threads can they use, and what kind of parallel hardware
did they run on?

This blog post presents the decision as one-dimensional; it claims all
parallelization methods are the same, so the only dimension to choose on is
memory efficiency. But I'm skeptical that all parallelization methods _are_
the same, and the experimental design gives me no information on that front.

~~~
nickcw
I had a squint at the ThreadPoolExecutor code on github and it uses the
default parameters. Exactly how many threads you get depends on which version
of python you are running:

> Changed in version 3.5: If max_workers is None or not given, it will default
> to the number of processors on the machine, multiplied by 5, assuming that
> ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the
> number of workers should be higher than the number of workers for
> ProcessPoolExecutor.

Not knowing how many things are happening in parallel means it is difficult to
draw conclusions from it.

~~~
ku3o
Thanks for pointing this out. I updated the article.

------
barrkel
Title should refer to concurrent IO, not parallel IO.

~~~
nine_k
On the interpreter side, most of the I/O read time is waiting for an OS buffer
to fill with the right amount of data. Waiting happens to occur "in parallel"
when using async I/O.

I suppose that sending is similar: you pass the OS a buffer, and wait for
completion. Sending of the data occurs while you wait.

So, if you can parallelize waits, the OS could be doing strictly parallel I/O
for you (e.g. via two network interfaces, or network and disk), even though
your code is concurrent but not parallel, and you don't run two sync OS I/O
calls in parallel.

~~~
zzzeek
> Waiting happens to occur "in parallel" when using async I/O. > and you don't
> run two sync OS I/O calls in parallel.

you can run sync IO calls and wait for each in separate threads. the GIL is
released for IO. the waiting is "in parallel" just as much with a threaded /
blocking approach as with a non-blocking.

------
azylman
Something seems off here - they mention using 100 workers (a worker for every
request). I would expect that to perform way more than 40% faster unless
there's a ton of overhead in creating those workers.

------
trevman
If you have a lot of RAM I highly recommend you look at Plasma + PyArrow

------
yahyaheee
Sounds like a good job for Go :)

------
blattimwind
tl;dr

If you want to parallelize[1] network I/O, use async. Otherwise, don't.

[1] Technically: not parallelize, but overlap.

~~~
nullp0tr
Concurrent is the technical term:)

~~~
blattimwind
Overlapping is a technical term, too, just not from the Unix world.

------
xstartup
I use gevent + gunicorn, it improved by API's latency and throughput by 10x.

