Hacker News new | comments | show | ask | jobs | submit login
Memory efficiency of parallel IO operations in Python (kiwi.com)
167 points by underyx 4 days ago | hide | past | web | favorite | 58 comments

Honest question.

Can somebody explain to me why async IO is so important and why it is better than using the operating system scheduler?

If process A is blocked because of IO, then then thing that needs to be done will need to wait for the IO anyways.

Of course, in a server context, process A cannot handle new server requests while it is blocked. But luckily we can run more than one process, so process B will be free to pick it up. I will need to run a few more worker processes than there are CPU cores, but is there a problem with that?


I'm thinking now the problem is maybe that running more workers than there are cores will mean that the server accepts more concurrent connections than it can handle? If I use async code and run exacly as many workers as I have cores, the workers will never blocked. But then, I have the scenario where multiple async callbacks resolve in short sequence, but cannot be picked up by a worker because all workers are busy.

So, in both scenarios (no async but more workers than cores VS async with as many workers as cores) it can happen that the server puts too much on its plate and accepts more than it can handle.

I have a feeling that this is a fundamental problem that manifests itself differently in both paradigms, but exists notheless?

I have an unpopular opinion. Non blocking IO is useful when we want to scale to lots and lots of parallel IO channels. If we don't have lots, only dozens, there is no strong need for non blocking IO. You don't actually need anything like coroutones or event driven frameworks to use non blocking IO, just loop through the sockets and tend to the ones that are ready, though event driven frameworks can make it much nicer (though still significantly more awkward than writing code in a synchronous paradigm).

Within the world of "we want non blocking IO", the world has gone crazy. JavaScript has created a whole generation of programmers who think inside out callback / yielded code is how everything should be done, and that threads are "too hard". They hardly understand the original purpose of non blocking IO and conflate the fact that they can conceptualize event driven code better than they can threads with "well that's why it's so much faster" (which usually it isn't). I can't retire fast enough from this industry.

I think many people agree that speed is not a reason to choose non-blocking IO over threads.

However, many people do believe that async/await and event loops make reasoning about non-blocking IO much easier. Has your opinion changed since http://techspot.zzzeek.org/2015/02/15/asynchronous-python-an...?

> However, many people do believe that async/await and event loops make reasoning about non-blocking IO much easier.

they do, until their program is mysteriously having MySQL server drop their connections randomly because something CPU-bound snuck in and the server dumps non-authenticated connections after ten seconds. Three days of acquiring and poring over HAProxy debug logs from the production system finally reveals the issue that never really should have happened in the first place because the server is only handling about 30 requests per second, and of course the fix is to switch that part of the program to threads.

asyncio certainly makes it easier to reason about non-blocking IO but it also means you have to construct your own preemptive multitasking system by hand, given only points of IO where context can actually switch. We're coding in high level scripting languages. Low level details like memory allocation, garbage collection, and multitasking should be taken care of for us.

Threads are a leaky abstraction, so it makes sense to explore other solutions. To cite another low level detail from you list: I don't think Rust is less usable because there is no garbage collection.

But keep in mind asyncio has many issues of its own, which is why I'm happy that alternatives like http://trio.readthedocs.io/ are possible in Python.

I don't think that's such an unpopular opinion. At least I think I'm mostly on board. :)

I don't really want to write callback-driven code anywhere, but even for simple definitely-not-thousands-of-concurrent-connections kinda problems, the typical language's standard library operation of "do this one thing and wait indefinitely until it's done" gets fairly annoying, working around it with threads is also annoying, and I'm just kinda hoping that the people working on making the async IO user experience better will accidentally make my life easier as well.

I suspect many people would be very sad if you retired from the industry! Am pretty sure I would quit if sqlalchemy stopped being developed.

If you flip the perspective and look at it from an async-first point of view, then OS-level processes are just a task implementation where every operation is a (potential) yield point and the state passed between them is the C stack. There are two flaws to this, one arguable, one less so:

1. Yield points are implicit rather than explicit, so their interaction with other effects is unpredictable. https://glyph.twistedmatrix.com/2014/02/unyielding.html has a description of the problem; more generally think about the reputation that thread safety bugs have. (There are those who argue that the advantages of implicit pervasive yielding outweigh the disadvantages).

2. At every yield the processor has to swap out the full C stack (usually 4K or 8K). This is a slow operation ("context switch") and inefficient when often the only information that actually needed to be passed from one task to the next was a single integer (e.g. a socket ID, user ID, SQL query ID, etc.) or something similarly small. Whereas with userspace scheduling, a task switch only has to pass the actual task state that's needed for the task in question.

Small point, but stack-pointers are swapped... not the full stack. The way you've worded it makes it seem like the whole stack is copied which of course is not the case.

True and in a machine with 0 cache, it's 100% true. Though, in real world with caches, it a little muddier. Yes, the stack pointer is switched. Depending on how much is referenced off that stack pointer, though, much of the stack may be in fact switched - in the cache.

Right. And I think a typical stack size is ballpark 8MB, not 8KB, these days.

It varies. 8MiB is believable. That's 8MiB (plus a guard page) of address space. Typically it's divided into 4KiB ("small") pages which are only backed with real RAM once accessed. So maybe only 4KiB of real RAM, depending on what your program is doing...plus whatever bookkeeping the kernel needs, which I think can be as much as 128KiB on Linux, unfortunately. (At least something makes it use that much for me at work, although it might be related to something special I'm doing.)

It's also possible (but not common) to release accessed-but-no-longer-needed pages. Doing this when returning to a thread pool avoids situations where one occasional deep call chain causes all the threads in the pool to eventually become huge.

I was working with pthreads recently and the default thread stack size is 8M.

Thread swapping has a non-negligible cost but I'm not sure what you mean by swapping the stack. Aren't they just swapping stack pointers? That's how I'd naively implement it.

You still have to bank a whole bunch of registers and there might be detrimental side effects but I can't imagine why the stack would get in the way. Process swapping is obviously much worse since you change the process ID and the virtual memory mapping but even then I don't see why you'd ever copy the stack.

Yes, you're right. The size of the stack is relevant to how much memory each switched-out task takes up, but not directly to how costly the switching process is.

A thread requires at minimum a new stack segment. That is 1kb or something like that, but if you plan on accepting a lot of connections and keep them on hold, it adds up fast and quickly becomes the bottleneck of your server.

A new process requires a new stack and a new heap segment, that later one is usually on the MB range.

Besides, starting and stopping threads take some time. If you are serving small files, forking at their start means that your process will spend most of its time forking. And if you don't use an easy architecture that forks on every connection, it isn't much more difficult to go all the way and make your server fully asynchronous.

> Can somebody explain to me why async IO is so important and why it is better than using the operating system scheduler?

You still do.

The difference is with thread-based IO you block a task until the IO scheduler is done with one operation, while with async IO you block a task until the IO scheduler is done with any operation.

The reason why async can be more efficient is that you have fewer tasks and possibly less task-housekeeping-overhead.

It's important to note that disk I/O generally has poor support for asynchronous operation (because disk I/O traditionally meant that high I/O concurrency brutally murders throughput, which has changed). It's something that's being worked on for Linux, though.

Suppose your system has 10,000 network connections open but two cores.

For each of those network connections, your program needs to retain some state for what it is doing.

There are various places to put the state. You can put the state in a callstack and use blocking IO, but then you have to have one process per connection, which has quite a high minimum memory overhead.

So people have developed a lot of frameworks which keep (connection,program) state in various ways and use the operating system's asynchronous frameworks, so you can have two worker processes handling thousands of connections.

I really enjoy phrasing the problems in terms of where the state lives. The implicit state machine of where in the code you are and all the crud in your stack is so ingrained in how I reason about code, it's like it isn't even there at all! But if I have to think hard about state and state transitions on every yield point or whatever, it's a bit of a headache.

CPU tasks (threads/processes) are quite heavy, for short and bursty connections the async model is a bit better in terms of memory usage but takes a bit more CPU in exchange (since you're effectively doing task scheduling on top of the OS task scheduler)

Multiprocessing has a lower CPU overhead and is usually the better choice if you have less but heavier connections (ie do a lot of work on each connection) since the OS scheduler can then properly allocate resources to each process (ie, providing a minimum amount of CPU time to all processes so no connection gets stuck). Async can do that too but it doesn't have the resource allocation granularity that processes have (In go or JS I can't just allocate less resources like CPU time to a connection)

Using thread/processes might work well if you have a bunch of workers that don't need to share a lot of state between them. Something like a simple web server for instance.

If however you need to keep a lot of shared state up-to-date it may become a bit messy. Consider a multiplayer video game server for instance, everybody needs to know where everybody else is. If you have different threads or process for every connection you need to have some sort of synchronization mechanism to update the shared game state. Meanwhile async I/O kind of hides the nasty details of synchronizing multiple connection and you end up with a more or less serialized and single-threaded events. You can handle each update one at a time synchronously.

Web services generally manage to go the first route because the synchronization is typically handled at the database level so there's no shared state at the webapp layer so each worker is effectively independent from the rest.

If we continue looking at memory usage as is the focus of the article, then I suspect that this multi-process model will have the highest memory usage of all the approaches. I’m under the impression that processes are quite heavyweight in resources compared to using threads or async.

Also, you may only need the concurrency for a small portion of your code, and the other approaches may be simpler to code and maintain (I’m looking at async).

One of the benefits of async code as compared to threading is a much easier to reason about data model. In comparison to threading, where shared data structures can change at any time, async code an be assured of data consistency within a given "block" of code - whether that block is defined by being within a single callback function or by being between `await`s in Python.

This is not a very accurate comment. Async code, cooperative scheduling, threading, and shared memory are all completely independent. You can have any combination of those.

It comes down to this: Mutual exclusion is a requirement for concurrent operations.

This can be accomplished with mutexes (usually how thread-based implementations work), cooperative scheduling (how some async implementations work, but not all), shared nothing architectures (one request per process, actor/CSP model), among other approaches.

There are plenty of hybrids out there. libevhtp is a threaded async web server. It screams. Erlang presents a no shared memory model along with an aggressive scheduler, allowing for one request per process designs, except in this case a process is extremely lightweight. Golang does something similar with goroutines, similarly lightweight units of execution.

It's important to handle many socket connections for once. Where "many" may be in the order of million. If you wonder who on earth need that many connections, there are many use-cases like pub/sub servers. Native threads and processes use an excessive amount of ram and context switching is way slower (more so for processes) in comparison. Not to mention asyncIO allows sharing memory (without copying), which processes will not (message passing through pipes or sockets is way slower) and since asyncIO is non-preemptive, it make it easier to deal with shared state than when using preemptive native-threads.

Doing asyncIO was a way to solved the c10k[0] issue back in the day.

[0] https://en.wikipedia.org/wiki/C10k_problem


Fixed. Thanks :D

There is really only 1 problem: Linux doesn't handle 50000 threads very well. There isn't anything fundamental about that, because Async IO can be implemented with coroutines which are basically lightweight threads.

linux handles 50K threads just fine (I work for a company that does this).

I guess you are right - I haven't been able to find the sources I was remembering that gave me this idea. Maybe it was just about pid_max. The default of about 32K clearly wouldn't cut it.

The max is only 4M so I guess I'll change my statement to be linux doesn't handle millions of threads very well.

You're probably thinking of Linux's older thread system, LinuxThreads, which didn't scale at all. It was replaced with NPTL, which scales far better. I think also the kernel scheduler got much better.

I'm not aware of any system that manages millions of threads well- it's also not clear what the usecase for this would be.

I'm not thinking of LinuxThreads, but internally Linux still treats threads and processes as one. So each thread is assigned a PID.

As per the top post, millions of threads means there is no need to implement your own scheduling system with async IO, which can handle millions of idle connections.

"Can somebody explain to me why async IO is so important and why it is better than using the operating system scheduler?"

There are many comments discussing the general case, but there's also a specific case that is relevant here, which is that Python is not very good at threading. Retrofitting threading onto dynamic scripting languages 10+ years into their lifetime has not proved a very successful project, with results ranging from "unusable" to "usable but probably not something you should put into production" [1].

So for these languages, "spawning lots of threads and using the OS scheduler" is simply not an option. The only way to recover any sort of concurrency is via async-style operations, where there's various ways of wrapping more or less syntax sugar around it but fundamentally at any given moment, only one instruction is being executed by the interpreter.

(I don't think dynamic scripting languages have any fundamental reason why they can't support threading, it's just really damned hard to retrofit that on to a code base that was optimized for many years for single-threaded behavior. The dynamic scripting languages that are popular all date back to the 1990s. Theoretically a new threadable one could be developed, but I suspect there's a lot of reasons why it would have a hard time gaining any traction, because this hypothetical new language would be trying to go toe-to-toe with all these other languages with decades of experience in the dynamic field, and on the "but we have working concurrency!" side you face competition from Go and other up-and-comers like Crystal, and I'm not sure there's enough sunlight in that niche to allow anything to grow.)

[1]: Generally, when I say this, people claim that there are some dynamic scripting languages that do support threading. Please point me at the exact module that implements it and show me some community consensus that it is safe to use in production. Last I knew, when I said this about a year ago, PHP was closest with a threading library, but community consensus was still "Yeah, don't use this in production." I have no issue with acknowledging that some dynamic scripting language has finally run the gauntlet to having a production-ready threading library, because the point I'm defending is that it was a gauntlet in the first place. If PHP does have production-ready threading, it was a project that took something like a full third or half of its lifetime to accomplish!

> with results ranging from "unusable" to "usable but probably not something you should put into production" [1].

> So for these languages, "spawning lots of threads and using the OS scheduler" is simply not an option.

Threaded Python applications are extremely common and are in widespread production use. The popular mod_wsgi Apache plugin is typically run in "daemon" mode where the Python code is run in a threaded server.

The issue where "spawning lots of threads" is not an option is when you are trying to parallelize IO in the range of many hundreds/thousands of concurrent connections within a single process. But that is not the general use case, because that process can only use one CPU core at a time which even for an IO-heavy process still presents a limiting factor. The typical "I want to run a web server" has fairly CPU-busy Python processes where a process typically handles a few dozen simultaneous requests, and for CPU parallelism you use multiple processes. Python has more CPU-busyness than people expect sometimes because it is after all an interpreted scripting language.

In the case of Python, adding threading was accompanied by adding the Global Interpreter Lock (GIL) that protects mutations of all data structures. Because of this, you can use Python threads for I/O-waiting just fine, but can't use them for CPU-bound tasks.

A similar situation exists in Ruby.

How about IronPython?

There is also a drop-in replacement for asyncio called uvloop [0]. It claims to be faster than asyncio, gevent, node.js, etc. and comparable to golang.

[0] https://magic.io/blog/uvloop-blazing-fast-python-networking/

More precisely, uvloop is an "asyncio.AbstractEventLoop" implementation using libuv.

It can be used instead of the default asyncio.SelectorEventLoop[0] on Lunix(? not Windows at least: https://github.com/MagicStack/uvloop/issues/14).

You will still, for better or worse, use the asyncio standard library when writing code running on uvloop.

[0]: https://docs.python.org/3/library/asyncio-eventloops.html#av...

> In my case, there was a ~40% speed increase compared to sequential processing. Once a code runs in parallel, the difference in speed performance between the parallel methods is very low.

But they didn't actually present the total processing time for all of the methods - I assume all of the parallel methods were about 17 seconds? (Compared to the sequential baseline of 29 seconds.) And how were the threaded frameworks configured? How many threads were they told to use (or just the default?), how many threads can they use, and what kind of parallel hardware did they run on?

This blog post presents the decision as one-dimensional; it claims all parallelization methods are the same, so the only dimension to choose on is memory efficiency. But I'm skeptical that all parallelization methods are the same, and the experimental design gives me no information on that front.

I had a squint at the ThreadPoolExecutor code on github and it uses the default parameters. Exactly how many threads you get depends on which version of python you are running:

> Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.

Not knowing how many things are happening in parallel means it is difficult to draw conclusions from it.

Thanks for pointing this out. I updated the article.

> it claims all parallelization methods are the same, so the only dimension to choose on is memory efficiency

From a speed (overall script duration), they are very similar and the differences between them might be as well caused by a different state of a network. My main interest was how the methods perform regarding memory usage. That's why I focused on one dimension only.

And it's fair to care about memory usage, but I think it's unlikely that your circumstances regarding runtime performance (both your testing script and testing environment) will generalize to other scenarios.

Title should refer to concurrent IO, not parallel IO.

On the interpreter side, most of the I/O read time is waiting for an OS buffer to fill with the right amount of data. Waiting happens to occur "in parallel" when using async I/O.

I suppose that sending is similar: you pass the OS a buffer, and wait for completion. Sending of the data occurs while you wait.

So, if you can parallelize waits, the OS could be doing strictly parallel I/O for you (e.g. via two network interfaces, or network and disk), even though your code is concurrent but not parallel, and you don't run two sync OS I/O calls in parallel.

> Waiting happens to occur "in parallel" when using async I/O. > and you don't run two sync OS I/O calls in parallel.

you can run sync IO calls and wait for each in separate threads. the GIL is released for IO. the waiting is "in parallel" just as much with a threaded / blocking approach as with a non-blocking.

Waiting for multiple concurrent things to complete isn't parallelism, in the on-topic jargon meaning of the term.

Parallelism is things physically being executed simultaneously, increasing throughput via physically multiple resources. The multiple IOs you're waiting on might be executing in parallel if you have e.g. a RAID array or SAN in the loop, but it's an implementation detail and isn't what the article is about.

if I write a web crawler and send off requests to 100 web servers, then handle results as they come in, those 100 web servers are working in parallel to get me my data back. So while I can't process the IO in parallel I am parallelizing my workload. Without parallelism of some form, your program would take the entire time the non-concurrent form takes.

Those web servers are working concurrently; they may or may not be working in parallel, that's an implementation detail. Without examining the wait states on the processes, you can't easily say.

The two terms mean something technically different. The article isn't about parallelism.

Something seems off here - they mention using 100 workers (a worker for every request). I would expect that to perform way more than 40% faster unless there's a ton of overhead in creating those workers.

I use gevent + gunicorn, it improved by API's latency and throughput by 10x.

If you have a lot of RAM I highly recommend you look at Plasma + PyArrow

Sounds like a good job for Go :)


If you want to parallelize[1] network I/O, use async. Otherwise, don't.

[1] Technically: not parallelize, but overlap.

Well, it's use async if you need the ~40% decrease in ram usage for your concurrent I/O access.

Concurrent is the technical term:)

Overlapping is a technical term, too, just not from the Unix world.

Very few of my networking projects risk running out of RAM. I'm usually fine with spawning a new thread per connection.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact