When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting. Obviously the async framework introduces some overhead, but that bit of overhead is probably a lot less than the 3 billion cpu cycles you'll waste waiting 1000ms for an external service.
As I describe in the first line of my article I don't think that people who think async is faster have unreasonable expectations. It seems very intuitive to assume that greater concurrency would mean greater performance - at least one some measure.
> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster" because your process can get some other useful work done while it's waiting.
I'm afraid I also don't think you have this right conceptually. An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.
Fundamentally, you do not waste "3 billion cpu cycles" waiting 1000ms for an external service. Making alternative use of the otherwise idle CPU is the purpose (and IMO the proper domain of) operating systems.
Sure, the operating system can find other things to do with the CPU cycles when a program is IO-locked, but that doesn't help the program that you're in the situation of currently trying to run.
> An async implementation that does multiple ("embarrassingly parallel") tasks in the same process - whether that is DB IO waiting or microservice IO waiting - is not necessarily a performance improvement over a sync version that just starts more workers and has the OS kernel scheduler organise things. In fact in practice an async version is normally lower throughput, higher latency and more fragile. This is really what I'm getting at when I say async is not faster.
You're right. "Arbitrary programs will run faster" is not the promise of Python async.
Python async does help a program work faster in the situation that phodge just described (waiting for web requests, or waiting for a slow hardware device), since the program can do other things while waiting for the locked IO (unlike a Python program that does not use async and could only proceed linearly through its instructions). That's the problem that Python asyncio purports to solve. It is still subject to the Global Interpreter Lock, meaning it's still bound to one thread. (Python's multiprocessing library is needed to overcome the GIL and break out a program into multiple threads, at the cost that cross-thread communication now becomes expensive).
This isn't how it works. While Python is blocked in I/O calls, it releases the GIL so other threads can proceed. (If the GIL were never released then I'm sure they wouldn't have put threading in the Python standard library.)
> Python's multiprocessing library is needed to overcome the GIL
This is technically true, in that if you are running up against the GIL then the only way to overcome it is to use multiprocessing. But blocking IO isn't one of those situations, so you can just use threads.
The comparison here is not async vs just doing one thing. It's async vs threads. I believe that's what the performance comparison in the article is about, and if threads were as broken as you say then obviously they wouldn't have performed better than asyncio.
As an aside, many C-based extensions also release the GIL when performing CPU-bound computations e.g. numpy and scipy. So GIL doesn't even prevent you from using multithreading in CPU-heavy applications, so long as they are relatively large operations (e.g. a few calls to multiply huge matrices together would parallelise well, but many calls to multiply tiny matrices together would heavily contend the GIL).
> No it's not, just use threads.
I just wanted to expand on this a little to describe some of the downsides to threads in Python.
Multi-threaded logic can be (and often is) slower than single-threaded logic because threading introduces overhead of lock contention and context switching. David Beazley did a talk illustrating this in 2010:
He also did a great talk about coroutines in 2015 where he explores threading and coroutines a bit more:
In workloads that are often "blocked" like network calls our I/O bound work loads, threads can provide similar benefits to coroutines but with overhead. Coroutines seek to provide the same benefit without as much overhead (no lock contention, fewer context switches by the kernel).
It's probably not the right guidelines for everyone but I generally use these when thinking about concurrency (and pseudo-concurrency) in Python:
- Coroutines where I can.
- Multi-processing where I need real concurrency.
- Never threads.
The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief. There seems to be some debate about whether the article is really representative, and I'm very curious about that. But then the parent comment to mine took us on an unproductive detour that based on the misconception that Python threads don't work at all. Now your comment has brought up that original belief again, but you haven't referenced the article at all.
The point of my comment is to say that neither threads or coroutines will make Python _faster_ in and of themselves. Quite the opposite in fact: threading adds overhead so unless the benefit is greater than the overhead (e.g. lock contention and context switching) your code will actually be net slower.
I can't recommend the videos I shared enough, David Beazley is a great presenter. One of the few people who can do talks centered around live coding that keep me engaged throughout.
> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO. The point of the article is to dispel that belief.
The disconnect here is that this article isn't claiming that asyncio is not faster than threads. In fact the article only claims that asyncio is not a silver bullet guaranteed to increase the performance of any Python logic. The misconception it is trying to clear up, in it's own words is:
> Sadly async is not go-faster-stripes for the Python interpreter.
What I, and many others are questioning is:
A) Is this actually as widespread a belief as the article claims it to be? None of the results are surprising to me (or apparently some others).
B) Is the article accurate in it's analysis and conclusion?
As an example, take this paragraph:
> Why is this? In async Python, the multi-threading is co-operative, which simply means that threads are not interrupted by a central governor (such as the kernel) but instead have to voluntarily yield their execution time to others. In asyncio, the execution is yielded upon three language keywords: await, async for and async with.
This is a really confusing paragraph because it seems to mix terminology. A short list of problems in this quote alone:
- Async Python != multi-threading.
- Multi-threading is not co-operatively scheduled, they are indeed interrupted by the kernel (context switches between threads in Python do actually happen).
- Asyncio is co-operatively scheduled and pieces of logic have to yield to allow other logic to proceed. This is a key difference between Asyncio (coroutines) and multi-threading (threads).
- Asynchronous Python can be implemented using coroutines, multi-threading, or multi-processing; it's a common noun but the quote uses it as a proper noun leaving us guessing what the author intended to refer to.
Additionally, there are concepts and interactions which are missing from the article such as the GIL's scheduling behavior. In the second video I shared, David Beazley actually shows how the GIL gives compute intensive tasks higher priority which is the opposite of typical scheduling priorities (e.g. kernel scheduling) which leads to adverse latency behavior.
So looking at the article as a whole, I don't think the underlying intent of the article is wrong, but the reasoning and analysis presented is at best misguided. Asyncio is not a performance silver bullet, it's not even real concurrency. Multi-processing and use of C extensions is the bigger bang for the buck when it comes to performance. But none of this is surprising and is expected if you really think about the underlying interactions.
To rephrase what you think I thought:
> The point is, many people think (including you judging by your comment, and certainly including me up until now but now I'm just confused) that in Python asyncio is better than using multiple threads with blocking IO.
Is actually more like:
> Asyncio is more efficient than multi-threading in Python. It is also comparatively more variable than multi-processing, particularly when dealing with workloads that saturate a single event loop. Neither multi-threading or Asyncio is actually concurrent in Python, for that you have to use multi-processing to escape the GIL (or some C extension which you trust to safely execute outside of GIL control).
Regarding your aside example, it's true some C extensions can escape the GIL, but often times it's with caveats and careful consideration of where/when you can escape the GIL successfully. Take for example this scipy cookbook regarding parallelization:
It's not often the case that using a C extension will give you truly concurrent multi-threading without significant and careful code refactoring.
Threads and async are not mutually exclusive. If your system resources aren't heavily loaded, it doesn't matter, just choose the library you find most appropriate. But threads require more system overhead, and eventually adding more threads will reduce performance. So if it's critical to thoroughly maximize system resources, and your system cannot handle more threads, you need async (and threads).
Absolutely false. OS threads are orders of magnitude lighter than any Python coroutine implementation.
But python threads, which have extra weight on top of an cross-platform abstraction layer on top of the underlying OS threads, are not lighter than python coroutines.
You aren't choosing between Python threads and unadorned OS threads when writing Python code.
I'm pointing out that this is a Python problem, not a threads problem, a fact which people don't understand.
- receives an HTTP POST multipart/form-data that contains three file parts. The first part is JSON.
- parses the form.
- parses the JSON.
- depending upon the JSON accepts/rejects the POST.
- for accepted POSTs, writes the three parts as three separate files to S3.
It runs behind nginx + uwsgi, using the Falcon framework. For parsing the form I use streaming-form-data which is cython accelerated. (Falcon is also cython accelerated.)
I tested various deployment options. cpython, pypy, threads, gevent. Concurrency was more important than latency (within reason). I ended up with the best performance (measured as highest RPS while remaining within tolerable latency) using cpython+gevent.
It's been a while since I benchmarked and I'm typing this up from memory, so I don't have any numbers to add to this comment.
> ...it would take possibly 8 gigs of memory.
No. Nothing is 'taken' when virtual memory is requested.
During any operation that does not need to modify Python objects, it is safe to unlock the GIL. Yielding control to the OS to wait on I/O is one such example, but doing heavy computation work in C (e.g. numpy) can be another.
Single-threaded performance is a major issue as that's most Python code.
Any solution which wants to consider per-object locking has to consider removing refcounting, or locking the refcount bits separately, as locking/unlocking objects to twiddle their refcounts is going to be ridiculously expensive.
Ultimately, the Python ownership and object model is not condusive to proper threading, as most objects are global state and can be mutated by any thread.
Workers (usually live in a new process) are not efficient. Processes are extremely expensive and subjectively harder for exception handling. Threads are lighter weight..and even better are async implementations that use a much more scalable FSM to handle this.
Offloading work to things not subjective to the GIL is the reason async Python got so much traction. It works really well.
On the web when the bulk of your application code time is waiting on APIs, database queries, external caches or disk I/O it creates a dramatic increase in the capacity of your server if you can do it with minimal RAM overhead.
It's one of the big reasons I've always wanted to see Techempower create a test version that continues to increase concurrency beyond 512 (as high as maybe 10k). I think it would be interesting.
Python doesn't block on I/O.
Edit: sorry I can do better.
If you're using async/await to not block on I/O while handling a request, you still have to wait for that I/O to finish before you return a response. Async adds overhead because you schedule the coroutine and then resume execution.
The OS is better at scheduling these things because it can do it in kernel space in C. Async/await pushes that scheduling into user space, sometimes in interpreted code. Sometimes you need that, but very often you don't. This is in conflict with "async the world", which effectively bakes that overhead into everything. This explains the lower throughput, higher latency, and higher memory usage.
So effectively this means "run more processes/threads". If you can only have 1 process/thread and cannot afford to block, then yes async is your only option. But again that case is pretty rare.
If you go back to the origins of Erlang, the intent was to build a language that would make it easier to write software for telecom (voice) switches; what comes out of that is one process for each line, waiting for someone to pick up the line and dial or for an incoming call to make the line ring (and then connecting the call if the line is answered). Having this run as an isolated process allows for better system stability --- if someone crashes the process attached to their line, the switch doesn't lose any of the state for the other lines.
It turns out that a 1980s design for operational excellence works really well for (some) applications today. Because the processes are isolated, it's not very tricky to run them in parallel. If you've got a lot of concurrent event streams (like users connected via XMPP or HTTP), assigning each a process makes it easy to write programs for them, and because Erlang processes are significantly lighter weight than OS processes or threads, you can have millions of connections to a machine, each with its own process.
You can absolutely manage millions of connections in other languages, but I think Erlang's approach to concurrency makes it simpler to write programs to address that case.
Immutable data, heap isolated by concurrent process and lack of shared state, combined with supervision trees made possible because of extremely low overhead concurrency, and preemptive scheduling to prevent any one process from taking over the CPU...create that operational consistency.
It's a combination of factors that have gone into the language design that make it all possible though. Very big and interesting topic.
But it does create a significant capacity increase. Here's a simple example with websockets.
I built a service that was making a lot of requests. Much enough that at some point we've run out of 65k connections limit for basic Linux polling (we needed to switch to kpoll). Some time after that we've ran out of other resources and switching from threads to threads+greenlets really solved our problem.
This is very true, especially when actual work is involved.
Remember, the kernel uses the exact same mechanism to have a process wait on a synchronous read/write, as it does for a processes issuing epoll_wait. Furthermore, isolating tasks into their own processes (or, sigh, threads), allows the kernel scheduler to make much better decisions, such as scheduling fairness and QoS to keep the system responsive under load surges.
Now, async might be more efficient if you serve extreme numbers of concurrent requests from a single thread if your request processing is so simple that the scheduling cost becomes a significant portion of the processing time.
... but if your request processing happens in Python, that's not the case. Your own scheduler implementation (your event loop) will likely also end up eating some resources (remember, you're not bypassing anything, just duplicating functionality), and is very unlikely to be as smart or as fair as that of the kernel. It's probably also entirely unable to do parallel processing.
And this is all before we get into the details of how you easily end up fighting against the scheduler...
I'd guess the c++ event loop is more important than the jit?
Maybe a better comparison is quart (with eg uvicorn)
Or Sanic / uvloop?
There's also the major issue of backpressure handling, but that's a whole other story, and not unique to Python.
My major issue with the post I replied to is that there are a bunch of confounding issues that make the comparison given meaningless.
Moreover, even if there's _still_ a discrepancy, unless you're profiling things, the discussion is moot. This isn't to say that there aren't problems (there almost certainly are), but that you should get as close as possible to an apples-to-apples comparison first.
So really we're in agreement. You're talking about reimplementing python specific things to make it more performant, and that is exactly another way of saying that the problem is python specific.
It's neither fair nor correct to mush together CPython's async/await implementation with the implementation of asyncio.SelectorEventLoop. They are two different things and entirely independent of one another.
Moreover, it's neither fair nor correct to compare asyncio.SelectorEventLoop with the event loop of node.js, because the former is written in pure Python (with performance only tangentally in mind) whereas the latter is written in C (libuv). That's why I pointed you to uvloop, which is an implementation of asyncio.AbstractEventLoop built on top of libuv. If you want to even start with a comparison, you need to eliminate that confounding variable.
Finally, the implementation matters. node.js uses a JIT, while CPython does not, giving them _much_ different performance characteristics. If you want to eliminate that confounding variable, you need to use a Python implementation with a JIT, such as PyPy.
Do those two things, and then you'll be able to do a fair comparison between Python and node.js.
All that matters is the concurrency model because that application he's running is barely doing anything else except IO and anything outside of IO becomes negligible because after enough requests, those sync worker processes will all be spending the majority of their time blocked by an IO request.
The basic essence of the original claim is that sync is not necessarily better than async for all cases of high IO tasks. I bring up node as a counter example because that async model IS Faster for THIS same case. And bringing up node is 100% relevant because IO is the bottleneck, so it doesn't really matter how much faster node is executing as IO should be taking most of the time.
Clearly and logically the async concurrency model is better for these types of tasks so IF tests indicate otherwise for PYTHON then there's something up with python specifically.
You're right, we are in disagreement. I didn't realize you completely failed to understand what's going on and felt the need to do an apples to apples comparison when such a comparison is not Needed at all.
And I'm saying all those confounding variables you're talking about are negligible and irrelevant.
Why? Because the benchmark test in the article is a test where every single task is 99% bound by IO.
What each task does is make a database call AND NOTHING ELSE. Therefore you can safely say that for either python or Node request less than 1% of a single task will be spent on processing while 99% of the task is spent on IO.
You're talking about scales on the order of 0.01% vs. 0.0001%. Sure maybe node is 100x faster, but it's STILL NEGLIGIBLE compared to IO.
It it _NOT_ Nonsense.
You Do not need an apples to apples comparison to come to the conclusion that the problem is Specific to the python implementation. There ARE NO confounding variables.
No, you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent. You're assuming the issue lies in one place (Python's async/await implementation) when there are a bunch of possible contributing factors _which have not been ruled out_.
Unless you've actually profiled the thing and shown where the time is used, all your assertions are nonsense.
Show me actual numbers. Prove there are no confounding variables. You made an assertion that demands evidence and provided none.
It's data science that is causing this data driven attitude to invade peoples minds. Do you not realize that logic and assumptions take a big role in drawing conclusions WITHOUT data? In fact if you're a developer you know about a way to DERIVE performance WITHOUT a single data point or benchmark or profile. You know about this method, you just haven't been able to see the connections and your model about how this world works (data driven conclusions only) is highly flawed.
I can look at two algorithms and I can derive with logic alone which one is O(N) and which one is O(N^2). There is ZERO need to run a benchmark. The entire theory of complexity is a mathematical theory used to assist us at arriving AT PERFORMANCE conclusions WITHOUT EVIDENCE/BENCHMARKS.
Another thing you have to realize is the importance of assumptions. Things like 1 + 1 = 2 will remain true always and that a profile or benchmark ran on a specific task is an accurate observation of THAT task. These are both reasonable assumptions to make about the universe. They are also the same assumptions YOU are making everytime you ask for EVIDENCE and benchmarks.
What you aren't seeing is this: The assumptions I AM making ARE EXACTLY THE SAME: reasonable.
>you're asserting something without actual evidence, and the article itself doesn't actually state that either: it contains no breakdown of where the time is spent
Let's take it from the top shall we.
I am making the assumption that tasks done in parallel ARE Faster than tasks done sequentially.
The author specifically stated he made a server that where each request fetches a row from the database. And he is saying that his benchmark consisted of thousands of concurrent requests.
I am also making the assumption that for thousands of requests and thousands of database requests MOST of the time is spent on IO. It's similar to deriving O(N) from a for loop. I observe the type of test the author is running and I make a logical conclusion on WHAT SHOULD be happening. Now you may ask why is IO specifically taking up most of the time of a single request a reasonable assumption? Because all of web development is predicated on this assumption. It's the entire reason why we use inefficient languages like python, node or Java to run our web apps instead of C++, because the database is the bottleneck. It doesn't matter if you use python or ruby or C++, the server will always be waiting on the db. It's also a reasonable assumption given my experience working with python and node and databases. Databases are the bottleneck.
Given this highly reasonable assumption, and in the same vein as using complexity theory to derive performance speed, it is highly reasonable for me to say that the problem IS PYTHON SPECIFIC. No evidence NEEDED. 1 + 1 = 2. I don't need to put that into my calculator 100 times to get 100 data points for some type of data driven conclusion. It's assumed and it's a highly reasonable assumption. So reasonable that only an idiot would try to verify 1 + 1 = 2 using statistics and experiments.
Look you want data and no assumptions? First you need to get rid of the assumption that a profiler and benchmark is accurate and truthful. Profile the profiler itself. But then your making another assumption: The profiler that profiled the profiler is accurate. So you need to get me data on that as well. You see where this is going?
There is ZERO way to make any conclusion about anything without making an assumption. And Even with an assumption, the scientific method HAS NO way of proving anything to be true. Science functions on the assumption that probability theory is an accurate description of events that happen in the real world AND even under this assumption there is no way to sample all possible EVENTS for a given experiment so we can only verify causality and correlations to a certain degree.
The truth is blurry and humans navigate through the world using assumptions, logic and data. To intelligently navigate the world you need to know when to make assumptions and when to use logic and when data driven tests are most appropriate. Don't be an idiot and think that everything on the face of the earth needs to be verified with statistics, data and A/B tests. That type of thinking is pure garbage and it is the same misguided logic that is driving your argument with me.
I do a lot of Django and Nodejs and Django is great to sketch an app out, but I've noticed rewriting endpoints in Nodejs directly accessing postgres gets much better performance.
Just my 2c
Although note that, while Node.js is fast relative to Python, it's still pretty slow. If you're writing web-stuff, I'd recommend Go instead for casually written, good performance.
So you have two implementations of async that are both bottlenecked by IO. One is implemented in node. The other in python.
The node implementation behaves as expected in accordance to theory meaning that for thousands of IO bound tasks it performs faster then a fixed number of sync worker threads (say 5 threads).
This makes sense right? Given thousands of IO bound tasks, eventually all 5 threads must be doing IO and therefore blocked on every task, while the single threaded async model is always context switching whenever it encounters an IO task so it is never blocked and it is always doing something...
Meanwhile the python async implementation doesn't perform in accordance to theory. 5 async workers is slower then 5 sync workers on IO bound tasks. 5 sync workers should eventually be entirely blocked by IO and the 5 async workers should never be blocked ever... Why is the python implementation slower? The answer is obvious:
It's python specific. It's python that is the problem.
NodeJS is faster than flask because of the concurrency model and NOT because of the JIT.
The python async implementation being slower than the python sync implementation means one thing: Something is up with python.
The poster implies that with the concurrency model the outcome of these tests are expected.
The reality is, these results are NOT expected. Something is going on specifically with the python implementation.
I've tested this extensively on Linux. There is no more CPU used for threads vs epoll.
On the other hand, if you don't get the epoll imementation exactly right, you may end up with many spurious calls. E.g. simply reading slow data from a socket in golang on Linux incurs considerable overhead: a first read that is short, another read that returns EWOULDBLOCK, and then a syscall to re-arm the epoll. With OS threads, that is just a single call, where the next call blocks and eventually returns new data.
Edit: one thing I haven't considered when testing is garbage collection. I'm absolutely convinced that up to 10k connections, threads or async doesn't matter, in C or Rust. But it may be much harder to do GC over 10k stacks than over 8.
This seems to also be a major behind io_uring.
The issue is memory usage, which OS threads take a lot of.
Would userland scheduling be more CPU efficient? Sure, probably in many cases. But I don't think that's the problem with handling many thousands of concurrent requests today.
Literally the first thing any concurrency course starts with in the very first lesson is that scheduling and context overhead are not negligible. Is it so hard to expect our professionals to know basic principles of what they are dealing with?
Co-routines are not necessarily faster than threads, but they yield to a performance improvement if one has to spin thousands of them : they have less creation overhead and consume less RAM.
This hardly matters when spinning up a few thousand threads. Only memory that's actually used is committed, one 4k page at a time. What is 10MB these days? And that is main memory, while it's much more interesting what fits in cache. At that point it doesn't matter if your data is in heap objects or on a stack.
Add to that the fact that Python stacks are mostly on the heap, the real stack growing only due to nested calls in extensions. It's rare for a stack in Python to exceed 4k.
Theoretically it makes no sense. A Task manager executing tasks in parallel to IO instead of blocking on IO should be faster... So the problem must be in the implementation.
This is because when they are first shown it, the examples are faster, effectively at least, because the get given jobs done in less wallclock time due to reduced blocking.
They learn that but often don't get told (or work out themselves) that in many cases the difference is so small as to be unmeasurable or in other circumstances the can be negative effects (overheads others have already mentioned in the framework, more things waiting on RAM with a part processed working day which could lead to thrashing in a low memory situation, greater concurrent load on other services such as a database and the IO system it depends upon, etc).
As a slightly of-the-topic-of-async example, back when multi-core processing was first becoming cheap enough that it was not just affordable at give but the default option, I had great trouble trying to explain to a colleague why two IO intensive database processes he was running were so much slower than when I'd shown him the same process (I'd run them sequentially). He was absolutely fixated on the idea that his four cores should make concurrency the faster option, I couldn't get through that in this case the flapping heads on the drives of the time were the bottleneck and the CPU would be practically idle no matter how many cores it had while the bottleneck was elsewhere.
Some people learn the simple message (async can handle some loads much more efficiently) as an absolute (async is more efficient) and don't consider at all that the situation may be far more nuanced.
You mean concurrent tasks in the same process?
And I don't think I'm alone nor being unreasonable.
This is a quintessential example of not seeing the forest for the trees.
The point of coroutines is absolutely to make my code execute faster. If a completely I/O-bound application sits idle while it waits for I/O, I don't care and I should not care because there's no business value in using those wasted cycles. The only case where coroutines are relevant is when the application isn't completely I/O bound; the only case where coroutines are relevant is when they make your code execute faster.
It's been well-known for a long time that the majority of processes in (for example) a webserver, are I/O bound, but there are enough exceptions to that rule that we need a solution to situations where the process is bound by something else, i.e. CPU. The classic solution to this problem is to send off CPU-bound processes to a worker over a message queue, but that involves significant overhead. So if we assume that there's no downside to making everything asynchronous, then it makes sense to do that--it's not faster for the I/O bound cases, but it's not slower either, and in the minority-but-not-rare CPU-bound case, it gets us a big performance boost.
What this test is doing is challenging the assumption that there's no downside to making everything asynchronous.
In context, I tend to agree with the conclusion that there are downsides. However, those downsides certainly don't apply to every project, and when they do, there may be a way around them. The only lesson we can draw from this is that gaining benefit from coroutines isn't guaranteed or trivial, but there is much more compelling evidence for that out there.
I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.
The code runs as fast as it runs, coroutines notwithstanding.
> I think rather the point is to make your APPLICATION either finish in less time, or to not take MORE time when given more load.
And from my perspective, I don't think it's unreasonable for me to expect you to try to understand what I'm trying to communicate, rather than attempting to force me to use different words. The burden of communication is shared by both speaker and listener.
The surprising conclusion of the article is that on a realistic scenario, the async web frameworks will ouput less requests/sec than the sync ones.
I'm very familiar with Python concurrency paradigms, and I wasn't expecting that at all.
Add to that zzzeek's article (the guy wrote SQLA...) stating async is also slower for db access, this makes async less and less appealing, given the additional complexity it adds.
Now appart from doing a crawler, or needing to support websockets, I find hard to justify asyncio. In fact, with David Beasley hinting that you probably can get away with spawning a 1000 threads, it raises more doubts.
The whole point of async was that, at least when dealing with a lot of concurrent I/O, it would be a win compared to threads+multiprocessing. If just by cranking the number of sync workers you get better results for less complexity, this is bad.
Is the price of the context switching too high, or are you compensating the weakness of each system, by handling I/O concurrently in async, but smoothing the blocking code outside of the await thanks to threads?
Making a _clean_ benchmark for would it be really hard, though.
The author of "black" suggested that the cause of the slow down may be that asyncio actually starved postgres for resources:
but threads get you the same thing with much less overhead. this is what benchmarks like this one and my own continue to confirm.
People often are afraid of threads in Python because "the GIL!" But the GIL does not block on IO. I think programmers reflexively reaching for Tornado or whatever don't really understand the details of how this all works.
That is not true, at least not in general, the whole point of using continuations for async I/O is to avoid the overhead of using threads, the scheduler overhead, the cost of saving and restoring the processor state when switching tasks, the per thread stack space, and so on.
The use of async either as callbacks, or user threads, or coroutines, is a convenience layer for structuring your code. As I understand, that layer does add some overhead, because it captures an environment, and has to later restore it.
Async and parallel always use more CPU cycles than sequential. There is no question. He real questions are: do you have cycles to burn, will doing so brings the wall clock time down, and is it worth the complexity of doing so?
I was thinking this would be about using multiprocessing to fire off two or more background tasks, then handle the results together once they all completed. If the background tasks had a large enough duration, then yeah, doing them in parallel would overcome the overhead of creating the processes and the overall time would be reduced (it would be "faster"). I thought this post would be a "measure everything!" one, after they realized for their workload they didn't overcome that overhead and async wasn't faster.
Upon what the post was about, my response was more like "...duh".
Waiting for I/O does usually not waste any CPU cycles, the thread is not spinning in a loop waiting for a response, the operating system will just not schedule the thread until the I/O request completed.
You are making dinner. You start to boil water for the potatoes. While that happens, you prepare the beef. Async.
You and your girlfriend are making dinner. You do the potatoes, she does the beef. Parallel.
You can perhaps see how you could have asynchronous and parallel execution at the same time.
In the context of a Web server, a request is handled by a single Python process (so don’t give me that “OS scheduler can do other things”). Async matters here because your request turnover can be higher, even if the requests/sec remains the same.
In the cooking example, each request gets a single cook. If that cook is able to do things asynchronously, he will finish a single meal faster.
If it were only parallel, you could have more cooks - because they would be less demanding - but they would each be slower.
There is a bit of nuance here, in that the async-chef would make any individual meal slower than a sync-chef, once the number of outstanding requests is large. The sync-chef would indeed have overall higher wait times, but each meal would process just as fast as normal (eg. more like a checkout line at a grocery store).
I prefer the grocery store checkout line metaphor for this reason. If a single clerk was "async" and checking out multiple people at once, all the people in a line would have an average wait time of roughly the same for a small line size. A "sync" clerk would have a longer line with people overall waiting longer, but each individual checkout would take the same amount of time once the customer managed to reached the clerk.
This is pertinent when considering the resources utilized during the job. If an sync clerk only ever holds a single database connection, while an async clerk holds one for every customer they try to check out at the same time, the sync clerk will be far more friendly to the database (but less friendly to the customers, when there aren't too many customers at once).
The sync chef doesn't occupy the frying pan when he's boiling potatoes, so in some sense he only really does as much as he can. Having hundreds of sync chefs would likely be more efficient in terms of order volume, _but not order latency._
It depends on what you mean by "faster". HTTP requests are IO bound, thus it is to be expected that the throughout of a IO bound service benefits from a technology that prevents your process from sitting idle while waiting for IO.
Thus it's surprising that Python's async code performs worse, not better, in both throughput and latency.
> When you're dealing with external REST APIs that take multiple seconds to respond, then the async version is substantially "faster"
The findings reported in the blog post you're commenting are the exact opposite of your claim: Python's async performs worse than it's sync counterpart.
“Faster” is misleading because the speed improvements that you get with async is very dependent on load. At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious.
The one caveat to this is cases where async allows you to run two requests in parallel, rather than sequentially. I would argue that this is less about async than it is about concurrency, and how async work can make some concurrent work loads more ergonomic to program.
> “Faster” is misleading
> "At low levels there is going to typically be negligible or no speed gains, but at higher levels the benefit will be incredibly obvious."
there are no "speed" gains period. the same amount of work will be accomplished in the same amount of time with threads or async. async makes it more memory efficient to have a huge number of clients waiting concurrently for results on slow services, but all of those clients walking off with their data will not be reached "faster" than with threads.
the reason that asyncio advocates say that asyncio is "faster" is based on the notion that the OS thread scheduler is slow, and that async context switches are some combination of less frequent and more efficient such that async is faster. This may be the case for other languages but for Python's async implementations it is not the case, and benchmarks continue to show this.
You are not waiting for that 1000ms, and you haven't been for 35 years since the first os's starting feature preemptive multitasking.
When you wait on a socket, the OS will remove you from the CPU and place someone who is not waiting. When data is ready, you are placed back. You aren't wasting the CPU cycles waiting, only the ones the OS needs to save your state.
Actually standing there and waiting on the socket is not a thing people have done for a long time.
The point is that async IO allows your own process/thread to progress while waiting for IO. Preemptive multitasking just assigns the CPU to something else while waiting, which is good for the box as a whole, but not necessarily productive for that one process (unless it is multithreaded).
This doesn’t surprise me at all, as I’ve had to deal with async python in production, and it was a performance and reliability nightmare compared to the async Java and C++ it interacted with.
...with the goal of making your application faster.
In doing that, you're removing natural parallelism, and end up competing with the kernel scheduler, both in performance and in scheduling decisions.
That doesn't matter though. If you think the average python user is looking for "concurrency without parallelism" with no speed/performance goal in mind, you totally have the wrong demographic.
The fact that the language chose to implement asyncio on a single thread (again the end user doesn't care that this is the case, it could have been thread/core abstraction like goroutines), with little gain, which lead to a huge fragmentation of its library ecosystem is bad. Even worse that it was done in 2018. Doesn't matter how smart you are about the internals.
Python implements things on a single thread due to language restrictions (or rather, reference implementation restrictions), as the GIL as always disallows parallel interpreter access, so multiple Python threads serve little purpose other than waiting for sync I/O. It's been many years since I followed Python development, but back then all GIL removal work had unfortunately come to a halt...
> ... no. With the goal
I assumed those meant the end user of the language (it is fair to assume the person you responded to meant that). The goal of the language itself was probably to stay trendy - e.g. JS/Golang/Nim/Rust/etc had decent async stories, where python didn't. Python needed async syntax support as the threading and multiprocessing interfaces were clunky compared to others in the space. What they ended up with arguably isn't good.
I'm pretty familiar with those restrictions which is why I expected this thread to be more of "yeah it sucks that its slower" instead of pulling the "coroutines don't technically make anything faster per se" argument which is distracting.
Then it was people saying “Guys, stop buying surgical masks, The science says they don’t work it’s like putting a rag over your mouth.”
All of these so called expert know it alls were wrong and now we have another expert on asynchronous python telling us he knows better and he’s not surprised. No dude your just another guy on the internet pretending he’s a know it all.
If you are any good, you’ll realize that nodejs will beat the flask implementation any day of the week and the nodejs model is exactly identical to the python async model. Nodejs blew everything out of the water, and it showed that asynchronous single threaded code was better for exactly the test this benchmark is running.
It’s not obvious at all. Why is the node framework faster then python async? Why can’t python async beat python sync when node can do it easily? What is the specific flaw within python itself that is causing this? Don’t answer that question because you don’t actually know man. Just do what you always do and wait for a well intentioned humble person to run a benchmark then comment on it with your elitist know it all attitude claiming your not surprised.
Is there a word for these types of people? They are all over the internet. If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.
Python started single process, added threading, and then bolted async on top of that. And CPython is a pretty straight interpreter.
A comparison between Node and PyPy would be more informative, but PyPy has a far less mature JIT and still has to deal with Python's dynamism.
> If we invent a label maybe they’ll start becoming self aware and start acting more down to earth.
You can't lecture people into self-awareness, any more than experts can lecture everyone into wearing masks.
If you say IO is the bottleneck, then you're claiming there is no significant difference between python and node. That's what a bottleneck means.
> The concurrency model for IO should determine overall speed.
"Speed" is meaningless, it's either latency or throughput. Yeah, yeah, sob in your pillow about how mean elites are, clean up your mascara, and learn the correct terminology.
We've already claimed the concurrency model is asynchronous IO for both python and node. Since they are both doing the same basic thing, setting up an event loop and polling the OS for responses, it's not an issue of which has a superior model.
> If python async is slower for IO tasks then sync then that IS an unexpected result and an indication of a python specific problem.
Both sync and async IO have their own implementations. If you read from a file synchronously, you're calling out to the OS and getting a result back with no interpreter involvement. This is a simple single-threaded server in C. All it does is tell the kernel, "here's my IO, wake me up when it's done."
When you do async work, you have to schedule IO and then poll for it. This is an example of doing that in epoll in straight C. Polling involves more calls into the kernel to tell it what events to look for, and then the application has to branch through different possible events.
And you can't avoid this if you want to manage IO asynchronously. If you use synchronous IO in threading or processes, you're still constructing threads or processes. (Which makes sense if you needed them anyway.)
So unless an interpreter builds its synchronous calls on top of async, sync necessarily has less involvement with both the kernel and interpreter.
The reason the interpreter matters is because the latency picture of async is very linear:
* event loop wakes up task
* interpreter processes application code
* application wants to open / read / write / etc
* interpreter processes stdlib adding a new task
* event loop wakes up IO task
* interpreter processes stdlib checking on task
* kernel actually checks on task
Since an event loop is a single-threaded operation, each one of these operations is sequential. Your maximum throughput, then, is limited by the interpreter being able to complete IO operations as fast as it is asked to initiate them.
I'm not familiar enough with it to be certain, but Node may do much of that work in entirely native code. Python is likely slow because it implements the event loop in python.
So, not only is Python's interpreter slower than Node's, but it's having to shuffle tasks in the interpreter. If Node is managing a single event loop all in low level code, that's less work it's doing, and even if it's not, Node can JIT-compile some or all of that interpreter work.
This is my claim that this SHOULD be what's happening under the obvious logic that tasks handled in parallel to IO should be faster then tasks handled sequentially and under the assumption that IO takes up way more time then local processing.
Like I said the fact that this is NOT happening within the python ecosystem and assuming the axioms above are true, then this indicates a flaw that is python specific.
>The reason the interpreter matters is because the latency picture of async is very linear:
I would say it shouldn't matter if done properly because the local latency picture should be a fraction of the time when compared to round trip travel time and database processing.
>Python is likely slow because it implements the event loop in python
Yeah, we're in agreement. I said it was a python specific problem.
If you take a single task in this benchmark for python. And the interpreter spends more time processing the task locally then the total Round trip travel time and database processing time... Then this means the database is faster than python. If database calls are faster then python then this is a python specific issue.
I expect upping this number would have a positive effect on asyncio numbers because the only thing this is measuring is how many database connections you have, and is about as far from a realistic workload as you can get.
Change your app to make 3 parallel requests to httpbin, collect the responses and insert them into the database. That's an actually realistic asyncio workload rather than a single DB query on a very contested pool. I'd be very interested to see how sync frameworks fare with that.
Sidenote here: one thing I found but didn't mention (the reason I put in the pooling, both in Python and pgbouncer) is that otherwise, under load, the async implementions would flood postgres with open connections and everything would just break down.
I think making a database query and responding with JSON is a very realistic workload. I've coded that up many times. Changing it to make requests to other things (mimicking a microservice architecture) is also interesting and if you did that I'd be interested to read your write up.
If the system as a whole is well saturated, and the python processes dominate the system load with a DB load proportional to the requests served, then I don't think we would hit any external bottlenecks.
The benchmarks performed are not that great (e.g., virtualized, same machine for all components, etc.), but I don't think the errors are enough to throw off the result. Note, of course, that such results are not universal, and some loads might perform better async.
Doesn't this prove that async is waiting for connections when you put a limit on it? The only way async wins is if it is free to hit the db whenever it needs to.
And the reasoning is explained in the article:
"The rule I used for deciding on what the optimal number of worker processes was is simple: for each framework I started at a single worker and increased the worker count successively until performance got worse."
I don't see how that is a more "realistic" asyncio workload.
It might be a workload that async is better suited for, but the point of the article is to compare async web frameworks, which will often be used just to fetch and return some data from the db.
If you had an endpoint which needed to fetch 3 items from httpbin and insert them in the db it may make sense to use asyncio tools for that, even within the context of a web app running under a sync framework+server like Falcon+Gunicorn.
In my experience Python web apps (Django!) often spend surprisingly little time waiting on the db to return results, and relatively a large amount of time cpu-bound instantiating ORM model instances from the db data, then transforming those instances back into primitive types that can be serialized to JSON in an HTTP response. In that context I am not surprised if sync server with more processes is performing better. In this test it's not even that bad... the 'ORM' seems to be returning just a tuple which is transformed to a dict and then serialized.
until then I'd had pretty much bought the hype that the new async frameworks running on Uvicorn were the way to go
I'm very glad to see this kind of comparative test being made, it's very useful, even if it later gets refined and added to and the results more nuanced
Highly disagree as the database is just another IO connection to a server, which is asyncio bread and butter. Being able to stream data from longer running queries without buffering and whilst serving other requests (and making other queries) is really quite powerful.
But yeah, if you're maxing out your database with sync code then async isn't going to make it magically go faster.
Despite this I think it's quite rare to hit this limit, at least in the orchestration-style use cases I use asyncio for. With those I value making independent progress on a number of async tasks rather than potentially being blocked waiting for a worker thread to become available.
1. One Postgresql connection is a forked process and has memory overhead (4MB iirc) + context switching.
2. A connection can only execute 1 concurrent query (no multiplexing).
3. Asyncpg to be fast, uses the features that I mentioned in my parent post. Those can only be used in Session Pooling https://www.pgbouncer.org/features.html.
The whole point of async is to some other work while waiting for a query (ex a different query).
If you have 10 servers with 16 cores, each vcore has 1 python process, each python process doing 10 simultaneous queries. 10 * 16 * 10 = 1600 opened connections.
The best way IMHO: Is to use autocommit connections. This way your transactions execute in 1 RPC. You can keep multiple connections opened with very light CPU and pooling is best.
I've done 20K short lived queries/second from 1 process with only ~20 connections opened in Postgresql (using Pgbouncer statement pooling).
As the blog post apparently cites as well (woo!), I've written about the myth of "async == speed" some years ago here and my conclusions were identical.
It's a difficult myth to dispel and I think the situation in terms of public mindshare is much worse now than it was in 2015. Some very silly claims from the async crowd now have basically widespread credence. I think one of the root causes is that people are sometimes very woolly about how multi-processing works. One of the others is that I think it's easy to make the conceptual mistake of 1 sync workers = 1 async worker and do a comparison that way
One of my worries is that right now it feels like everything in Python is being rewritten in asyncio and the balkanisation of the community could well be more problematic than 2 vs 3.
this is exactly why the issue is so concerning for me as well.
Ok in 2015 it was a pain but with Python 3.8 it's actually a only joy & fun in my opinion.
> the balkanisation of the community could well be more problematic than 2 vs 3
If you could call Python2 code from Python3 or vice-versa as easily as you can do with async then it would be comparable.
Python packaging is something that I have fully automated (maintaining over 50 packages here) and that I'm pretty happy with.
I fail to see the problem with Python packaging, maybe because I have an aggressive continuous integration practice ? (always integrate upstream changes, contribute to dependencies that I need, and when I'm not doing TDD it's only because I have not yet proof that the code I'm writing is not actually going to be useful) That's not something everybody wants to do (I don't understand their reasoning though).
People would rather freeze their dependencies and then cry because upgrading is a lot of work, instead of upgrading at the rhythm of upstream releases. If other packages managers or other languages have packaging features that encourages what I consider to be non-continuous integration then good for them, but that's not how a hacker like me wants to work, being able to "ignore upstream releases" is not a good feature, it made me a sad developer really, "ignoring non-latest releases" have made me a really happy developer.
Most performance issues are not imputable to the language. If they are, it's probably not affecting all your features, you can still rewrite the feature that Python is not well performing for into a compiled language. I need most of my code to be easy to manipulate, and very little of it to actually outperform Python.
I've recently re-assessed if I should keep going with Python for another 10 years, tried a bunch of languages, frameworks, at the end of the month I still wanted a language that easy to manipulate with basic text tools, that's sufficiently easy so that I can onboard junior collegues on my tools, that provides sufficiently advanced OOP because I find it efficient to structure and reuse code.
Python does what it claims, it solves a basic human-computer problem, let's face it: it's here to stay and shine, and its wide ecosystem seems like a solid proof. Wether it makes sense to invest in a project or not should not depend in the language anyway.
Oh man, we moved away from uwsgi to async a couple of years ago and that's been one of the best decisions we've made. Async is no walk in the park, but not having to deal with uwsgi configuration, etc has been well worth it.
> Python packaging is something that I have fully automated (maintaining over 50 packages here) and that I'm pretty happy with.
Yeah, I don't doubt this. Many people have found a happy path that works for them, but I've found that those tend to be people who don't have significant constraints (e.g., they don't need fast builds, or they don't care about reproducibility, or they don't have to deal with a large number of regular contributors, or etc).
> Most performance issues are not imputable to the language.
This isn't true in a meaningful sense. For the most part, if you're doing anything more complicated than a CRUD app, you will run into performance problems with Python almost immediately upon leaving the prototype phase, and your main options for improving performance are horizontal scaling (multiprocess/multihost parallelism) or rewriting the hot path in a faster language. As previously discussed, these options only work for certain use cases where the ratio of de/serialization to real work is low, so you often find yourself without options. Further, horizontal scaling is expensive (compute is expensive) and rewriting in a different language is differently expensive (you now have to integrate a separate build system and employ developers who are not only well-versed in the new language, but also in implementing Python extensions specifically).
On the other hand, if you chose a language like Go, you would be in the same ballpark of maintainability, onboarding, etc (many would argue Go is easier to write and maintain due to simplicity and static typing) but you would be in a much better place with respect to packaging and performance. You likely wouldn't need to optimize anything since naive Go tends to be 10-100X faster than naive Python, and if you needed to optimize, you can do so in-language without paying any sort of de/serialization overhead (parallelism, memory management, etc), allowing you to eek out another magnitude of performance. There are other options besides Go that also give performance gains, but they often involve trading off simplicity/packaging/deployment/tooling/ecosystem/etc.
> If they are, it's probably not affecting all your features, you can still rewrite the feature that Python is not well performing for into a compiled language.
This is true, but "rewriting features" is usually prohibitively expensive, and it's often non-trivial to figure out up-front which features will have performance problems in the future such that you could otherwise avoid a rewrite.
> Python does what it claims that's a basic human problem, let's face it: it's here to stay and shine.
Yes, Python is here to stay, but that's more attributable to network effects and misinformation than merit in my experience.
> Many people have found a happy path that works for them, but I've found that those tend to be people who don't have significant constraints (e.g., they don't need fast builds, or they don't care about reproducibility, or they don't have to deal with a large number of regular contributors, or etc).
I'm really curious about this statement, building a python codebase for me means building a container image, if the system packages or python dependencies don't change then it's really going to take less than a minute. What does your build look like ?
Can you define "a large number of regular contributors".
What do you mean "they don't need reproductibility" ? I suppose they just build a container image in a minute and then go over and deploy on some host. If a dependency breaks the code, it's still reproductible, but broken, then it means it has to be fixed, rather than ignored, a temporary version pin is fine though.
> This is true, but "rewriting features" is usually prohibitively expensive, and it's often non-trivial to figure out up-front which features will have performance problems in the future such that you could otherwise avoid a rewrite.
If Go is so much easier to write then I fail to see how it can be a problem to use Go to rewrite a feature for which performance is mission critical, and for which you have final specifications in the python implementation you're replacing. But why write it in Go instead of Rust, Julia, Nim, or even something else ?
You're going to choose the most appropriate language for what exactly you have to code. If you're trying to outperform an interpreted language and/or don't care about being stuck with a rudimentary pseudo-object oriented feature set then choose such a compiled language. Otherwise, Python is a pretty decent choice.
> Yes, Python is here to stay, but that's more attributable to network effects and misinformation than merit in my experience.
If Go was easier to write and read, why would they implement a Python subset in Go for configuration files, instead of just having configuration files in Go ? go.starlark.net Oh right, because it's not as easy to read and write than Python, and because you'd need to recompile. So apparently, even Google who basically invented also seem to need it to support some Python dialect.
10-100X performance is most probably something you'll never need when starting a project, unless performance is mission critical from the start. Static types and compile is an advantage for you, but for me dynamic typing and interpretation means freedom (again, I'm going to TDD on one hand and fix runtime exceptions as soon as I see them in applicative monitoring anyway).
I don't believe comparing Python and Go is really relevant, comparing PHP and Ruby and Python for example would seem more appropriate, when you say "people shouldn't need Python because they have Go" I fail to see the difference with just saying "people shouldn't need interpreted languages because there are compiled languages".
Humans need a basic programing language that is easy to write and read, without caring about having to compile it for their target architecture, Python claims to do that, and does it decently. If you're looking for more, or something else, then nobody said that you should be using Python.
I might be wrong, but when I'm talking about Humans, I'm referring to, what I have seen during the last 20 years as 99% of the projects out there in the wild, not the 1% of projects that have extremely specific mission critical performance requirements, thousands of daily contributors, and the like. Those are also pretty cool, and they need pretty cool technology, but it's really not the same requirements. For me saying everybody needs Go would look a bit like saying everybody needs k8s or AWS. Languages are many and solve different purpose. The one that Python serves is staying, not by misinformation, but because of Human nature.
Running tests, building a PEX file, putting the PEX file into a container image. We have probably about a dozen container images and counting at this point. The tests take a long time (because Python is 2+ orders of magnitude slower than other languages), and our CI bill is killing us (we're looking into other CI providers as well).
> Can you define "a large number of regular contributors".
More than 20 (although our eng org is 30-50). Multiple teams. You don't want to hold everyone's hand and show them all the tips and tricks you've found for working around the quirks of Python packaging or give them an education on wheels, bdists, sdists, virtualenvs, pipenvs, pyenvs, poetries, eggs, etc. They were promised Python was going to be easy and they wouldn't have to learn a bunch of things, after all.
> What do you mean "they don't need reproductibility" ? I suppose they just build a container image in a minute and then go over and deploy on some host.
Container images aren't reproducible in practice. Moreover, they have to also be reproducible for local development, and we use macs and Docker for mac is prohibitively slow. Need something else to make sure developers aren't dealing with dependency hell.
> If Go is so much easier to write then I fail to see how it can be a problem to use Go to rewrite a feature for which performance is mission critical, and for which you have final specifications in the python implementation you're replacing.
Both can be true: Go is easier to write than Python and it's still prohibitively expensive to rewrite a whole feature in Go. If the feature is small, well-designed, and easily isolated from the rest of the system, then rewriting is cheap enough, but these cases are rare and "opportunity cost" is a real thing--time spent rewriting is time not spent building new features.
> But why write it in Go instead of Rust, Julia, Nim, or even something else ?
Because Rust slows development velocity by an order of magnitude and Julia and Nim aren't mature general-purpose application development languages.
> You're going to choose the most appropriate language for what exactly you have to code. If you're trying to outperform an interpreted language and/or don't care about being stuck with a rudimentary pseudo-object oriented feature set then choose such a compiled language. Otherwise, Python is a pretty decent choice.
Yes, you have to choose the most appropriate language, but I contend that Python is a pretty rubbish choice for reasons that people often fail to consider up front. E.g., "My app will never need to be fast, and if it's fast I can just rewrite the slow parts in C!".
> If Go was easier to write and read, why would they implement a Python subset in Go for configuration files, instead of just having configuration files in Go ? go.starlark.net Oh right, because it's not as easy to read and write than Python, and because you'd need to recompile. So apparently, even Google who basically invented also seem to need it to support some Python dialect. Starlark is pretty cool though, and I use it a lot; I just wish it were statically typed.
Apples and oranges. Starlark is an embedded scripting language, not an app dev language. Different design goals. It also probably pre-dates Go, or at least derives from something which pre-dates Go.
> 10-100X performance is most probably something you'll never need when starting a project, unless performance is mission critical from the start.
You would be surprised. As soon as you're doing something moderately complex with a small-but-not-tiny data set you can easily find yourself in the tens of seconds. And 100X is the difference between a subsecond request and an HTTP timeout. It matters a lot.
> Static types and compile is an advantage for you, but for me dynamic typing and interpretation means freedom (again, I'm going to TDD on one hand and fix runtime exceptions as soon as I see them in applicative monitoring anyway).
We do TDD for our application development too and we still see hundreds of typing errors in production every week. I think your idea of "static typing" is jaded by Java or C++ or something; you can have fast, flexible iteration cycles with Go or many of the newer classes of statically typed languages, as previously mentioned. "Type inference" (in moderation) is your friend. Anyway, Go programs can often compile in the time it takes a Python program to finish importing its dependencies. A Go test can complete in a fraction of the time it takes for pytest to start testing (no idea why it takes so long for it to find all of the tests).
> I don't believe comparing Python and Go is really relevant, comparing PHP and Ruby and Python for example would seem more appropriate, when you say "people shouldn't need Python because they have Go" I fail to see the difference with just saying "people shouldn't need interpreted languages because there are compiled languages".
"compiled" and "interpreted" aren't use cases. "General app dev" is a use case. Python and Go compete in the same classes of tools: web apps, CLI applications, devops automation, lambda functions, etc. PHP and Ruby are also in many of these spaces as well. I don't especially care if Python is the fastest interpreted language (it's not by a long shot), I care if it's fast enough for my application (it's not by a long shot).
> Humans need a basic programing language that is easy to write and read, without caring about having to compile it for their target architecture, Python claims to do that, and does it decently. If you're looking for more, or something else, then nobody said that you should be using Python.
Lots of people recommend Python for use cases for which it's not well suited, and since so many Python dependencies are C, you absolutely have to worry about recompiling for your target architecture, and it's much, much harder than with Go (to recompile a Go project for another architecture, just set the OS and the architecture via the `GOOS` and `GOARCH` env vars and rerun `go build`--you'll have a deployable binary before your Python Docker image finishes building).
> I might be wrong, but when I'm talking about Humans, I'm referring to, what I have seen during the last 20 years as 99% of the projects out there in the wild, not the 1% of projects that have extremely specific mission critical performance requirements
Right, Python is alright for CRUD apps or any other kind of app where the heavy lifting can easily be off-loaded to another language. There's still the build issues and everything else to worry about, but at least performance isn't the problem. But I think you'll be surprised to find out that lots of apps don't fit that bill.
> For me saying everybody needs Go would look a bit like saying everybody needs k8s or AWS.
I'm not saying everyone needs Go, I'm saying that Go is a better Python than Python. There are a handful of exceptions--there's not currently a solid Go-alternative for django, and I wouldn't be surprised if the data science ecosystem was less mature. But for general purpose development, I think Go beats Python at its own game. And I've been playing that game for a decade now. This conversation has been pretty competitive, but I really encourage you to give Go a try--I think you'll come around eventually, and you can learn it so fast that you can be writing interesting programs with it in just a few hours. Check out the tour: https://tour.golang.org.
CI bills are aweful, I always deploy my own CI server, a gitlab-runner where I also spawn a Traefik instance to practice eXtreme DevOps.
More than 20 daily contributors that's nice, but I must admit that I have contributed to some major python projects that don't have a packaging problem, such as Ansible or Django. So, I'm not sure if the number of contributors is really a factor in packaging success. That said, sdist and well are things that happen in CI for me, it's just adding this to my .gitlab-ci.yml:
That's how I automate the packaging of all my Python packages (I have something similar for my NPM packages). As for virtualenvs, it's true that they are great but I don't use them, I use pip install --user, which has the drawback that you need all your software to run with the latest releases of dependencies, otherwise you have to contribute the fixes, but I'm a more happy developer this way, and my colleagues aren't blocked by a breaking upstream release very often, they will just pin a version if they need to keep working while somebody takes care of changing our code and contribute to dependencies to make everything work with latest versions.
I don't think that other languages are immune to version compatibility issues, I don't think that problem is language dependent, either you pin your versions and forget about upstream releases, either you aggressively integrate upstream releases continuously in your code and your dependencies.
> My app will never need to be fast
I maintain a governmental service that was in production in less than 3 months, then 21 months of continuous development, serving 60m citizen with a few thousand administrators, as sole techie, on a single server, for the third year. Needless to say, my country has never seen such a fast and useful project. I have not optimized anything. Of course you can imagine it's not my first project in this case. For me, Python's speed most often not a problem is not a lie, I proved it.
The project does have a slightly complex database, the administration interface does implement really tight permission granularity (each department has its own admin team with users of different roles), it did have to iterate quickly, but you know the story with Django : changing the DB schema is easy, migrations are generated by Django, you can write data migrations easily, tests will tell you what you broke, you write new tests (I also use snapshot testing so a lot of my tests actually write themselves), and upgrading a package is just as easy as fixing anything that broke when running the tests.
You seem to think that Python is outdated because it's old, and that's also what I thought when I went over all alternative for my 10 next years of app devs. I was ready to trash all my Python really. But that's how I figured that the human-computer problem Python solves will just always be relevant. I'll assume that you understand the point I made on that and that we simply disagree here.
Or maybe we don't really disagree, I'll agree with you that a compiled language is better for mission-critical components, but any of these will almost always need a CRUD and that's where Python shines.
But I've not always been making CRUDs with Python, I have 2 years of experience as an OpenStack developer, and I must admit that Python fit the bill pretty well here too. Maybe my cloud company was not big enough to have problems, or we just avoided the common mistakes. I know people like Rackspace had hard times maintaining forks of the services, I was the sole maintainer of 4 network services rewrites which were basically 1 package using OpenStack as a framework (like I would use Django), to simply listen on RabbitMQ and do stuff on SDN and SSH. Then again, I think not so much people actually practice CI/CD correctly, so that's definitely going to be a problem for them at some point.
> there's not currently a solid Go-alternative for django
That's one of the things that put me of, I tried all Go web frameworks, and they are pretty cool, but will they ever reach the productivity levels of Django, Rails or Symfony ?
Meanwhile, I'm just waiting for the day someone puts me in charge of something where performance is sufficiently performance-critical that I need to rewrite it in a compiled language, if I could have the chance to do some ASM optimizations that would also be a lot of fun. Another option is that I have something to contribute to a Go project, but so far, Go developers seem doing really fine without me for sure :)
While I choose it for general purpose development ? I guess I'm stuck with "I love OOP" just like "the little functional programing Python offers".
I really enjoyed this conversation too, would like to share it on my blog if you don't mind, thank you for your time, have a great weekend.
In general, you can get higher throughput with asyncio because you don't have context switches, but it comes at the cost of latency. So hand-wavy, indeed. It really depends what sort of speed you're after.
Imagine you're loading a profile page on some social networking site. You fetch the user's basic info, and then the information for N photos, and then from each photo the top 2 comments, and for each comment the profile pic of the commentor. You can't just fetch all this in one shot because there's data dependencies. So you start fetching with blocking IO, but that makes your wait time for this request proportional to the number of fetches, which might be large.
So instead, you ideally want your wait to be proportional to the depth of your dependency tree. But composing all these fetches that way is hard without the right abstraction. You can cobble it together with callbacks but it gets hairy fast.
So (outside of extreme scenarios) it's not really about whether async is abstractly faster than sync. It's about how real developers would solve the same problem with/without async.
(Source: I worked on product infrastructure in this area for many years at FB)
At least in Typescript nowadays, the ability to just mark a function `async` and throw an `await` in front of its invocation drastically lowers the barrier to moving something from blocking to non-blocking. In the same cases if I had to recommend the same change with thread pools and callbacks (and the manual book-keeping around all that) most developers just wouldn't bother.
Yeah, that's an extremely painful way to write threaded code. Much more normal is to simply block your thread while waiting for others to .Join() and return their results, likely behind an abstraction layer like a Future.
The only time you really need to use callbacks is when you need to blend async and threaded code, and you aren't able to block your current thread (e.g. Android main thread + any thread use is an example of this). But there are much much easier ways to deal with that if you need to do it a lot - put your primary logic in a different, blockable thread.
 There are a few rare exceptions in node js (functions suffixed with "Sync"), but in the same vein, they are blocking whether or not you use async/await.
const a = an async operation
const b = another async operation
// Resolve a and b concurrently
const [x, y] = await Promise.all([a, b])
// Do something with x and y
Edit: I just re-read your comment and the one you were responding to, and do agree that async/await don't "move" things from blocking to non-blocking. It just helps using already non-blocking resources more easily. It will not help you if you're trying to make a large numerical computation asynchronous, for example. In this regard it's very different from Golang's `go`, which will run the computation in a separate goroutine, which itself will run concurrently (with Go's scheduler deciding when to yield), and in parallel if the environment allows it.
A more interesting example would be a request that requires multiple blocking operations (database queries, syscalls, etc.). You could do something like:
# Non-concurrent approach
a = get_row_1()
b = get_row_2()
c = get_row_3()
return render_json(a, b, c)
# asyncio approach
async def handle_request(request):
a, b, c = await asyncio.gather(
return render_json(a, b, c)
# Naive threading approach
a_q = queue.SimpleQueue()
t1 = threading.Thread(target=get_row_1(a_q))
b_q = queue.SimpleQueue()
t2 = threading.Thread(target=get_row_2(b_q))
c_q = queue.SimpleQueue()
t3 = threading.Thread(target=get_row_3(c_q))
return render_json(a_q.get(), b_q.get(), c_q.get())
# concurrent.futures with a ThreadPoolExecutor
def handle_request(request, thread_pool):
a = thread_pool.submit(get_row_1())
b = thread_pool.submit(get_row_2())
c = thread_pool.submit(get_row_3())
return render_json(a.result(), b.result(), c.result())
Therefore, I would have liked to see how much memory all those workers use, and how many concurrent connections they can handle.
The underlying issue with python is that it does not support threading well (due to the global interpreter lock) and mostly handles concurrency by forking processes instead. The traditional way of improving throughput is having more processes, which is expensive (e.g. you need more memory). This is a common pattern with other languages like ruby, php, etc.
Other languages use green threads / co-routines to implement async behavior and enable a single thread to handle multiple connections. On paper this should work in python as well except it has a few bottlenecks that the article outlines that result in throughput being somewhat worse than multi process & synchronous versions.
Taken from Stephen Cleary's SO answer on this topic: https://stackoverflow.com/a/31192718
Memory is cheap; the cost is in constant de/serialization. Same with "just rewrite the hotspots in C!"-style advice; de/serialization can easily eat anything you saved by multiprocessing/rewriting. Python is a deceivingly hard language, and a lot of this is a direct result of the "all of CPython is the public C-extension interface!" design decision (significant limitations on optimizations => heavy dependency on C-extensions for anything remotely performance sensitive => package management has to deal extensively with the nightmare that is C packaging => no meaningful cross-platform artifacts or cross compilation => etc).
What? What makes you say that? What did you think I was talking about if not a production system? To be clear, we're talking about the overhead of single-digit additional python interpreters unless I'm misunderstanding something...
Another google search shows me Gunicorn, for instance, using high memory on fork isn't exactly uncommon either.
Edit: I reworded some stuff up there and tried to make my point more clear.
maybe author is concerned that many people are jumping the gun on async-await before we all fully understand why we need it at all. and that's true. but that paradigm was introduced (borrowed) to solve a completely different issue.
i would love to see how many concurrent connections those sync processes handle.
Maybe what you're getting at is cases where there are a large number of (fairly sleepy) open connections? Eg for push updates and other websockety things. I didn't test that I'm afraid. The state of the art there seems to be using async and I think that's a broadly appropriate usage though that is generally not very performance sensitive code except that you try to do as little as possible in your connection manager code.
I use trio/asyncio to more easily write correct complex concurrent code when performance doesn't matter. See "The Problem with Threads".
For this use case, Async Python probably still isn't faster, but that doesn't matter. Let's not throw out the baby with the bathwater :)
This is great for react or vue front end applications which get their state updated when things happen in the outside world (e.g. somebody else starts the music player, that gets related)
When CPU performance is an issue (say generate a weather video from frames) you want to offload that into another process or thread, but it is an easy programming style if correctness matters.
Concurrency can help you separate out logic that is often commingled in non-concurrent code, but doesn't need to be. As a real-world example, I used to do safety critical systems for aircraft. The linear, non-concurrent version, included a main loop that basically executed a couple dozen functions. Each function may or may not have dependencies on the other functions, so information was passed between them over multiple passes through this main loop (as their order was fixed) using shared memory.
A similar project had about a dozen processes, each running concurrently. There was no speed improvement, but the connection between each activity was handled via channels (equivalent in theory to Go's channels, less like Erlang's mailboxes as the channels could be shared). We knew it was correct because each process was a simple state machine, separated cleanly from all other state machines.
The second system's code was much simpler, there was no juggling (in our code) of the state of the system, compared to managing the non-concurrent logic. If a channel had data to be acted on, the process continued, otherwise it waited. Very simple. And it turns out that many systems can be modeled in a similar fashion (IME). Of course, we had a very straightforward communication mechanism (again, essentially the same as Go channels except it was a library written in, as I recall, Ada by whoever made the host OS).
I mean think about it. Whats the difference between sending message A and then message B versus sending messages A and B into a queue and letting some async process pop from it? Less complexity and guaranteed message delivery come for free in single-threaded code.
Am I wrong? What am I missing?
You'd need a good watchdog and error handling, but presumably some of that came for "free" in their environment.
Although if you take out the "free" OS support, watchdog, etc., I agree that there's likely a place between "shared memory spaghetti" and "multi-processing" that's simpler than both.
The other benefit of the concurrent design (versus the single-threaded version) was that it was actually much simpler. This was critical for our field because that system is still flying, now 12 years later, and will probably be flying for another 30-50 years. The single-threaded system was unnecessarily complex. Much of the complexity came from having to include code to handle all the state juggling between the separate tasks, since each had some dependency on each other (not a fully connected graph, but not entirely disconnected either). The concurrent design made it trivial to write something very close to the most naive version possible, where waiting was something that only happened when external input was needed. So the coordination between each task just fell out naturally.
You still have to care about locking the system up, but in our case because each process was sufficiently reduce to its essentials, this was easy to evaluate and reason about.
I.e. when handling many in- and outputs I can write my own loop around epoll etc, write logic to keep of track of queues of data to send per-target etc. Or I can use a runtime that provides that for me and lets me mostly pretend things are running on their own.
Given how slow I/O operations are, and how much modern code depends on the network, we typically need some concurrency in our code. So for me, almost always, the question isn't, "which concurrency choice is fastest?" but rather, "which concurrency choice is fast enough while leading to code with the least bugs?"
It's like multi-threading 2+2.
I suspect that the best async is that supported by the server OS, and the more efficiently a language/compiler/linker integrates with that, the better. JIT/interpreted languages introduce new dimensions that I have not experienced.
I do have some prior art in optimizing libraries, though. In particular, image processing libraries in C++.
My opinion is that optimization is sort of a "black art," and async is anything but a "silver bullet." In my experience, "common sense" is often trumped by facts on the ground, and profilers are more important than careful design.
I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion, as you have the same timeline as sync, but with thread management overhead.
There are also hardware issues that come into play, like L1/2/3 caches, resource contention, look-ahead/execution pipelines and VM paging. These can have massive impact on performance, and are often only exposed by running the app in-context with a profiler. Sometimes, threading can exacerbate these issues, and wipe out any efficiency gains.
In my experience, well-behaved threaded software needs to be written, profiled and tuned, in that order. An experienced engineer can usually take care of the "low-hanging fruit," in design, but I have found that profiling tends to consistently yield surprises.
While Windows has had asynchronous I/O for ages, it's still one kernel transition per operation, whereas Linux can batch these now.
I suspect that all the CPU-level security issues will eventually be resolved, but at a permanently increased overhead for all user-mode to kernel transitions. Clever new API schemes like io_uring will likely have to be the way forward.
I can imagine a future where all kernel API calls go through a ring buffer, everything is asynchronous, and most hardware devices dump their data directly into user-mode ring buffers by default without direct kernel involvement.
It's going to be an interesting new landscape of performance optimisation and language design!
> I have found that it's actually possible to have worse performance with threads, if you write in a blocking fashion
But isn't excessive blocking/synchronization not something the should already be tackled in your design instead of trying to rework it after the fact ?
I would expect profiling to mostly leads to micro-optimisations, eg combining or splitting the time a lock is taken, but when you're still designing you can look at avoiding as much need for synchronization as possible. eg: sharing data copy-on-write (not requiring locks as long as you have a reference) instead of having to lock the data when accessing it.
As another commenter says
> with asyncio we deploy a thread per worker (loop), and a worker per core. We also move cpu bound functions to a thread pool
you can't easily go from eg. thread-per-connection to a worker pool. that should have been caught during design
Yes and no. Again, I have not profiled or optimized servers or interpreted/JIT languages, so I bet there's a new ruleset.
Blocking can come from unexpected places. For example, if we use dependencies, then we don't have much control over the resources accessed by the dependency.
Sometimes, these dependencies are the OS or standard library. We would sometimes have to choose alternate system calls, as the ones we initially chose caused issues which were not exposed until the profile was run.
In my experience, the killer for us was often cache-breaking. Things like the length of the data in a variable could determine whether or not it was bounced from a register or low-level cache, and the impact could be astounding. This could lead to remedies like applying a visitor to break up a [supposedly] inconsequential temp buffer into cache-friendly bites.
Also, we sometimes had to recombine work that we had sent to threads, because that caused cache hits.
Unit testing could be useless. For example, the test images that we often used were the classic "Photo Test Diorama" variety, with a bunch of stuff crammed onto a well-lit table, with a few targets.
Then, we would run an image from a pro shooter, with a Western prairie skyline, and the lengths of some of the convolution target blocks would be different. This could sometimes cause a cache-hit, with a demotion of a buffer. This taught us to use a large pool of test images, which was sometimes quite difficult. In some cases, we actually had to use synthesized images.
Since we were working on image processing software, we were already doing this in other work, but we learned to do it in the optimization work, too.
When my team was working on C++ optimization, we had a team from Intel come in and profile our apps.
It was pretty humbling.
I think my question is whether async Python is slower in the case it was designed for -- many, long-running open sockets.
Async was traditionally used server-side for things like chat servers, where I might have millions of sockets simultaneously open.
This wasn't really the reason for the shift away from cooperative multitasking, it was really because cooperative multitasking isn't as robust or well behaved unless you have a lot of control over what tasks you have trying to run together.
In theory cooperative multitasking should have better throughput (latency is another story) because each task can yield at a point where its state is much simpler to snapshot rather than having to do things like record exact register values and handle various situations.
We've had a track record of technologies which:
1) Automated things (reliving programmers from thinking about stuff)
2) Were expected to make stuff slower
3) In reality, sped stuff up, at least in the typical case, once algorithms got smart
That's true for interpreted/dynamic languages, automated memory management/garbage collection, managed runtimes of different sorts, high-level descriptive languages like SQL, etc.
Sometimes, it took a lot of time to figure out how to do this. Interpreters started out an order-of-magnitude or more slower than compilers. It took until we had bytecode+JIT that performance roughly lined up. Then, we started doing profiling / optimization based on data about what the program was actually doing, and potentially aligning compilation to the individual users' hardware, things suddenly got a smidgeon faster than static compilers.
There is something really odd to me about the whole async thing with Python. Writing async code in Python is super-manual, and I'm constantly making decisions which ought to be abstracted away for me, and where changing the decisions later is super-expensive. I'd like to write.
None of that is true.
Even SQL modeling declarative work in the form of queries requires significant tuning all the time.
The rest of the list is egregious.
> things suddenly got a smidgeon faster than static compilers.
No, they did not.
It really didn't. Yes, in highly specialized benchmark situations, JITs sometimes manage to outperform AOT compilers, but not in the general case, where they usually lag significantly. I wrote a somewhat lengthy piece about this, Jitterdämmerung:
Discussed at the time:
On the other side, you don't.
That linguistic flexibility often leads to big-O level improvements in performance which aren't well-captured in microscopic benchmarks.
If the question is whether GC will beat malloc/free when translating C code into a JIT language, then yes, it will. If the question is whether malloc/free will beat code written assuming memory will get garbage collect, it becomes more complex.
And is AOT compiled.
GC can only "beat" malloc/free if it has several times the memory available, and usually also only if the malloc/free code is hopelessly naive.
And you've got the micro-benchmark / real-world thing backward: it is JITs that sometimes do really well on microbenchmarks but invariably perform markedly worse in the real world. I talk about this at length in my article (see above).
Of the things you mention, I agree on SQL, and "managed runtimes" is generic enough that I cannot really judge.
I'm thoughroghly unconvinced about the rest being faster than the alternatives (and that's why you don't see many SQL servers written in interpreted languages with garbage collection).
There's a big difference between normal code and hand-tweaked optimized code. SQL servers are extremely tuned, performant code. Short of hand-written assembly tuned to the metal, little beats hand-optimized C.
I was talking about normal apps. If I'm writing a generic database-backed web app, a machine learning system, or a video game. Most of those, when written in C, are finished once they work, or at the very most have some very basic, minimal profiling / optimization.
For most code:
1) Expressing that in a high-level system will typically give better performance than if I write it in a low-level system for V0, the stage I first get to working code (before I've profiled or optimized much). At this stage, the automated systems do better than most programmers do, at least without incredible time investments.
2) I'll be able to do algorithmic optimizations much more quickly in a high-level programming language than in C. With a reasonably time-bounded investment in time, my high-level code tends to be faster than my low-level code -- I'll have the big-O level optimizations finished in a fraction of the time, so I can do more of them.
3) My low-level code gets to be faster once I get into a very high level of hand-optimization and analysis.
Or in other words, I can design memory management better than the automated stuff, but my get-the-stuff-working level of memory management is no longer better than the automated stuff. I can design data structures and algorithms better than PostgreSQL specific to my use case, but those won't be the first ones I write (and in most cases, they'll be good enough, so I won't bother improving them). Etc.
> If I'm writing a generic database-backed web app
If you are writing a system where performance does not matter, then performance does not matter.
> a machine learning system or a video game. Most of those, when written in C, are finished once they work, or at the very most have some very basic, minimal profiling / optimization.
Wait, what? ML engine backends and high-level descriptions, and video games are some of the most heavily tuned and optimized systems in existence.
> At this stage, the automated systems do better than most programmers do, at least without incredible time investments.
General-purpose JIT languages are so far from being an actual high-level declarative model of computation that the JIT compiler cannot perform any kind of magic of the kind you are describing.
Even actual declarative, optimizable models such as SQL or Prolog require careful thinking and tuning all the time to make the optimizer do what you want.
> 2) I'll be able to do algorithmic optimizations much more quickly in a high-level programming language than in C.
C is not the only low-level AOT language. C is intentionally a tiny language with a tiny standard library.
Take a look at C++, D, Rust, Zig and others. In those, changing a data structure or algorithm is as easy as in your usual JIT one like C#, Java, Python, etc.
> 3) My low-level code gets to be faster once I get into a very high level of hand-optimization and analysis.
You seem to be implying that a low-level language disallows you from properly designing your application. Nonsense.
> I can design memory management better than the automated stuff, but my get-the-stuff-working level of memory management is no longer better than the automated stuff
You seem to believe low-level programming looks like C kernel code of the kind of a college assignment.
It's not binary. Performance always matters, but there are different levels of value to that performance. Writing hand-tweaked assembly code is rarely a good point on the ROI curve.
> Wait, what? ML engine backends and high-level descriptions, and video games are some of the most heavily tuned and optimized systems in existence.
Indeed they are. And the major language most machine learning researchers use is Python. There is highly-optimized vector code behind the scenes, which is then orchestrated and controlled by tool chains like PyTorch and Python.
> Take a look at C++, D, Rust, Zig and others. In those, changing a data structure or algorithm is as easy as in your usual JIT one like C#, Java, Python, etc.
I'd recommend working through SICP.
> You seem to be implying that a low-level language disallows you from properly designing your application. Nonsense.
Okay: Here's a challenge for you. In Scheme, I can write a program where I:
1) Write the Lagrangian, as a normal Scheme function. (one line of code)
2) Take a derivative of that, symbolically. (it passes in symbols like 'x and 'y for the parameters). I get back a Scheme function. If I pretty-print that function, I get an equation render in LaTeX
3) Compile the resulting function into optimized native code
4) Run it through an optimized numeric integrator.
This is all around 40 lines of code in MIT-Scheme. Oh, and on step 1, I can reuse functions you wrote in Scheme, without you being aware they would ever be symbolically manipulated or compiled.
If you'd like to see how this works in Scheme, you can look here:
That requires being able to duck type, introspect code, have closures, GC, and all sorts of other things which are simply not reasonably expressible in C++ (at least without first building a Lisp in C++, and having everything written in that DSL).
The MIT-Scheme compiler isn't as efficient as a good C++ compiler, so you lose maybe 10-30% performance there. And all you get back is a couple of orders of magnitude for (1) being able to symbolically convert a high-level expression of a dynamic system to the equations of motion suitable for numerical integration (2) compile that into native code.
(and yes, I understand C++11 kinda-added closures)
Read again what I wrote. Even the model itself is optimized. The fact that it is written in Python or in any DSL is irrelevant.
> I used to think that too before I spent years doing functional programming.
I have done functional programming in many languages, ranging from lambda calculus itself to OCaml to Haskell, including inside and outside academia. It does not change anything I have said.
Perhaps you spent way too many years in high-level languages that you have started believing magical properties about their compilers.
> prided myself on being able to implement things like highly-optimized numerical code with templates.
Optimizing numerical code has little to do with code monomorphization.
It does sound like you were abusing C++ thinking you were "optimizing" code without actually having a clue.
Like in the previous point, it seemed you attributed magical properties to C++ compilers back then, and now you do the same with high-level ones.
How do you even manage write code in Lisp etc. "like C++"? What does that even mean?
> You putting "Python" and "Java" in the same sentence shows this isn't a process you've gone through yet. Java has roughly the same limitations as C and C++.
Pure nonsense. Java is nowhere close to C or C++.
> Here's a challenge for you.
I would use Mathematica or Julia for that. Not Scheme, not C++. Particularly since you already declared the last 30% of performance is irrelevant.
You are again mixing up domins. You are picking a high-level domain and then complaining a low-level tool does not fit nicely. That has nothing to do with the discussion and we could apply that flawed logic to back any statement we want.
> It does sound like you were abusing C++ thinking you were "optimizing" code without actually having a clue.
> Like in the previous point, it seemed you attributed magical properties to C++ compilers back then, and now you do the same with high-level ones.
I think at this point, I'm checking out. You're making a lot of statements and assumptions about who I am, what my background is, what I know, and so on. I neither have the time nor the inclination to debunk them. You don't know me.
When you make it personal and start insulting people, that's a good sign you've lost the technical argument. Technical errors in your posts highlight that too.
If you do want to have a little bit of fun, though, you should look up the template-based linear algebra libraries of the late nineties and early 00's. They were pretty clever, and for a while, were leading in the benchmarks. They would generate code, at compile time, optimized to the size of your vectors and matrixes, unroll loops, and similar. They seem pretty well-aligned to your background. I think you'll appreciate them.
Except for a few very special cases, it is perfectly fine to block on I/O. Operating systems have been heavily optimized to make synchronous I/O fast, and can also spare the threads to do this.
Certainly in client applications, where the amount of separate I/O that can be usefully accomplished is limited, far below any limits imposed by kernel threads.
Where it might make sense is servers with an insane number of connections, each with fairly low load, i.e. mostly idle, and even in server tasks quality of implementation appears to far outweigh whether the server is synchronous or asynchronous (see attempts to build web servers with Apple's GCD).
For lots of connections actually under load, you are going to run out of actual CPU and I/O capacity to serve those threads long before you run out of threads.
Not when you know how to call sync functions from async functions and vice versa.
An sync function can call an async function via:
loop = asyncio.new_event_loop()
result = loop.run_until_complete(asyncio.ensure_future(red(x)))
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(None, blue, x)
async def red(x):
This works for regular functions I don't know how it works for generators.
result = asyncio.run(red(x))
Not seeing any particular readability issue with that usage neither. If you don't call asyncio.run or await on the result of an async function call, then you get a coroutine for result.
Still much harder to think about/read then
result = red(x)
Async in this case has less ceremony than threads. But still enough to make thing explicit.
For instance, imagine a "maybe_await" method that just calls sync if is synchronous or otherwise awaits.
result = yourfunc()
result = asyncio.run(result)
I believe this also would allow non-async function to return a coroutine I suppose.
Anyway, in this case chances are that there will be no performance overhead if there's any io bound operation running in the coroutine so it should run the iscoroutine check of the result and the await call before the function is done.
result = asyncio.run(red(x))
For example, when someone access a descriptor in Django.. this could end being a query to the db (transparent) but dangerous. With asyncio you explicitly await something to return the execution to the event loop.
At least for me sounds like a safer behaviour
But the difference between asyncio.run(red(x)) and blue(x)... isn't. There's no difference which matters. They are just different implementations of the same behaviour.
If red and blue are both the same DB query, with the only difference being red is async-style and blue sync-style, these two lines have exactly the same program behaviour:
result = blue(x)
It's almost the opposite of Python's usual duck-typing parsimony, which normally allows equivalent things to be used in place of each other without ceremony.
Zen is not respected by explicit asyncio, just try to compose asyncio with iterators 
This problem doesn't exist with gevent, and composability is a desired thing in any programming language. Python's asyncio fractioned the community that was previously doing implicit asyncio with sync interfaces, and the current state of API is not an example of composable primitives that follow the Zen of Python:
> Beautiful is better than ugly.
> Simple is better than complex.
> Readability counts.
> Special cases aren't special enough to break the rules.
Meanwhile, my pretty python foo = bar().something has gone all foo = (await bar()).something
An entire category of data races like `x += 1` become impossible without you even thinking about it. And that's often worth it for something like a game server where everything is beating on the same data structures.
I don't use Python, so I guess it's less of an issue in Python since you're spawning multiple processes rather than multiple threads so you're already having to share data via something out of process like Redis and using its own synchronization guarantees.
But for example the naive Go code I tend to read in the wild always has data races here and there since people tend to never go 100% into a channel / mutex abstraction (and mutexes are hard). And that's not a snipe at Go but just a reminder of how easy it is to take things for granted when you've been writing single-threaded async code for a while.
(Not necessarily on topic, but if you’re really excited about dodging data races, I figured it would give you something fun to look at!)
As to the article the comparisons are good but fails to mention resource constraints, like Gunicorn, forking 16 instances is going to be a lot heavier on memory so for a little more RPS you're probably spending a decent chunk of change more to run your work and I don't think that's worth it considering the Async model in python is pretty easy to grok these days and under this benchmark share a similar performance profile.
Now that said If I had to guess these numbers are fine for the average API but if you're doing something like high throughput web crawling or need to serve something on the order of 10's of thousands to hundred thousands RPS async will win out on speed and resource use and ultimately cost.
Plus at one point they were like "we could only get an 18% speed up with Vibora" haven't used them my self. But 18% performance increase at really any level of load is fantastic. Hand waving that off tells me the work loads for what is "realistic" don't take in to account real high RPS workloads like you might see at major tech companies.
It really depends on how the application is designed. Fork operates through mmap and copy-on-write. It's extremely lightweight by default.
A well-designed fork-based application will already have everything necessary to run a given process into memory, not munge any of the existing shared memory, and only allocate and free memory associated with new events/connections/etc.
When programmed that way, individual forks are incredibly light on resources. All the workers are sharing the exact same core application code and logic in memory.
Oh interesting, are you saying an intelligent forking implementation is able to share static portions of memory with multiple children?
I was perhaps under the naive assumption forking was pretty much just a full memory copy of the parent.