In response to the multiple comments here complaining that multithreading is impossible in Python without using multiple processes, because of the GIL (global interpreter lock):
This is just not true, because C extension modules (i.e. libraries written to be used from Python but whose implementations are written in C) can release the global interpreter lock while inside a function call. Examples of these include numpy, scipy, pandas and tensorflow, and there are many others. Most Python processes that are doing CPU-intensive computation spend relatively little time actually executing Python, and are really just coordinating the C libraries (e.g. "mutiply these two matrices together").
The GIL is also released during IO operations like writing to a file or waiting for a subprocess to finish or send data down its pipe. So in most practical situations where you have a performance-critical application written in Python (or more precisely, the top layer is written in Python), multithreading works fine.
If you are doing CPU intensive work in pure Python and you find things are unacceptably slow, then the simplest way to boost performance (and probably simplify your code) is to rewrite chunks of your code in terms of these C extension modules. If you can't do this for some reason then you will have to throw in the Python towel and re-write some or all of your code in a natively compiled language (if it's just a small fraction of your code then Cython is a good option). But this is the best course of action regardless of the threads situation, because pure Python code runs orders of magnitude slower than native code.
> complaining that multithreading is impossible in Python without using multiple processes, because of the GIL ... this is not true
I think some people's opinions is that if you're writing in C then you're not really writing a Python program, so they think it is impossible in Python. Which seems a reasonable point to make to me.
Your argument is that Python is fine for multithreading... as long as you actually write C instead of Python.
If a, b and c are numpy arrays then this function releases the GIL and so will run in multiple threads with no further work and with little overhead (if a, b and c are large). I would describe this as a function "written in Python", even though numpy uses C under the hood. It seems you describe this snippet as being "written in C instead of Python"; I find that odd, but OK.
But, if I understand you right, you are also suggesting that the other commenters here that talk about the GIL would also describe this as "written in C". They realise that this releases the GIL and will run on multiple threads, but the point of their comments is that proper pure Python function wouldn't. I disagree. I think that most others would describe this function as "written in Python", and when they say that functions written in Python can't be parallelised they do so because they don't realise that functions like this can be.
This only gives you parallelism in one specific situation though - where operations like '+' and '@' take a long time. If they were fine-grained operations, then this doesn't help you.
If instead of operating on a numerical matrix, you were instead operating on something like a graph of Python objects, something like a graph traversal would be hard to parallelise as you could not stay out of the GIL long enough to get anything done.
I agree that not quite all situations will be covered by coarse locks like this. But many will, and my original comment was meant to draw attention to those situations. Your previous comment seemed to be saying that everyone already knew about those, but I still believe that some people commenting or reading here weren't aware you could release the GIL with just a numpy call.
I did also concede that if you do have to write your algorithm completely from scratch, with no scope for using existing C extensions (be they general purpose like numpy or more specialist that implements the whole algorithm) then yes you'll be caught be the GIL, so I agree with you on that. But I also made the point that you'll be caught even more (orders of magintude more!) by the slowness of Python, so any discussion about parallelism or the GIL is a red herring. It's like worrying that you car's windscreen will start to melt if you travel at 500mph; even if that's technically true, it's not the problem you should be focusing on.
It's interesting you mention graphs because the most popular liberally licensed graph library is NetworkX, which is indeed pure Python and so presumably isn't particularly fast. There are graph libraries written as C extension modules but I believe they are less popular and less librally licensed (GPL-style rather than BSD-style). So I definitely agree that this is a big weakness of the Python ecosystem.
This is a good point. Fine grained parallelism works well with c extensions, but coarse grained does not. Which is why data science and MATLAB like tasks are so often done in python without worrying too much about the python penalty. But if you have a lot of small dense matrix operations, even VBA in excel will be faster by more than 10x because you keep popping back into python.
Could you explain the return statement in your example? I only know the '@' as a decorator in Python. This looks like invalid syntax to me what am I missing?
A whole lot depends on what exactly it is that someone wants to get out of using threading.
The GIL means that a single Python interpreter process can execute at most one Python thread at a time, regardless of the number of CPUs or CPU cores available on the host machine. The GIL also introduces overhead which affects the performance of code using Python threads; how much you're affected by it will vary depending on what your code is doing. I/O-bound code tends to be much less affected, while CPU-bound code is much more affected.
All of this dates back to design decisions made in the 1990s which presumably seemed reasonable for the time: most people using Python were running it on machines with one CPU which had one core, so being able to take advantage of multiple CPUs/cores to schedule multiple threads to execute simultaneously was not necessarily a high priority. And most people who wanted threading wanted it to use in things like network daemons, which are primarily I/O-bound. Hence, the GIL and the set of tradeoffs it makes. Now, of course, we carry multi-core computers in our pockets and people routinely use Python for CPU-bound data science tasks. Hindsight is great at spotting that, but hindsight doesn't give us a time machine to go back and change the decisions.
Anyway. This is not the same thing as "multithreading is impossible". This is the same thing as "multithreading has some limitations, and for some cases the easiest way to work around them will be to use Python's C extension API". Which is what the parent comment seemed to be saying.
> All of this dates back to design decisions made in the 1990s which presumably seemed reasonable for the time ... Hence, the GIL and the set of tradeoffs it makes. Now, of course, we carry multi-core computers ... Hindsight is great at spotting that, but hindsight doesn't give us a time machine to go back and change the decisions.
Sadly I don't think this is _quite_ true. I believe GILs are used in a number of interpreters and fall prey to the common problem of where either coarsening locks or making them finer ruins somebody's day. I believe Guido Van Rossum hung the GILectomy on two main issues: The interpreter must remain relatively simple, and C extensions cannot be slowed down.
I'm not disagreeing with the decision (necessarily) but it isn't simply a layover from a bygone era. It was a decision that has been reaffirmed and upheld numerous times.
I'm familiar with the various attempts to remove the GIL over the years.
The thing is, in the 90s the choices the produced the GIL as it exists were not bad ones; that's why I went to the trouble of explaining how it affects threaded code and why those effects can be considered reasonable tradeoffs for what was known at the time, in implementing threading (without completely breaking the ecosystem of Python + Python extensions, which was already significant even back then).
Of course, knowing what's known today about the directions computing and the use of Python went in, different decisions might end up being made, but at this point it's very difficult (more difficult than people typically expect) to undo them or make different choices.
I've done it once, converting about 15 lines of python to rust. It was completely painless and resulted in a large speedup (changed a hotspot that was taking approximately 90% of execution time in a scientific simulation to approximately 0%).
Type system and expressive macros seems like a big win over c to me.
Care to share a bit more detail on how you did this? Was there some interfacing library that you used analogous to Cython/SWIG/etc.? Presumably you didn't code directly against the C API (in python.h)?
The rust library interfacing with python is https://github.com/dgrunwald/rust-cpython This library understand things like python arrays, objects, etc. and provides nice rust interfaces to them. Basically I just have to write a macro that specifies what functions I'm exposing to python, and other then that I'm writing normal rust. On the python side I'm importing them and calling them like any other python library.
I really wish he had shown his numpy code. He said at 13:46 "Numpy actually doesn't help you at all because the calculation is still getting done at the Python level". But his function could be vectorised with numpy using functions like numpy.maximum or numpy.where, in which case the main loop will be in C not Python. I can't figure out from what he said whether his numpy code did that or not.
But either way, it's interesting that in this case the numpy version is arguably harder to write than the Cython version: rather than just adding a few bits of metadata (the types), you have to permute the whole control flow. If there's only a small amount of code you want to convert, I would still say it's better to use numpy though (if it actually is fast enough), because getting the build tools onto your computer for Cython can be a pain. And for some matrix computation there are speed inprovements above the fact that it's implemented in C e.g. matrix multiplication is faster than the naive O(n^3) version.
Because we want first class python multithreading, like many other languages have. If we have to drop down into C, might as well use another language with first class multi-threading like java, kotlin, golang or swift and avoid all the other issues that come with slow GIL languages.
I think you're thinking of the multiprocessing module, which uses separate processes to bypass the GIL. That's why the arguments and results need to be pickleable: pickle is a serialisation procotol, so it allows you to communicate the contents of objects between different processes. If you use threads within a single process, you don't need to pickle the objects; you just pass the object directly.
> (note that you must be using Python 2 for this workshop and not using Python 3. Complete this workshop using Python 2, then read about the small changes if you are interested in using Python 3)
I find it strange that nobody ever seems to mention python's concurrent.futures module [0] which is new in Python 3.2. I think asyncio got a lot of attention when it came out in Python 3.4 and concurrent.futures took a back seat. This article also doesn't mention the module in it's Python 2 and 3 differences link.
asyncio is a good library for asyncronous I/O but concurrent.futures gives us some pretty nifty tooling which makes concurrent programming (with ThreadPoolExecutor) and parallel programming (with ProcessPoolExecutor) pretty easy to get right. The Future class is a pretty elegant solution for continuing execution while a background task is being executed.
ThreadPoolExecutor and ProcessPoolExecutor were exactly what I was waiting for someone to mention. I was doing some Python as a systems architect at my previous position and now as a full time data scientist, my life has pretty much been consumed by Python. Unsurprisingly, a lot of my initial work is retrieving and cleaning very large volumes of data, the later usually being I/O bound and the former being CPU bound and frankly myself and a lot of my team immediately default to using both ThreadPoolExecutor and ProcessPoolExecutor respectively, because of how simple and performant they are. Perhaps asyncio is more familiar terminology to people coming from Web Dev, so that's why they're gravitating towards it, but there are few times when I find myself needing that particular tooling outside of Web Dev anyways.
Relative performance compared to C is somewhere between an order of magnitude or two slower. Considering how much harder and more error-prone multi-core is, maybe first try a fast sequential solution.
Yeah. Recently switched some Blender Python algorithms I wrote to Swift/Metal, and the speedup was somewhere between 1000 and 1000000 depending on the algorithm.
Yeah, properly written Python is at extreme worst O(1000) times slower than speeding it up code with a Numpy/Numba/c/Fortran/etc. implementation. Brute-force loopy code in Python I've seen is 100x slower than the compiled alternatives. So I agree, these extreme numbers are the sign of writing the worst possible Python implementation of a thing and saying Python sucks.
Who would have guessed that compiled, static, non-dynamic, hardware accelerated code would be a ton more performant than runtime, highly dynamic, garbage collected and very powerful code that is not hardware accelerated.
Why? I can run my Python plotting script with multiprocessing in one of the blade servers at work and get the job done quickly. All without translating a big bunch of code to C.
I love python. But its seriously, incapable for doing non trivial concurrent tasks. Multiprocessing module doesnt count. I hope the python core-devs take some inspiration from golang for developing the right abstractions for concurrency.
Concurrent or parallel? For concurrency, python has asyncio, which many people consider a success.
For parallel execution, there's the GIL, but in practice it rarely matters, because once you want to do parallel execution, you have most likely a computationally intensive task to do, at which point you call down to C or something, and then GIL doesn't matter.
These are all quite a lot harder to use than Go, and often they don't play well together. For example, there are lots of sync libraries (the Docker API, the AWS SDK, etc) that can't be turned async, so other folks have had to go through the trouble of forking and porting to async and since those other folks are often not affiliated with the original dev teams, who knows what the quality level of those libraries may be? We've also had a lot of problems with asyncio alone--often developers forgetting to await an async call or doing something (I'm not sure what exactly) that causes processes to hang indefinitely. It's all quite a lot more complex than Go's concurrency model.
And all of that is really just I/O parallelization; there's also CPU parallelization, and I don't believe Python has anything that's quite as easy as "Do these two things in parallel". Pretty much everything requires a lot of marshalling and process management which can easily slow a program down instead of improving it.
Python is great for a lot of things, and the community has found many creative workarounds for its shortcomings, but Go beats Python in I/O and CPU parallelism handily.
While I agree that python isn't ideal beyond a certain scope, I think you're overstating how bad it is. My team and I have built a number of non-trivial machine learning products with pipelines that use both the ThreadPool and ProcessPool components successfully. The headaches we have are related more to the fact that Python is dynamic than its concurrency story.
OP probably is overstating a bit, but it is hard to efficiently parallelize computation in Python. For example, if you have a large Python object graph that you need to compute over, you can't easily parallize the computation without paying some significant serialization cost. You can probably alleviate that by carefully choosing algorithms that minimize the amount of serialization per worker process, but at the end of the day, all of this is still quite a lot harder than using shared memory and goroutines. And not to mention Go is 1-2 orders of magnitude faster than Python in single-threaded execution... Python is great for lots of things, but efficient parallel programming in Python is _hard_, even if there are a handful of cases where it's not so hard.
I successfully do concurrent+parallel computing with Python using asyncio combined with ProcessPoolExecutor. I can see why perhaps that doesn't scratch your itch, but it sure scratches my web-crawling itch.
I think a lot of this complexity can be avoided by just writing single threaded python and using GNU parallel for running it on multiple cores. You can even trivially distribute the work across a cluster that way.
This is the approach I've taken, albeit at the "top level" of the program. Since I know I don't have to deal with Windows I much prefer simply piping to parallel instead of xargs, or calling make -j8, or similarly letting some shell wrapper handle it over dealing with the overhead inside of python, especially multiprocessing.
However, where I think having this stuff available inside of python is useful is that it's cross platform and consumable from "higher levels" of python. A library can do some mucky stuff internally to speed computation but still present a simple sync interface, all without external dependencies.
Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python. I've done a bit of batch processing using the multi process module; which uses processes instead of threads. This works but it is a bit of a kludge if you are used to languages that support concurrency properly.
You make some extremely large claims about ZProc, what advantages does it have over every other message-passing library for every other language ever built? (including the other zeromq bindings?)
TBH, you're claims sound like you've just "discovered" message-passing, of which many, many languages, runtimes and operating systems have been using for many years/decades.
(https://en.wikipedia.org/wiki/Message_passing)
In other words... its not a revolution.
ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.
This is not the way to get remotely close to "high performance". What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved).
> What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved)
Minor point of pedantry which I'll state because it's an often-overlooked timesaver for folks developing on multiprocessing: not only is MP potentially faster for transferring data between processes compared to this solution, but it can also be way, way faster in situations where you have all your data before creating your processes/pool and just want to farm it out to your MP processes without waiting for it all to be chunked/pickled/unpickled.
Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.
This pattern can be used to totally bypass all considerations of performance/CPU/etc. for pickling/unpickling data and lends a massive speed boost in certain situations--e.g. a massive dataset is read into memory at startup, and then ranges of that dataset are processed in parallel by a pool of MP processes, each of which will return a relatively small result-set back to the parent, or each of which will write its processed (think: data scrubbing) range to a separate file which could be `cat`ed together, or written in parallel with careful `seek` bookkeeping.
Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).
* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.
Note that this works for big objects, but not for small objects. E.g. if you fork-share a large list of integers or dicts or something like that, then you don't get any memory usage benefits, because every access will cause a refcount-write and that will copy the whole page containing the object.
> * Technically it's O(N) for the size of data you have in memory at process pool start
It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.
> this works for big objects, but not for small objects
Very true; I went into some more detail about my typical use case above. Using MP for lots of small objects that you've already extracted from raw data/IO/whatever is a game of diminishing returns. It's in situations like that where traditional shared-memory starts looking more and more attractive. When I get to that point, while multiprocessing and some other packages provide a few nice abstractions over shmem, I start looking for other platforms than Python.
> It's not quite that simple; sharing n pages can take very little time or a bit more time
Definitely; I was simplifying in order to compare the overhead of fork with the overhead of pickling/shipping/unpickling data. Sharing large pieces of data with even very slow fork()ing is, in my experience, so much faster than the [de]serialize approach that it is effectively constant in comparison, but I didn't mean to discount the complexities of what make certain forking situations faster/slower than others.
> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.
Have you tried this or got it working ? The fly in the ointment is the reference count. Add a reference and BOOM you suddenly have a huge copy. It can be made to work efficiently in certain cases but takes a lot of care.
In practice, I find reference-count related issues with this pattern to be minor.
Most of the situations where I care enough about memory and/or pickling overhead fall into the "take a giant block of binary/string data and process ranges of it in parallel" family, in which case there aren't too many references until the subprocesses get to work. If I had more complex structures of data I'd probably get a little less performance bang for my buck, but even then I suspect it would be much faster than multiprocessing's strategy: pickling and sending data between processes via pipes is many times slower than moving the equivalent amount of data by dirty-writing pages into a forked child.
That's not meant to discount anything y'all are saying, though: refcounts are definitely a very important thing to be mindful of in this situation. A child comment suggests gc.freeze, which can help, but can't entirely save you from thinking about this stuff.
It's also very important to be mindful of what happens with your program at shutdown: if you have a big set of references shared via fork(), and all your children shut down around the same time, your memory usage can shoot up as each child tries to de-refcount all objects in scope. This applies even if each child was only operating on a subset of the references shared to it. If you're processing, say, 1GB of data from the parent in 8 children on a 4 core system (doing M>N(cpu) because e.g. children spend some time writing results out to the FS/network), a near-simultaneous shutdown could allocate 9GB of memory in the very worst case, which can cause OOM or unexpected swapping behavior. Throttled shutdowns using a semaphore or equivalent are the way to go in that case.
> in which case there aren't too many references until the subprocesses get to work.
In my workload that's exactly when it hits.
We ran into this when sharing different parts of a huge matrix with different workers. We had to be extra careful that we did not create new references in the subprocesses. We were operating at scale where if we got it wrong OOM will kill us.
Working with memory mapped arrays are more forgiving.
Wouldn't you be better off using a Database for that kind of work?
> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant time
The `multiprocessing.Pool` uses a `multiprocessing.Queue` in the background to retrieve the results after completion.
The `multiprocessing.Queue` in turn uses `multiprocessing.connection.Pipe` and sends the pickled objects over to the wire.
So I don't see how this is any better than ZMQ.
Just because stuff has an API that doesn't look like message passing doesn't mean it can't be doing that in the background. Which is funny, because that's the whole point of ZProc.
I realize the subtle difference that Cpython uses pipes, not sockets, unlike ZMQ. But that doesn't really make a difference now, does it?
Proof:
Process Pool worker, returning the result by using `outqueue.put()`
No pipes or queues are used as part of the example code above. It transfers the large piece of data without serialization.
The point of the original post is that MP lets you do more than just serialize/ship data around after pool start time; there are substantial optimizations you can do if you know lots of the data you need to process early on.
I totally agree. It's just a better way of doing things zmq already perfected.
Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.
> central (pubsub?) server.
Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.
> you've just "discovered" message-passing
Guess you're right? 2 years is a peanut on the time scale...
P.S. Thanks for all the feedback, I've been dying to hear something for a while now.
I would suggest you don't make dramatic claims for a subject that has decades of theory behind it with a huge amount of nuance depending on the exact workload and characteristics of the machines in question.
Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:
- Smalltalk and Erlang (for message passing languages).
- QNX (for a message-passing OS)
- mpiPY (for a message-passing Python library, mpi is the
grandfather of message passing libraries that runs everywhere).
- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).
If you could point out some stuff from ZProc's page, that would be nice!
> mpi is the grandfather of message passing libraries
Never heard of it before, but just a simple google search reveals that it _might_ be more performant than zmq, but not as fault-tolerant and flexible. It really looks like a niche thing, from this comment by peter hintjens
> Why smart cloud builders are betting everything on 0MQ. In detail, compare to the alternatives. Hand-rolling your own TCP stack is insane. Using any broker-based product won't scale. Buying licenses from IBM or TIBCO would eat up your capital. Supercomputing products like MPI aren't designed for this scale. There is literally no alternative.
P.S. Seems like you know quite a lot about this topic. Do you have any projects of your own that I can see?
Bottom line, I think most people would be happy doing message passing parallelism in the real world. Sure, it doesn't look that good in theory but works damn good in practicality.
Nothing against zeromq, its good s/w, but like all tools it must be used appropriately.
...also, nanomsg is the 'improved' successor.
Also, MPI isn't a 'niche' thing, its the way that a large proportion of high-performance applications have been implemented for a few decades (think Crays & weather prediction).
Zeromq has a few simple web-apps using it (I exagerate slightly).
That depends on your workload. 0MQ is fine software, but they solve different problems.
The problem here is the claims you're making. You've written some utility classes around 0MQ for some applications, which is a real thing, so I'd rewrite your GitHub readme to just demonstrate what problems you've solved with it (and at what kind of scale). Making big, sweeping claims gets you into these kinds of threads, because extraordinary claims require extraordinary evidence :-)
Like the vast majority of software engineers, my work is not open source.
But since you seem like arguments from authority, I've got around 25 years experience in software ranging from hard-real-time embedded defence software to safety-critical train braking systems. I've been software architect on systems selling 10's of millions of products, currently working in the IoT space.
I've architected and implemented software on servers, desktops, embedded and mobile platforms.
But no, you aren't likely to find my stuff on GitHub.
If you're interested in IoT (or embedded s/w in general), get away from MicroPython.
The primary characteristic of most embedded products is to be low-cost. When you're selling millions of products, cost counts. You can't waste cycles or resources on Python.
MicroPython is a toy for the 'makers'. Similarly the JS equivalents. No real high volume product would use those technologies.
Won't argue about pyboard, because that's more of a gimmick. (Way too expensive for what it does)
I think the cost and time of developing on micropython vs a lower level language like C, would superseded the cost associated with wasted cpu cycles.
However, I agree with the fact that no real product would use mpy right now in production because of the infancy of the project.
It certainly looks promising.
It's definitely NOT a toy.
OTOH JS is just a bad choice for this kind of work IMO. ( I'm a firm believer that JS is just a bad choice for anything in general, but IOT is just madness)
In embedded s/w, the cost of the h/w outweighs everything else, ok so your 'easily developed' Python app will cost (say) 10K less to develop... but the resource requirements mean you need to go up $0.5 on the processor....
Oops, on your 500'000 devices you've suddenly wasted $250,000. All because you couldn't be bothered to save a few KB of RAM.
Python is a toy for embedded s/w.... dont get me going on catching bugs at runtime in an embedded system because you're using a dynamically typed language! Now you have to upgrade 500,000 device over-the air (cellular data costs) which will take a week, meanwhile your customers a fuming because their data is being lost!
Do yourself a favour and learn your craft instead of trotting out obviously wrong statements to people who know better. Thats not a way to make a good impression.
Oh yes, regarding Erlang, trust me, I'm aware of 'let-it-fail' and have implemented it production on large distributed systems and trust me... that does not justify writing embedded s/w in a dynamically-typed language.
> My library lets you do parallelism in a unique way
That's a big claim which you don't really back up as much as you need to. Unique is an extremely high bar in this very busy field.
There are several other similar red flags on the linked GitHub; I think your enthusiasm is running away from you a little. You might want to dial the ten-dollar language back a bit – it made me immediately suspicious ("utterly perfect", for example is another danger phrase).
It's the combination of grandiose language + solution-in-search-of-a-problem which leads to that.
If you're going to sell hard, what I would want to see is a large, complex, high-traffic system which makes extensive use of this; if you compare and contrast with Ray, which I've also only just encountered in this thread, there's a real problem (distributed hyperparameter optimization) which they've built a solution for with the library, and that immediately lends it credibility; I know the system can be used for something because it has been.
Thought linking it there would make it better, but I'll just remove it...
And you do make a good point. It doesn't really solve anything technically. But would you agree that it exposes a better API for doing much of the same stuff?
I wouldn't know without using it. That's where "software using this library" is a really useful bit of social proof. Think of Django; even without looking at the code you have a lot of evidence that it can conveniently solve a wide range of real problems.
>> Zproc uses a Server, which is responsible for storing and communicating the state.
>>
>> This isolates our resource (state), eliminating the need for locks.
So you've just invented a new name for a coordinator process and called it a new fashion in computation?
You're probably right, but see my comment above: not only is MP possibly superior at being a picking/arbitrating server, but it also supports taking advantage of copy-on-write semantics on Unix-ish systems to transfer memory to children at startup in constant time with no pickling/unpickling necessary.
They did not, which is why this "course" illustrates taking advantage of multiple cores via multiprocessing without mentioning the GIL at all. Which is a little misleading if you think about it.
Also, by having the introductory chapter be about "functional programming" (which incidentally Python does not do well), he completely bypasses the serious issue of shared state.
Which goes to show that parallelism in Python is more like a gimmick than a real-world solution since it doesn't let you do in-process shared-memory processing via threads in parallel which is so important for many applications. In my case, the vast majority of the time I do not want to farm workers out to different operating system processes and deal with serialization and communication, but this is the only way for Python code to take advantage of multiple cores [1].
[1] Another way is to write a module in C and have Python code call into it on a new thread and release the GIL while doing so, but of course this is even worse pain-wise than doing it with multiprocessing and you end up writing/compiling C.
> It lets you do message passing parallelism without the effort of tedious wiring.
You'll be doing message passing without ever dealing with sockets!
Also, Shared memory parallelism is hard to get right irregardless of which language you use. I would recommend strongly against it, unless you're writing some really really really niche thing where message passing is a bottleneck (it isn't most of the time)
The mantra that shared memory parallelism is hard to get right to the point where such platitudes as "unless you're writing some really really really niche thing" are uttered is entirely erroneous I find, through my own experience.
There are idiot-proof thread-safe datastructures and producer/consumer APIs that map extremely well to most problems that come up in practice in the domain, that one should confidently use. Refusing to do shared memory parallelism because of the _abstract potential for havoc_ rather than any practical justifications based on the problem-at-hand is throwing out the baby with the bathwater and is not the mark of competent engineering.
You must be some sort of programming GOD, I guess.
The problem is that its _hard_ to get right.
For example - It's not trivial to use locks when you're working at an abstraction level higher than operating systems. Most people don't even realise there is a race in their application, because locks are inherently non-enforcing. Code written in locks is also really hard to read and reason by.
Message passing just makes it a little more trivial to avoid the pitfalls associated with parallel programming.
I also found that it lets you avoid busy waiting in certain places, which is always a performance advantage :)
Can you shed some light on those "idiot-proof thread-safe datastructures"?
I do concurrency in Java all the time with CompletableFuture and threadsafe data structures provided by various libraries, e.g. the Guava caches, and I rarely need to use locks or semaphores. It's a good set of abstractions that make concurrency pretty close to idiot-proof.
Futures in particular make it easy to write concurrent code close to the way you would write single-threaded code, because all of the threading is handled behind the scenes.
It uses 100% CPU, true but when the duration of the lock is extremely small (i.e. nanoseconds->microseconds) the total CPU usage is less than arranging for an OS level context-switch.
In other words, you use it when synchronising with hardware or when implementing test-and-set primitives for higher level mechanisms. Crucially, the time that the lock is held for must be very short.
Given those restrictions and use cases you get a very efficient low latency locking mechanism.
"By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns."
That doesn't mean to say its "perfect" or "solves" multithreading, just that its easy to write and understand and portable across architectures.
That says nothing of how optimal it is for concurrency or parallelism ease-of-use wise or performance-wise, just that its 'easy'.
easy to write and understand is something completely different to correctness, robustness, scalability, etc.
All those must be considered if you think you have 'solved' parallelism, but they are orthogonal to 'easy to understand'.
Perfect _implies_ that it's easy to write and understand, but it's not the whole picture. It's just a feature that _he_ thinks is _crucial_ to it being perfect.
You get my point right?
Like sure, you could implemented a _perfect_, I don't know like gnome desktop in assembly language, but it wouldn't be easy to write and understand.
He thinks it's essential that it should be easy to read and write for it to be perfect.
Unfortunately, He's not with us now so can't even confirm :(
Obligatory "concurrency != parallelism" statement; concurrency is fine on both platforms with Python threading in a single process with a GIL; parallelism is less of a done deal.
While it's a very big hammer, consider experimenting with Celery for your parallelism needs on Windows. I've had good results using per-script Celery "clusters" with either a filesystem (on a ramdisk for extra speed) or an embedded Redis backend to accomplish pretty nice bidirectional RPC-ish parallelism. The initial setup is much more complicated than something like goroutines, but once you get it working you can boilerplate it onto other tasks without much trouble.
It still won't save you from memory constraints imposed by the lack of good fork() emulation, though. Hopefully the WSL stuff will either bring better fork() emulation, or allow support for shared memory objects (e.g. multiprocessing.Value) in order to ease some of that pain.
mpi4py should be included. It's a wrapper for the MPI library, which is the de facto standard for scientific computing:
https://mpi4py.readthedocs.io/en/stable/
If your tasks are fairly coarse-grained (take >50ms each),
Celery [1] has existed for a several years; takes a bit of setting up but works well, its very flexible. If your needs are simple, don't forget that your common or garden webserver can parallelize workloads too (distribute web requests to workers on multiple cores), it depends mostly on your client code for fan-out, and redis has worked well for synchronization for me.
Nowadays you can also use serverless to parallelize coarse-grained workloads in the cloud.
Multi-core parallelism isn't so interesting for serious computation. You want to be able to use large distributed HPC systems, but Python doesn't seem to have the equivalent of https://pbdr.org for R.
One more epic discussion on Python, where we have the unique opportunity to learn that using C libraries from Python is "cheating".
I could not agree more
It's definitely cheating to use C code with the exception of most Python libraries that already are to a large extent nothing more than thin wrappers over existing C libraries or the tiny fact that the most popular by far implementation of Python , CPython, is almost 50% implemented in the C language, including the standard library.The author even dared include "C" in the name of the implementation.
Those cheaters, becoming bolder and bolder every day.
The GIL has considerable benefits: I don’t have to worry about whether Python functions are thread-safe. Thread-based parallelism is hard to get right, and given the number of workarounds, Python’s GIL is a total non-issue.
Can you elaborate on that? Is there a blog post somewhere that illustrates the problem you're talking about? I was under the assumption that Python interpreters run single-threaded.
This is just not true, because C extension modules (i.e. libraries written to be used from Python but whose implementations are written in C) can release the global interpreter lock while inside a function call. Examples of these include numpy, scipy, pandas and tensorflow, and there are many others. Most Python processes that are doing CPU-intensive computation spend relatively little time actually executing Python, and are really just coordinating the C libraries (e.g. "mutiply these two matrices together").
The GIL is also released during IO operations like writing to a file or waiting for a subprocess to finish or send data down its pipe. So in most practical situations where you have a performance-critical application written in Python (or more precisely, the top layer is written in Python), multithreading works fine.
If you are doing CPU intensive work in pure Python and you find things are unacceptably slow, then the simplest way to boost performance (and probably simplify your code) is to rewrite chunks of your code in terms of these C extension modules. If you can't do this for some reason then you will have to throw in the Python towel and re-write some or all of your code in a natively compiled language (if it's just a small fraction of your code then Cython is a good option). But this is the best course of action regardless of the threads situation, because pure Python code runs orders of magnitude slower than native code.