The mechanism PyParallel uses is that it still has the one, main thread with GIL. It also has parallel threads with their own heap and a trivial pointer-incrementing allocator in a thread-local pool that deallocates on completion by dropping the whole pool. Either the main thread is running, in typical Python fashion, or some parallel threads are running. When parallel threads are running, the main Python heap is all marked read-only and all reference counting is disabled. I believe communication from parallel threads to the main thread is done by sending serialised objects.
> I believe communication from parallel threads to the main thread is done by sending serialised objects.
That's one of the open areas of exploration... how do you share things in a shared-nothing approach? This isn't as important for stateless I/O-driven applications (like an HTTP server), but it becomes very important when trying to leverage PyParallel for, well, exactly that, parallel computation.
Another way is to use a faster serialiser (marshal, ujson) and write them to files on SSDs with flock, but again a trick that's slower than real datasharing like java and go.
There's a more recent but slightly shorter deck that focuses more on the general concurrency/parallelism problem here: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...
less is more :) one one the best and most memorable talks i went to was 4 slides long.
that said i've been thinking a lot the past week or so about a machine i can dedicate to scientific computing, preferably using Sage noteboks or something like that. i have some long running evaluations (graphs) that i would rather not crush my laptop for. and so i get to thinking about the kind of machine i want to do this on, something parallel is tempting because of the higher speed gains. that said, i'm wondering if that would help Python (and hence Sage) or not. so now i'm looking at something like iJulia (Julia in an IPython notebook) and cobbling together something like this.
than i see your parallel python mods and wonder if i should try this. should i? will it help me in my use case (speeding up a Sage server)?
You probably don't want to use all of Sage for that anyway because you are most likely not using GAP, for example. Instead gut out parts of Sage and look at IPython's interface to Spark http://nbviewer.ipython.org/gist/JoshRosen/6856670 or implement a protocol for your specific computation and use PyParallel.
I highly doubt you just point to new a Python implementation and tell Sage to just roll with it and expect things to be automagically faster.
Also, if you have a long running job, then why would you do it in a notebook that requires you to keep your browser open?
(Spot on, by the way.)
"Completeness" isn't really a good argument against shortening the talk ruthlessly. I watched the video and at this speed I hardly think I remember more than maybe 10 slides' worth anyway.
However, between accomplishments like Micropython (huge potential for Python on mobile/resource constrained devices) , PyPy's slow but steady gains, and projects like this, it's at least an interesting time for Pythonistas.
Now, if we could only get an optional static type checker... (heresy, I know). Dynamic typing is great for quick prototyping, and I would never want to lose that in Python, but I'm very uneasy now taking on any large projects or long-term projects without static typing. Mypy holds some promise here, but I think it will take sponsorship from a big company to push something like this to a mature state.
People have been taking serious looks at the GIL for a long time (there were technically working GIL-less versions of 1.5). The issue has never been that no one wanted to work on it (let alone that it was impossible), but that so far the single-threaded performance hit has been too large to be acceptable to the core team.
> Now, if we could only get an optional static type checker... (heresy, I know).
Er… how would that be heretic when Python 3's annotations were introduced to support exactly this?
I know well about Python 3 annotations, since that's what gave me hope to begin with. But I had the impression that every serious proposal I've seen on the Python Dev mailing list to integrate a static type checker in CPython had been met with rejection.
However, I just checked the archives and see now I was wrong. There have been multiple discussions about this, and even a working group.
GvR has given static types in Python the cold shoulder, but that attitude is clearly not shared by everyone in the community.
Glad to have been wrong in this case.
Good luck encoding this simple contract (in Racket) exactly in your type system:
(define/contract (fun numbers)
(-> (listof (and/c (integer-in 0 255)
;; do something with a list of even integers in a given range
Basically, my parent said that static typing can express many things that testing (which is not static, because it runs the code) can. This is true, but there are things you can't express in any of the commonly used static type systems and I just gave an example of such thing.
Other than this there's nothing wrong with static typing in general and I'd be happy if Python got some kind of (optional) static type system, maybe in the vein of Typed Clojure and Typed Racket or Erlang's Dialyzer. Still, static type systems are not silver bullets and have their drawbacks - and that's all I wanted to convey in my comment.
Edit:  the type system feature which would let you encode the example contract I posted is called "dependent typing" and is available only in a very few languages.
Racket is more powerful, but that is a bad example.
I really don't understand what's going on in this thread; I'm stating an obvious fact that you can't encode things like "is even" in a static type systems without dependent types and that this is a drawback of such systems and I get downvoted. I never complained about my downvotes before, but this time I seriously don't understand what's happening. Did I fail that miserably in an effort to explain my position, or is this position controversial?
For the record, I replied to this:
> [dynamic in nature] Tests are not the only way to ensure correctness (and they aren't even the only thing you should use, even when they’re mandated by policy) – did you consider using strong [static?] types encoding exactly the assumptions you spot-check with tests?
With an example of dynamic test (a contract) which you'd have a problem expressing in any but the most advanced type systems.
Did claudius mean strong, but dynamic typing and I misunderstood? But then there is no difference at all between ensuring something in unit tests and in a type system as they are both executed at runtime.
I honestly have no idea and I really would like to know which part of what I wrote deserved downvotes to avoid writing such things in the future if for no other reason.
Looks like I misunderstood you, that's why my reply does not make much sense. But it is possible to verify such kind of constraints statically if you accept a certain amount of false errors, what all static type systems generate, to varying degrees.
Your example and the equivalent pycontract are static types that miss a static time verifier, not dynamic ones (pycontracts can only create static types AFAIK).
As much as you realize that having tests also doesn't mean error-free code.
>Even with static-typing, it's probably a good idea to have tests.
For much less stuff. With a proper compiler, half of the kind of tests people do in Ruby land are totally needless. You can refactor and you know from the first recompile what has broken and where.
That would probably make it easier to decouple from "Normal" python code too.
I'm not a very senior software dev or anything, but I imagine myself liking having my well-written, compact .py file detailing logic, and a separate 'doc'(?) file containing long-winded explanations of what some key functions do, and listing all the type consistencies I absolutely wish to guarantee.
In fact, this is exactly what OCaml does (or can do). The interface file is not always required, but people usually include it for clarity, to make sure the compiler's type inference works according to their wishes, or to limit visibility of certain types and functions.
So, if I were to implement what you said, I'd go with a file containing interface classes, i.e. classes that have getters/setters and type checking built in and documented, and another file defining logic operating on them.
Another pattern is to use something like:
... define key constraints...
Long-winded docstring about do_stuff
from .impl import do_stuff
Do you have data for this claim or is it a hunch? I've seen this meme repeated a lot in HN.
No, I don't have any solid data, but I do think the majority of people posting about Go on HN have either Python or Ruby backgrounds. I 've also found a lot of Python people in the Rust community (which I personally prefer vastly to Go).
People need more performance, particularly multicore performance. Traditional Python supporters can put their head in the sand about this if they want, but these highly-performant new languages have clearly found a niche among Python folks tired of trying to optimize all the time.
And yes, there's a fine line between conducting needed optimizations and wasting time prematurely optimizing, but people would clearly rather spend a little more time up front in exchange for a big speedup. The choice is no longer between C and Python -- there's a nice middle ground.
Right. I find it odd because I don't get why Go is a supposed replacement for Python. Does Go have a framework like Django, a good SQL API/ORM, numerical computing packages, scientific packages, machine learning? This is where I see people using Python the most.
> And yes, there's a fine line between conducting needed optimizations and wasting time prematurely optimizing, but people would clearly rather spend a little more time up front in exchange for a big speedup
Yeah, fast by default is not only acceptable but desirable, I don't think using a modern compiled language counts as premature optimization. It doesn't look like using Go is a lot more complex/time consuming than Python, you lose some flexibility but win others (e.g., being able to distribute binaries).
Personally, I'm not keen on it - I find that roughly 2/3rds of my code end up being error handling - and hope Rust eventually takes off.
I agree, the status-quo isn't great at the moment if you want to use Python and optimally exploit all your cores. I actually cover that in a more recent presentation: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...
First of all, I don't even particularly like Golang as a language, so you're incredibly wrong about my motivations.
Secondly, if you don't think Python to Golang adoption is on the rise among SV start-ups, you're living in a bubble. Look up HN Golang posts and count how many of them are from Python backgrounds. This may not represent the entire global Python programming community, but it's without doubt an important and influential segment of it.
Finally, I clearly struck some kind of nerve, since my comment is at the top of the page. You may not like my opinion, but it's evidently a common one.
And I agree that the tooling is part of what makes static types so powerful, but I actually think the tooling might emerge organically if there's a well-designed, effective, and open-source static type checker available.
Or neither, given that it's okey to have specialized tools for different jobs. Speed/networking was never supposed to be Pythons moto.
I'm not even sure if I can adequately summarize it here :-)
(I'm planning on covering this stuff in a subsequent presentation.)
I played around with a few approaches to the things you're asking. Had good results with specialized interlocked container-type classes (e.g. xlist(); a simplified list-type object) that parallel threads could use to persist simple scalar objects (string, int, bytes).
I think there's definitely room for sharing techniques that exploit the fact that threads inherently share address space -- I don't want to say that shared-nothing is the only paradigm supported and all communication must be done by message passing (like Rust?), because that's not the best solution for all problems.
As for the main-thread/parallel-thread pause/run relationship... it won't be as black and white as I allude to in the deck -- the main thread will still be running, albeit with a limited memory view (i.e. you'll be restricted with what you can do in the main thread whilst parallel threads are running).
Ideally, the only time all the parallel threads get paused is when global state needs to be updated by the main thread. Constantly pausing the parallel threads just because the main thread needs to do some periodic work won't be ideal.
The application you're referring to... is it something that exists, or are you just using hypothetical examples? I'm always curious to hear of architectures where threads need to constantly talk to each other in order for work to get done -- does your app fit this bill?
The Lonestar benchmark suite from Texas includes good examples http://iss.ices.utexas.edu/?p=projects/galois/lonestar.
Re: target problems... the catalyst behind PyParallel can ultimately be tied back to the discussions on python-ideas@ in Sept/Oct 2012 that led to Python 3.4's asyncio.
I wanted to show that, hey, there's a different way you can approach async I/O that, when paired with better kernel I/O primitives, actually allows you to exploit parallelism too (i.e. use all my cores).
I'm particularly interested in problems that are both I/O-bound (or driven) and compute heavy, which is common in the enterprise. The parallel aspect of PyParallel is an area I'm still flushing out (I wanted to get the async stuff working first, and I'm happy with the results). I definitely want to spend the next sprint focusing on using PyParallel for parallel computation problems, where you typically go from sequential execution, fan out to parallel compute, then fan back in to sequential execution. This is common with aggregation-oriented "parallel data" problems.
I'm definitely less familiar with problems that inherently require a lot of cross-talk between threads, like the agglomerative clustering referred to in that paper linked above.
Now, all that being said, I did have some good results simply wrapping Windows' synchronization primitives (http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...) and exposing via Python: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/asy...
Things like `async.signal_and_wait(object1, object2)` are actually pretty darn useful. Again, it's thanks to the vibrant set of synchronization primitives provided by Windows.
One of the big reasons it's effective in Servo is that it can prove that multiple threads working on a shared, mutable data structure are operating in a safe manner.
There's also a journal article under submission which is a more consumable version of his dissertation work: http://www.haripyla.com/wordpress/wp-content/uploads/2012/11...
Well... er, that's a bit of a vague sentence :-)
> but what about interrupting a parallel thread from the main thread and send data to it?
That is the most UNIX-ey signal-ly thing I've ever heard :-) That's sort of what's wrong with signals on UNIX, a paradigm that's useful at the process level when you have one thread of execution, but falls apart in a multithreading world.
The correct approach is to use a mechanism like IOCP: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...
That slide depicts how the kernel pushes completion packets onto an I/O completion port, such that they can be processed by waiting threads, but that's just one example. Anything can push a completion packet to an IOCP -- in fact, that's exactly how you'd get your parallel worker threads to gracefully shutdown: have the main thread enqueue a "shutdown please" completion packet (via PostQueuedCompletionStatus()), which you'd detect in your parallel threads.
> Also, what would PyParallel look like if it has to support all these NIX systems?
The GIL-sidestepping techniques via Py_PXCTX are conceptually platform independent -- they'll work fine on any POSIX platform (although things like interlocked lists that I get for free on Windows and OS X will need to be re-implemented on other platforms).
The intrinsic pairing between asynchronous I/O and parallelism (that is, automatically and efficiently handling work in an I/O-driven system on all hardware cores) needs kernel-level support: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...
Linux and BSD are the odd ones out here. AIX copied the IOCP API verbatim soon after NT 4.0 came out (I suspect they recognized a good thing when they saw it!), Solaris implemented something very close with event ports (except for the kernel being cognizant of event port concurrency, which is a key piece), and OS X got Grand Central Dispatch, which is a wildly different API (and a bit nicer, to be honest), but semantically equivalent to IOCP+threadpools on Windows.
There are two outcomes re: Linux/BSD/POSIX support.
1. They implement the same kernel-level primitives for async I/O and synchronization supported by Windows/OSX, which PyParallel would be able to use directly.
2. They don't, so the PyParallel-backend is implemented via existing primitives (epoll/kqueue etc).
I'm of the opinion that the Windows kernel-level primitives are fundamentally superior at the architectural level compared to the existing facilities provided by Linux/BSD/POSIX, evidenced by: a) better performance on identical hardware, b) less code required to achieve the desired effect, and c) much cleaner code versus the alternative.
The latter is, at a conceptual level, not limited to Windows.
The proof-of-concept implementation that pairs the two concepts is Windows-only, at the moment, because Windows simply has better out-of-the-box scaffolding for this sort of stuff: https://speakerdeck.com/trent/parallelizing-the-python-inter...
As for non-Windows implementations, there are other operating systems out there that provide similar primitives: AIX copied the IOCP API from Windows verbatim, Solaris has event ports, and OS X got GCD.
I'd love to see the Linux kernel provide semantically equivalent primitives. You simply can't achieve the same affect without kernel-level thread dispatching support tied into the mix (https://speakerdeck.com/trent/parallelism-and-concurrency-wi...).
I'd love to see that happen before the year is out :-)
Is it possible to know the thread ID efficiently on Linux or other *NIX systems, using a similar fashion presented for Windows in the slides? I have no idea on how much calling pthread_self() every time impacts the performance, but it would be better if a faster method is available.
(Source: http://koala.cs.pub.ro/lxr/glibc/nptl/pthread_self.c and http://koala.cs.pub.ro/lxr/glibc/nptl/sysdeps/x86_64/tls.h#L...)
(Funnily enough that intrinsic doesn't (or didn't at the time) appear to be exposed by MSVC, so I just stuck with __readfsdword(0x48) on Windows.)
This mechanism is essentially the only reason why amd64 in long mode still has limited support (complete enough to implement this, nothing much more) for segment registers.
Does anyone (or Trent) know how this appointment will support or hinder further development of PyParallel?
I assume that Continuum could accelerate this development or direct Trent's efforts to other areas.
Either way, I'm sure there will be benefits for the Python ecosystem.
It's been a busy time since I first presented PyParallel to the core Python developers at PyCon last year -- incidentally it was also when I met Peter and Travis and joined Continuum.
I've since relocated from East Lansing to NYC, via visas and trips to Australia and whatnot, and have been very busy engaged with client consultancy here in NYC since arriving officially around July/August last year.
Peter (President) and Travis (CEO) are very supportive of PyParallel, and it's actually an incredibly good fit within Continuum's existing ecosystem. I'm primarily engaged with consultancy at the moment, but we're looking at having me spend more time on PyParallel development very soon. Watch this space!
Rewriting the original program in Java would make the program about 100 times faster... on one core.
Now let's see. We deal with hacked Python, with complex error-prone thread synchronization code in Python for 8x gain, or grab the 100x gain in a single thread in Java?
While I don't mind Python getting faster, threading should be the last thing to try before everything else is exhausted in order to make Python run fast in one thread.
His point was about using Java instead of Python. It was that at best, a perfect parallel CPython can yield a N times more performant program (where N = number of cores). And that's for a perfect parallel CPython, and only when there's very little sync/sharing overhead and the program is also fully parallelizable. So, even for an 8 core machine, parallelizing Python can yield at best c*8 better performance, where c < 1.
Now, what he says is, there are languages that run stuff 10 and 100 times faster than CPython on a SINGLE core. So maybe start from there (improving single core CPython performance), something which has much more room to speed up our code, even if it's not parallel.
His point was about using the language with the most productive / fastest libraries. Single-thread python vs 8-thread python vs java makes no difference if the majority of your processing time is spent inside a highly optimised native library.
I (or the OP) didn't tell anybody to use Java instead of Python. What we said is that it will be better for performance to optimize CPython's core speed instead of its multi-core capabilities.
So the remark about Python having more libs than Java is beside the point, since Java wasn't mentioned as a migration option, but as an example of how far single-core performance can be taken, and as an advice to try to get some of that to Python.
Oh, and the "but still it doesn't matter because Python has fast native libs" is not an argument either, because if that was enough people wouldn't care about parallelizing Python to get more speed. Which was the whole topic that started this thread.
> if [native code] was enough people wouldn't care about parallelizing Python
They would, for the same reason that they care about parellelizing Java -- once you're hitting the limits of single-thread speed (whether it's by being fast yourself or by having fast extensions), multithreading is next on the list.
> there's plenty more to be gained yet from that single thread.
Except that there isn't, if your single thread is spending most of its time in highly optimised native code extensions already :P
Numba is one option -- rather than executing the CPython innards, have it switched out for an LLVM-optimized kernel instead.
Of course development time will also go up 10.000 times!</irony>