Hacker News new | comments | show | ask | jobs | submit login
PyParallel: How we removed the GIL and exploited all cores (speakerdeck.com)
170 points by trentnelson on June 7, 2014 | hide | past | web | favorite | 85 comments



A quick summary, from the slides:

The mechanism PyParallel uses is that it still has the one, main thread with GIL. It also has parallel threads with their own heap and a trivial pointer-incrementing allocator in a thread-local pool that deallocates on completion by dropping the whole pool. Either the main thread is running, in typical Python fashion, or some parallel threads are running. When parallel threads are running, the main Python heap is all marked read-only and all reference counting is disabled. I believe communication from parallel threads to the main thread is done by sending serialised objects.


InfoQ did a surprisingly accurate summation, too: http://www.infoq.com/articles/PyParallel

> I believe communication from parallel threads to the main thread is done by sending serialised objects.

That's one of the open areas of exploration... how do you share things in a shared-nothing approach? This isn't as important for stateless I/O-driven applications (like an HTTP server), but it becomes very important when trying to leverage PyParallel for, well, exactly that, parallel computation.


I tend to do that with multiprocessing queues, but at some point the serialisation becomes the bottleneck.

Another way is to use a faster serialiser (marshal, ujson) and write them to files on SSDs with flock, but again a trick that's slower than real datasharing like java and go.


Good summary ... which highlights Greenspun's tenth rule as applied to Erlang.


If you want to see me try get through 153 slides in 45 minutes (and fail), this talk was recorded here: http://vimeo.com/79539317.

There's a more recent but slightly shorter deck that focuses more on the general concurrency/parallelism problem here: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...


What would be terrific is a 400 word article summarizing what you did.


There is a fair summary written by Jonathon Allen on InfoQ [0]

[0] http://www.infoq.com/articles/PyParallel


I inadvertently ran into this article a few weeks ago and was pleasantly surprised... it's not a bad summary at all (especially considering I had no contact with Jonathan).


> 153 slides in 45 minutes (and fail)

less is more :) one one the best and most memorable talks i went to was 4 slides long.

that said i've been thinking a lot the past week or so about a machine i can dedicate to scientific computing, preferably using Sage noteboks or something like that. i have some long running evaluations (graphs) that i would rather not crush my laptop for. and so i get to thinking about the kind of machine i want to do this on, something parallel is tempting because of the higher speed gains. that said, i'm wondering if that would help Python (and hence Sage) or not. so now i'm looking at something like iJulia (Julia in an IPython notebook) and cobbling together something like this.

than i see your parallel python mods and wonder if i should try this. should i? will it help me in my use case (speeding up a Sage server)?


My guess is that you can't drop in PyParallel without any work of porting Sage because it's not clear how you would pass objects around like those used for GAP. Sage also would have to be rewritten to use this new API of implementing protocol-based classes.

You probably don't want to use all of Sage for that anyway because you are most likely not using GAP, for example. Instead gut out parts of Sage and look at IPython's interface to Spark http://nbviewer.ipython.org/gist/JoshRosen/6856670 or implement a protocol for your specific computation and use PyParallel.

I highly doubt you just point to new a Python implementation and tell Sage to just roll with it and expect things to be automagically faster.

Also, if you have a long running job, then why would you do it in a notebook that requires you to keep your browser open?


^^^ what turnersr said.

(Spot on, by the way.)


thanks, turnersr, all good points.


I use cloud.sagemath.com for offloading heavy computational work.


Depending on what kind of graph problem you might consider graphlab or one of the other specialty graph processing systems out there.


Fwiw, 2min/slide (averaged) isn't a bad heuristic to start with.


If you don't have time for 153 slides, maybe you shouldn't have 153 slides in the deck ;-)

"Completeness" isn't really a good argument against shortening the talk ruthlessly. I watched the video and at this speed I hardly think I remember more than maybe 10 slides' worth anyway.


But we wouldn't know that if he hadn't tried :)


I'm really glad to see some of the Python committers taking a serious look at the GIL. Python is either posed for great victory (given its rapid rate of adoption in academia) or slow failure (given the rapid rate at which server apps are starting to migrate from Python to Go).

However, between accomplishments like Micropython (huge potential for Python on mobile/resource constrained devices) , PyPy's slow but steady gains, and projects like this, it's at least an interesting time for Pythonistas.

Now, if we could only get an optional static type checker... (heresy, I know). Dynamic typing is great for quick prototyping, and I would never want to lose that in Python, but I'm very uneasy now taking on any large projects or long-term projects without static typing. Mypy holds some promise here, but I think it will take sponsorship from a big company to push something like this to a mature state.


> I'm really glad to see some of the Python committers taking a serious look at the GIL.

People have been taking serious looks at the GIL for a long time (there were technically working GIL-less versions of 1.5). The issue has never been that no one wanted to work on it (let alone that it was impossible), but that so far the single-threaded performance hit has been too large to be acceptable to the core team.

> Now, if we could only get an optional static type checker... (heresy, I know).

Er… how would that be heretic when Python 3's annotations were introduced to support exactly this?


>Er… how would that be heretic when Python 3's annotations were introduced to support exactly this?

I know well about Python 3 annotations, since that's what gave me hope to begin with. But I had the impression that every serious proposal I've seen on the Python Dev mailing list to integrate a static type checker in CPython had been met with rejection.

However, I just checked the archives and see now I was wrong. There have been multiple discussions about this, and even a working group.

GvR has given static types in Python the cold shoulder, but that attitude is clearly not shared by everyone in the community.

Glad to have been wrong in this case.


There is also the mypy project (http://mypy-lang.org/). They are working on a type-checked variant of python. Currently it's just a pre-processing stage, but they eventually want to make it possible to AOT compile using the type information to get a faster language.


It was originally part of the authors PhD work, it hasn't been touched since he graduated.


I wrote a small XML parsing library for a firewall API recently. Just writing about 15 functions handing data off to each other made me nervous about not enforcing types. The best I could do was write function signatures and descriptions.


Types are not the only way to ensure correctness (and they aren't even the only thing you should use, even when they're available) - did you consider using tests?


Tests are not the only way to ensure correctness (and they aren’t even the only thing you should use, even when they’re mandated by policy) – did you consider using strong types encoding exactly the assumptions you spot-check with tests?


> did you consider using strong types encoding exactly the assumptions you spot-check with tests?

Good luck encoding this simple contract (in Racket) exactly in your type system:

    (define/contract (fun numbers) 
        (-> (listof (and/c (integer-in 0 255) 
                            even?)) 
             any)
     ;; do something with a list of even integers in a given range
     null)
I hope you like your Zero and Succ n and are proficient with Agda or ATS or something like this.


I have no idea what you just said. Given that we're talking about type checking in python, care to give an example of a weakness of type checking in python?


We're even, because I don't know about what "weakness" you're talking and why the heck you're bringing Python into abstract discussion on static typing.

Basically, my parent said that static typing can express many things that testing (which is not static, because it runs the code) can. This is true, but there are things you can't express in any of the commonly used[1] static type systems and I just gave an example of such thing.

Other than this there's nothing wrong with static typing in general and I'd be happy if Python got some kind of (optional) static type system, maybe in the vein of Typed Clojure and Typed Racket or Erlang's Dialyzer. Still, static type systems are not silver bullets and have their drawbacks - and that's all I wanted to convey in my comment.

Edit: [1] the type system feature which would let you encode the example contract I posted is called "dependent typing" and is available only in a very few languages.


Pycontracts does that.

Racket is more powerful, but that is a bad example.


This is an example of a thing you can't encode in (most) static type systems, not something that can't be done in Python! On the contrary, it is very much possible to do in Python thanks to its dynamicity...

I really don't understand what's going on in this thread; I'm stating an obvious fact that you can't encode things like "is even" in a static type systems without dependent types and that this is a drawback of such systems and I get downvoted. I never complained about my downvotes before, but this time I seriously don't understand what's happening. Did I fail that miserably in an effort to explain my position, or is this position controversial?

For the record, I replied to this:

> [dynamic in nature] Tests are not the only way to ensure correctness (and they aren't even the only thing you should use, even when they’re mandated by policy) – did you consider using strong [static?] types encoding exactly the assumptions you spot-check with tests?

With an example of dynamic test (a contract) which you'd have a problem expressing in any but the most advanced type systems.

Did claudius mean strong, but dynamic typing and I misunderstood? But then there is no difference at all between ensuring something in unit tests and in a type system as they are both executed at runtime.

I honestly have no idea and I really would like to know which part of what I wrote deserved downvotes to avoid writing such things in the future if for no other reason.


There are some people around here that think a downvote is the appropriate answer to a post they disagree with, what's very unfortunate.

Looks like I misunderstood you, that's why my reply does not make much sense. But it is possible to verify such kind of constraints statically if you accept a certain amount of false errors, what all static type systems generate, to varying degrees.

Your example and the equivalent pycontract are static types that miss a static time verifier, not dynamic ones (pycontracts can only create static types AFAIK).


Yes, hate them. Gimme types.


You hate tests, therefore want static-typing. You do realize that static-typing doesn't mean error-free code, right? Even with static-typing, it's probably a good idea to have tests.


>You hate tests, therefore want static-typing. You do realize that static-typing doesn't mean error-free code, right?

As much as you realize that having tests also doesn't mean error-free code.

>Even with static-typing, it's probably a good idea to have tests.

For much less stuff. With a proper compiler, half of the kind of tests people do in Ruby land are totally needless. You can refactor and you know from the first recompile what has broken and where.


There are a lot of ideas out there on how to do optional runtime type checking in a Pythonic way. My contribution is ensure (https://github.com/kislyuk/ensure), for example check out the ensure_annotations decorator (https://ensure.readthedocs.org/en/latest/#ensure.ensure_anno...).


I kinda wish I could keep all the type definitions in a separate file, myself. Like a map of the project.

That would probably make it easier to decouple from "Normal" python code too.

I'm not a very senior software dev or anything, but I imagine myself liking having my well-written, compact .py file detailing logic, and a separate 'doc'(?) file containing long-winded explanations of what some key functions do, and listing all the type consistencies I absolutely wish to guarantee.


>I kinda wish I could keep all the type definitions in a separate file, myself. Like a map of the project.

In fact, this is exactly what OCaml does (or can do)[0]. The interface file is not always required, but people usually include it for clarity, to make sure the compiler's type inference works according to their wishes, or to limit visibility of certain types and functions.

0. http://stackoverflow.com/a/3268836


Have you used Sphinx for documenting your project? (You can also use RtD - for example, https://ensure.readthedocs.org/ has the docs for the project linked above). Sphinx makes it fairly straightforward to inline documentation and code while maintaining highly readable docs. Also, you can collapse and otherwise manage doctrings in IDEs like PyCharm.

So, if I were to implement what you said, I'd go with a file containing interface classes, i.e. classes that have getters/setters and type checking built in and documented, and another file defining logic operating on them.

Another pattern is to use something like:

    class Widget(object):
        ... define key constraints...
    """
    Long-winded docstring about do_stuff
    """
    from .impl import do_stuff
So all the implementation complexity of do_stuff is hidden in .impl (which also avoids stray exports).


> (given the rapid rate at which server apps are starting to migrate from Python to Go).

Do you have data for this claim or is it a hunch? I've seen this meme repeated a lot in HN.


Thank you for not asking in a rude and confrontational manner.

No, I don't have any solid data, but I do think the majority of people posting about Go on HN have either Python or Ruby backgrounds. I 've also found a lot of Python people in the Rust community (which I personally prefer vastly to Go).

People need more performance, particularly multicore performance. Traditional Python supporters can put their head in the sand about this if they want, but these highly-performant new languages have clearly found a niche among Python folks tired of trying to optimize all the time.

And yes, there's a fine line between conducting needed optimizations and wasting time prematurely optimizing, but people would clearly rather spend a little more time up front in exchange for a big speedup. The choice is no longer between C and Python -- there's a nice middle ground.


> No, I don't have any solid data, but I do think the majority of people posting about Go on HN have either Python or Ruby backgrounds.

Right. I find it odd because I don't get why Go is a supposed replacement for Python. Does Go have a framework like Django, a good SQL API/ORM, numerical computing packages, scientific packages, machine learning? This is where I see people using Python the most.

> And yes, there's a fine line between conducting needed optimizations and wasting time prematurely optimizing, but people would clearly rather spend a little more time up front in exchange for a big speedup

Yeah, fast by default is not only acceptable but desirable, I don't think using a modern compiled language counts as premature optimization. It doesn't look like using Go is a lot more complex/time consuming than Python, you lose some flexibility but win others (e.g., being able to distribute binaries).


Go is vaguely Pythonic (extensive stdlib, and generally adheres to the Zen of Python). Even if there's not much in terms of frameworks, it's great for replacing small parts of a server that have to handle a lot of load. A lot of people have existing Python servers that grew into being performance hotspots. It fits into the sort of niche that node.js does.

Personally, I'm not keen on it - I find that roughly 2/3rds of my code end up being error handling - and hope Rust eventually takes off.


> People need more performance, particularly multicore performance. Traditional Python supporters can put their head in the sand about this if they want, but these highly-performant new languages have clearly found a niche among Python folks tired of trying to optimize all the time.

I agree, the status-quo isn't great at the moment if you want to use Python and optimally exploit all your cores. I actually cover that in a more recent presentation: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...


It's a case of "keep repeating it and maybe it'll be true, someday, err, I hope". It's something you often see when people make mental investments into certain paths/things/ideas. And then they have to keep repeating it, both in order to try make it true, and to shield themselves from reality. Sort of like node.js


Wow, that's so full of wrong assertions and ad hominem I don't even know where to start.

First of all, I don't even particularly like Golang as a language, so you're incredibly wrong about my motivations.

Secondly, if you don't think Python to Golang adoption is on the rise among SV start-ups, you're living in a bubble. Look up HN Golang posts and count how many of them are from Python backgrounds. This may not represent the entire global Python programming community, but it's without doubt an important and influential segment of it.

Finally, I clearly struck some kind of nerve, since my comment is at the top of the page. You may not like my opinion, but it's evidently a common one.


In my sample size of one, I've been using python for new projects almost exclusively for a few years (still maintaining a bunch of other legacy code) - but now in cases where Go is suitable (things where I don't need Python's enormous and well-versioned third party library ecosystem) I've been switching to it. In general, it feels like Python and C had a baby and it inherited the best of both, so I can see myself pushing for it a lot more in future.


Personally I don't see why someone would switch from Python to Go. Go feels like a major step down from Python, in everything except for speed and parallelism.


Well then the reason for someone switching would probably be to gain speed and parallelism. :) I also think static typing is a big reason, for large projects at least.


That's not really heresy considering the existence of Typescript. You're probably right about it needing sponsorship from a big company. Adding type checking to a language is difficult enough, but the tooling around it is what really makes it powerful.


Yeah, Typescript is great. But by heresy I meant that's the concept is pretty rejected among many (most?) core Python developers, not that it's not feasible.

And I agree that the tooling is part of what makes static types so powerful, but I actually think the tooling might emerge organically if there's a well-designed, effective, and open-source static type checker available.


>I'm really glad to see some of the Python committers taking a serious look at the GIL. Python is either posed for great victory (given its rapid rate of adoption in academia) or slow failure (given the rapid rate at which server apps are starting to migrate from Python to Go)

Or neither, given that it's okey to have specialized tools for different jobs. Speed/networking was never supposed to be Pythons moto.


If you like static typing and dynamic features, take a look at C# .. it has features inspired from dynamic and functional languages. Plus excellent support for parallelism and concurrency. Gone are times when C# was only object oriented java alternative.


Is there some mechanism to selectively escape the protection of each thread's memory? If not, how do I communicate between threads if my application has some kind of shared state? Do I have to pass any messages to the main thread each time, for it to be forwarded to another parallel thread? Does that create a kind of BSP-style point where all the parallel threads stop and the main thread has to synchronise between them? How do you stop this becoming a bottleneck, and how do you load manage if all parallel threads have to stop to allow the main thread to run?


So, I have actually spent a lot of time thinking about these exact problems, but it would have been too overwhelming to try include that information in this initial deck.

I'm not even sure if I can adequately summarize it here :-)

(I'm planning on covering this stuff in a subsequent presentation.)

I played around with a few approaches to the things you're asking. Had good results with specialized interlocked container-type classes (e.g. xlist(); a simplified list-type object) that parallel threads could use to persist simple scalar objects (string, int, bytes).

I think there's definitely room for sharing techniques that exploit the fact that threads inherently share address space -- I don't want to say that shared-nothing is the only paradigm supported and all communication must be done by message passing (like Rust?), because that's not the best solution for all problems.

As for the main-thread/parallel-thread pause/run relationship... it won't be as black and white as I allude to in the deck -- the main thread will still be running, albeit with a limited memory view (i.e. you'll be restricted with what you can do in the main thread whilst parallel threads are running).

Ideally, the only time all the parallel threads get paused is when global state needs to be updated by the main thread. Constantly pausing the parallel threads just because the main thread needs to do some periodic work won't be ideal.

The application you're referring to... is it something that exists, or are you just using hypothetical examples? I'm always curious to hear of architectures where threads need to constantly talk to each other in order for work to get done -- does your app fit this bill?


I'm just talking theory, but there is a whole class of applications that are very difficult to express without any shared state, or state shared by relative expensive message passing.

The Lonestar benchmark suite from Texas includes good examples http://iss.ices.utexas.edu/?p=projects/galois/lonestar.


Thanks for the pointer to Lonestar, I'll review. (PDF link to the relevant paper: http://iss.ices.utexas.edu/Publications/Papers/ispass2009.pd...)

Re: target problems... the catalyst behind PyParallel can ultimately be tied back to the discussions on python-ideas@ in Sept/Oct 2012 that led to Python 3.4's asyncio.

I wanted to show that, hey, there's a different way you can approach async I/O that, when paired with better kernel I/O primitives, actually allows you to exploit parallelism too (i.e. use all my cores).

I'm particularly interested in problems that are both I/O-bound (or driven) and compute heavy, which is common in the enterprise. The parallel aspect of PyParallel is an area I'm still flushing out (I wanted to get the async stuff working first, and I'm happy with the results). I definitely want to spend the next sprint focusing on using PyParallel for parallel computation problems, where you typically go from sequential execution, fan out to parallel compute, then fan back in to sequential execution. This is common with aggregation-oriented "parallel data" problems.

I'm definitely less familiar with problems that inherently require a lot of cross-talk between threads, like the agglomerative clustering referred to in that paper linked above.

Now, all that being said, I did have some good results simply wrapping Windows' synchronization primitives (http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...) and exposing via Python: http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/asy...

Things like `async.signal_and_wait(object1, object2)` are actually pretty darn useful. Again, it's thanks to the vibrant set of synchronization primitives provided by Windows.


Rust doesn't require all communication to be done by message passing.

One of the big reasons it's effective in Servo is that it can prove that multiple threads working on a shared, mutable data structure are operating in a safe manner.


There is no memory protection - all threads of a process share the same address space, the address space of the process.


As chrisseaton points out, you very much can change the permissions on a processes' memory at the page level. The mprotect system call lets you change the protection on regions of memory, but you can do fancy things by unmapping memory. A friend of mine from grad school did some really elaborate tricks related to memory protection to implement runtime systems for speculation ("Exploiting Coarse-Grain Speculative Parallelism", http://www.haripyla.com/wordpress/wp-content/uploads/2012/11...) and deadlock elimination (see chapters 2 and 3 of his dissertation, "Safe Concurrent Programming and Execution", http://vtechworks.lib.vt.edu/bitstream/handle/10919/19276/Py...).


Quickly scanned both those papers -- they look very interesting, I look forward to reading both in detail.


The source code is available on his website: http://www.haripyla.com/research/

There's also a journal article under submission which is a more consumable version of his dissertation work: http://www.haripyla.com/wordpress/wp-content/uploads/2012/11...


There's seven slides titled "Memory Protection" about how he changes the permission of memory pages used by different threads. If each thread has its own heap, and can't write to the global heap, how do you communicate between threads? Through the main thread? Doesn't that create a bottleneck?


So writing to the main thread from a parallel context crashes the thread, but what about interrupting a parallel thread from the main thread and send data to it? Not that one should do something like that but how does PyParallel handle it? Also, what would PyParallel look like if it has to support all these NIX systems? As the OP pointed out, all we do now for "async IO" on these NIX is basically polling the crap out of the OS.


> So writing to the main thread from a parallel context crashes the thread

Well... er, that's a bit of a vague sentence :-)

> but what about interrupting a parallel thread from the main thread and send data to it?

That is the most UNIX-ey signal-ly thing I've ever heard :-) That's sort of what's wrong with signals on UNIX, a paradigm that's useful at the process level when you have one thread of execution, but falls apart in a multithreading world.

The correct approach is to use a mechanism like IOCP: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...

That slide depicts how the kernel pushes completion packets onto an I/O completion port, such that they can be processed by waiting threads, but that's just one example. Anything can push a completion packet to an IOCP -- in fact, that's exactly how you'd get your parallel worker threads to gracefully shutdown: have the main thread enqueue a "shutdown please" completion packet (via PostQueuedCompletionStatus()), which you'd detect in your parallel threads.

> Also, what would PyParallel look like if it has to support all these NIX systems?

The GIL-sidestepping techniques via Py_PXCTX are conceptually platform independent -- they'll work fine on any POSIX platform (although things like interlocked lists that I get for free on Windows and OS X will need to be re-implemented on other platforms).

The intrinsic pairing between asynchronous I/O and parallelism (that is, automatically and efficiently handling work in an I/O-driven system on all hardware cores) needs kernel-level support: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...

Linux and BSD are the odd ones out here. AIX copied the IOCP API verbatim soon after NT 4.0 came out (I suspect they recognized a good thing when they saw it!), Solaris implemented something very close with event ports (except for the kernel being cognizant of event port concurrency, which is a key piece), and OS X got Grand Central Dispatch, which is a wildly different API (and a bit nicer, to be honest), but semantically equivalent to IOCP+threadpools on Windows.

There are two outcomes re: Linux/BSD/POSIX support.

1. They implement the same kernel-level primitives for async I/O and synchronization supported by Windows/OSX, which PyParallel would be able to use directly.

2. They don't, so the PyParallel-backend is implemented via existing primitives (epoll/kqueue etc).

I'm of the opinion that the Windows kernel-level primitives are fundamentally superior at the architectural level compared to the existing facilities provided by Linux/BSD/POSIX, evidenced by: a) better performance on identical hardware, b) less code required to achieve the desired effect, and c) much cleaner code versus the alternative.


Could someone more knowledgeable please comment if this is Windows only? And if yes, could the mods append (Windows) to the submission title?


There are two components to PyParallel: the alternate approach to async I/O afforded by Windows and IOCP, and the changes to CPython that facilitate multiple interpreter threads running simultaneously in parallel.

The latter is, at a conceptual level, not limited to Windows.

The proof-of-concept implementation that pairs the two concepts is Windows-only, at the moment, because Windows simply has better out-of-the-box scaffolding for this sort of stuff: https://speakerdeck.com/trent/parallelizing-the-python-inter...

As for non-Windows implementations, there are other operating systems out there that provide similar primitives: AIX copied the IOCP API from Windows verbatim, Solaris has event ports, and OS X got GCD.

I'd love to see the Linux kernel provide semantically equivalent primitives. You simply can't achieve the same affect without kernel-level thread dispatching support tied into the mix (https://speakerdeck.com/trent/parallelism-and-concurrency-wi...).


Thank you for explaining that! So, theoretically OSX could use this paradigm assuming PyParallel is (re?)written making use of GCD? How likely do you feel this will happen and sooner/later?


Yup, OS X wouldn't be that hard to port to at all as all the primitives I need are provided already by the OS. Vastly different APIs so still a bit of code needed, but definitely viable.

I'd love to see that happen before the year is out :-)


Very intriguing, thanks for the excellent slides.

Is it possible to know the thread ID efficiently on Linux or other *NIX systems, using a similar fashion presented for Windows in the slides? I have no idea on how much calling pthread_self() every time impacts the performance, but it would be better if a faster method is available.


pthread_self() pretty much returns a thread-local variable. Short of inlining the function to one mov instruction, I don't think it's possible to make it faster.

(Source: http://koala.cs.pub.ro/lxr/glibc/nptl/pthread_self.c and http://koala.cs.pub.ro/lxr/glibc/nptl/sysdeps/x86_64/tls.h#L...)


Ah, I just reviewed the slides... looks like I don't actually mention __readfsbase(), I thought I did. Basically, on amd64, you can just read the base address of one of the FS/GS segment registers and that'll be guaranteed to be unique amongst all threads in a given process.

(Funnily enough that intrinsic doesn't (or didn't at the time) appear to be exposed by MSVC, so I just stuck with __readfsdword(0x48) on Windows.)


Speculating here, but all that method needs to do is access some thread record struct. The big question is whether it does a context switch to access this struct in memory every time. Maybe the struct is in some read-only memory, and can be read from user space, but not written?


On i386 and amd64 on essentially all operating systems there is some kind of thread description structure that is accessible through one of segment registers which contains low-level thread context, some way to get to thread ID and user-accessible space for thread local variables (or pointer to such space). Getting some kind of thread ID is thus only about one "mov dest, [?S: some offset]", which is what both pthread_self() on linux and GetCurrentThreadId() on windows does (note that linux stores pointer to pthread_t in there, windows stores some opaque number that you have to convert to handle yourself if you need handle. The point is that in both cases the function does not do any kind of involved computation as would be the case if it had to traverse list of currently running threads).

This mechanism is essentially the only reason why amd64 in long mode still has limited support (complete enough to implement this, nothing much more) for segment registers.


Congratulations to @trentnelson on joining Continuum.

Does anyone (or Trent) know how this appointment will support or hinder further development of PyParallel? I assume that Continuum could accelerate this development or direct Trent's efforts to other areas.

Either way, I'm sure there will be benefits for the Python ecosystem.


Thanks darkseas :-)

It's been a busy time since I first presented PyParallel to the core Python developers at PyCon last year -- incidentally it was also when I met Peter and Travis and joined Continuum.

I've since relocated from East Lansing to NYC, via visas and trips to Australia and whatnot, and have been very busy engaged with client consultancy here in NYC since arriving officially around July/August last year.

Peter (President) and Travis (CEO) are very supportive of PyParallel, and it's actually an incredibly good fit within Continuum's existing ecosystem. I'm primarily engaged with consultancy at the moment, but we're looking at having me spend more time on PyParallel development very soon. Watch this space!


A perfectly written parallel Python app without GIL would be ideally 8 times faster on an 8 core system, compared to stock Python.

Rewriting the original program in Java would make the program about 100 times faster... on one core.

Now let's see. We deal with hacked Python, with complex error-prone thread synchronization code in Python for 8x gain, or grab the 100x gain in a single thread in Java?

While I don't mind Python getting faster, threading should be the last thing to try before everything else is exhausted in order to make Python run fast in one thread.


Give me Numpy, Scipy, Matplotlib, Scikit-learn, and iPython in Java and we are talking.


I think you missed his whole point. Whoosh.

His point was about using Java instead of Python. It was that at best, a perfect parallel CPython can yield a N times more performant program (where N = number of cores). And that's for a perfect parallel CPython, and only when there's very little sync/sharing overhead and the program is also fully parallelizable. So, even for an 8 core machine, parallelizing Python can yield at best c*8 better performance, where c < 1.

Now, what he says is, there are languages that run stuff 10 and 100 times faster than CPython on a SINGLE core. So maybe start from there (improving single core CPython performance), something which has much more room to speed up our code, even if it's not parallel.


I think you missed his whole point. Whoosh.

His point was about using the language with the most productive / fastest libraries. Single-thread python vs 8-thread python vs java makes no difference if the majority of your processing time is spent inside a highly optimised native library.


I think you missed my whole point. Whoosh.

I (or the OP) didn't tell anybody to use Java instead of Python. What we said is that it will be better for performance to optimize CPython's core speed instead of its multi-core capabilities.

So the remark about Python having more libs than Java is beside the point, since Java wasn't mentioned as a migration option, but as an example of how far single-core performance can be taken, and as an advice to try to get some of that to Python.

Oh, and the "but still it doesn't matter because Python has fast native libs" is not an argument either, because if that was enough people wouldn't care about parallelizing Python to get more speed. Which was the whole topic that started this thread.


I think you missed my whole point. Whoosh. (PS. This is silly :P)

> if [native code] was enough people wouldn't care about parallelizing Python

They would, for the same reason that they care about parellelizing Java -- once you're hitting the limits of single-thread speed (whether it's by being fast yourself or by having fast extensions), multithreading is next on the list.


Coldtea got my point pretty well. Everyone else gets a whoosh.

I'm not saying abandon Python for Java, I'm saying if you'll optimize Python, first optimize single-thread Python performance, before we get to parallel processing, because extracting performance out of parallel is hard, while languages like Java (and JavaScript recently) show they there's plenty more to be gained yet from that single thread.


A whoosh for you too :)

> there's plenty more to be gained yet from that single thread.

Except that there isn't, if your single thread is spending most of its time in highly optimised native code extensions already :P


You're right, even if you're still running efficiently on all cores, it's still Python. I actually make this exact point in a separate presentation that I did a few weeks ago: https://speakerdeck.com/trent/parallelism-and-concurrency-wi...

Numba is one option -- rather than executing the CPython innards, have it switched out for an LLVM-optimized kernel instead.


<irony>Why not implement everything in assembly language and use all parallelizing tricks of e.g. the intel architecture has to offer and get a speed-up of (in some selected cases) 10.000 times?

Of course development time will also go up 10.000 times!</irony>




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: