
Solving multi-core Python - ot
https://lwn.net/Articles/650521/
======
trentnelson
I attempted to describe the progress I've made with PyParallel already here:
[https://mail.python.org/pipermail/python-
ideas/2015-June/034...](https://mail.python.org/pipermail/python-
ideas/2015-June/034342.html)

And here: [https://mail.python.org/pipermail/python-
ideas/2015-June/034...](https://mail.python.org/pipermail/python-
ideas/2015-June/034260.html)

I'm getting excellent scaling and performance across the board for the simple
TEFB tests
([https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821...](https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33db28a/examples/tefb/tefb.py?at=3.3-px)),
and have implemented something that really shows where PyParallel shines: an
instantaneous wiki search REST API:
[https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821...](https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821ceb75683ee96ed33db28a/examples/wiki/wiki.py?at=3.3-px).

Quoting:

I particularly like the wiki example as it leverages a lot of benefits
afforded by PyParallel's approach to parallelism, concurrency and asynchronous
I/O:

    
    
        - Load a digital search trie (datrie.Trie) that contains every
          Wikipedia title and the byte-offset within the wiki.xml where
          the title was found.  (Once loaded the RSS of python.exe is about
          11GB; the trie itself has about 16 million items in it.)
    
        - Load a numpy array of sorted 64-bit integer offsets.  This allows
          us to do a searchsorted() (binary search) against a given offset
          in order to derive the next offset.
    
        - Once we have a way of getting two byte offsets, we can use ranged
          HTTP requests (and TransmitFile behind the scenes) to efficiently
          read random chunks of the file asynchronously.  (Windows has a
          huge advantage here -- there's simply no way to achieve similar
          functionality on POSIX in a non-blocking fashion (sendfile can
          block, a disk read() can block, a memory reference into a mmap'd
          file that isn't in memory will page fault, which will block).)

~~~
scott_s
Apparently, you should expect to hear from him soon:
[https://mail.python.org/pipermail/python-
ideas/2015-June/034...](https://mail.python.org/pipermail/python-
ideas/2015-June/034275.html)

Antoine Pitrou said that Snow's proposal was similar to your work. What do you
see as the similarities?

~~~
trentnelson
I cover that in the other e-mail I quoted. The similarity is simply focusing
on solving the problem with a single process and threads versus multiple
processes via fork().

I don't agree with the subinterpreter approach. In fact, my opinion is that
the best way to solve the problem is to use the approach taken by PyParallel.

Granted, I would think that wouldn't I, being the author of PyParallel and all
;-)

------
ngoldbaum
The situation around multiple cores in Python (up to and including the link in
the OP) is very nicely summarized in these notes by Nick Coghlan:

[http://python-
notes.curiousefficiency.org/en/latest/python3/...](http://python-
notes.curiousefficiency.org/en/latest/python3/multicore_python.html)

~~~
dtxcoolbits
"...it’s very easy to envision a future where CPython is used for command line
utilities (which are generally single threaded and often so short running that
the PyPy JIT never gets a chance to warm up) and embedded systems, while PyPy
takes over the execution of long running scripts and applications, letting
them run substantially faster and span multiple cores without requiring any
modifications to the Python code"

IMHO this is the most sensible and logical approach and doesn't break any
existing code. This also seems like what will likely happen anyways. Out of
all the options PyPy-STM seems like the best approach because:

"pypy-stm is fully compatible with a GIL-based PyPy; you can use it as a drop-
in replacement and multithreaded programs will run on multiple cores."

[http://pypy.readthedocs.org/en/latest/stm.html](http://pypy.readthedocs.org/en/latest/stm.html)

------
maxerickson
In case it isn't obvious, this got lots of discussion on the mailing list it
was posted to:

[https://mail.python.org/pipermail/python-
ideas/2015-June/thr...](https://mail.python.org/pipermail/python-
ideas/2015-June/thread.html#34177)

------
zabbadabba
This proposal is needlessly complicated. All Python has to do is put all
interpreter state into a single struct to allow for completely independent
interpreter instances. Then the GIL can be discarded altogether. And finally
they have to change the C API to include a PythonInterpreter* as the first
parameter rather than relying on the thread-local data hack.

~~~
JoshTriplett
Interpreter state isn't the only problem; if that were the only issue, it'd be
a more tractable problem.

However, built on top of CPython's C API are a huge pile of libraries that
assume they can manipulate global or per-data-structure state without locking.

~~~
zabbadabba
Avoiding fixing the C API for the sake of backwards compatibility is the
reason for the GIL mess. Python 3 had a chance to correct the API situation
since they were starting anew, but it didn't happen. All these API half
measures are just dancing around the real problem of not having truly
independent interpreter instances.

~~~
pekk
Python 3 wasn't really starting anew, they were also balancing the ongoing
screaming about every difference from Python 2.

~~~
zabbadabba
It is curious how they completely revamped Python making most Python3 code
incompatible with Python2 and yet they wanted to preserve C API
"compatibility". Compatibility with what? Most Python3 modules had to be
rewritten anyway - it was the perfect opportunity to fix the C API and get rid
of the GIL once and for all.

~~~
ericsnow
you forgot your <hand-wavy gross exaggeration> tags

------
crdoconnor
>However, in CPython the GIL means that we don't have parallelism, except
through multiprocessing which requires trade-offs.

I can't speak for anybody else, but I personally haven't felt that
inconvenienced by these trade offs at all.

The major one is needing to pickle objects before sending them back and forth
between processes. Since I prefer to keep the thread interfaces as tight as
possible anyway, this only ended up being a major problem once - when I was
trying to pickle a stacktrace before sending it to the parent process.

~~~
vosper
I was inconvenienced by multiprocessing just today: We were using it to spin
up a bunch of workers, and have spent some time debugging an issue with
SQLAlchemy, which requires special management when forking: "it’s usually
required that a separate Engine be used for each child process. This is
because the Engine maintains a reference to a connection pool that ultimately
references DBAPI connections - these tend to not be portable across process
boundaries." [1]

So you have to be careful when dealing with forking and databases, which has
lead us to subclass multiprocessing.Process specifically for our (common) case
of wanting to continue to use the session object in the child process without
having to think about recycling the Engine object. It was that code we have
been trying to debug today, because in some specific cases we still run into
issues (yes, we read the docs). Also, when people (usually new engineers)
don't know about this they usually blindly use the multiprocessing module
directly (and who can blame them) and end up spend some time debugging
intermittent connectivity issues until someone says "oh, you should use the
xyz module, it handles the DB stuff for you".

[1]
[http://docs.sqlalchemy.org/en/rel_0_9/core/connections.html#...](http://docs.sqlalchemy.org/en/rel_0_9/core/connections.html#basic-
usage)

~~~
crdoconnor
>it’s usually required that a separate Engine be used for each child process.

Yeah, provided you do this there should be no headache.

>So you have to be careful when dealing with forking and databases, which has
lead us to subclass multiprocessing.Process specifically for our (common) case
of wanting to continue to use the session object in the child process without
having to think about recycling the Engine object.

What use case led to this being a non-negotiable requirement?

~~~
vosper
It's not non-negotiable, it's for safety and convenience and DRY: We want our
engineers to be able to use the database in their parallel processing jobs
without having to understand the internals of SQLAlchemy, Engines, and
forking.

------
mianos
If this comes I'll move from 2.7 to a 3.x

------
BuckRogers
What problem does this solve? It does not solve the multicore "problem". I'm
not a huge fan of this proposal. A few reasons.

This doesn't really add much that wasn't possible before, and doesn't really
solve any technical or PR issue. The PR issue will never be resolved with
CPython, those people who don't understand it are free to write multithreaded
Java apps. But I think explicitly spinning up pthreads should be reserved for
writing systems software. I'm assuming subinterpreters means they'll be
allocated as pthreads because this is meant to "use all your cores". Looking
forward, this proposal sounds great as long as we're on 4 or 8 cores max, at a
certain point this solution starts to look like something like a gimmicky
ideal created to fit the technology way back in 2015. The ultimate multicore
and multinode solution at a non-systems level, if we really need every single
language to solve that specific problem- would be Erlang's approach.

It's yet more technical churn rather than innovation in a language (Python3),
that was born of technical churn.

To offer something constructive as well, I think an ideal solution would be to
find a more implicit approach. Think gevent for pthreads or subinterpreters.
That would be a lot more work to figure out, this proposal looks more like a
hack. I don't think this type of "improvement" will draw people to Python3
either, I wish they'd stop throwing crap at the wall seeing what will stick.
Constantly expanding the language, which is bad. The core dev team doesn't
know what to do, but usually in that case it's better to do nothing. Thus
Python3 is looking more and more like a playground or experimental branch as
time goes on. I'm increasingly thinking my "Python3 migration" will be to
Swift or Go.

~~~
pekk
I couldn't detect any real proposal in anything you said except that Python
programmers should switch to Swift, and your main thesis seems to be that
Python 3 is bad because it changes.

~~~
sfk
Even for HN standards, your comment is one of the most peculiar summaries I've
read.

------
angry_octet
It seems to me that there is alot of muddled explanation going on here, if not
muddled thinking.

The goal fine grained parallelism python (fgpp) must be to improve the speed
of code for which fat parallelism (user conconcted error prone 'multi-
threading' or message passing) doesn't gain much, and for which calling out to
an impl in another language or specialised library has overhead or doesn't
make sense.

IMHO, the most important part of fgpp must be that it is easily reasoned
about, both by compiler/runtime and programmers, so as to maximise scope for
transformations and user-driven design decisions (as opposed to VOODOO). It
would ideally avoid the costs of premature optimisation, or the often
inefficient flattening required for vectorisation.

So I think any propsal should be addressing these questions, with appropriate
benchmarks too, before it can be considered seriously.

~~~
ericsnow
Thanks for the feedback.

------
Scramblejams
Here's what I want: Something with programmer productivity similar to Python,
but type-inferred for predictably good execution speed, with more support for
functional programming style, with green threads preemptively executing in
parallel and immutability baked deeply into the whole thing. And then a track
record of deployment so the ugly GC and type system edge cases are ironed out
and libraries are plentiful.

Elixir/Erlang will probably never be fast enough, Go didn't commit to
immutability and its functional programming support is minimal, Rust could
have all that but it's too low-level...

I guess OCaml is as close as it comes for the moment -- it's got everything I
listed except a reasonable parallel execution story, but it looks like that's
being worked on at the moment.

~~~
beagle3
I don't think there's anything quite like that, but Nim seems to be a better
fit than OCaml.

~~~
Scramblejams
Tried that. It's quite a big and complex language, there are lots of rough
edges (I had it crashing from a double free within 5 minutes of trying it
out), the bus factor is in the basement and although I didn't stick with it
long enough to know for sure, there's much more mutability there than made me
comfortable during my short sojourn.

------
rurban
This looks like a new Python 4, but why not. It's a big problem that needs to
be resolved for once.

------
smegel
Wow, Python will become the new Perl, how cool is that?!

Not very.

~~~
coldtea
What does that even mean?!

Not much.

