

Tornado Without a GIL on PyPy STM - fijal
http://morepypy.blogspot.com/2014/11/tornado-without-gil-on-pypy-stm.html

======
haberman
I recently read this paper about Julia
([http://arxiv.org/abs/1411.1607](http://arxiv.org/abs/1411.1607)). Now when I
hear about PyPy I can't help but think about this quote from that paper:

    
    
        New users also want a quick explanation as to why
        Julia is fast, and whether somehow the same “magic
        dust” could also be sprinkled on their traditional
        scientific computing language. [...]  Julia is
        fast because we, the designers, developed it that
        way for us, the users. Performance is fragile,
        like accuracy, one arithmetic error can ruin an
        entire otherwise correct computation. We do not
        believe that a language can be designed for the
        human, and retrofitted for the computer. Rather a
        language must be designed from the start for the
        human and the computer.
    

To me, Python and Ruby are both perfect examples of languages that were
designed for the human, and ever since have seen extensive effort to retrofit
them for fast execution by computers.

I respect the work of the PyPy team, particularly given the raving reviews
I've seen lately of how RPython is a boon to language designers who can use it
to prototype their languages and get a decently-performing VM in not very much
time:
[http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast...](http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast_enough_time)

But I can't help but think that languages like Python and Ruby will start to
fall to languages like Swift, Julia, Go, etc. that were designed with
performance in mind. I'm not saying this will happen soon, but these languages
are showing that you can have your cake and eat it too.

I'm not sure how JavaScript and Lua fit into this analysis. They weren't
specifically designed for performance, but have been very successfully
optimized. Lua is a very simple language and Mike Pall is a genius, so LuaJIT
has been very successful at speeding up Lua. JavaScript is a little more
complicated, but has received an immense amount of resources into optimizing
it, and has also been quite successful at getting fast.

~~~
nickbauman
All languages other than pure machine code were designed for the humans first.

~~~
_delirium
I don't think that's quite true. A lot of languages are a mix of designed-for-
humans and designed-for-machine. They do aim to be higher-level than machine
code, but it's quite common to have design decisions at least partly driven by
considerations from the compiler side as well. Not necessarily only designing
for efficient execution (though that is one); other design-for-machine
considerations can include ease of parsing and ease of compiler
implementation.

------
reduce
Am I reading this correctly? Looking at the benchmark code, Why not do a
comparison versus the default Tornado setup, which is to fork one process per
core? So STM Tornado is allowed to use multiple cores in this benchmark, but
vanilla Tornado is not allowed to?

    
    
            http_server = HTTPServer(Application(),
                                     xheaders=True,
                                     )
            http_server.bind(port)
            http_server.start(0) # Forks one sub-process per core
    
    

[1] [https://bitbucket.org/kostialopuhin/tornado-stm-
bench/src/65...](https://bitbucket.org/kostialopuhin/tornado-stm-
bench/src/65144cda7a1f8e56db6d3d9057d7d39a1652e622/utils.py?at=default)

~~~
fijal
The problem with multiple processes is the "share nothing" model. It works for
some problems, but it blatantly fails for a whole variety of other problems.
STM tries to address those problems where "share nothing" does not work, e.g.
because there is interesting data to be shared (albeit with few conflicts) or
the memory overhead of N processes is just too much.

------
nickbauman
I have found personally A* is particularly difficult to scale across cores
because it's never a shared-nothing problem. For two reasons. One, each core
has to know about the other cores' search space and avoid it (to avoid
duplicating effort) so you will contend on some sort of 'visited node' cache.
Two, the graph your building itself must be shared, obvs.

Are most of the interesting problems to solve always limited by Amdahl's Law?
Will we never see the gains of single-core speed we saw in the last century
again?

~~~
desdiv
_One, each core has to know about the other cores ' search space and avoid it
(to avoid duplicating effort) so you will contend on some sort of 'visited
node' cache._

Just make the visited node cache public and immutable.

 _Two, the graph your building itself must be shared, obvs._

If the graph is immutable then there's zero problem with it being shared.

~~~
jerven
You can't make the cache immutable because then it will be empty at start and
stays that way ;)

The cache has to mutate and be shared as that's the work completed list. As
each thread completes a bit of work (visits a node) it needs to communicate it
with the other threads.

~~~
desdiv
In Scala, its:

    
    
        var cache = Seq(1,2,3)
        cache :+= 4
    

The cache is immutable and freely shareable. Any other thread and come in and
read it and be guaranteed that its current state is valid.

~~~
jerven
:+= Returns A copy of this sequence with an element appended.

Meaning the cache is no longer shared as all threads end up having a thread
local cache. You can't update shared changing state and have immutability at
the same time.

Immutability is great when one can have it but sometimes its not possible.
Shared changing state is something to be avoid as much as possible but
sometimes we need it.

------
error54
Note: If anyone was wondering what STM was, it stands for Software
Transactional Memory. The read the docs page gives a good overview[1] of what
pypy-stm is.

1-[http://pypy.readthedocs.org/en/latest/stm.html](http://pypy.readthedocs.org/en/latest/stm.html)

------
fijal
While we're here, please donate to PyPy STM, it's purely a crowdfunded effort
[http://pypy.org/tmdonate2.html](http://pypy.org/tmdonate2.html)

~~~
tshepang
Please create a PyPy team on Gratipay.

~~~
fijal
here is mine - [https://gratipay.com/fijal/](https://gratipay.com/fijal/) \-
not exactly killing it

~~~
tshepang
Better it be the team account, to avoid having donors try decide "who
contributes how much"... let the team decide. Also, please offer it as an
alternative to PayPal on the project website.

~~~
fijal
We can't really be creating accounts everywhere for minor donations. I support
your sentiment, but the official PyPy bookkeeping has to be done in a proper
way via the Software Freedom Conservancy, so going through all those services
for a few $$$ is simply not worth it.

~~~
tshepang
Why doubt the amount that can be received from Gratipay? What do you need to
see to consider using it for PyPy funding? Would a promise of, say $500/week,
be enough to make it worth a bother?

------
exacube
This is some great progress!

But I should say that there is still too much overhead in using STM; you will
still be able to very easily (and by a large margin) outperform STM-4 by
running 4 instances of Tornado with HAProxy or some other lightweight router
ontop. A comparison graph for this should have been the benchmark.

------
zaphar
I like that they instrumented the STM code enough that you can debug slowdowns
when writing your code. Never understimate the power of well instrumented
languages.

------
pjmlp
Maybe one day, PyPy will eventually become the canonical implementation of
Python.

~~~
Animats
Python's Little Tin God wouldn't like it.

Python's feature set is basically what's easy to do in a naive interpreter.
Everything is a dictionary. Anything can be changed from anywhere. With
"getattr", you can patch one thread from another. It's elegant, and very
difficult to speed up. Google tried, with von Rossum on board. Their "Unladen
Swallow" JIT compiler project crashed and burned.

The PyPy group has made a fast Python compiler/interpreter/JIT system. It's
really hard. The initial funding from the European Union got them started, but
wasn't enough. They really try to handle all the hard cases. This requires two
interpreters and a JIT. They have to handle "oh no, someone patched object A
from thread R, invalidating code that's running in thread S". There's a
"backup interpreter" that kicks in for hard cases, and once control is out of
the area in trouble, the JIT can recompile it. (This is an old and
oversimplified description.)

This transactional memory thing is very clever. It has to separate things at
run time that probably should have been separated at compile time, of course.
It's impressive that they can get it to work. It's a lot like how a
superscalar CPU works, including transaction commit and backup at the
retirement unit.

Python gets into this mess because, like C and C++, the language doesn't
really know about concurrency. (Threads came late to UNIX, and C predates
threads. So C has an excuse for backing into concurrency.) Python has the C
model of concurrency; at the user level, it's treated as a library issue.
Internally, though, it needs a lot of locking, because there's so much mutable
state in the interpreter.

It would be a lot easier if the language were restricted a little. But then It
Wouldn't Be Python(tm). The price of this is huge complexity layered on a
simple model, and probably years of obscure bugs in PyPy.

~~~
pjmlp
> C predates threads

There were other OS with threads and co-routines being explored as design
while UNIX was being developed at AT&T.

As for the rest I agree with you.

Personally I don't have any use for Python besides the occasional shell
script, but as a user of applications written in Python I would like them to
perform fast.

~~~
Animats
Yes, and UNIVAC 1108 Exec 8 had threads in 1967. But it wasn't on UNIX until
the 1980s.

------
amelius
I'm wondering, how does PyPy compare to nodejs?

~~~
Fede_V
Like apples compare to oranges :)

Cheeky comment aside, PyPy is an alternative interpreter for Python and, more
broadly, a meta-frame work to make writing tracing JIT for the language of
your choice easy.

It's much faster than regular Python. The only downside (which, admittedly, is
very, very big) is that the PyPy developers released their own FFI library for
interfacing with C-code (which works brilliantly), but using c-types and the
c-python c-api is very kludgy. This means that a lot of very important python
libraries which are basically thin wrappers around c/fortran code (SciPy,
NumPy, etc) are not really working.

There is some effort around reimplementing NumPy in PyPy, but it's going very
slowly.

Edit: PyPy is also a heroic effort by a small team of developers who receive
some donations. JS is probably the language that received the most amount of
money and attention into making it run fast from Google/Mozilla, etc - in
terms of results/money, PyPy is incredible.

~~~
TillE
> but using c-types and the c-python c-api is very kludgy

It's easy enough to use CPython extension modules via cpyext, but it's really
slow.

------
bwlandstreet
Sometimes I forget the number of great contributions / projects that Bret
Taylor keystroked. It's a motivating reminder to build.

~~~
fijal
I think your comment is moderately off-topic. This is __really __not about
tornado at all, rather than you can take medium-sized program and get STM in
PyPy not to crash

