
How fast can we make interpreted Python? (2013) - luu
http://arxiv.org/abs/1306.6047
======
piquadrat
Title should say 2013. These days, PyPy is way past 25% average speedup with
respect to CPython.

/edit to be fair, the project README points out the difference in goals and
performance between Falcon and PyPy quite nicely:
[https://github.com/rjpower/falcon/blob/master/README.md](https://github.com/rjpower/falcon/blob/master/README.md)

~~~
haberman
I thought the 25% number was Falcon vs. CPython, not PyPy vs. CPython. The
paper suggests (though does not say directly) that PyPy is faster than Falcon,
and that the only benefit of Falcon is compatibility with existing C
extensions.

~~~
Animats
That's about right. That's about where Unladen Swallow maxed out. To get
beyond that point, you have to do more global analysis and JIT compilation and
recompilation, as PyPy does.

Every Python symbol reference requires a dictionary lookup. Every function
call and operator requires that the type of the left hand side be examined for
dispatching purposes. More than 99% of the time, the symbol and type will be
the same as last time, and it's a huge win to assume that it will be and
compile code for that case. You still have to be prepared for the times when
it isn't, and have a backup system. That's basically what PyPy does.

More stuff is mutable in Python than really needs to be mutable.

~~~
rcarmo
That is true. I've been fooling around with Hy (hylang.org) for a fairly long
time because it's nicer than Clojure for doing glue, and that has made me
wonder how fast some things would go if Python had native immutable data
types.

------
epx
Quite old, plain Python is still slow but PyPy is in the league of Node.js.

I rewrote the same program (compare two images, generate a third that shows
the diff) in a number of languages. Considering CPython as the reference
implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x
in PyPy.

Initially I got 4x with PyPy but I did a light refactor, removing some map()s
and zip()s that were gratitious (3-element lists) and then PyPy went real
fast.

~~~
joshvm
You've picked a poor example, I think. And you're probably coding in an
inefficient way.

Here's the OpenCV way for a pair of 2048x2048 images:

    
    
        import cv2
        import time
    
        t = time.clock()
    
        a = cv2.imread("./image_1.tiff", cv2.IMREAD_GRAYSCALE);
        b = cv2.imread("./image_0.tiff", cv2.IMREAD_GRAYSCALE);
    
        c = a-b
    
        cv2.imwrite("out.tiff", c);
    
        print time.clock()-t
      

Takes about 0.2 CPU seconds on average for me (note using time.clock, not
time.time on UNIX).

    
    
        #include <opencv2/opencv.hpp>
        #include <ctime>
    
        using namespace cv;
        using namespace std;
    
        int main(void){
    
          clock_t begin = clock();
    
          Mat a = imread("./image_1.tiff", IMREAD_GRAYSCALE);
          Mat b = imread("./image_0.tiff", IMREAD_GRAYSCALE);
    
          Mat c = a - b;
    
          imwrite("out.tiff", c);
      
          clock_t end = clock();
          double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
          cout << elapsed_secs << endl;
        }
    

Again, about 0.2 seconds. The difference is negligible if you use the right
libraries. Python should not be your bottleneck for high performance code.

~~~
Someone
Now, try implement that matrix subtraction or something more complex such as
blurring or edge detection in Python, and compare results.

Also, are you sure that doesn't measure disk speed?

~~~
joshvm
I get around 0.07 seconds without the write in C++ and the same in Python
(good call though).

I agree that in pure Python it'd be slower, but realistically why would you do
that? Unless you work somewhere where you're forced to write your own
libraries... but even then you could still implement your own.

~~~
Someone
If you only removed the write and not the reads from the timing, I would guess
the reads (even with warm caches) still dominate the time.

And you would want to do it in pure Python if you want to answer the question
"how fast can we make interpreted Python?". Using C extensions for that is
cheating, as it isn't Python and it isn't interpreted. You don't answer the
question "how fast can you run?" With "30 km an hour, using a bicycle",
either.

If you make those images large enough (and I guess 2k x 2k is large enough),
any language that uses OpenCV to do the job will give results in the same
ballpark. For example, you can make the difference between Python
implementations that can call OpenCV as small as you want it.

~~~
joshvm
In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the
same.

I agree that the question is valid - making vanilla Python faster is cool. My
point was that this particular example (image processing) was flawed, because
it's not something a sane person would ever do in pure Python.

~~~
Someone
_" In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the
same."_

Of course it is about the same. Except for function entry and function exit,
which should be a few thousand instructions, at the most, it runs the exact
same instruction sequence (if you are using identical versions, compiler and
compiler flags)

If you want an easily measurable difference, use way smaller images, and make
a few thousand or even a few million calls, or look at the python sources to
see how efficiently it calls into C.

------
_ihaque
Stack->register JIT compiler for Python that maintains compatibility with
existing C extensions because it runs as an extension within CPython. Average
of 25% faster to up to 2.5x on the benchmarks in the paper.

Source link:
[https://github.com/rjpower/falcon/](https://github.com/rjpower/falcon/)
(doesn't show up in the paper till the very end!)

------
antman
I have been using cython, as presented here [0] with huge speedups. Prototype
in python and then make a few changes to produce python code.

[https://spacy.io/blog/writing-c-in-cython](https://spacy.io/blog/writing-c-
in-cython)

------
smortaz
there's also Microsoft's effort:

[https://github.com/Microsoft/Pyjion](https://github.com/Microsoft/Pyjion)

[disclaimer]

~~~
oldmanjay
What are you disclaiming? I have to admit, having been a bit baffled when
people started disclaiming their credentials against all definitional
expectation, just seeing a bare disclaimer with nothing obviously being
disclaimed, I am completely confused.

------
heydenberk
Previous posting:
[https://news.ycombinator.com/item?id=6112995](https://news.ycombinator.com/item?id=6112995)

25 comments and good discussion there.

------
pepijndevos
For all those talking about PyPy, have a look at Pyston, an ongoing effort by
Dropbox to build a fast Python.

[https://github.com/dropbox/pyston](https://github.com/dropbox/pyston)

~~~
forgotpwtomain
Any idea why they decided to write their own JITed python implementation
rather than putting in additional support for PyPy ?

~~~
Fede_V
C-API compatibility, I'm guessing. I agree it was of a strange choice though.

~~~
andreasvc
PyPy also has higher memory usage.

------
mangeletti
Something worth noting:

Python 3.6 has a number of refactorings of standard library components[1], and
while that doesn't effect the CPython interpreter itself, these enhancements
should do a lot to speed up applications that make heavy use of the standard
library.

1\. Some are mentioned at
[https://docs.python.org/3.6/whatsnew/3.6.html](https://docs.python.org/3.6/whatsnew/3.6.html)
(search page for "fast")

------
vegabook
Ah yes. 25% faster.

I get 50x faster (that would be 5000%) just dipping into Numpy when I need to
(admittedly with AVX), and C when I have to. Both are like, trivially easy.

Why are we bothering with making native Python marginally faster when it is
already the perfect tool for the mission it is there to accomplish (glue), and
there are dozens of other tools which are optimized for performance?

~~~
Animats
_" Both are like, trivially easy."_

Until you screw up memory allocation and have to debug.

There's a race condition in Python 3's CPickle that corrupts memory.[1] I
can't reproduce it well enough to submit a bug report that won't be ignored.

[1] [http://bugs.python.org/issue23655](http://bugs.python.org/issue23655)

~~~
burfog
Debugging gets way easier if you finish the C-to-Python transition, ripping
out the last bit of Python. Having the interpreter running makes debugging way
harder than it needs to be. Lose that, and suddenly you can take advantage of
all sorts of powerful debugging tools. (valgrind, less-painful use of a
standard debugger, -fsanitize= compiler options, C interpreters, coverity,
etc.)

Ditching the python also dramatically improves start-up latency.

------
exabrial
But why? I'm not trolling, I mean this as sincere non hostile criticism...
Static types alone improve code quality and execution speed, and it's been
shown hardly anyone actually exploits dynamic types anyway.

------
eveningcoffee
Is it at all possible to make Python really fast given that it depends on GIL,
that to my understanding makes Python performance memory bound?

~~~
andreasvc
The GIL is an issue when you want parallelism. It is not relevant when talking
about single thread performance.

~~~
eveningcoffee
As I understand it, it eliminates some optimization possibilities and adds
overhead.

I do not have time to check the Python code to see how it is actually
implemented (hence my question) but based on my knowledge it would imply at
least a CAS operation to check and take the lock, writing the register values
into memory (cache) and applying a memory barrier.

You can not keep the values in the registries (elimination of optimization
possibilities) and you add considerable overhead by needing memory barriers
and CAS operations.

I am not claiming that Python does it like this, I am just assuming that it
should do it like this to obtain the guaranties of GIL.

~~~
smegel
The GIL is a condition variable and a mutex. Nothing fancy. In a single
threaded program, it gets acquired once, if at all.

~~~
eveningcoffee
Mutex and a shared condition variable are expensive compared to the single
instruction.

But I had an impression that it is acquired before every atomic Python
instruction and it looks that it is actually acquired for group of predefined
number of instructions (100) that then are executed inside one GIL time frame
[0].

Therefore it actually should not be a big obstacle to make Python code run
fast by a JIT compiler.

[0]
[http://www.dabeaz.com/python/GIL.pdf](http://www.dabeaz.com/python/GIL.pdf)

------
baccheion
Very.. if a VM or JIT compiler is added to the mix, and optional type hints.
There are many potential Python optimizations that can't be done that are
suddenly possible if something can provide real time code
monitoring/optimization.

------
continuations
How does Falcon compare to PyPy performance wise?

------
amelius
Why not write a Py->JS transpiler to solve the efficiency problem?

~~~
rpearl
I can't tell if you're joking, but if not... No, JS isn't more efficient than
Python.

~~~
infogulch
I couldn't tell either. Poe's law in full effect.

