Hacker News new | past | comments | ask | show | jobs | submit login
How fast can we make interpreted Python? (2013) (arxiv.org)
125 points by luu on April 18, 2016 | hide | past | favorite | 78 comments

Title should say 2013. These days, PyPy is way past 25% average speedup with respect to CPython.

/edit to be fair, the project README points out the difference in goals and performance between Falcon and PyPy quite nicely: https://github.com/rjpower/falcon/blob/master/README.md

The Pyston team at Dropbox has a presentation where they say the "real world" performance metrics aren't as good for PyPy as the benchmarks would suggest, allegedly slower than cpython on many metrics.


It's an argument I haven't heard anywhere else (and one I'm not in a position to substantiate) but presumably Dropbox, who use Python at scale, have as valid perspective on this issue as anyone.

I wrote a game renderer with pygame, its >10x faster in pypy... it's voxels (in the comanche sense).

That's just an anecdote but obviously you should check your use case and find what tool is best suited for your project. There is no magic cure-all, but in my experience pypy comes close.

I'm pretty confident it's also going to be a magic speed boost for a decent size web app I work on, but it remains to be tested with this particular app so I'll check it out... The most bulletproof way to find out what's faster is to test it.

Comanche's voxels is an elevation map.

That's why I said "in the comanche sense", since this style of raycasting was commonly and erroneously referred to as "voxels" in the 90s (all other raycasters having basically flat floors).

"elevation map" doesn't mean anything in terms of projection, you can project an elevation map onto a 2D display in a hundred ways, "raycaster" does mean something here, a raycasted elevation map or raycasted heightfield is what this is, anyway it's just an example of what types of problems can be solved 10x faster with pypy.

I thought the 25% number was Falcon vs. CPython, not PyPy vs. CPython. The paper suggests (though does not say directly) that PyPy is faster than Falcon, and that the only benefit of Falcon is compatibility with existing C extensions.

That's about right. That's about where Unladen Swallow maxed out. To get beyond that point, you have to do more global analysis and JIT compilation and recompilation, as PyPy does.

Every Python symbol reference requires a dictionary lookup. Every function call and operator requires that the type of the left hand side be examined for dispatching purposes. More than 99% of the time, the symbol and type will be the same as last time, and it's a huge win to assume that it will be and compile code for that case. You still have to be prepared for the times when it isn't, and have a backup system. That's basically what PyPy does.

More stuff is mutable in Python than really needs to be mutable.

That is true. I've been fooling around with Hy (hylang.org) for a fairly long time because it's nicer than Clojure for doing glue, and that has made me wonder how fast some things would go if Python had native immutable data types.

> That's about right. That's about where Unladen Swallow maxed out. To get beyond that point, you have to do more global analysis and JIT compilation and recompilation, as PyPy does.

Unladen Swallow did JIT compilation and recompilation, despite not doing better.

But compatibility with C extensions is a major selling point.

25% speedup for pure Python code is not that appealing in numerical analysis code, when you can use NumPy, or sprinkle a few typedefs into a critical function and compile with Cython for a 1000x speedup.

If C extension compatibility is broken, then many Python programs will altogether still be slower, despite the pure python speedup.

cython community can definitely use some love -- there's a ton of low-hanging fruit (unboxing arrays of extension classes, for example) they haven't gotten to because of time / funding constraints.

Not an apples to apples comparison, but my guess is the state of the art of JIT is way ahead of what the cython compiler can detect; some of the JIT tricks will likely work in the cython static translation step.

And to be clear I love cython, it's very useful as-is -- there's a large community of people for whom C & python expertise have gone hand in hand for years, and cython is the tool they end up using to max productivity and minimize surprises.

Rereading my comment, it can easily be misunderstood. Sorry about that. The 25% average speedup they cite is indeed relative to CPython. Thanks for clarifying!

I think the newest version of RPython has better FFI; they've been blogging about lxml compatibility for a few months.

pypy only does python 2. Some people want to use the new features of python.

Thanks, we added the year above.

Quite old, plain Python is still slow but PyPy is in the league of Node.js.

I rewrote the same program (compare two images, generate a third that shows the diff) in a number of languages. Considering CPython as the reference implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x in PyPy.

Initially I got 4x with PyPy but I did a light refactor, removing some map()s and zip()s that were gratitious (3-element lists) and then PyPy went real fast.

You've picked a poor example, I think. And you're probably coding in an inefficient way.

Here's the OpenCV way for a pair of 2048x2048 images:

    import cv2
    import time

    t = time.clock()

    a = cv2.imread("./image_1.tiff", cv2.IMREAD_GRAYSCALE);
    b = cv2.imread("./image_0.tiff", cv2.IMREAD_GRAYSCALE);

    c = a-b

    cv2.imwrite("out.tiff", c);

    print time.clock()-t
Takes about 0.2 CPU seconds on average for me (note using time.clock, not time.time on UNIX).

    #include <opencv2/opencv.hpp>
    #include <ctime>

    using namespace cv;
    using namespace std;

    int main(void){

      clock_t begin = clock();

      Mat a = imread("./image_1.tiff", IMREAD_GRAYSCALE);
      Mat b = imread("./image_0.tiff", IMREAD_GRAYSCALE);

      Mat c = a - b;

      imwrite("out.tiff", c);
      clock_t end = clock();
      double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
      cout << elapsed_secs << endl;
Again, about 0.2 seconds. The difference is negligible if you use the right libraries. Python should not be your bottleneck for high performance code.

> The difference is negligible if you use the right libraries.

OpenCV is written in C++, it's going to be fairly efficient to call out to it in any language.

For me, most of the time spent in our code is in 'business logic' which necessarily must be in the main language of the codebase.

That's where PyPy gets its wins.

My conclusion is: as long as you don't do anything new, Python is fast enough.

That's true, but most of the time you can express the "new" stuff in terms of existing fast libraries.

I've had a few cases where you can't, and have become a big fan of Cython for that. It lets you add C typedefs to Python code, and then compile the module at import time. Example here: http://pastebin.com/sF8KmyiU

All of pure Python is still allowed in these modules, but the typedeffed variables become pure C variables instead of objects, and loops become pure C loops. For this particular function, I got a 1000x speedup compared to the original Python code.

In the end, this isn't Python any more - but it's close enough, and only needed for loops that run over millions of items.

Now, try implement that matrix subtraction or something more complex such as blurring or edge detection in Python, and compare results.

Also, are you sure that doesn't measure disk speed?

I get around 0.07 seconds without the write in C++ and the same in Python (good call though).

I agree that in pure Python it'd be slower, but realistically why would you do that? Unless you work somewhere where you're forced to write your own libraries... but even then you could still implement your own.

If you only removed the write and not the reads from the timing, I would guess the reads (even with warm caches) still dominate the time.

And you would want to do it in pure Python if you want to answer the question "how fast can we make interpreted Python?". Using C extensions for that is cheating, as it isn't Python and it isn't interpreted. You don't answer the question "how fast can you run?" With "30 km an hour, using a bicycle", either.

If you make those images large enough (and I guess 2k x 2k is large enough), any language that uses OpenCV to do the job will give results in the same ballpark. For example, you can make the difference between Python implementations that can call OpenCV as small as you want it.

Not the parent, but I think the point is that, in the real world, the vast majority of use cases for which Python is slow are ones where you would use an existing library written in a lower-level language. So questions like "How fast can we make matrix multiplication in Python?" are irrelevant for the vast majority of Python developers because NumPy exists, and it's always going to be faster than anything you can write in pure Python.

In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same.

I agree that the question is valid - making vanilla Python faster is cool. My point was that this particular example (image processing) was flawed, because it's not something a sane person would ever do in pure Python.

"In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same."

Of course it is about the same. Except for function entry and function exit, which should be a few thousand instructions, at the most, it runs the exact same instruction sequence (if you are using identical versions, compiler and compiler flags)

If you want an easily measurable difference, use way smaller images, and make a few thousand or even a few million calls, or look at the python sources to see how efficiently it calls into C.

Ok. But I would do it in Numba.

Be aware that python opencv uses c++ calls, it's only a wrapper. SciPy or numpy might be better examples.

I was under the impression that Numpy also just calls BLAS underneath? Hence why doing element-wise calls in Numpy is far, far faster than simply doing nested for loops.

But I think this is the great strength of Python. It's a glue language. If you need speed, you can always write a wrapper around a C/C++ library.

Funny you should say C/C++, because BLAS is Fortran.

Actually, it's mostly assembly (depending on the particular implementation). Lapack is Fortran though.

> SciPy or numpy might be better examples.

Python comprises less than 50% of the code in both of those repositories, but they are certainly great for learning the CPython API.

I've never used Rust, how would it compare to a C++ solution?

The standard response from the Rust team is that Rust should match or beat non-SIMD C++ performance and if it doesn't, you should file a bug.

Note: The first thing anybody will ask when you complain about Rust being slow is whether you compiled with optimizations turned on (`cargo build --release`) since it tends to make a 10-15x difference.

Theoretically the Rust borrow-checker also knows enough about your code's protection and dispatch semantics such that additional information could be used to create deeper optimizations than are available in either C or C++. Numerical analysis in Rust could compete with Fortran in performance, but I don't know if any of that has been actualized in Rust yet.

I think some of that might start to come along with increase in compiler plugins, which is feature gated to nightly build right now.

By what measurement?

I think the only thing I could say would be that it would be safer? And possible terser and possibly easy to understand.

As a C++ developer I find it hilarious that Node.js is a high performance league! ( your benchmark for example shows Rust as 6x faster)

But it's high performance given the semantics of the language. The work that has gone into making V8 perform as it does is extraordinary and should be respected, not mocked as 'hilarious'.

I believe the high performance usually refers to NodeJS' non-blocking I/O.

The benchmarks show that by using another language his code went 8x faster. Perhaps if he optimized his C++ it would go even faster. It's funny that people are saying 10x off the theoretical is high performance. I wonder if these people are living in a Javascript bubble.

Stack->register JIT compiler for Python that maintains compatibility with existing C extensions because it runs as an extension within CPython. Average of 25% faster to up to 2.5x on the benchmarks in the paper.

Source link: https://github.com/rjpower/falcon/ (doesn't show up in the paper till the very end!)

I have been using cython, as presented here [0] with huge speedups. Prototype in python and then make a few changes to produce python code.


there's also Microsoft's effort:



What are you disclaiming? I have to admit, having been a bit baffled when people started disclaiming their credentials against all definitional expectation, just seeing a bare disclaimer with nothing obviously being disclaimed, I am completely confused.

This is actually really cool!

Previous posting: https://news.ycombinator.com/item?id=6112995

25 comments and good discussion there.

For all those talking about PyPy, have a look at Pyston, an ongoing effort by Dropbox to build a fast Python.


Any idea why they decided to write their own JITed python implementation rather than putting in additional support for PyPy ?

According to this slide their "real world" tests of PyPy didn't live up to the benchmarks, and in fact showed "no clear improvement" compare with cpython


C-API compatibility, I'm guessing. I agree it was of a strange choice though.

PyPy also has higher memory usage.

I'm kinda jealous that there is so much effort put into Python... I like to use ruby as a scripting language...

Something worth noting:

Python 3.6 has a number of refactorings of standard library components[1], and while that doesn't effect the CPython interpreter itself, these enhancements should do a lot to speed up applications that make heavy use of the standard library.

1. Some are mentioned at https://docs.python.org/3.6/whatsnew/3.6.html (search page for "fast")

Ah yes. 25% faster.

I get 50x faster (that would be 5000%) just dipping into Numpy when I need to (admittedly with AVX), and C when I have to. Both are like, trivially easy.

Why are we bothering with making native Python marginally faster when it is already the perfect tool for the mission it is there to accomplish (glue), and there are dozens of other tools which are optimized for performance?

"Both are like, trivially easy."

Until you screw up memory allocation and have to debug.

There's a race condition in Python 3's CPickle that corrupts memory.[1] I can't reproduce it well enough to submit a bug report that won't be ignored.

[1] http://bugs.python.org/issue23655

Debugging gets way easier if you finish the C-to-Python transition, ripping out the last bit of Python. Having the interpreter running makes debugging way harder than it needs to be. Lose that, and suddenly you can take advantage of all sorts of powerful debugging tools. (valgrind, less-painful use of a standard debugger, -fsanitize= compiler options, C interpreters, coverity, etc.)

Ditching the python also dramatically improves start-up latency.

Because we need faster glue? :)

super glue!

Precisely. :)

Also, see above my comment on using Hy (hylang.org) for LISPy glue.

But why? I'm not trolling, I mean this as sincere non hostile criticism... Static types alone improve code quality and execution speed, and it's been shown hardly anyone actually exploits dynamic types anyway.

Is it at all possible to make Python really fast given that it depends on GIL, that to my understanding makes Python performance memory bound?

The GIL is an issue when you want parallelism. It is not relevant when talking about single thread performance.

As I understand it, it eliminates some optimization possibilities and adds overhead.

I do not have time to check the Python code to see how it is actually implemented (hence my question) but based on my knowledge it would imply at least a CAS operation to check and take the lock, writing the register values into memory (cache) and applying a memory barrier.

You can not keep the values in the registries (elimination of optimization possibilities) and you add considerable overhead by needing memory barriers and CAS operations.

I am not claiming that Python does it like this, I am just assuming that it should do it like this to obtain the guaranties of GIL.

The GIL is a condition variable and a mutex. Nothing fancy. In a single threaded program, it gets acquired once, if at all.

Mutex and a shared condition variable are expensive compared to the single instruction.

But I had an impression that it is acquired before every atomic Python instruction and it looks that it is actually acquired for group of predefined number of instructions (100) that then are executed inside one GIL time frame [0].

Therefore it actually should not be a big obstacle to make Python code run fast by a JIT compiler.

[0] http://www.dabeaz.com/python/GIL.pdf

Could you explain why GIL wouldn't make it CPU bound, rather than memory bound?

I am sorry, but by memory I meant cache memory that is in fact inside the CPU. This is error on my side.

I read [0] and could not infer anything to confirm my understanding, but if we consider the semantics of GIL, then there should be some form of guarantee in place that two CPU cores see the coherent picture of the cache (other way two threads would see incoherent values of the same variable or would not see the changes at all). This is usually archived by some form of memory barriers and flushing the registers into memory and this by my experience (that comes from Java) makes the code order of magnitude or two slower.

[0] https://wiki.python.org/moin/GlobalInterpreterLock

Very.. if a VM or JIT compiler is added to the mix, and optional type hints. There are many potential Python optimizations that can't be done that are suddenly possible if something can provide real time code monitoring/optimization.

How does Falcon compare to PyPy performance wise?

Why not write a Py->JS transpiler to solve the efficiency problem?

A big part of the reason why the likes of V8 are fast is because they can concentrate on being single-threaded. If you wanted to write a transpiler, it would need to be targeted at a JS implementation that allowed for multiple threads to be run at the same time. If you did that, you'd have a JS runtime with similar problems to CPython.

I can't tell if you're joking, but if not... No, JS isn't more efficient than Python.

I couldn't tell either. Poe's law in full effect.

having a Python JITing VM written in python, compiled to javascript, can actually be faster than CPython (after warmup!)


Maybe not, but if you are doing single threaded stuff in python you will probably get better performance by using node. Now, I'm no python or javascript programmer, but I generally get a lot better performance for small scripts using node.

Sarcasm detector is also failing.

But even compiling to C would not magically solve the issues.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact