/edit to be fair, the project README points out the difference in goals and performance between Falcon and PyPy quite nicely: https://github.com/rjpower/falcon/blob/master/README.md
It's an argument I haven't heard anywhere else (and one I'm not in a position to substantiate) but presumably Dropbox, who use Python at scale, have as valid perspective on this issue as anyone.
That's just an anecdote but obviously you should check your use case and find what tool is best suited for your project. There is no magic cure-all, but in my experience pypy comes close.
I'm pretty confident it's also going to be a magic speed boost for a decent size web app I work on, but it remains to be tested with this particular app so I'll check it out... The most bulletproof way to find out what's faster is to test it.
"elevation map" doesn't mean anything in terms of projection, you can project an elevation map onto a 2D display in a hundred ways, "raycaster" does mean something here, a raycasted elevation map or raycasted heightfield is what this is, anyway it's just an example of what types of problems can be solved 10x faster with pypy.
Every Python symbol reference requires a dictionary lookup. Every function call and operator requires that the type of the left hand side be examined for dispatching purposes. More than 99% of the time, the symbol and type will be the same as last time, and it's a huge win to assume that it will be and compile code for that case. You still have to be prepared for the times when it isn't, and have a backup system. That's basically what PyPy does.
More stuff is mutable in Python than really needs to be mutable.
Unladen Swallow did JIT compilation and recompilation, despite not doing better.
25% speedup for pure Python code is not that appealing in numerical analysis code, when you can use NumPy, or sprinkle a few typedefs into a critical function and compile with Cython for a 1000x speedup.
If C extension compatibility is broken, then many Python programs will altogether still be slower, despite the pure python speedup.
Not an apples to apples comparison, but my guess is the state of the art of JIT is way ahead of what the cython compiler can detect; some of the JIT tricks will likely work in the cython static translation step.
I rewrote the same program (compare two images, generate a third that shows the diff) in a number of languages. Considering CPython as the reference implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x in PyPy.
Initially I got 4x with PyPy but I did a light refactor, removing some map()s and zip()s that were gratitious (3-element lists) and then PyPy went real fast.
Here's the OpenCV way for a pair of 2048x2048 images:
t = time.clock()
a = cv2.imread("./image_1.tiff", cv2.IMREAD_GRAYSCALE);
b = cv2.imread("./image_0.tiff", cv2.IMREAD_GRAYSCALE);
c = a-b
using namespace cv;
using namespace std;
clock_t begin = clock();
Mat a = imread("./image_1.tiff", IMREAD_GRAYSCALE);
Mat b = imread("./image_0.tiff", IMREAD_GRAYSCALE);
Mat c = a - b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed_secs << endl;
OpenCV is written in C++, it's going to be fairly efficient to call out to it in any language.
For me, most of the time spent in our code is in 'business logic' which necessarily must be in the main language of the codebase.
That's where PyPy gets its wins.
I've had a few cases where you can't, and have become a big fan of Cython for that. It lets you add C typedefs to Python code, and then compile the module at import time. Example here: http://pastebin.com/sF8KmyiU
All of pure Python is still allowed in these modules, but the typedeffed variables become pure C variables instead of objects, and loops become pure C loops. For this particular function, I got a 1000x speedup compared to the original Python code.
In the end, this isn't Python any more - but it's close enough, and only needed for loops that run over millions of items.
Also, are you sure that doesn't measure disk speed?
I agree that in pure Python it'd be slower, but realistically why would you do that? Unless you work somewhere where you're forced to write your own libraries... but even then you could still implement your own.
And you would want to do it in pure Python if you want to answer the question "how fast can we make interpreted Python?". Using C extensions for that is cheating, as it isn't Python and it isn't interpreted. You don't answer the question "how fast can you run?" With "30 km an hour, using a bicycle", either.
If you make those images large enough (and I guess 2k x 2k is large enough), any language that uses OpenCV to do the job will give results in the same ballpark. For example, you can make the difference between Python implementations that can call OpenCV as small as you want it.
I agree that the question is valid - making vanilla Python faster is cool. My point was that this particular example (image processing) was flawed, because it's not something a sane person would ever do in pure Python.
Of course it is about the same. Except for function entry and function exit, which should be a few thousand instructions, at the most, it runs the exact same instruction sequence (if you are using identical versions, compiler and compiler flags)
If you want an easily measurable difference, use way smaller images, and make a few thousand or even a few million calls, or look at the python sources to see how efficiently it calls into C.
But I think this is the great strength of Python. It's a glue language. If you need speed, you can always write a wrapper around a C/C++ library.
Python comprises less than 50% of the code in both of those repositories, but they are certainly great for learning the CPython API.
Note: The first thing anybody will ask when you complain about Rust being slow is whether you compiled with optimizations turned on (`cargo build --release`) since it tends to make a 10-15x difference.
I think the only thing I could say would be that it would be safer? And possible terser and possibly easy to understand.
Source link: https://github.com/rjpower/falcon/ (doesn't show up in the paper till the very end!)
25 comments and good discussion there.
Python 3.6 has a number of refactorings of standard library components, and while that doesn't effect the CPython interpreter itself, these enhancements should do a lot to speed up applications that make heavy use of the standard library.
1. Some are mentioned at https://docs.python.org/3.6/whatsnew/3.6.html (search page for "fast")
I get 50x faster (that would be 5000%) just dipping into Numpy when I need to (admittedly with AVX), and C when I have to. Both are like, trivially easy.
Why are we bothering with making native Python marginally faster when it is already the perfect tool for the mission it is there to accomplish (glue), and there are dozens of other tools which are optimized for performance?
Until you screw up memory allocation and have to debug.
There's a race condition in Python 3's CPickle that corrupts memory. I can't reproduce it well enough to submit a bug report that won't be ignored.
Ditching the python also dramatically improves start-up latency.
Also, see above my comment on using Hy (hylang.org) for LISPy glue.
I do not have time to check the Python code to see how it is actually implemented (hence my question) but based on my knowledge it would imply at least a CAS operation to check and take the lock, writing the register values into memory (cache) and applying a memory barrier.
You can not keep the values in the registries (elimination of optimization possibilities) and you add considerable overhead by needing memory barriers and CAS operations.
I am not claiming that Python does it like this, I am just assuming that it should do it like this to obtain the guaranties of GIL.
But I had an impression that it is acquired before every atomic Python instruction and it looks that it is actually acquired for group of predefined number of instructions (100) that then are executed inside one GIL time frame .
Therefore it actually should not be a big obstacle to make Python code run fast by a JIT compiler.
I read  and could not infer anything to confirm my understanding, but if we consider the semantics of GIL, then there should be some form of guarantee in place that two CPU cores see the coherent picture of the cache (other way two threads would see incoherent values of the same variable or would not see the changes at all). This is usually archived by some form of memory barriers and flushing the registers into memory and this by my experience (that comes from Java) makes the code order of magnitude or two slower.
But even compiling to C would not magically solve the issues.