
How fast can we make interpreted Python? - iskander
http://www.phi-node.com/2013/06/how-fast-can-we-make-interpreted-python.html
======
cwzwarich
There are a couple of things you want to do (some of which overlap with the
article):

1) Use a register-based VM (with a sliding and growing register file) instead
of a stack-based VM. In theory you can make a stack-based VM fast with lots of
macroinstructions that fuse smaller operations together, but it isn't worth
it.

2) Use inline caching for method calls, property accesses, and primitive
operations that do type checks. In an interpreter you can modify the
instruction stream even on platforms that disallow modification of executable
code. I know this isn't the origin of the technique in bytecode interpreters,
but here's a paper describing it in case it's not obvious:

[http://www.lirmm.fr/~ducour/Doc-
objets/ECOOP10/papers/6183/6...](http://www.lirmm.fr/~ducour/Doc-
objets/ECOOP10/papers/6183/61830429.pdf)

3) Pick your value encoding carefully. You almost always want fast immediate
integers. On 64-bit platforms it is quite common these days to repurpose some
of the NaN range in IEEE doubles for type tags to enable storing doubles in
immediate values.

4) Write your interpreter in assembly. Compilers generate terrible code for
interpreters, even (especially?) with the use of computed goto / labels-as-
values extensions. The register allocators of traditional compilers are
designed to optimize loops by moving spill code outside of them and to reduce
the impact of function calls. They will not be able to realistically allocate
registers across different instruction bodies, and they won't be able to make
the correct tradeoff about how much work to push into the slow path of
instruction bodies.

5) Rearrange your instruction bodies based on execution / transition
frequencies to improve instruction cache performance.

6) Pay close attention to the boundaries between your interpreter and the
runtime libraries / the FFI. You don't want to take a bigger hit than you need
to every time you call out to native code.

~~~
iskander
>Rearrange your instruction bodies based on execution / transition frequencies
to improve instruction cache performance.

Do you mean...group all the frequent operations together so they overlap on
cache lines? It's hard to tell how much this would help, have you tried it?

~~~
cwzwarich
When I was working on WebKit we would rearrange instruction bodies to
influence the generated code based on opcode statistics, but back then the
interpreter was using computed goto, so there wasn't quite a direct connection
between the placement of the input code and the generated code. It's unlikely
that any two instruction implementations will overlap on cache lines, since
they are all typically larger than a cache line, but more temporal coherency
throughout code execution will improve performance, especially on CPUs with
smaller caches.

You can do it automatically by gathering statistics on frequent instruction
pairs. In practice greedy algorithms for code scheduling work fairly well,
assuming you have meaningful statistics.

------
jerf
It seems to me over the past ten years I've heard this story so many times:
"Python sucks. Let's do the obvious thing that makes it faster." Then, a month
or two later, "I did the obvious thing and it's sometimes faster but often
slower, net no gain or possible loss." to which the response is obviously "No
sale."

I say this merely as an interesting observation. I've come to consider this a
_de facto_ counterargument to the claim that languages aren't slow, only
implementations are. It may be theoretically true, but in practice, as nice as
Python may be to use, it has proved a very difficult language to speed up.
(PyPy has taken a very good run at it, but it sure wasn't a case of "I'll just
do this easy, obvious thing." PyPy seems to have hit Python's performance with
multiple PhD-thesis level attacks, and it's still certainly not C in the
general case.)

~~~
iskander
>It may be theoretically true, but in practice, as nice as Python may be to
use, it has proved a very difficult language to speed up.

I disagree with you. Python isn't much harder to speed up than Lua and in some
ways it's better behaved than JavaScript. Still, both of those languages enjoy
implementations significantly faster than CPython. Really, it's not the
semantics of the language which hold back Python's performance but rather the
fact that extension modules are extremely tightly coupled with a particular
interpreter implementation. Lua has a clean interface with C, JavaScript
implementations generally force the outside world to use doubly indirect
handles on objects. CPython, on the other hand, is shameless in flaunting its
internals for the whole world to see.

The only reason that PyPy has taken "multiple PhD-thesis level attacks" to
near completion is because their approach is insanely ambitious. They didn't
write a JIT. Instead they wrote a toolkit for partially evaluating
interpreters on source files and generate native code by tracing an
interpreter while it itself runs a program. It's nuts! It's amazing that PyPy
works and the amount of effort is totally unsurprising.

Had they gone a more traditional route, the whole thing could have been done
in a year or two. They would, however, still face resistance from a Python
community that wants to neither give up nor rewrite their PyObject-laced
libraries.

~~~
fijal
I think you underestimate the complexity of Python language.

Note that PyPy is not the only project that did that - remember psyco? There
are reasons why after 3 years Armin said "I give up, let's do PyPy". It's not
the "well behaved" part, this can be worked around, Python is simply more
complex than Javascript or Lua and by complex I mean just bigger. All the
extension modules that everyone naturally expects to be fast (even just the
stdlib), descriptor protocol, crazy frame access semantics. That does make it
very labour intensive to do the right thing. Look what happened to Unladen
Swallow - they did not get anywhere really within a year. Several of PyPy
optimizations that took forever to do are really new stuff, whether you do JIT
by hand or generate it automatically.

~~~
nostrademons
The last time I talked with the Unladen Swallow guys (a couple years ago),
they were pretty clear that one of their main stumbling blocks was supporting
the Python/C API, and wanting to have complete compatibility with C extension
modules. While we can't really know how hard the task would've been if they'd
lifted that requirement - it was baked into their design from an early stage -
when I'd floated the idea of doing a from-the-ground-up LLVM-based
implementation of Python, they seemed significantly more optimistic. It
wouldn't be all that useful for most people, but it would've worked fine for
my use case.

Alas, it's doubtful that a Python implementation that sacrifices C extensions
would get all that far with mainstream adopters, as so many useful libraries
are done as C extensions.

~~~
fijal
They were optimistic at the beginning too (even with C extensions). How would
it fill your usecase in a way that pypy does not?

~~~
nostrademons
We were looking for something easily embeddable, but all host modules were
provided by the application, so there was no need for outside C extensions.
And the set of libraries that was importable was restricted and coding
styleguides banned advanced language features like metaprogramming, so we
could afford to cut corners on corner cases of the language. "Decent"
performance (i.e. more like Java than CPython) was a requirement, as was
multithreading support and lack of a GIL, and RAM usage was also at a premium
(which was probably the largest argument against PyPy...also, this was a
couple years ago, when PyPy was not as mature).

------
beagle3
If you haven't already, LuaJIT's source code (and Mike Pall when asked) is a
treasure trove of speedup ideas.

One idea that stood out to me (and which I first saw in LuaJIT, and as far as
I know originated with Pall) is: when rewriting loop code, unroll at least 2
iterations of the loop. (The first executes and conditionally continues into
the second; the second loops onto itself). So far, just extra work.

However, any kind of constant folding algorithm is now immediately elevated
into a "code hoisting out of loop" algorithm at no extra cost - e.g., SSA form
gets that kind of code motion.

I'm not sure Python can make much use of that, because it is nearly impossible
to guarantee idempotence of operations - but in case you can somehow make that
guarantee, that can be very significant for e.g. function name lookups.

A possible way to use that is to have the loop opcode have two branch targets:
"namespaces modified" (which goes to the first iteration, which reloads
values) and "namespaces unmodified" (which loops at the 2nd iteration, relying
on the constant folding and not looking up in dicts again). This could make
calls like "a.b.c.d.e.f" require 0 lookups in most iterations of most loops --
but would also require a global "namespace modified" flag.

~~~
fijal
We have this optimization in PyPy (so yes, Python can do it). There is even a
paper [0]. LuaJIT is a great source of inspiration but Python is just a huge
language, which makes it much harder.

[http://www.maths.lth.se/matematiklth/vision/publdb/reports/p...](http://www.maths.lth.se/matematiklth/vision/publdb/reports/pdf/ardo-
bolz-etal-dls-12.pdf)

------
xenonflash
I don't think its correct to say the CPython is slow. What you can more
accurately say about CPython is that the performance is highly variable. Some
things are very fast, while others are comparatively slow.

The slow things tend to be the sort of numerical loops that you see in micro-
benchmarks. It's no coincidence that the version of Python in the linked
article saw its greatest speed up in a numerical loop, but only modest
improvement elsewhere. It's exactly this sort of simple repetitive operation
where interpreter overhead matters the most.

Language features that encapsulate complex functionality tend to be harder to
speed up in CPython because the VM operates at a fairly high level. In effect
you're just kicking off a large subroutine that is written in C, and you're
really executing native code until that operation is complete. You're not
going to improve very much on that no matter how much you try.

What this means is that speed will depend heavily on the type of application
program being written, and also on how much the programmer takes advantage of
the unique language features. It also makes realistic cross language
benchmarks difficult because the right way to do something in Python may not
have a direct equivalent in another language. The result tends to be "lowest
common denominator" benchmarks, which are exactly the sort of algorithms which
CPython does worst at.

~~~
toolslive
it really is a slow interpreter. Python comes with a pystone benchmark, and
CPython is invariably the slowest of all interpreters. This being said, if you
take a look at it's implementation, then it's immediately obvious why. The
interpreter is a basically a simple C switch, with no optimization whatsoever.
Simply threading the interpreter would make it about a factor 2 faster (at
least that's what the experts claim you gain by threading)

~~~
xenonflash
Pystone isn't a performance benchmark, or at least it isn't a useful one. It's
more of a regression test to see if anything has changed between versions.
It's not useful as a performance benchmark because it doesn't weight the
results according to how much the individual features matter in real life.
There are three versions of Python besides CPython that are in commercial use.
Two are much slower than CPython (up to three times slower), and Pypy is
(currently) faster in some applications and slower in others.

The CPython interpreter is _not_ a simple switch. It uses computed gotos if
you compile it with gcc. Microsoft VC doesn't have language support needed for
writing fast interpreters, so the Python source is written in a way that will
default to using a switch if you compile it with MS VC. So, on every platform
except for one, it's a computed goto.

Modern CPU performance is very negatively affected by branch prediction
failure and cache effects. A lot of the existing literature that you may see
on interpreter performance is obsolete because it doesn't take those factors
into account, but rather assumes that all code paths are equal. Threading
worked well with older CPUs, not so well with newer ones.

I am current working on an interpreter that recognises a subset of Python for
use as a library in complex mathematical algorithms. As part of this I have
bench marked multiple different interpreter designs for it and also compared
it to native ('C') code. It is possible to get a much faster interpreter,
provided you limit it to doing very simple things repetitively. These simple
things also happen to be the sorts of things which are popular with benchmark
writers (because they're easy to write cross language benchmarks for), but
which CPython does not do well in.

A sub-interpreter which targets these types of problems should give improved
performance in this area. Rewriting the entire Python interpreter though would
probably have little value, as the characteristics of opening a file or doing
set operations, or handling exceptions are entirely different from adding two
numbers together.

There is no such thing as a single speed "knob" which you can crank up or down
to improve performance. There are many, many, features in modern programming
languages, all of which have their own characteristics. Picking out a
benchmark which happens to exercise one or a few of them will tell you nothing
about how a real world application will perform unless it corresponds to the
actual bottlenecks in your application. For that, you need to know the
application domain and the language inside and out.

One thing about Python developers is that they tend to be very pragmatic. When
someone comes to them with an idea, they say "show me the numbers in a real
life situation". More often than not, the theoretical advantage of the
approach being espoused evaporates when subjected to that type of analysis.

~~~
toolslive
[http://hg.python.org/cpython/file/16fe29689f3f/Python/ceval....](http://hg.python.org/cpython/file/16fe29689f3f/Python/ceval.c#l1334)

looks like a switch to me.

Anyway, I've let them tell me the CPython interpreter is very simple on
purpose to allow it to function as a standard 'definition' of the language
behaviour. A simple jit does wonders, as does a less brain dead gc.
Superinstructions, threading, ... are all possible. But you're absolutely
right: It's really difficult to predict how much each improvement would
contribute.

~~~
xenonflash
Have a look at the lines starting at line 821 in the very file you referenced.
I have quoted a bit of it here:

"Computed GOTOs, or the-optimization-commonly-but-improperly-known-
as-"threaded code" using gcc's labels-as-values extension (...) At the time of
this writing, the "threaded code" version is up to 15-20% faster than the
normal "switch" version, depending on the compiler and the CPU architecture."

They also have an explanation of the branch prediction effect which I
mentioned earlier.

They have both methods (switch and computed goto) since some compilers don't
support computed gotos, and some people want to use alternative compilers
(e.g. Microsoft VC).

In my own interpreter, I tried both switch and computed gotos, as well as
another method called "replicated switch". I auto-generate the interpreter
source code (using a simple script) so that I could change methods easily for
comparison. In my own testing, computed gotos were about 50% faster than a
simple switch, but keep in mind that is strictly doing numerical type code.
More complex operations would water that down somewhat, as less of the
execution time would be due to dispatch overhead.

Computed gotos aren't really any more complex than a switch once you
understand the format, and as I said above you can convert between the two
with a simple script. What does get complex is doing Python level static or
run time code optimization to try to predict types or remove redundant
operations from loops. CPython doesn't do that, while Pypy does this
extensively. It's these types of compiler and run-time re-compile
optimizations which make the big difference.

Overall, my interpreter is currently about 5.5 times faster than CPython with
the specific simple benchmark program I tested. However, keep in mind it only
does (and only ever will do) a narrow subset of the full Python language.
Performance is never the result of a single technique. It's the result of many
small improvements each of which address a specific problem.

~~~
toolslive
so the conclusion really is: CPython is way slower than it should be.
Question: if the subset is small, isn't it better to use something like 'shed
skin' ?
[http://code.google.com/p/shedskin/](http://code.google.com/p/shedskin/)

I once looked at it, and it does a fairly literal translation. The only
problem is that it changes semantics of the primitive types. For example a
python integer becomes a C++ int. (and overflow semantics change)

------
bsimpson
As a web developer, I've often heard neckbeards bickering about Python's
performance, but haven't had a real point-of-reference to understand how bad
it can be until recently.

I've started working on a side project that processes geo data in AppEngine.
My dataset includes many long lists of numbers (lats, longs, altitudes,
timestamps, etc.). A 700 route dataset is about 25MB in a sqlite database, but
trying to access any significant portion of it quickly maxes out the 4GB of
RAM available on either of my dev machines (which is more than I could
reasonably expect to be provisioned in the cloud). I mentioned this as a
potential bug to the relevant Googler at I/O this year and he basically said
"that's not us, that's Python."

It's mindboggling how quickly you can burn through your RAM in CPython.
Hopefully you can prove something that will eventually make its way back into
CPython and lift everyone's boats. Unfortunately, even if Falcon helped on my
dev machine, I can't imagine it being taken up on cloud platforms like
AppEngine.

~~~
gregorsamza
A dataset with "many long lists of numbers" sounds like an ideal use case for
NumPy, have you tried using that?

~~~
scardine
GAE allows only pure Python. No binary modules like NumPy.

~~~
kirubakaran
Numpy is supported
[https://developers.google.com/appengine/docs/python/tools/li...](https://developers.google.com/appengine/docs/python/tools/libraries27)

------
dchichkov
Can we make function calls cheaper?

From my observations in pretty much any unoptimized Python (CPython
interpreted) code function calls is nearly always a bottleneck. And speed is
directly bound by the number of function calls being performed, not by
ponderous data structures.

~~~
iskander
The ponderousness of these data structures isn't just about memory consumption
or having to use boxed numbers. As far as performance goes, PyObjects infect
everything in the interpreter. For example, when you're calling a Python
function, after a long run-around in ceval, PyObject_Call, the function
object's function_call method, you'll finally get back to ceval which creates
a frame via a lengthy call to PyFrame_New. The whole process is a mess of
allocating, deconstructing, increfing, decrefing, and tag-checking.

------
fijal
The title is very misleading - it should be called "How fast can we make
CPython", because it talks about compatibility with PyObject* and stuff.

While I would argue that we can't make interpreted Python particularly fast,
the _actual_ topic in question is much harder than that.

------
jaegerpicker
While this is an undeniably cool project from the tech side, I think it's
usually a better idea to rewrite bottlenecks of the kind that this helps with
as a c extension. It's a fairly easy process (MUCH easier then in Java for
example) and only a small amount of code needs to be in c itself but you can
get huge performance increases without changing the Cython environment. I've
always thought that was one of python/ruby's greatest strength's is the easy C
integration.

~~~
fernly
I think you mean the "CPython environment"? "Cython" is something else.

------
fernly
No mention of the somewhat confusingly named Cython [1]? It addresses the same
issues in a different way.

[1] [http://en.wikipedia.org/wiki/Cython](http://en.wikipedia.org/wiki/Cython)

------
freework
Python is typically used to build applications that are IO bound. Squeezing
more performance out of the interpreter is not going to translate to any real
gains for most Python users these days.

~~~
kkowalczyk
That's circular thinking.

Because Python is slow, Python is not used in scenarios where speed is
crucial. That much is true.

However, if Python was faster, it would be used in those scenarios, so more
people would be using it for speed critical code so it would provide real
gains for a great many Python programmers.

This is exactly what happened with JavaScript: before V8 JavaScript was in
exactly the same position as Python. Not many people were writing large
programs in JavaScript because JavaScript was too slow. V8 sped up JavaScript
10x+ and people started writing much larger apps that do require that speed.
If JavaScript speed suddenly dropped to pre-V8 speeds, we would all find the
most popular web apps unusably slow.

~~~
cookiecaper
It's probably worth noting that Google already attempted a V8-like
transformation for Python with Unladen Swallow, and that that attempt mostly
failed. Perhaps it was just prioritized differently, and that's why V8 was
successful and Unladen Swallow wasn't.

------
int3
Interesting approach. One criticism: The paper mentions that compile times max
out at 1.1ms "for the most complex function" in the benchmark (AES), and
therefore it is sufficient to just compile everything. However, those
benchmarks seem too small to justify that conclusion.

------
dschiptsov
The answer is very simple:

1) Know what ought to be done - do it and send the patches.

2) Need "speed" \- write that part in C.)

~~~
sherjilozair

        2) Need "speed" - write that part in C.)

That's not so easy. Interfacing Python and C code is also incredibly hard, and
no one true way exists.

~~~
gizmo686
>That's not so easy. Interfacing Python and C code is also incredibly hard,
and no one true way exists.

Can you elaborate on this. I've worked on python C extensions (just minor
updates and fixes, I've never been the one to write significant chunks of it),
and it seems like interfacing python with C is pretty straight forward.

~~~
lmm
If you use the CPython C API it's easy - but you also bind yourself closely to
CPython. If you know performance is going to be important it's probably better
to bite the bullet and use PyPy - which means you have to use the somewhat
cruder cffi to interface with C code.

------
mappu
This is bikeshedding, but "The only hard problems in computer science are
cache invalidation and naming things" \- there's already a quite popular,
multi-paradigm programming language named Falcon[1].

1\. [http://www.falconpl.org/](http://www.falconpl.org/)

~~~
simonh
Sorry, but the two hard problems in programming are actually cache
invalidation, naming things and off by one errors.

------
catmanjan
Haha oh my god Terry A. Davis commented on it!

SQUAWK SQUAWK SQUAWK

