
Why is Python slow - zbb
http://blog.kevmod.com/2016/07/why-is-python-slow/
======
alt_
Database died. Google cache:
[http://webcache.googleusercontent.com/search?q=cache:5V0TMa0...](http://webcache.googleusercontent.com/search?q=cache:5V0TMa0cMecJ:blog.kevmod.com/2016/07/why-
is-python-slow/+&cd=1&hl=en&ct=clnk&gl=uk)

The gist of it is:

* Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python"
part of Python. Another way of saying this is that Python opcodes are very
complex, and the cost of executing them dwarfs the cost of dispatching them.
Another analogy I give is that executing Python is more similar to rendering
HTML than it is to executing JS -- it's more of a description of what the
runtime should do rather than an explicit step-by-step account of how to do
it.

Pyston's performance improvements come from speeding up the C code, not the
Python code. When people say "why doesn't Pyston use [insert favorite JIT
technique here]", my question is whether that technique would help speed up C
code. I think this is the most fundamental misconception about Python
performance: we spend our energy trying to JIT C code, not Python code. This
is also why I am not very interested in running Python on pre-existing VMs,
since that will only exacerbate the problem in order to fix something that
isn't really broken.

~~~
smegel
> This means that it doesn't really matter how quickly you execute the
> "Python" part of Python. Another way of saying this is that Python opcodes
> are very complex, and the cost of executing them dwarfs the cost of
> dispatching them.

That doesn't really explain why Python is slow. Your just explaining how
Python works. Why should C code be slow? Usually it is fast. Just saying the
opcodes are complex doesn't really help, because if a complex opcode takes a
long time, it is usually because it is doing a great deal.

Java used to have the opposite problem. It was doing too much at the "Java
bytecode" level, such as string manipulation - so they added more "complex"
opcodes written in C/C++ to speed things up, significantly.

What you really need to explain is why Python is _inefficient_. Bloated data
structures and pointer hopping for simple things like adding two numbers may
be a big reason. I know Perl had many efficiencies built in, and was
considered quite fast at some point (90s?).

~~~
kamaal
>>I know Perl had many efficiencies built in, and was considered quite fast at
some point (90s?).

There are a lot of threads in Perlmonks that talk in detail about speeding up
Perl, related project et al.

To be summarizing it. Languages like Perl and Python are slow because they do
a lot of work out of the box that languages like C don't. There fore when you
talk of talk of translating Python to C, or Perl to C. Essentially what you
are talking of is translating all that extra action back into C, which will
run as fast as Perl or Python itself.

The more you make it easy for the compiler to interpret the faster it can run
and vice versa.

Python is slow for the very reason its famous, its easy for the programmer.

~~~
pjmlp
Lisp and Smalltalk like languages run circles around Python and Perl
performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.

~~~
infinite8s
And also how much the language designers care about proper language design.

------
chrisseaton
I think it's true that language implementations such as Ruby and Python spend
most of their time running the C parts of the code. I did a talk saying the
same thing about Ruby a couple of weeks ago, but referring to the Java code in
JRuby,
[https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf](https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf).

But this doesn't mean that a JIT is not going to help you. It means that you
need a more powerful JIT which can optimise through this C code. That may mean
that you need you to rewrite the C in a managed language such as Java or
RPython which you can optimise through (which we know works), or maybe we
could include the LLVM IR of the C runtime and make that accessible to the JIT
at runtime (which is a good idea, but we don't know if it's practical).

I work on an implementation of Ruby, and we make available the IR of all our
runtime routines (in our case implemented in Java) to a powerful JIT, so that
we can inline from the interpreter into the runtime and back again.

In the case of Python, PyPy does the same thing, allowing the JIT to optimise
between the interpreter and runtime, as they're both written in RPython.

So I think the problem the Pyston project needs to solve is how to allow the
JIT to see the runtime routines and optimise through them like it does with
Python code.

~~~
brianwawok
Pypy makes your app take many times the memory for like 20% perf. Which is
good but seems maybe often not worth the effort.

~~~
e12e
Eh...

    
    
      cat<<eof > float.py
      import itertools
      s = sum(itertools.repeat(1.0, 100000000))
      print(s)
    
      $ time python float.py 
      100000000.0
    
      real    0m0.602s
      user    0m0.596s
      sys     0m0.004s
    
      time python3 float.py 
      100000000.0
    
      real    0m0.603s
      user    0m0.600s
      sys     0m0.000s
    
      $ time pypy float.py 
      100000000.0
    
      real    0m0.211s
      user    0m0.088s
      sys     0m0.004s
    

That's with _no warmup_ for the pypy variant (or indeed the other python
variants). Or, slightly more "robust":

    
    
       $ python -m timeit -s "import itertools as i" \
                     "sum(i.repeat(1.0, 100000000))"
      10 loops, best of 3: 594 msec per loop
    
      $ python3 -m timeit -s "import itertools as i" \
                     "sum(i.repeat(1.0, 100000000))"
      10 loops, best of 3: 592 msec per loop
    
      $ pypy -m timeit -s "import itertools as i" \
                  "sum(i.repeat(1.0, 100000000))"
      10 loops, best of 3: 68.2 msec per loop
    

Pypy actually does pretty good here:

    
    
      $ cat float.cpp 
      #include<iostream>
    
      int main() {
        double s = 0;
        for (int i = 0; i < 100000000; ++i) {
            s++;
        }
    
        std::cout << s << std::endl;
        return 0;
      }
    
      $ g++ --std=c++14 -O3 float.cpp
      $ time ./float
      1e+08
    
      real    0m0.237s
      user    0m0.236s
      sys     0m0.000s
    

Note that the C++ code use a loop, not a lazy generator. Apparently they may
be coming in c++17 as proposal N4286.

~~~
jerf
Summing a list of numbers is easy mode for a JIT. You've got a tight loop with
one type that can be statically shown will never be violated in real-time.
Unfortunately, unless that's actually your workload, the speed with with a
JIT-based system can add numbers is not relevant to how fast it runs in
practice. Any JIT that can't tie C on that workload is broken somehow.

Personally, I think people often go quite overboard with the "benchmarks are
useless" idea, but this benchmark really is useless, because it will never
produce any differences betweens JITs and thus can't show whether one is good
or bad.

~~~
chrisseaton
> it will never produce any differences betweens JITs and thus can't show
> whether one is good or bad

It can tell you which JITs can't even manage to remove the loop, which is
useful to know.

~~~
e12e
Apparently neither cpython, pypy or gcc manage to remove the loop in this
case. I actually think it is interesting that this "slow" code in cpython is
within [ed: ~10x] of pypy/jit/machine code (c++ probably should do better, I'm
not all that familiar with gcc - maybe -O3 isn't enough to try to unroll loops
and/or try to vectorize).

Actually code like this arguably should be a win for a high-level language
with an optimization pass; ideally the whole thing should be translated to a
constant at compile-time.

~~~
chrisseaton
Ah right I think that's because the accumulator is a double. I missed that. I
think it should still be possible but compilers probably don't bother.

------
saboot
I never quite grasped the actual machinations of Python until I watched Philip
Guo's lectures on Python internals.

[https://www.youtube.com/watch?v=LhadeL7_EIU](https://www.youtube.com/watch?v=LhadeL7_EIU)

It's a bit long, and definitely over several sittings, but I feel like I
really understand Python better and relevant to the post, the complexities and
tracking (frames, exceptions, objects, types, stacks, references) that occur
behind the curtain which drive Python's slow native performance.

~~~
amelius
Is the slowness due to the structure of the language, or is it because of the
implementation?

I guess the latter, because PyPy performs a lot better I hear.

~~~
jerf
PyPy performs better, but when you perform 2-3x times better than something
~40x slower than C, you still don't end up with a "fast implementation". Just,
"not as slow". If you've got Python code in hand and you want it to go faster,
PyPy can have a great bang-for-the-buck, but if you want it to be legitimately
approaching the limits of the capabilities of the hardware, you'll need a
different approach.

But let me once again underline that if you have Python code in hand, and you
want it to be faster, PyPy is a great option. I'm not being critical of PyPy.

A common mantra I've heard dozens of times in the last ~20 years is that
there's no such thing as a slow language, only slow implementations. But after
witnessing the effort to create "fast" implementations for a lot of slow
languages over the past 10 years, and seeing so many of them plateau out at
about 10x slower than C, I no longer believe this. Or at least, I no longer
believe it is practically true. If there is an implementation of Python
somewhere in theoretical program space that is as fast as C, it does not
appear to me that it will be possible for humans to produce it.

~~~
bluejekyll
I agree. Though speed of the JVM for instance is not quite as bad. C comes at
a development cost, Rust makes this better, but the memory management is still
something that you have to get comfortable with.

The question that really nags at me is why do people want interpreted
languages in all of these cases? When you're deploying code, you inevitably go
through a series of steps in deployment where throwing in a compile wouldn't
destroy the workflow.

I think for many of these cases, the GIL is a great example of this, the
language has over-optimized for development at the cost of its runtime.

~~~
mikelevins
I'm guessing that people don't usually really want an interpreted language;
usually what they want is a language they like, and the one they like happens
to be interpreted.

I can imagine reasons to want an interpreted language. As a matter of fact,
I've written several implementations over the past 20 years of a hobby
language, some of them compiled, some of them interpreted, and in some of the
later cases I consciously chose interpretation because dynamic runtime
introspection and the ability to see (expanded) source code directly in the
runtime during a breakloop was something I wanted, because I was experimenting
with runtime semantics and I wanted to be able to see it directly at runtime
with the minimum possible change from the source code as-written.

I've also written interpreters sometimes because I'm actually interested in
interpreters per se.

But most people probably don't want interpreters for those kinds of reasons.
As I said, I think it's more likely that most of the time when someone wants
"an interpreted language", what they really wanted is some particular language
whose most prominent implementation happens to be interpreted.

That raises the question of why implementations are interpreted, of course.
The answers, I think, are some combination of the answers I gave above and the
fact that interpreters are really easy to write, especially if you choose the
right source language. Simple compilers are not much harder, but easier is
easier. I'm generally inclined to start with an easy interpreter, myself,
(unless what I'm interested in is compilation strategies) because I get from
zero to experimenting with semantics that much quicker, and experimenting with
semantics is usually where the fun is.

------
Animats
Poor article. This subject has been covered many times, and others have put in
the key references.

Python's dynamism is part of the problem, of course. Too much time is spent
looking up objects in dictionaries. That's well known, and there are
optimizations for that.

Python's concurrency approach is worse. Any thread can modify any object in
any other thread at any time. That prevents a wide range of optimizations.
This is a design flaw of Python. Few programs actually go mucking with stuff
in other threads; in programs that work, inter-thread communication is quite
limited. But the language neither knows that nor provides tools to help with
it.

This is a legacy from the old C approach to concurrency - concurrency is an OS
issue, not a language issue. C++ finally dealt with that a little, but it took
decades. Go deals with it a little more, but didn't quite get it right; Go
still has race conditions. Rust takes it seriously and finally seems to be
getting it right. Python is still at the C level of concurrency understanding.

Except, of course, that Python has the Global Interpreter Lock. Attempts to
eliminate it result in lots of smaller locks, but not much performance
increase. Since the language doesn't know what's shared, lots of locking is
needed to prevent the primitives from breaking.

It's the combination of those two problems that's the killer. Javascript has
almost as much dynamism, but without extreme concurrency, the compiler can
look at a block of code and often decide "the type of this can never change".

~~~
EE84M3i
Note that threading is not the only place this can happen: this can happen as
the result of signal handlers as well. cpython has can trigger signal handlers
between any python opcodes.

------
beagle3
(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT2 is suspect in its conclusions.
On the surface, Lua is almost as dynamic as Python, and LuaJIT2 is able to do
wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you
use all the time can potentially shift between two loop iterations, e.g.

    
    
        for x in os.listdir(PATH):
           y = os.path.join(PATH, x)
               process_file(y)
    

There is no guarantee that "os.path" (and thus "os.path.join") called by one
iteration is the same one called by the next iteration - process_file() might
have modified it.

It used to be common practice to cache useful routines (e.g. start with
"os_path_join = os.path.join" before the loop and call "os_path_join" instead
of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm
not sure why it isn't anymore - it would also likely help PyPy and Pyston
produce better code.

This is by no means the only thing that makes Python slower - my point is that
if one looks for inherently slow issues that cannot be solved by an
implementation, comparison with Lua/LuaJIT2 is a good way to go.

~~~
xamuel
There isn't even any guarantee that "os.path" is the vanilla iterative lookup
it's presented as. For all we know, it could be a @property-decorated method
thousands of lines long.

~~~
beagle3
But as long as you know it is the same thing, you could move ("hoist") it
outside the loop. In Lua, you generally can (and LuaJIT2 does). In Python,
rarely if ever.

------
jacquesm
Python is high level glue between efficient C functions. It's blindingly fast
_if_ you are allowing those efficient C functions to do the heavy lifting.
That's why you can do image processing in python, right up until the point
where something you need to do doesn't have a ready made primitive and then
your program will slow down tremendously if you don't take the time to write
the thing you're missing in C.

~~~
sitkack
As glue, it isn't particularly efficient, both in runtime or in programmer
affordances. Ctypes and cffi are both fairly new and not very friendly. The
number of programmers who "drop to native" is ridiculously small. A better
glue language would make this almost transparent.

~~~
vegabook
every python programmer who has every used Numpy or Pandas, and in my opinion
this is where Python shines and why it's so huge in scientific programming, is
"dropping into native". So actually a large amount of people are doing so. And
I find Python to be an excellent glue language with almost anything I can
think of being possible, and much of my heavy lifting being extremely
efficient, especially if you use a modern AVX-enabled Numpy.

Arguably anybody who ever accessed a database in Python is also "dropping into
native". That's why no sane database is written in Python, but plenty of
database-using applications are.

The only language I have discovered that approaches the efficiency of Python
as "glue" is R, but it's about 20x slower, and doesn't even try to be threaded
(which can be a big problem for IO sensitive glue tasks).

~~~
sitkack
Because extensions that use native code exist doesn't mean that low friction
affordances exist for the median programmer to use native code in their
applications. I think we will start to see some interesting projects in this
space after Python 3.6 ships.

------
0xmohit
The following may also be of interest:

\- Why Python is Slow: Looking Under the Hood [0]

\- Fast Python, Slow Python by Alex Gaynor [1]

[0] [https://jakevdp.github.io/blog/2014/05/09/why-python-is-
slow...](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)

[1]
[https://www.youtube.com/watch?v=7eeEf_rAJds](https://www.youtube.com/watch?v=7eeEf_rAJds)

------
merb
Actually I never found Python "slow". I found it "slow" in some things, but
not "all things are slow with python". What was problematic was pulling big
lists from a database and doing some stuff with the list.

Also it was really akward that "threading" is not really great on python.
Especially not when you are using 16 core servers. I mean you could create
vm's for that or dockerize that. but that means deployment complexity
increases which wasn't our goal. But I've seen a lot of successful python
deployments and if you have enough manpower you pretty sure can run with
CPython just fine.

~~~
brianwawok
On a 16 core server you can run 32 copies of your program Ala Celery. It will
peg the CPUs just dandy..

~~~
retrogradeorbit
My experience is that it is not "dandy". We have had a lot of trouble pegging
the CPUs on our celery worker boxes (doing CPU bound jobs). You get more than
one CPU utilised, sure, but we never can seem to get all cores fully utilised.
We rewrote some of the tasks into a single multi-threaded JVM process pulling
off the rabbit queue and they instantly and consistently pegged every CPU at
100%. I wish I knew how to get our celery worker farm to full utilisation
because it would save us a fair bit of money.

~~~
jstanley
He's saying run 32 individual processes, not 32 threads within one process.
Python's global interpreter lock will knobble you if you're using threads.

~~~
darpa_escapee
Celery has a worker pool of separate Python processes that jobs can be
offloaded to. It side steps the GIL because it doesn't use threading.

~~~
brianwawok
Well two ways. Either launch 32 or 64 copies of the process using
multiprocessing, or 1000 threads with geventlet.

I have never found a box I couldn't peg :)

I guess if you had 1 gb ram and 16 cores it would be challenging in python..
but a few gb of ram and we are on.

------
eldude
Can someone ELI5 why Python is more like rendering HTML than executing JS?
This is confusing to me since much of node.js/V8 is C, yet AFAIK (from the
title and my experience) it's faster, and I don't recall anything
intrinsically more declarative (i.e. HTML) when writing python compared to JS.
They both feel very similar as scripting languages to me.

IOW, from my limited ignorant perspective, this feels more like the WHAT than
the underlying differentiating WHY.

It's possible it's in there and I missed it.

EDIT: FWICT from the linked slides[1], it's the result of 2 issues: 1.
Expensive dynamic language features and 2. python is like node.js, but as if
you only called V8 bindings and so VM performance was irrelevant. This is
strange to me; while I feel I can conceptualize the difference, I still don't
know enough to understand why it is so compared with node.js.

[1]
[https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFA...](https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFAtBMKP-
DpxPCKWppzoIfmbfQ/mobilepresent?slide=id.g142ccf5e49_0_453)

~~~
dvogel
The HTML comparison was referring to the python bytecode rather than the
language. A more apt comparison would be CISC vs RISC processors, where the V8
IR is more like a RISC processor.

------
alephnil
The argument is that Python spend most time in the C runtime, but that is
really only a property of the current implementation. If I use Jython or even
implement Python on top of bare metal, making Python the OS, that will not be
spending any time in the C runtime because there wouldn't be one, but that's
besides the point.

The question is whether an implementation could be made significantly faster
than the current Python interpreter is, or whether there are properties of the
Python semantics that makes that hard. One such thing is the amount of dynamic
behavior that is allowed in Python. That require most variables to be boxed,
even for basic types like numbers. There are dynamic languages (LuaJIT was
mentioned, but Javascript, Julia and several Lisp implementation could be
mentioned as well) that are considerably faster than Python, so why isn't
Python fast?

Personally I think that Python could be made quite a bit faster than it is,
but such a new Python system would almost certainly be incompatible with a
large number of the Python libraries that interface with native code. For most
Python users, this availability of libraries is a major motivation for using
Python in the first place, and a new fast implementation without such
compatibility would be worthless for most users.

------
jlarocco
The notion that Python spends most of its time in its C code isn't
particularly insightful. That's how interpreters traditionally work. It just
raises the question of _why_ that C code is slower than other interpreters.

Personally, I don't mind Python's speed. I don't use it for runtime speed, I
use it for development speed. If I need runtime speed I use C++. Lately,
though, I'm getting to the point where I write Common Lisp about as quickly as
Python _and_ it typically runs 4-5x faster than Python, so I've just been
using that.

------
e12e
From the linked [lwn] article:

"Another example he gave demonstrates the slowness of the C runtime:

    
    
        import itertools
        sum(itertools.repeat(1.0, 100000000))
    

That will calculate the sum of 100 million 1.0s. But it is six times slower
than the equivalent JavaScript loop. Float addition is fast, as is sum(), but
the result is not.

Larry Hastings asked what it was that was slowing everything down. Modzelewski
replied that it is the boxing of the numbers, which requires allocations for
creating objects. Though an audience member did point out with a chuckle that
you can increase the number of iterations and Python will still give the right
answer, while JavaScript will not."

Reminded me about the excellent talk about Julia: "Julia: to Lisp or not to
Lisp?"
[https://www.youtube.com/watch?v=dK3zRXhrFZY](https://www.youtube.com/watch?v=dK3zRXhrFZY)

One things he points out early is that both the C99 and the R6RS Scheme spec
is 20% numerical. Because correct and (reasonably) fast numbers and arithmetic
is actually pretty hard to get right on a computer - if you want to abstract
away hardware "short-cuts" and allow for precise arithmetic by default.

It will be interesting to see how much type hints (eliminating some of the
boxing/unboxing) will help python. And if it turns out to really be a good fit
for the language -- everyone wants "free" performance, but transitioning to a
(even partially) typed language is certainly not "free".

Another point with the above loop, is that ideally, even if it can't be
optimized/memoized down to a constant - it really shouldn't have to be much
slower than its C counterpart. Except for handling bignums in some way or
other (perhaps on overflow only).

[lwn] [https://lwn.net/Articles/691243/](https://lwn.net/Articles/691243/)

~~~
bjourne
> Another point with the above loop, is that ideally, even if it can't be
> optimized/memoized down to a constant - it really shouldn't have to be much
> slower than its C counterpart. Except for handling bignums in some way or
> other (perhaps on overflow only).

Sadly, bignum handling is very expensive even with type hinting. Essentially,
every time two numbers is added or subtracted you have to check for overflow:

    
    
        int z = hw_add(x, y);
        if (over/under-flowed?()) {
          bignum* bn = new bignum();
          bignum_add(to_bignum(x), to_bignum(y), &bn);
          return bn;
        }
        return z;
    

So most modern statically typed languages (none named, none forgotten!)
actually cheat and use modular arithmetic instead. For example, the
straightforward way of computing 9223372036854775807 + 1 on a 64bit system in
one of those languages is likely to yield an incorrect result.

Which imho is complete bullshit because there is no reason to sacrifice
correctness for performance other than to win benchmarking contests.

~~~
rurban
This overflow check is extremely cheap. Just check the overflow flag on the
cpu. jo +4; jmp bignum; 6 byte. C compilers didn't support it until last year,
so everyone just used inline assembly. Now we have it in gcc 5 and clang 3.4

One good trick started by lua is to use double for everything and check the
mantissa of the normalized result for overflows. A double can represent the
whole int32 range. With double and nan-tagging you have a unique fast
representation of all primitives and can use a double word stack. Which is the
biggest win. Look at how python, perl and ruby are doing this. Horrible. And
then look how a fast dynamic language is doing it.

~~~
bjourne
Everything is relative. :) Compared to the add instruction itself, the
overflow check is very expensive. But the big problem is that your types
widen. E.g:

    
    
        // x and y are hardware integers >= 0
        var z = x + y
        for (var i = 0; i < z; i++) {
            ...
        }
    

If your language used modular arithmetic, 'z' would also be a hardware integer
and the induction variable 'i' could be kept in a cpu register. But with real
arithmetic using bignums you can't assume that anymore and each iteration of
the loop must do a relatively expensive comparison of 'i' against 'z'. In
pseudo code:

    
    
        var z = x + y
        var i = 0
        if (!hw_int(z)) i = new_bignum(0)
        while (true) {
          // i < z
          if (hw_ints?(i, z)) {
            if (i >= z) break
          } else {        
            if (bignum_gte(i, z)) break
          }        
          ... body ...
          // i++
          if (hw_int?(i)) {
            i++;
          } else {
            i = bignum_add(i, 1)
          }
        }

~~~
rurban
No, it's super cheap. overflows happen <1%, so you mark it as UNLIKELY, and
the branch prediction units just don't care about all those bignum branches.

If you know to use bigints or bignums, just type it and use their variants
from the beginning.

Most better dynamic languages already use tagged ints, so they have only say
~30bits for an int and do a lot of those overflow checks and still beat the
hell out of the poorly designed dynamic languages with full boxed ints or
nums.

------
makecheck
Python has had some surprising costs but some well-known cases have
straightforward work-arounds.

For example, at one point, the _way_ you called a multi-dot function actually
mattered (not sure if this persists in the latest Python releases). If your
function is like "os.path.join", the interpreter would seem to incur a cost
for each dot: once to look up "os.path" and once to look up "path.join". This
meant that there was a noticeable difference between calling it directly
several times, and calling it only through a fully resolved value (e.g. say
"path_join = os.path.join" and call it only as "path_join" each time).

Another good option is to use the standard library’s pure-C alternatives for
common purposes (containers, etc.) if they fit your needs.

------
__s
Yet we got a ~5% speed advantage when I refactored bytecode to wordcode
[https://bugs.python.org/issue26647](https://bugs.python.org/issue26647)

------
alayne
Just because you're in the C runtime, doesn't mean you're doing productive
work. If it isn't the bytecode VM that is slow, then why is PyPy able to be
much faster in many cases.

In my experience, Python is easily 10-20x slower than a compiled language when
doing computational work where you can't just call into a big C function, just
like you would expect to see from any interpreter. I won't generally use it
for anything data intensive.

------
saynsedit
What's expensive about the runtime is the redundant type/method-dispatch not
the opcode dispatch. The runtime is constantly checking types and looking up
methods in hash tables.

Gains can be made by "inter-bytecode" optimization in the same vein as inter-
procedural optimization.

If you can prove more assumptions about types between the execution of
sequential opcodes, you can remove more type checks and reduce the amount of
code that must be run and make it faster.

E.g.:

    
    
        01: x = a + b
        02: y = x + c
    

If we have already computed that a and b are Python ints, then we can assume
that x is a Python int. To execute line 2, we just then need to compute the
type of c, thus saving time.

The larger the sequence of opcodes you are optimizing over, the better gains
you can make.

Right now I think Pyston uses a traditional inline-cache method. This only
works for repeated executions. The code must eagerly fill the inline cache of
each opcode to get the speed I'm talking about.

Another reason Python's runtime is slow is because there is no such thing as
undefined behavior and programming errors are always checked against. E.g. The
sqrt() function always checks if its argument is >= 0, even though it's
programming error to use it incorrectly and should never happen. This can't be
fixed by a compiler project, it's a problem at the language level.

Being LLVM based, I think Pyston has its greatest potential for success as an
ahead-of-time mypy compiler (with type information statically available). IMO
leave the JITing to PyPy.

------
max_
Why can't someone just write another compiled language with 100% Python
syntax? e.g Julia. only more general purpose.

~~~
allendoerfer
I think Nim ([http://nim-lang.org/](http://nim-lang.org/)) is what you are
looking for.

~~~
max_
Is that a joke? cause I find it really funny..

~~~
alehander42
why?

~~~
max_
Nim's syntax is far away from that of Python its more like "clean" JavaScript.

I wasnt intending to be sarcastic. I apologize if so.

~~~
alehander42
I think it's very pythonic.

Can you give examples?

Also, "clean" JavaScript can look like CoffeeScript which look very much like
Python/Ruby, you can almost say that "clean" versions of languages tend to
converge to something like that for some ppl

~~~
collyw
CoffeeScript is terrible. Full of implicit rules and magic. Python generally
seems to be the opposite.

~~~
alehander42
I don't think it's magical syntactically. Semantically, maybe?

------
rurban
The arguments presented here are extremely dumb. Of course everyone knows
already that the ops itself are slow. What not many people know is that all
this could be easily optimized. javascript had the very same problems, but had
good engineers to overcome the dynamic overhead. php7 just restructured their
data, lua and luajit are extreme examples of small data and ops winning the
cache race. look at v8, which was based on strongtalk. look at guile. look at
any lisp. All these problems were already solved in the 80ies, and then again
in the 90ies, and then again in the 2000ies.

python is similar to perl and ruby plagued with dumb engineering. python is
arguably the worst. They should just look at lua or v8. How to slim down the
ops to one word, how to unbox primitives (int, bool), how to represent numbers
and strings. How to speed up method dispatch, how to inline functions, how to
optimize internals, how to call functions and represent lexicals. How to win
with optional types. Basically almost everything done in python, perl and ruby
is extremely dumb. And it will not change. I'm still wondering how php7 could
overcame the very same problems, and I believe it was the external threat from
hhvm and some kind of inner revolution, which could convince them to rewrite
it. I blame google who had the chance to improve it, after they hired Guido,
and went nowhere. You still have the old guys around resisting any
improvements. They didn't have an idea how to write a fast vm's in the old
days, and they still don't know.

~~~
orf
> javascript had the very same problems, but had good engineers to overcome
> the dynamic overhead

Sure, it also had Google to funnel an insane amount of money and engineering
time/skill into it. If it's so simple to do then put all your ideas into a PEP
and submit it.

~~~
rurban
A PEP? python needs to be rewritten completely. cython, pypy or mypy are good
starts, but their op and data layout is still suboptimal and not best
practice.

And I have no interest in python as language, only in fast dynamic VMs. e.g.
potion could just add a python parser and get a 200x function call speedup,
just as done for ruby or perl. The problem is the library.

------
shadowmint
Python is made faster by optimizing the C implementation of it's opcodes, not
by doing any optimization at a higher level?

Really?

Sounds more like; It's _more convenient for us_ to try to optimize cpython at
the opcode level because its _technically difficult_ to apply JIT techniques
at a higher level because of _the way cpython is implemented_.

    
    
        Pyston's performance improvements come from speeding up the C code, not the Python code.
    
        When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question
        is whether that technique would help speed up C code.  I think this is the most fundamental
        misconception about Python performance: we spend our energy trying to JIT C code, not Python 
        code.  
    
        This is also why I am not very interested in running Python on pre-existing VMs, since that 
        will only exacerbate the problem in order to fix something that isn't really broken.
    

...really?

All I can fathom is that this project is about trying to take the existing
cpython implementation and make it faster by applying various magical hacks at
a very low level, rather than trying to address any of the more difficult
problems about why the cpython runtime is slow.

This is exactly the opposite approach from pypy (ie. reimplement cpython in a
way which is fundamentally better); and it certainly seems to be yielding some
interesting results.

...but I think I'm a little skeptical that its the only solution.

It just happens to be the solution they've decided to pursue.

------
elchief
Who cares? You're not using it because it's performant. You use it because
it's fast to code in and to ship. If you only care about performance, write C
or Java or an assembler and extend your ship date.

~~~
collyw
Obviously being down voted for being too pragmatic.

------
markhahn
Important clarification: when the author says "C runtime", what he means is
"Python runtime written in C". The C runtime is, of course, libc, libm, etc.
I'm not sure why the author thinks Python's high-level IR (please don't call
it a VM) is a good thing, or unoptimizable (to a better IR or native code).
Perhaps he's never read about Smalltalk/Self optimization (which is eye-
opening!)

------
ebbv
For me this post isn't an argument against Mozilla implementing a similar
dialogue; it's an argument for implementing one and improving it.

To really have meaningful improvements would require totally reworking how
extensions deal with the DOM and requiring them to ask for different levels of
permission. That would break most plugins as written and probably require
significant work to implement.

------
dicroce
imho it's about references (pointers) and the inability of the memory
prefetcher to optimize memory accesses. to fix this languages need true value
types.

------
grx
Why is website down

~~~
kmod
Something in my server's configuration eats more and more memory, until the
OOM killer decides that killing the MySQL database looks like a dandy way to
reclaim memory.

~~~
poooogles
Try Varnish, for stuff like this it's pretty perfect.

------
therestisgone
Very nice to see some circular reasoning here. Because python is slow everyone
dispatches to C code. Because everyone dispatches to C code, optimising the
python interpreter is not worth it.

------
bellajbadr
Error establishing a database connection

~~~
0xmohit
At least it doesn't emit the entire stack trace along with the message!

Question: are static websites so hard to do?

~~~
anc84
Static websites take away one of the crucial pieces of blogging: Comments.

~~~
ssalazar
Its trivial in Jekyll to add support for a 3rd party comment service, e.g.
Disqus.

~~~
e12e
Only if you are comfortable using a 3rd party comment service (or injecting
3rd party javascript in your site in the first place). I know it's common, but
even with the advances in CORS and whatnot, it still (IMNHO) defeats a lot of
the benefits of a static web site.

That said, I would probably prefer embedding disqus or
[https://muut.com/](https://muut.com/) comments to running a complex php
application _just for comments_.

------
known
Any script language is slow

~~~
rurban
Only the old dumb ones (python, perl, ruby, formerly also php). Normal dynamic
script languages are written by engineers and are therefore very fast.

~~~
pkd
Are you implying that people like Larry Wall, Yukihiro Matsumoto and Guido van
Rossum, all of who have advanced degrees in computer science are not (good?)
engineers?

~~~
rurban
They might be good hackers, but have no idea how a VM should be implemented.
That's why you got this mess.

Just look how they designed the ops, the primitives, the calling convention,
the stack, the hash tables. This is not engineering, this was amateurish
compared to existing practice.

~~~
pkd
Can you give some examples of more professional VMs which support dynamic
languages without an extra compilation step apart from the JVM?

~~~
rurban
I gave already and I didn't call them more professional. I called them
engineered, in opposite to those simple adhoc implementations. The only part
why worse is better in this regard was because of getting business attraction.

lua, v8, guile (but just last year), smalltalk and family (esp. strongtalk and
the other pre-v8 projects), ml and family (MLton and ocaml), haskell (after
many years of being too slow), any better scheme (>20), any better lisp (>5),
php7 (just last year), and then the tiny lua based ones, like potion, wren,
tinyrb, tinypy or tvmjit (with simple method ast->jit), or partially io, lily,
or the other post-smalltalk languages, like maru (simpliest ast->jit) or the
soda family around alan key and ian piumarta.

when you design a 20-100x slower calling convention without proper support for
lexical variables, slow data, no unboxing, no tagging, slow ops, no inlining,
no fast memory reclamation you are in the pre-SICP days and should be treated
like that.

not even talking about the modern static languages and type systems which are
now catching up, via llvm. not go or rust, but stuff like pony or julia.

------
SFJulie
Well, C/Fortran/C++/ASM are a PITA when it comes to dynamic structures like
the one for handling configurations.

Plus authors missed that boxing in python has a tendency to fragment data in a
non controlable way in the memory thus making the use of L1/L2/L3/mem (on x86)
or other memory architecture with similar layout very hard to use.

If you code often you know the 80/20 rule: 80% of your code is the setup
preparing for the 20% of heavy lifting.

Numpy (which relies on Fortran) is a nice Proof that when done correctly
python is really useful.

A computing a Moving average in pure python is 10 000 times slower than using
numpy (especially if you use FFT).

So ... I am saying since python (like Tcl or Perl) is a good language for
doing FFI (foreign function integration) it should be used this way.

And thanks to the often unfairly hated GIL it enables to use un-thread safe
foreign language library in a thread safe way.

All being said and done, if python used this way is slow, I dare say it is
because some coders do not understand how to build there data/execution flow.
And this is not language dependent, but a question of coder.

I thought they were 10x coders in the past.

I recently realized there are /10 coders in fact.

The one that are pissed at coding when «it does not work the way it should»
and expect language to be magically doing most of the job without learning.

The 1x coders on the other hands are boring, slow coders and accept that when
the «stuff» is not behaving the right way, it may not be the stuff that is not
working, but him/her that have a misconception.

1x coder is not a state, it is a trajectory that can degrade or improve with
new challenge and poor/good state of mind, hence the misconception on 10x.

~~~
lokedhs
It is possible to make a dynamic language natively compiled and very fast.
Common Lisp is arguably even more dynamic than Python, and SBCL manages to
generate code that rivals C in performance.

~~~
PeCaN
I don't consider Common Lisp “more dynamic” than Python; at least in that you
modify code at runtime and such. Frequently Common Lisp code is not
particularly dynamic, because you can get the same level of expressiveness
without relying on mucking around with magic at runtime.

Also SBCL only generates really fast code if you use a lot of type hints and
(optimize (speed 3) (safety 0)). That said, when you do, it's _really_ fast.

~~~
lispm
Typically Common Lisp and its implementations provides a wide range of
performance options. It's not limited to the model of some more primitive
language runtimes, which provide only two modes (simplified): slow if
implemented in the language, fast if it runs mostly in its runtime and library
functions written in C. There are Common Lisp implementations which follow
this model, too. But there are also many which have sophisticated runtimes
with optimizing native code compilers.

To get to really fast code in Common Lisp one needs to write relatively low-
level code with type declarations, optimization hints, low-level operators
etc.

There is a part of Common Lisp which is as dynamic (more?) as Python:
everything written in CLOS (multi-dispatch, multi-methods, method-
combinations, ...) and expecially when using the CLOS MOP. Is it slower or
faster than comparable Python code? I have no idea and I haven't seen any
interesting benchmarks. The amount of CLOS use depends on the implementations.
Some implementations have a lot of their library code and also much of the
language itself (everything IO, error handling, ...) written using CLOS.

~~~
lokedhs
I have written plenty of heavily CLOS-based code that is still very fast. I
doubt that Python could come close in performance.

I think it would be an interesting exercise to do some benchmarks to get some
actual data behind both of our guesses.

As I am not a very experienced Python developer, would you be interested in
writing some test code that is representative of typical Python code, and I'd
be happy to port it over to CLOS and so that we can run some benchmarks?

I think such results would be of interest to the HN audience.

