
Python performance: it’s not just the interpreter - kmod
http://blog.kevmod.com/2020/05/python-performance-its-not-just-the-interpreter/
======
barrkel
The article is a fine example of incremental optimization of some Python,
replacing constructs that the standard Python interpreter has overheads in
executing with others which trigger fewer of those same overheads.

The title isn't quite right, though. Boxing, method lookup, etc. come under
"interpreter" too.

There's a continuum of implementation between a naive interpreter and a full-
blown JIT compiler, rather than a binary distinction. All interpreters beyond
line-oriented things like shells convert source code into a different format
more suitable for execution. When that format is in memory, it can be
processed to different degrees to achieve different levels of performance.

For example, if looking up "str" is an issue, an interpreter could cache the
last function it got for "str" conditional on some invalidation token (e.g.
"global_identifier_change_count"), so it doesn't in fact need to look it up
every time.

Boxing can be eliminated by using type-specific representations of the
operations, and choosing different code paths depending on checking the actual
type. Hoisting those type checks outside of loops is then a big win. Add
inlining, and the hoisting logic can see more loops to move things out of.
Inlining also gets rid of your argument passing overhead.

None of this requires targeting machine code directly, you can optimize
interpreter code, and in fact that's what the back end of retargetable
optimizers looks like - intermediate representation is an interpretable
format.

Of course things get super-complex super-quickly, but that's the trade-off.

~~~
kmod
I have to disagree that those are "interpreter" overheads, since the later
versions of the benchmark do not use the Python interpreter at all yet still
suffer from these overheads. Maybe disagreeing on the wording is pedantic
though, since I think the real discussion is what can be done about it.

We definitely agree directionally: you can create specific implementations of
operations that are fast for their inputs. But look at how specific we have to
get here. It's not just on the type of the callable: we have to know the exact
callable we are calling (the str type). And its behavior is heavily dependent
on its argument type. So we need a code path that is specifically for calling
the str() function on ints. I would argue that this is prohibitively
specialized for an interpreter, but one can imagine a JIT that is able to
produce this kind of specialization, and that's exactly what this blog post is
trying to motivate.

~~~
acqq
> So we need a code path that is specifically for calling the str() function
> on ints. I would argue that this is prohibitively specialized for an
> interpreter,

I would argue it isn't. It's actually less about the ints but about the tuple
allocation:

"Now that we've removed enough code we can finally make another big
optimization: no longer allocating the args tuple This cuts runtime down to
1.15s." (down from 1.40s)

It seems to me that having special case for one argument, avoiding the tuple
allocation in each call is nothing "prohibitively expensive" and would benefit
all the functions being called with one argument, and there are enough of
such.

Regarding ints vs something else -- somewhere in the code there is anyway
different code that does str of int vs str of something else. It's just about
accessing that code for that type, it doesn't have to cost having today's
speculative execution in the CPU's. The people who produce compiled code know
about that and can use it.

~~~
kmod
There's a minor but important point: the amount of work str() has to do
depends on its input type. In particular, if arg.__str__() returns a string or
a subtype of string or not a string, str() will call different tp_init
functions. So we can't get rid of the args tuple until after the optimization
of knowing that arg.__str__() returns an string (which is why I did them in
that order). So I guess the condition is a little bit weaker than str(int) but
it's more than str() of one arg.

~~~
burfog
I think you've proven that the language is a lost cause. It's never going to
perform OK.

With all the breakage of Python 3, it's a shame that the causes of slowness
were not eliminated. The language could have been changed to allow full ahead-
of-time compilation, without all the crazy object look-up. Performance might
have been like that of Go.

~~~
khc
then it will become perl6 which no one uses

~~~
lizmat
Perl 6 is mostly not used because it has been renamed to Raku
([https://raku.org](https://raku.org) using the #rakulang tag on social
media). Raku is very much being used. If you want to stay up-to-date with Raku
developments, you can check out the Rakudo Weekly News
[https://rakudoweekly.blog](https://rakudoweekly.blog)

------
umvi
I've found it often doesn't matter how fast or slow python is if the
bottleneck is outside of python's control.

For example, I wrote an lcov alternative in python called fastcov[0]. In a
nutshell, it leverages gcov 9's ability to send json reports to stdout to
generate a coverage report in parallel (utilizing all cores)

Recently someone emailed me and advised that if I truly wanted speed, I needed
to abandon python for a compiled language. I had to explain, however, that as
far as I can tell, the current bottleneck isn't the python interpreter, but
GCC's gcov. Python 3's JSON parser is fast enough that fastcov can parse and
process a gcov JSON report _before_ gcov can serialize the next JSON.

So really, if I rewrote it in C++ using the most blisteringly fast JSON
library I could find, it would just mean the program will spend more time
blocking on gcov's output.

In summary: profile your code to see where the bottlenecks are and then fix
them. Python is "slow", yes, but often the bottlenecks are outside of python,
so it doesn't matter anyway.

[0]
[https://github.com/RPGillespie6/fastcov](https://github.com/RPGillespie6/fastcov)

~~~
zeta0134
Also of course, like any other language, Python is not immune to I/O wait.
Recently I was using Python to do some low level audio processing (nothing
fancy) and wondering why it was taking so long to process such a small file.
Turns out the "wave" stream reader I was using isn't internally buffered, like
most of the normal file readers are, so each single-byte read was making the
round trip to disk. Simply reading the whole file in one go sped up the
program's execution more than 10x.

I like to think I've been doing this long enough to not make such silly
mistakes, but when you're focused on a totally different area of the problem,
things like this are still quite easy to miss.

~~~
nh2
I'd argue that these things are comparatively easy to address, as one use of
`strace` will immediately show bad IO patterns like that, and it works the
same way across all programming languages.

------
sandGorgon
> _impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original
> version of the benchmark. So not only do they get rid of most of the
> overheads, but they are significantly faster at unicode conversion as well._

It is super unfortunate that Pypy struggles for funding every quarter. Funding
a mil for Pypy by the top 3 python shops (Google, Dropbox, Instagram?) should
be rounding error for these guys...and has the potential to pay off in
hundreds of millions atleast (given the overall infrastructure spend).

~~~
staticassertion
[https://blog.pyston.org/](https://blog.pyston.org/)

You can look at the last release of Pyston for why companies are _not_ funding
this work. They're just moving to other languages.

Consider that Dropbox, Google, and Instragram likely spend _much more_ than 1M
on optimizing/ improving Python, just to have it still be a relatively slow
language.

At some point it becomes way cheaper to just move to other languages that
don't require that constant level of effort. Think about how much time and
money is spent on things like:

* Adding types to Python

* Improving performance

when you could just move performance critical code to Go/ Rust as Dropbox has
done, for a fraction of the cost, with far less maintenance burden.

"Make Python faster" is just a losing game imo. It is fundamentally never
going to be as fast as other languages, it's far too dynamic (and that's a
huge appeal of the language). Just look at the optimizations done here -
moving `str` to local scope to avoid a global lookup? And can you even avoid
that? Not without changing semantics - what if I _want_ to change `str`
globally?

Still, I was surprised by some of the sorta nutty wins that were achieved
here. There's clearly some perf left on the table, and I'm not an expert on
interpreters, it just seems _really_ hard to build a semantically equivalent
Python ("I can change global functions from a background thread") that can
automatically optimize these things.

~~~
catblast
> "Make Python faster" is just a losing game imo. It is fundamentally never
> going to be as fast as other languages, it's far too dynamic (and that's a
> huge appeal of the language).

This cannot be overstated. Unfortunately python is especially perversely
dynamic. After all Javascript and Ruby are highly dynamic, but for a number of
reasons have avenues that allow more meaningful optimization gains without
changing language semantics. (And although it is true that Google had
tremendous resources to pour into V8, it's not like this point doesn't stand -
luajit as an example).

Python imho took the performance/productivity trade off way too far. You can
get some very effective dynamism with some minimal restrictions that won't so
badly shoot yourself in the foot for performance opportunities later. Frankly,
for mainstream development, Kotlin or C# gives very pleasant, productive
languages with strong ecosystems without paying such a penalty. Swift is good.
Go isn't personally my cup of tea, but sure.

Python sort of got a foothold in science.. but that's never been known for
being a field of quality software engineering.

~~~
pjmlp
This is always mentioned as motivation factor, yet Self, Smalltalk, Dylan,
Julia, Common Lisp, JavaScript are just as dynamic and manage to have a good
set of JITs to choose from.

For example in Smalltalk you can completely change the structure of a given
class across the whole process space, thus invalidating everything that the
JIT tried to optimize.

So no, Python isn't any special snowflake, rather there isn't the willingness
to actually spend the resources improving its JIT story.

~~~
pas
Is Julia _really_ just as dynamic? Isn't the problem with Python that the low-
level C API still needs to be respected during JIT, which just kills
performance. (That's why PyPy largely doesn't support it, right?)

Even JS doesn't suffer from this, because folks can't just load V8 extensions
in their browser, and Node went through quite a few NAPI versions - forced by
V8 changes.

That said, of course it'd be possible to speed up Python/CPython with pouring
a lot more money into it. But ... the CPython codebase is already old and has
many problems, a lot of backward compatibility gotchas, and relatively few
people willing to work on it. Because there's not a big reason to do so.
Whereas with JS Google (and thus Mozilla too) was very incentivized to make it
fast.

~~~
pjmlp
Julia, I am not sure how far its dynamism goes, but Dylan, Common Lisp and
Smalltalk certainly.

You can even at any point in time stop execution in the debugger, rewrite part
of the world and just press continue as if that was how the code was
originally written to start with.

~~~
catblast
> You can even at any point in time stop execution in the debugger, rewrite
> part of the world and just press continue as if that was how the code was
> originally written to start with.

Bit of a straw man, because you wouldn't do this regularly in code. Python on
the other hand is a relatively large language at this point, and plain
idiomatic python code leans relatively heavier on JIT unfriendly constructs
compared to the other languages mentioned. Meanwhile, CL has a whole concept
of "compile-time" that doesn't really exist in python. Hence the "perversely"
part.

PyPy has used similar tricks as Smalltalk, Self, and JS/V8, many which were
old hat in the 90s, but PyPy demonstrates that writing a performant JIT with
reasonable memory requirements for real world code is much harder for Python.

~~~
pjmlp
For me the only thing that PyPy sadly demonstrates is that the Python
community doesn't care about JIT support and anyone that wishes to use
languages that embrace JITs should look elsewhere instead of having a
continuous disappointment.

------
stabbles
This is a fun benchmark in C++, where you can see that GCC has a more
restrictive small string optimization. On my desktop the main python example
runs in 3.1s. Then this code

    
    
        void example() {
          for (int64_t i = 1; i <= 20; ++i) {
            for (int64_t j = 1; j <= 1'000'000; ++j) {
              std::to_string(j);
            }
          }
        }
    

runs with GCC in 2.0s and with clang in 133ms, so 15x faster.

I've also benchmarked it in Julia:

    
    
        function example()
          for j = 1:20, i = 1:1_000_000
            string(i)
          end
        end
    

which runs in 592ms. Julia has no small string optimization and does have
proper unicode support by default.

None of the compilers can see that the loop can be optimized out.

~~~
slavik81
The result of std::to_string is defined to be the same as sprintf, so it
depends on the current locale, which is global, mutable state, and initialized
to the user's chosen value at runtime. That makes the optimization impossible
to do safely without special knowledge about the locale functions.

The awful way that C hacked on locales via implicit global state has been a
disaster, leading to both slow and broken code. If you want fast, predictable
conversions between numbers and strings, you need to use std::to_chars. Or,
std::format for localized conversions.

~~~
slavik81
After a little more spelunking, it seems that this particular case was
possible to optimize. While you can set the thousands separator for integers
in C++ via locales, it only applies to iostreams and not to sprintf. You can
also set it for sprintf in POSIX, but it doesn't apply to integers unless you
opt-in via the format specifier.

So, there is no applicable locale information needed for integer formatting
with std::to_string, and GCC began to replace the naive sprintf-based
implementation with one based on std::to_chars in 2017:
[https://gcc.gnu.org/legacy-ml/gcc-
patches/2017-05/msg02103.h...](https://gcc.gnu.org/legacy-ml/gcc-
patches/2017-05/msg02103.html)

The problems I previously mentioned would apply if that were a float, though,
and I would still recommend using std::to_chars just so you don't have to
think about it.

------
FartyMcFarter
This article seems to be using a very specific definition of interpreter,
which is perhaps not what most people think of when they hear "interpreter" ?

If I understand correctly, they call the module generating Python opcodes from
Python code the "interpreter", and everything else is a "runtime". But Python
opcodes are highly specific to CPython, and they are themselves interpreted,
right? Calling the former "interpreter" and the latter something else seems
like an artificial distinction.

Not only is this definition of "interpreter" strange, but their definition of
"runtime" also seems strange; in other languages, the runtime typically refers
to code that assists in very specific operations (for example, garbage
collection), not code that executes dynamically generated code.

~~~
tom_mellior
> If I understand correctly, they call the module generating Python opcodes
> from Python code the "interpreter",

No, you misunderstand. They explicitly define the interpreter as "ceval.c
which evaluates and dispatches Python opcodes". Maybe "evaluate and dispatch"
suggest something else to you, but ceval.c really is the code that iterates
over a list of opcodes and executes the associated computations. This is
absolutely 100% the part of Python that is the interpreter. The module that
_generates_ Python opcodes is the "compiler" (or "bytecode compiler"), and the
article specifically points out that it's not included.

~~~
gsnedders
But at what point do function calls from ceval.c stop counting as part of the
interpreter? Okay, calling a C function clearly at some point ceases being
part of the interpreter, but is the entirety of an attribute lookup (i.e.,
executing the LOAD_ATTR inst) on an ordinary Python object part of the
interpreter or the runtime? In plenty of VMs the object representation is an
intrinsic part of the VM design, with the VM having deep knowledge of it.

~~~
tom_mellior
That's a very blurry distinction, and I'm not very interested in it. I was
correcting the OP's first two paragraphs, I didn't take a stance on the third.

------
pedrovhb
> And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original
> version of the benchmark. So not only do they get rid of most of the
> overheads, but they are significantly faster at unicode conversion as well.

Wow, that's pretty impressive. I never really got to use PyPy though, as it
seems that for most programs either performance doesn't really matter (within
a couple of orders of magnitude), or numpy/pandas is used, in which case the
optimization in calling C outweighs any others.

Can anyone share use cases for PyPy?

~~~
6c696e7578
> Can anyone share use cases for PyPy?

Well, anything that you need to do where the libraries are there and waiting
for you.

If you get into the territory of missing libraries it can be a bit of a pain.
Otherwise, it's a breath of fresh air as it's almost a drop in replacement.

~~~
hangonhn
Just to clarify, does it matter if these libraries are pure Python? What I
have a hard time understanding is how PyPy is almost a drop-in replacement but
then have issues with missing libraries. Couldn't you just pip install the
libraries or just literally get the source and run them if they're pure
python?

~~~
6c696e7578
> does it matter if these libraries are pure Python

Yes, from what I remember, the issue is where the libraries have compiled C
which doesn't work in pypy's JIT compiler.

So long as what you're doing is python code then your code will be ok.

------
rurban
I did similar studies about a decode ago for perl with similar results.

But what he's missing are two much more important thing.

1\. smaller datastructures. They are way overblown, both the ops and the data.
Compress, trade for smaller data and more simplier ops. In my latest VM I use
32 bit words for each and for each data.

2\. Inlining. A tellsign is when the calling convention (arg copying) is your
biggest profiler contributor.

Python's bytecode and optimizer is now much better than perl's, but it's still
2x slower than perl. Python has by far the slowest VM. All the object and
method hooks are insane. Still no unboxing or optimize refcounting away, which
brought php ahead of the game.

------
anentropic
When it says that "argument passing" was responsible for 31% of the time, do I
understand right that we're talking about this line in the inner loop?

    
    
        str(i)
    

...and the time is spent packing i into a tuple (i,) and then unpacking it
again?

are keyword args faster? or they do the same but via dict instead of tuple I
guess

~~~
kmod
Yep! It's slightly worse than you would think. Here's a (slightly edited)
version of how the argument passing works

    
    
      static PyObject *
      unicode_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
      {
          PyObject *x = NULL;
          static char *kwlist[] = {"object", "encoding", "errors", 0};
          char *encoding = NULL;
          char *errors = NULL;
      
          PyArg_ParseTupleAndKeywords(args, kwds, "|Oss:str",
                                      kwlist, &x, &encoding, &errors))
      }
    

Notice the call to PyArg_ParseTupleAndKeywords -- it takes an arguments tuple
and a format string and _executes a mini interpreter to parse the arguments
from the tuple_. It has to be ready to receive arguments as any combination of
keywords and positional, but for a given callsite the matching will generally
be static.

~~~
anentropic
And is that literally every Python function doing that under the hood? Even a
1-arity function?

I don't get what the format string is for.

Anyway, thanks for answering!

~~~
kmod
It used to be this way, but at some point they added a faster calling
convention and moved a number of things to it. Calling type objects still
falls back to the old convention though.

------
no_gravity
I wanted to play with variations of the code. For that it is useful to make it
output a "summary" so you know the variation you tried is computationally
equivalent.

For the first benchmark, I added a combined string length calcuclation:

    
    
        def main():
         r = 0
         for j in range(20):
           for i in range(1000000):
             r += len(str(i))
         print(r)
        main()
    

When I execute it:

    
    
        time python3 test.py
    

I get 8.3s execution time.

The PHP equivalent:

    
    
        <?php
        function main() {
         $r = 0;
         for ($j=0;$j<20;$j++)
           for ($i=0;$i<1000000;$i++)
             $r += strlen($i);
         print("$r\n");
        }
        main();
    

When I execute it:

    
    
        time php test.php
    

Finishes in 1.4s here. So about 6x faster.

Executing the Python version via PyPy:

    
    
        time pypy test.py
    

Gives me 0.49s. Wow!

For better control, I did all runs inside a Docker container. Outside the
container, all runs are about 20% faster. Which I also find interesting.

Would like to see how the code performs in some more languages like
Javascript, Ruby and Java.

~~~
iruoy
I've added rust using this code

    
    
        fn main() {
            let mut r = 0;
    
            for _x in 0..20 {
                for y in 0..1_000_000 {
                    r += y.to_string().len();
                }
            }
    
            println!("{}", r);
        }
    

Surprisingly PyPy is the fastest

    
    
        % hyperfine target/release/perftest "php perftest.php" "python perftest.py" "pypy perftest.py" -w 3      
        Benchmark #1: target/release/perftest
        Time (mean ± σ):     624.8 ms ±   9.8 ms    [User: 623.0 ms, System: 0.8 ms]
        Range (min … max):   614.5 ms … 644.0 ms    10 runs
        
        Benchmark #2: php perftest.php
        Time (mean ± σ):     697.8 ms ±  18.3 ms    [User: 696.7 ms, System: 1.1 ms]
        Range (min … max):   650.1 ms … 718.0 ms    10 runs
        
        Benchmark #3: python perftest.py
        Time (mean ± σ):      3.326 s ±  0.071 s    [User: 3.313 s, System: 0.003 s]
        Range (min … max):    3.232 s …  3.419 s    10 runs
        
        Benchmark #4: pypy perftest.py
        Time (mean ± σ):     270.7 ms ±   5.7 ms    [User: 257.5 ms, System: 13.0 ms]
        Range (min … max):   257.8 ms … 277.8 ms    11 runs
        
        Summary
        'pypy perftest.py' ran
            2.31 ± 0.06 times faster than 'target/release/perftest'
            2.58 ± 0.09 times faster than 'php perftest.php'
            12.29 ± 0.37 times faster than 'python perftest.py'

~~~
eggsnbacon1
My Java timing is very close to Rust. I have a feeling PyPy is eliminating the
string conversion as dead code. When I alter my Java toy code to not prevent
elimination of the toString() call it runs close to speed of PyPy.

~~~
apta
I found Java's to be quicker than Rust:

Java:

    
    
        time java Test
        117777800
                0.60 real         0.52 user         0.12 sys
    

Rust:

    
    
        time ./test
        117777800
                1.94 real         1.89 user         0.00 sys

~~~
steveklabnik
What flags did you use with Rust?

~~~
apta
Both `-O` and `-C opt-level=3`. I'm guessing this is one of those high
throughput scenarios where having a GC is quicker because it's able to defer
deallocation until the end of the program.

~~~
steveklabnik
-O is opt level 2, incidentally.

Yeah, I think you’re right. Very interesting!

------
apalmer
I think he is making the distinction between 3 different categories that
readers are in general lumping into the 'interpreter':

1) Python is executed by an interpreter (necessary overhead)

2) Python as a language is so dynamic/flexible/erfonomic that it has to do
things that have overhead (necessary complexity unless you change the
language)

3) the specific implementation of the interpreter achieves 1 and 2 in ways
that can be significantly slower than necessary

Seems he is pointing out that a lot of performance issues that are generally
thought to be due to 1 and 2 are really 3

~~~
gsnedders
> 2) Python as a language is so dynamic/flexible/erfonomic that it has to do
> things that have overhead (necessary complexity unless you change the
> language)

As PyPy demonstrates, much of this doesn't need to have anywhere near the
overhead that it does in CPython. You can absolutely do better than CPython
without changing the language, and you can do better than CPython without a
JIT if you start specialising code.

------
adsharma
Static subset of <language>

While many here correctly observe that the "too much dynamism" can become a
performance bottleneck, one has to analyze the common subset of python that
most people use and see how much dynamism is intrinsic to those use cases.

Other languages like JS have tried a static subset (Static typescript is a
good example), that can be compiled to other languages - usually C.

Python has had RPython, but no one uses it outside of the compiler community.

The argument here is that python doesn't have to be one language. It could be
2-3 with similar syntax catering to different use cases and having different
linters.

* A highly dynamic language that caters to "do this task in 10 mins of coding" use case. This could be used by data scientists and other data exploration use cases.

* A static subset where performance is at a premium. Typically compiled down to another language. Strict typing is necessary. Performance sensitive and a large code base that lives for many years.

* Some combination of the two (say for a template engine type use case).

A problem with the static use case is that the typing system in python is
incomplete. It doesn't have pattern matching and other tools needed to support
algebraic data types. Newer languages such as swift, rust and kotlin are more
competitive in this space.

------
chrisseaton
Kevin knows much more than I do about optimising Python, but aren't lots of
the things listed as 'not interpreter overhead' only slow because they're
being interpreted? For example you only need integer boxing in this loop
because it's running in an interpreter. If it was compiled that would go away.
So shouldn't we blame most of these things on 'being interpreted'?

~~~
jsnell
Not really. Integer boxing is totally orthogonal to compilation vs
interpretation. Assuming we need to preserve program semantics, a naive
compiler would have exactly the same boxing overhead. And on the other hand a
smart interpreter could eliminate it. (E.g stack allocate integers by default,
dynamically detect conditions that require promoting them to the heap,
annotate the str builtin as escape safe.)

~~~
chrisseaton
Isn't removing overhead like boxing the main point of a compiler? Seems like
if you write a compiler that doesn't do that you might as well not have
bothered.

------
mark-r
> The benchmark is converting numbers to strings, which in Python is
> remarkably expensive for reasons we'll get into.

I was a bit disappointed that converting numbers to strings was the only thing
he didn't actually discuss. I've discovered that the conversion function is
unnecessarily slow, basically O(n^2) on the number of digits. This despite
being based on an algorithm from Knuth.

------
jokoon
I'll never understand why there are so many fast, alternative python
interpreters.

Is the language correctness of the official interpreter causing lower
performance? Does it prevent using some modules? What's at stake here?

I'm planning to use python as a game scripting language but I hear so much
about performance issues that it scares me to try learning how to use it in a
project. I love python though.

~~~
entha_saava
They are mostly JITs or have specific purposes (like stackless python).

JITs like pypy have a raw performance advantage. But implementation is more
complex, and there is startup time overhead. Also I have heard CPython
reference implementation prefers clarity of code over optimizations.

As for game scripting, Lua is most used one.

------
throwaway894345
> In this post I hope to show that while the interpreter adds overhead, it is
> not the dominant factor for even a small microbenchmark. Instead we'll see
> that dynamic features -- particularly, features inside the runtime -- are to
> blame.

I'm probably uneducated here, but I don't understand the distinction between
the runtime and the interpreter for an interpreted language? Isn't the
interpreter the same as the runtime? What are the distinct responsibilities of
the interpreter and the runtime? Is the interpreter just the C program that
runs the loop while the runtime is the libpython stuff (or whatever it's
called)?

~~~
PaulDavisThe1st
Consider a program written in Python. Imagine it converted into some "internal
representation" that will be used by the interpreter to execute it.

We can ask questions like "when I write a statement in Python that adds two
integers (typically a single machine instruction in a C program), how many
machine instructions are executed while running that statement?"

Those are questions about the interpreter - essentially, the raw speed and
execution patterns of the language.

But when you're running a Python program (or one written in many other
languages these days), there is a lot going on inside the process that you
cannot map back to the Python statements you wrote. The most obvious is memory
management, but there are several others.

This stuff takes time to execute and can change the behavior of the statements
that you did write in subtle (and not so subtle ways).

Note that this kind of issue exists even for compiled languages too: when you
run a program written in C++, the compiler will have inserted varying amounts
of code to manage a variety of things into the executable that do not map back
explicitly to the lines you wrote. This is the "C++ runtime" at work, and even
though the "C++ language" may run at essentially machine speed, the runtime
still adds overhead in some places.

Interpreted languages are the same, just worse (by various metrics)

~~~
throwaway894345
I understand the notion of a "runtime" in the compiled language case; my
question is about the distinction between an interpreter and a runtime in the
interpreted language case. Perhaps out of naivety, I would think that the
interpreter _is_ the runtime--the interpreter manages the call stack, memory,
etc.

~~~
PaulDavisThe1st
Well, imagine for example that you wrote this in python:

    
    
        s = "hello" + "world"
    

You could measure the execution of the internal representation of this, and
that would be all about the speed of the interpreter.

At some point, something may have to clean up temporaries and that may happen
asynchronously with respect to the actual statement execution. Yes, it happens
in the same process as the interpreter's execution of the code, but it's a
theoretically "optional" cost that has no specific relationship with the
statement.

There are interpreters, for example, where you can disable garbage collection.
If the overall execution time decreased, it's not because the interpreter got
faster, it's just doing less of the "runtime" work.

~~~
throwaway894345
So is there anything besides the GC which satisfies this definition of
"runtime"? What other asynchronous activities are going on apart from those of
the Python program itself?

------
erdewit
Replacing str(i) with the f-string f'{i}' lets it run about 2x faster.

------
inglor
The article is quite good but the Node.js example is wrong. OP is measuring
dead code elimination and general overhead time.

No strings get harmed in the process, run Node.js with --trace-opt to see
what's happening.

~~~
kmod
I don't know how to interpret the results of --trace-opt, but when I increase
the iteration bounds 10x the running time increases 10x, so I don't think the
code is being eliminated. This is with node v12.13.0

------
nojito
A potential issue with benchmarks like this is that there are instances where
the initial findings don't scale.

I would be interested to see how it does over an operation that takes 1
minute, 5 minutes, 10 minutes.

------
aldanor
Since people are talking about speed and JIT here, it's worth mentioning Numba
([http://numba.pydata.org](http://numba.pydata.org)). Being in the quant field
myself, it's often been a lifesaver - you can implement a c-like algorithm in
a notebook in a matter of seconds, parallelise it if needed and get full numpy
api for free. Often times if you're doing something very specific, you can
beat pandas/numpy versions of the same thing by an order of magnitude.

------
eggsnbacon1
for reference, a pure java version through JMH that takes 0.38 seconds on my
machine. This uses parallel stream so its multithreaded.

Single threaded it takes 0.71 seconds. Removing the blackhole to allow dead
code elimination takes single thread down to 0.41 seconds. This is close to
PyPy, which I assume is dead code eliminating the string conversion as well.

    
    
        package org.example;
    
        import org.eclipse.collections.api.block.procedure.primitive.IntProcedure;
        import org.eclipse.collections.impl.list.Interval;
        import org.openjdk.jmh.annotations.Benchmark;
        import org.openjdk.jmh.infra.Blackhole;
        
        public class MyBenchmark {
        
          @Benchmark
          public void testMethod(final Blackhole blackhole) {
            Interval.oneTo(20)
                .parallelStream().forEach((IntProcedure)
                    outer ->
                        Interval.oneTo(1000000)
                            .forEach(
                                (IntProcedure)
                                    inner ->
                                        // prevent dead code elimination
                                        blackhole.consume(Integer.toString(inner))));
          }
        }

~~~
6c696e7578
Not too surprising as dynamic languages take 4-20 times longer at numerical
work, from rough experience.

Java/c/c#/c++/rust etc are roughly at the same end of the spectrum (unless
you're creating stupid numbers of objects).

Perl/python/ruby, they're dynamic, so expect slower results.

I like the threaded approach you're using.

~~~
eggsnbacon1
> like the threaded approach you're using.

thanks :) Eclipse Collections can also do batched loops which might speed this
up. Telling Java how many threads you want will probably help as well

Interestingly, I tried plain old for(i) loops and the result was exactly the
same. At least for Eclipse Collections, the syntactic sugar for ranges and
forEach is completely optimized away, apparently.

~~~
palinkapika
You might already know this, but there is one potential caveat with APIs like
that when it comes to performance or at least measuring their performance. The
hot loop (the code that actually iterates over your data points) does not live
in your code base. But performance often depends on what the JIT compiler
makes out of that loop. If the API is used at several locations in your
program, the compiler might not be able to generate code that is optimal for
your call-site and inputs. Instead, it will generate generalized code that
works for all inputs but might be slower.

However, when writing benchmarks, there is often no other code around to force
the JIT compiler to generate such generalized code.

The following code demonstrates this. If the parameter _warmup_ is set, I
invoke the _forEach_ methods with different inputs first (they do the same but
are different methods in the Java byte code). The purpose is to force the
compiler to generate generalized code:

    
    
        @Param({"true", "false"})
        public boolean warmup;
    
        @Setup
        public void setup() {
            if (!warmup) {
                Interval.oneTo(20).forEach(
                        (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                                (IntProcedure) j -> Integer.toString(j)));
                Interval.oneTo(20).forEach(
                        (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                                (IntProcedure) j -> Integer.toString(j)));
                Interval.oneTo(20).forEach(
                        (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                                (IntProcedure) j -> Integer.toString(j)));
            }
        }
    
        @Benchmark
        public void coollectionsBlackhole(Blackhole blackhole) {
            Interval.oneTo(20).forEach(
                    (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                            (IntProcedure)j -> blackhole.consume(Integer.toString(j))));
        }
    
    
        @Benchmark
        public void collectionsDeadCode() {
            Interval.oneTo(20).forEach(
                    (IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
                            (IntProcedure)j -> Integer.toString(j)));
        }
    

And the for loop implementations for reference:

    
    
        @Benchmark
        public void loopBlackhole(Blackhole blackHole) {
            for (int j = 0; j < 20; j++) {
                for (int i = 0; i < 1_000_000; i++) {
                    blackHole.consume(Integer.toString(i));
                }
            }
        }
    
        @Benchmark
        public void loopDeadCode() {
            for (int j = 0; j < 20; j++) {
                for (int i = 0; i < 1_000_000; i++) {
                    Integer.toString(i);
                }
            }
        }
    
        @Benchmark
        public void loopBlackholeOnly(Blackhole hole) {
            for (int j = 0; j < 20; j++) {
                for (int i = 0; i < 1_000_000; i++) {
                    hole.consume(0xCAFEBABE);
                }
            }
        }
    

On my desktop machine, this gives me the following results (Java
11/Hotspot/C2):

    
    
      Benchmark              (warmup)  Mode  Cnt    Score   Error  Units
      collectionsDeadCode        true  avgt   99  162.482 ± 1.615  ms/op
      collectionsDeadCode       false  avgt   99  218.816 ± 4.217  ms/op
      coollectionsBlackhole      true  avgt   99  235.122 ± 1.362  ms/op
      coollectionsBlackhole     false  avgt   99  270.192 ± 1.627  ms/op
      loopBlackhole              true  avgt   99  207.214 ± 1.162  ms/op
      loopBlackhole             false  avgt   99  206.711 ± 0.932  ms/op
      loopBlackholeOnly          true  avgt   99   74.774 ± 0.180  ms/op
      loopBlackholeOnly         false  avgt   99   74.359 ± 0.180  ms/op
      loopDeadCode               true  avgt   99  143.394 ± 0.900  ms/op
      loopDeadCode              false  avgt   99  142.654 ± 0.795  ms/op
    

All results are for single threaded code. Now, the difference is not huge, but
significant. Overall it looks like the invocation of toString(int) is not
removed and accounts for most of the runtime.

Just to be clear: I am not saying one should stay way from the stream APIs. As
soon as the work per item is more than just a few arithmetic operations, there
is a good chance the difference in runtime is negligible. But when doing
numerical work (aggregations etc.), a simple loop might be the better option.

Finally, these differences are of course compiler-dependent. For example, C2
might behave differently than Graal and who knows what the future brings.

~~~
gergo_barany
> For example, C2 might behave differently than Graal and who knows what the
> future brings.

Yes, the GraalVM compiler is more aggressive on some optimizations that can be
especially helpful with streams. Here are numbers from my machine (using
ops/s, so _higher_ is better!):

C2 (on Java(TM) SE Runtime Environment (build 1.8.0_251-b08)):

    
    
        Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
        MyBenchmark.collectionsDeadCode        true  thrpt   10   2.179 ± 0.070  ops/s
        MyBenchmark.collectionsDeadCode       false  thrpt   10   1.640 ± 0.055  ops/s
        MyBenchmark.coollectionsBlackhole      true  thrpt   10   1.661 ± 0.109  ops/s
        MyBenchmark.coollectionsBlackhole     false  thrpt   10   1.571 ± 0.062  ops/s
        MyBenchmark.loopBlackhole              true  thrpt   10   1.829 ± 0.077  ops/s
        MyBenchmark.loopBlackhole             false  thrpt   10   1.855 ± 0.062  ops/s
        MyBenchmark.loopBlackholeOnly          true  thrpt   10  19.775 ± 0.184  ops/s
        MyBenchmark.loopBlackholeOnly         false  thrpt   10  20.213 ± 0.491  ops/s
        MyBenchmark.loopDeadCode               true  thrpt   10   2.022 ± 0.057  ops/s
        MyBenchmark.loopDeadCode              false  thrpt   10   1.990 ± 0.074  ops/s
    

GraalVM Community Edition (current master, close to yesterday's 20.1 release,
JDK 8):

    
    
        Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
        MyBenchmark.collectionsDeadCode        true  thrpt   10   3.151 ± 0.118  ops/s
        MyBenchmark.collectionsDeadCode       false  thrpt   10   2.790 ± 0.053  ops/s
        MyBenchmark.coollectionsBlackhole      true  thrpt   10   2.713 ± 0.137  ops/s
        MyBenchmark.coollectionsBlackhole     false  thrpt   10   2.488 ± 0.031  ops/s
        MyBenchmark.loopBlackhole              true  thrpt   10   2.862 ± 0.073  ops/s
        MyBenchmark.loopBlackhole             false  thrpt   10   2.835 ± 0.043  ops/s
        MyBenchmark.loopBlackholeOnly          true  thrpt   10  24.396 ± 1.278  ops/s
        MyBenchmark.loopBlackholeOnly         false  thrpt   10  24.103 ± 0.957  ops/s
        MyBenchmark.loopDeadCode               true  thrpt   10   3.235 ± 0.097  ops/s
        MyBenchmark.loopDeadCode              false  thrpt   10   3.205 ± 0.168  ops/s
    

GraalVM Enterprise Edition (current master, close to 20.1, JDK 8):

    
    
        Benchmark                          (warmup)   Mode  Cnt   Score   Error  Units
        MyBenchmark.collectionsDeadCode        true  thrpt   10   3.748 ± 0.046  ops/s
        MyBenchmark.collectionsDeadCode       false  thrpt   10   2.879 ± 0.042  ops/s
        MyBenchmark.coollectionsBlackhole      true  thrpt   10   2.792 ± 0.059  ops/s
        MyBenchmark.coollectionsBlackhole     false  thrpt   10   2.479 ± 0.040  ops/s
        MyBenchmark.loopBlackhole              true  thrpt   10   3.070 ± 0.046  ops/s
        MyBenchmark.loopBlackhole             false  thrpt   10   3.077 ± 0.057  ops/s
        MyBenchmark.loopBlackholeOnly          true  thrpt   10  25.200 ± 0.760  ops/s
        MyBenchmark.loopBlackholeOnly         false  thrpt   10  25.752 ± 0.481  ops/s
        MyBenchmark.loopDeadCode               true  thrpt   10   3.812 ± 0.161  ops/s
        MyBenchmark.loopDeadCode              false  thrpt   10   3.909 ± 0.109  ops/s
    

Amusingly, on the OP's benchmark GraalVM also does better than C2, and it
somehow (I haven't tried to analyze this yet) makes the serial stream version
as fast as the parallel one:

C2:

    
    
        Benchmark                            Mode  Cnt  Score   Error  Units
        MyBenchmark.originalParallelStream  thrpt   10  2.547 ± 0.336  ops/s
        MyBenchmark.originalSerialStream    thrpt   10  1.719 ± 0.041  ops/s
    

GraalVM Community Edition:

    
    
        Benchmark                            Mode  Cnt  Score   Error  Units
        MyBenchmark.originalParallelStream  thrpt   10  2.809 ± 0.141  ops/s
        MyBenchmark.originalSerialStream    thrpt   10  2.868 ± 0.090  ops/s
    

(I work in the GraalVM compiler team at Oracle Labs, I don't speak for Oracle,
I'm not claiming anything about whether this is a good or bad benchmark, etc.)

~~~
palinkapika
The results for the parallel code are indeed interesting. However, the speedup
< 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common
worker pool should default to 3 workers), I would expect it to be above 2 for
this kind of workload.

The last time I looked into Graal (over a year ago), I ran into something I
could not make sense of.

We were designing an API with methods similar to Arrays.setAll(double[],
IntToDoubleFunction) which we expected to be called with large input arrays
from many places in the code base.

The performance of the code generated by C2 dropped as one would expect once
you were using more that two functions (lambda functions were no longer
inlined and invoked for each element). For simple arithmetic operations the
runtime increased by around 5x.

The performance of the code generated by Graal was competitive with C2's code
for mono- and bimorphic scenarios for a much larger number of input functions.
I don't recall the exact number, but I believe it was around 25. However, once
we hit that threshold, performance tanked. I believe the difference was closer
to 15x than than C2's 5x.

We never figured out what caused this.

Do you have an idea? Or, since this is probably outdated data, do you know
what the expected behavior is nowadays? Is there maybe any documentation on
this?

~~~
gergo_barany
_> However, the speedup < 2 with C2 is suspicious. Even on an old dual core /
4 thread system (common worker pool should default to 3 workers), I would
expect it to be above 2 for this kind of workload._

The speedup I found on C2 is 2.547 / 1.719 ~= 1.5, the OP's speedup (also on
C2) was 0.71 s / 0.38 s ~= 1.9. Not the same factor, but not above 2 either.
The benchmark is quite small, so the overhead of setting up the parallel
computation might matter. The JDK version might matter as well.

 _> Graal was competitive with C2's code for mono- and bimorphic scenarios for
a much larger number of input functions. I don't recall the exact number, but
I believe it was around 25._

If we are talking about a call like Arrays.setAll(array, function) and you
have one or two candidates for function, then you are mono- or bimorphic. If
you have 25 candidates for function, you are highly polymorphic. I don't
understand what you mean by a mono- or bimorphic scenario with up to 25
candidates.

Performance will usually decrease as you transition from a call site with few
candidates to one with many candidates. I agree that a factor of 15x looks
steep, but it's hard to say more without context. The inliners (Community and
Enterprise use different ones) are constantly being tweaked, though that's not
my area of expertise. It's very possible that this behaves differently than it
did more than a year ago.

If you have a concrete JMH benchmark you can link to and are willing to post
to the GraalVM Slack (see signup link from
[https://www.graalvm.org/community/](https://www.graalvm.org/community/)),
that would be great. Alternatively, you could drop me a line at <first . last
at oracle dot com>, and I could have a look at a benchmark, without promising
anything.

~~~
palinkapika
> If we are talking about a call like Arrays.setAll(array, function) and you
> have one or two candidates for function, then you are mono- or bimorphic. If
> you have 25 candidates for function, you are highly polymorphic. I don't
> understand what you mean by a mono- or bimorphic scenario with up to 25
> candidates.

Sorry, I phrased that poorly. What I was trying to say was that the moment we
introduced a 3rd candidate the performance with C2 decreased by ~5x. Whereas
with Graal the performance stayed virtually the same when introducing a 3rd or
4th candidate. And it was performing well. But it did decrease at some point
after adding more and more candidates and then the performance hit was much
bigger.

I will reach out to my colleagues still working on this next week (long
weekend ahead where I live). If we can reproduce this with a current Graal
release, I'll share the benchmark – always happy to learn something new and if
it is 'just' why out benchmark is broken. :)

~~~
eggsnbacon1
I really appreciate the in-depth replies on HN. Something Reddit lost ages
ago. I gotta say your posts have introduced me to a lot of dark magic in the
JVM that isn't really documented anywhere unless you count source code. Much
appreciated!

------
tuananh
Can anyone explain to me why this is a lot faster than j.toString() or
String(j)

    
    
      for (let i = 0; i < 20; i++) {
    
        for (let j = 0; j < 1000000; j++) {
    
            `${j}`
    
        }
    
      }
    
    

I got

    
    
       Executed in  177.12 millis    fish           external
    
       usr time  159.12 millis   91.00 micros  159.03 millis 
    
       sys time   17.96 millis  443.00 micros   17.52 millis

------
hpcjoe
I know this is supposed to be about python optimization. However, the post
switches over to C at nearly the beginning of the process. Hence it really is
about how to optimize python applications, by rewriting them in C (or other
fast languages).

Which IMO, isn't about optimizing Python, apart from tangential API/library
usage.

I've been under the impression that I'll get the best performance out of a
language when I write code that leverages the best idiomatic features, and
natural aspects of the language. If I have to resort to another language along
the way to get needed performance, I guess the question is, isn't this a
strong signal that I should be looking at different languages ... specifically
fast ones?

Most of the fast elements of python are interfaces to C/Fortran code (numpy,
pandas, ...). What is the rationale for using a slow language as glue versus
using a faster language for processing?

~~~
shepardrtc
> What is the rationale for using a slow language as glue versus using a
> faster language for processing?

It's much quicker and easier to put together a Python program than a C/C++
program. The number of libraries out there for Python is incredible.

~~~
hpcjoe
Honestly, this is subjective. I generally agree that some languages are well
designed for fast prototyping, but not really great for computationally
intensive production. I see Python in this role.

Note that when the original author wished to move this to a faster execution
capability, they had to change languages. So they get the disadvantage of
having to do the port anyway.

How does this save time?

------
varelaz
I really sick of such kind of benchmarks. I never have seen a real world
python program that doesn't depend on any IO or doesn't have some C code
behind wrapped calls. If your code is heavy with computations there are
numpy/scipy libs that are very good at this. These optimizations bring < 10%
of speed to real project/programm, but will require a lot of developers time
to support it. If performance is the key feature and very critical, then
likely python is not the right choice, because python is more about
flexibility, ability to maintain and write solid, easy to read code.

~~~
VHRanger
Hard disagree.

Learning the tool you're working with means you know patterns to write
generally more efficient code.

Even if you're going to use numpy/cython/cffi for faster submodules, writing
faster code in general is a good thing.

~~~
varelaz
I don't mind about knowing limitations. I'm saying that these optimizations
usually are very hard tradeoffs and opposite side of it – code readability,
speed of developer work, ability to maintain it working. I tried Cython and
PyPy. Both are really good if your project is started with them, but if you
decide to migrate to them in order to increase performance, it's like rewrite
project to another language. Also both have a lot of limitations and cpython
gives you still a lot more flexibility in decision making (choose frameworks,
libraries and approaches how to solve certain problem)

------
earthboundkid
I would be interested in seeing the performance difference for f"{i}". My
intuition is that it would be faster.

------
drcongo
That was really interesting, even though a lot of it (almost all the C) was
way over my head. Thank you.

------
carapace
Site broken, "Hug of Death"?

> This page isn’t working

> blog.kevmod.com is currently unable to handle this request.

> HTTP ERROR 500

~~~
kmod
This is what I get for hosting my own wordpress server. It was struggling and
I installed a caching plugin which took down the site. Should be back up,
sorry!

~~~
SketchySeaBeast
Installing plugins into Wordpress kind of feels like doing your own brain
surgery.

------
Mikhail_K
On my somewhat old Linux machine his main() takes 5.22 seconds. Meanwhile,
this Julia code

    
    
      @time map(string, 1:1000000);
    

reports execution time 0.18 seconds. But that includes compilation time, if
you use BenchmarkTools that runs the code repeatedly, I get 88.6 milliseconds

~~~
eggsnbacon1
the outer loop runs 20 times as well

------
g8oz
I'd be interested in seeing how PHP performs running equivalent code.

~~~
kmod
I don't know PHP, but feel free to send a pull request!

~~~
owyn
Using range() in php it takes about 1.3 seconds. The PHP docs imply that it's
a generator now but I'm not positive about that. I wrote an equivalent php
function just using for loops and calling strval($x) 20 million times and on
my laptop it runs in .9 seconds... The second form of creating 20 lists with
1m elements runs in .4 seconds. Without needing to write 300 lines of Py/C
stuff. So... shrug? microbenchmarks? I guess writing the optimized code was
the fun part and the actual benchmark/timing part doesn't really matter. It's
just for loops... the other comments are right that benchmarks definitely
matter when you're doing more varied "work" though.

For what it's worth, I did benchmark a big application in PHP "for real" and
parsing a large configuration file (10,000 lines +) on every request did take
up about %15 of the wall clock time. We optimized a few things there because
it was worth it. It was a HUGE application of about ~1M LoC and an average
request was about 300-400 milliseconds so... I guess it wasn't doing a lot of
for loops?

Edit: I had a few minutes before my next meeting to code golf this so I
decided to test if range() worked with yield [1]

    
    
      <?
      function gen() {  // 1.3s
        foreach (range(0,20) as $i) {
          foreach (range(0,1000000) as $j) {
            strval($j);
          }
        }
      }
      function foo() {  // .9s
        for ($i = 0; $i < 20; $i++) {
          for ($j = 0; $j < 1000000; $j++) {
            strval($j);
          }
        }
      }
      function bar() {  // .4s
        for ($i = 0; $i < 20; $i ++) {
          $x = range(0, 1000000);
        }
      }
    
      // [1]  returns in .06 seconds
      // pretty sure this just executes yield once and does no real work just like me this morning
    
      function gen2() {
       foreach (range(0,20) as $i) {
         foreach (range(0,1000000) as $j) {
            yield;
         }
       }
      }

------
dzonga
this is an area, nim could've come and improved on. i.e if nim had an ability
to port your python code and have it running 98% on nim. most python users
would have been there already.

------
oneiftwo
I've always assumed array iteration is more expensive you don't know the size
of the objects and/or they aren't contiguous.

------
antb123
hmm so complain and then say pypy is 7 times faster(and 4 times faster than
nodejs)

~~~
dguaraglia
I don't see a complaint, just an analysis of things that make code slow.
Assuming that PyPy will just fix the problem is unrealistic, considering
PyPy's limitations (doesn't support every architecture Python supports, lags
behind a Python major minor version or two, etc.)

Python is a great language and PyPy a great tool, but let's not become
complacent or - worse - dismiss valid information just because we like them..

------
andybak
[EDIT - posted in haste. I should RTFA]

Ctrl+F javascript - nothing.

At first glance this seems to be "dynamism==slow" but surely you need to
explain why Python is slower than Javascript and for many years has resisted a
lot of effort to match the performance of v8 and it's cousins?

~~~
cosarara
If you had done ^F NodeJS, you would have found the relevant paragraph. Or,
you can read the whole thing.

~~~
andybak
Yeah - I just went to delete my comment on the grounds I hadn't RTFA

