The article is a fine example of incremental optimization of some Python, replacing constructs that the standard Python interpreter has overheads in executing with others which trigger fewer of those same overheads.
The title isn't quite right, though. Boxing, method lookup, etc. come under "interpreter" too.
There's a continuum of implementation between a naive interpreter and a full-blown JIT compiler, rather than a binary distinction. All interpreters beyond line-oriented things like shells convert source code into a different format more suitable for execution. When that format is in memory, it can be processed to different degrees to achieve different levels of performance.
For example, if looking up "str" is an issue, an interpreter could cache the last function it got for "str" conditional on some invalidation token (e.g. "global_identifier_change_count"), so it doesn't in fact need to look it up every time.
Boxing can be eliminated by using type-specific representations of the operations, and choosing different code paths depending on checking the actual type. Hoisting those type checks outside of loops is then a big win. Add inlining, and the hoisting logic can see more loops to move things out of. Inlining also gets rid of your argument passing overhead.
None of this requires targeting machine code directly, you can optimize interpreter code, and in fact that's what the back end of retargetable optimizers looks like - intermediate representation is an interpretable format.
Of course things get super-complex super-quickly, but that's the trade-off.
As someone who has wrote Python professionally here is my quirk about it. As soon as performance requirements start cropping up in any Python project, all "easy-ness" of Python starts to fade. You have to start looking up for optimisations left, right and centre. Re-writing your code with C-calls to improve performance.
All these has taught me a thing or two. Now if I know that we will be using the product wherein performance or scalability is required, I straight away opt one of Rust, Golang or JAVA. I dont want recurring efforts on tuning the code-base to get better performance. I know I might get some opinions here, but Python is a scripting language for rapid prototyping and I plan to use it as such.
Yup. Use the right tool for the job. Python's performance is definitely something that the implementers of Python should worry about, but as a user, you have choices and you should exercise them. If your problem requires relatively high performance: Java, Go and Rust are all good alternatives. If absolute pedal-to-the-metal, no compromises, performance is needed: C, C++ are your choices. Heck, you can even inline assembly in some cases.
This is especially true when writing code that needs to run on many cores -- Python starts getting really cumbersome in this case, which can be avoided using the relatively elegant threading models in alternative languages.
Task based parallelism for a web server or background job server is really easy and that's 99% of the cases covered. Problems which _need_ full multi-threading with shared mutable state are actually pretty rare.
I have to disagree that those are "interpreter" overheads, since the later versions of the benchmark do not use the Python interpreter at all yet still suffer from these overheads. Maybe disagreeing on the wording is pedantic though, since I think the real discussion is what can be done about it.
We definitely agree directionally: you can create specific implementations of operations that are fast for their inputs. But look at how specific we have to get here. It's not just on the type of the callable: we have to know the exact callable we are calling (the str type). And its behavior is heavily dependent on its argument type. So we need a code path that is specifically for calling the str() function on ints. I would argue that this is prohibitively specialized for an interpreter, but one can imagine a JIT that is able to produce this kind of specialization, and that's exactly what this blog post is trying to motivate.
It's all tied together because there is an shared value/calling model that permeates both runtime and interpreter and imo with CPython much of it becomes academic because unless people are willing to break source compatibility with native extensions(Esp with Python2 vs Python3 still lingering even if most have gone to P3) it's too much of a historic mess to clean up (Although it's a good sign that the official Python docs points to CFFI and ctypes and discourages new development with the C api).
After doing my thesis work(JS-subset -> C compiler) i was a bit disappointed in the results on weaker processors and it led me to experiment with a series of toy compilers. Those experiments really showed how instrinsic a good value model is to determining your performance _ceiling_.
From my memory you had a factor 3-5 of improving or eliminating each of these:
- bytecode dispatch (vs native code)
- naive dynamic type tests(vs optimized paths)
- obj/refcount management (vs just using values directly with a GC backed model with cheap stack-root-registrations)
- naive property lookups(vs optimized variants).
If you started with a naive dynamic js/python-like language ref counted impl you were within a magnitude of the CPython level of perf, as you check off these checkboxes you get closed to V8 (I used C compilers as backends for code generation so i had a slight edge on the lowlevel code generation part).
The CPython model is great for integrating with hand written C/C++ code but the price you pay is a low ceiling since it requires you to manage ref counts (at the very least at function call boundaries even if you optimize them with deferrals) and as you mention in the article pass values via objects.
Special str for ints is not unusual, I'll specifically disagree with you there. Formatting and string output are incredibly common, especially in scripting languages, and repay specialized optimization.
IIRC, the Delphi compiler converts WriteLn('foo: ', 42) into WriteStr, WriteInt and WriteLn calls, for example. StringBuilder classes in Java and .NET have overloads to specialize appending non-boxed int. Etc.
> So we need a code path that is specifically for calling the str() function on ints. I would argue that this is prohibitively specialized for an interpreter,
I would argue it isn't. It's actually less about the ints but about the tuple allocation:
"Now that we've removed enough code we can finally make another big optimization: no longer allocating the args tuple This cuts runtime down to 1.15s." (down from 1.40s)
It seems to me that having special case for one argument, avoiding the tuple allocation in each call is nothing "prohibitively expensive" and would benefit all the functions being called with one argument, and there are enough of such.
Regarding ints vs something else -- somewhere in the code there is anyway different code that does str of int vs str of something else. It's just about accessing that code for that type, it doesn't have to cost having today's speculative execution in the CPU's. The people who produce compiled code know about that and can use it.
There's a minor but important point: the amount of work str() has to do depends on its input type. In particular, if arg.__str__() returns a string or a subtype of string or not a string, str() will call different tp_init functions. So we can't get rid of the args tuple until after the optimization of knowing that arg.__str__() returns an string (which is why I did them in that order). So I guess the condition is a little bit weaker than str(int) but it's more than str() of one arg.
I'm still very sure that the Python's source code can be refactored to avoid the allocation completely in the case of a zero or one parameter for all the functions that now always have an expectation of the tuple. But I also understand that you can't achieve that in your case without changing the existing code base.
I think you've proven that the language is a lost cause. It's never going to perform OK.
With all the breakage of Python 3, it's a shame that the causes of slowness were not eliminated. The language could have been changed to allow full ahead-of-time compilation, without all the crazy object look-up. Performance might have been like that of Go.
Perl 6 is mostly not used because it has been renamed to Raku (https://raku.org using the #rakulang tag on social media). Raku is very much being used. If you want to stay up-to-date with Raku developments, you can check out the Rakudo Weekly News https://rakudoweekly.blog
> There's a continuum implementations between a naive interpreter and a full-blown JIT compiler
And many full-blown JIT compilers also include an interpreter for things like deoptimisation and rolling towards a safe state. Few things are a pure JIT (I think the .NET runtime is) and actually you don't want that is limits how far you can optimise.
It always depends on the use-case and language characteristics. V8 used to be JIT-only for a long time (generating naive x86 code isn't much slower than bytecode), IIRC a big the reason they added a interpreter mode was partly to decrease memory pressure since JS usage had gone through the roof over the years (and the fact that most of the JS code was ever only run one time).
Also faster JIT compilers for JS,etc makes many special cases and paths for type-specialization to avoid GC pressure,etc whereas a .NET or Java runtime has mostly primtive-static paths(Java pre-invokedynamic), ie most of the time an int, double or object reference can't appear in the same "variable slot". The particular object type matters more for invocation optimizations(especially when it comes to interfaces) but less for data movement in generic code.
> IIRC a big the reason they added a interpreter mode was partly to decrease memory pressure since JS usage had gone through the roof over the years
There were a whole bunch of reasons:
IIRC, average memory on devices Chrome runs on was _decreasing_ over time (due to growth in low-end mobile devices) while average JS code size was increasing. Memory consumption from the relatively naïvely generated machine code was sizeable.
The baseline JIT was based on the original V8 JIT, which was becoming a maintenance nightmare (it was no longer used for any optimised tier, IIRC), in part because it was very much originally written for a single architecture and then ported elsewhere, and the per-architecture code-size was massive compared with the newer JITs.
Increasing technical debt in the original JIT.
Reducing the amount of duplicate implementation of new language features across various tiers.
> The title isn't quite right, though. Boxing, method lookup, etc. come under "interpreter" too.
See also jsnell's answer here, however. E.g. (with my interpretation added, all possible errors in them mine) a "naive compiler would have exactly the same boxing overhead" (during the execution)." And on the other hand a smart interpreter could eliminate it" (if it is programmed to recognize these cases during the execution).
To add to that, if the previous processing of the source allows that during the interpretation phase some work can be done out of the loop, then it's not technically that "interpretation" that is sped up but the speedup is achieved by what having the "interpreter" code doing less work.
Arguing from your perspective, one can call "an interpreter" everything that has an input of the source and outputs the result. But I personally, having worked on the compilers, believe that even the internals of "interpreted" languages of today have to be observed with more granularity. As an example, in the broader Perl world it's fully normal to speak about the "compiler" of Perl. What "comes under interpreter" depends on whom you ask.
I believe that Python can be made faster than it is now, without other disadvantages but having maybe the sources of Python in some "less easy to understand" form, and I believe the post we all comment illustrates that. "Common" operations could have less default overhead, however we call that all.
I love to have a source where this optimizations are explained/show. I'm building mine, and the only way you found it is by luck looking at some codebases.
Every time someone talks about inherent interpreter overhead, I like to remind of k[1][2], which is faster than c despite being a relatively naive interpreter.
Python is also blazingly fast when your Python program spends 99% of its time outside of Python and inside some highly optimized library like NumPy. That's because the interpreter overhead doesn't matter when you are not interpreting.
I've found it often doesn't matter how fast or slow python is if the bottleneck is outside of python's control.
For example, I wrote an lcov alternative in python called fastcov[0]. In a nutshell, it leverages gcov 9's ability to send json reports to stdout to generate a coverage report in parallel (utilizing all cores)
Recently someone emailed me and advised that if I truly wanted speed, I needed to abandon python for a compiled language. I had to explain, however, that as far as I can tell, the current bottleneck isn't the python interpreter, but GCC's gcov. Python 3's JSON parser is fast enough that fastcov can parse and process a gcov JSON report before gcov can serialize the next JSON.
So really, if I rewrote it in C++ using the most blisteringly fast JSON library I could find, it would just mean the program will spend more time blocking on gcov's output.
In summary: profile your code to see where the bottlenecks are and then fix them. Python is "slow", yes, but often the bottlenecks are outside of python, so it doesn't matter anyway.
Also of course, like any other language, Python is not immune to I/O wait. Recently I was using Python to do some low level audio processing (nothing fancy) and wondering why it was taking so long to process such a small file. Turns out the "wave" stream reader I was using isn't internally buffered, like most of the normal file readers are, so each single-byte read was making the round trip to disk. Simply reading the whole file in one go sped up the program's execution more than 10x.
I like to think I've been doing this long enough to not make such silly mistakes, but when you're focused on a totally different area of the problem, things like this are still quite easy to miss.
I'd argue that these things are comparatively easy to address, as one use of `strace` will immediately show bad IO patterns like that, and it works the same way across all programming languages.
> So really, if I rewrote it in C++ using the most blisteringly fast JSON library I could find, it would just mean the program will spend more time blocking on gcov's output.
Until the day when clang starts outputting gcov at 30x the rate GCC does it, and then fastcov will become the bottleneck.
That said, I'm an user and it is much better than lcov so thank you for it :-)
Very true... and let's hope that day does come, because it means we can generate coverage reports even faster (if not with my tool, with someone else's)!
Node's json parser is so fast that I've seen people recommend using it over static initialization of object graphs as recently as last week. I had to check the date, thinking that I was looking at some post from 2016. Nope, spring 2020.
As an aside, have you benchmarked running 1.5x or 2x gcov instances per core? Most old code is written as aggressively sequential, so any stall causes it to sit doing absolutely nothing while waiting for an operation to complete. The throughput curve for parallel tasks often has a peak somewhere the interval of 1 < n < 3. When n > 2 it's time to have a heart-to-heart with the authors.
Code gets reused, even if you don't plan for it. Prototypes go to production. Chunks of code from one project, where performance doesn't matter, wind up in another project where performance is important.
For this reason, starting with slow code (either Python or just lazy OO-heavy C++) sets you down a path to future trouble. There will never be time for a rewrite, and if you think there is time then a case of "second system syndrome" may be what stops you.
There is also the matter of personal habit. You use what you are comfortable with. If you get comfortable writing slow code, then you'll start projects with slow code by default. You'll hit the above problems far more often than somebody who is in the habit of writing fast code by default.
No offense, but that seems like bikeshedding to me. At what point do you stop making premature optimizations like that? The commenter had a need, wrote a package to fulfill the need in a language they were comfortable with, and got the job done. Performance is important to the project and the performance goals are met.
Why should they have spent more time to accomplish the same goal and same performance?
I think the parent post also may be making an assumption that C++ code takes little more effort/time to write than python. My experience is that this couldn't be further from the truth. I recently did some DSP work, and prototyping it in python then rewriting the code in C++ easily saved me calendar time as well as effort and headache. Python code has a tendency to just work for numerical applications and it would have been a bigger headache to check my work in C++ without answers that were easier to verify in python (I remember reading years ago a thread on benchmarking that got shut down hard by someone pointing out that only the python code was giving the correct answers due to various floating point issues). On top of that, I needed to use templates to get more compile time substitution to squeeze out performance, and struggling with templates and the type checker would have been a context switching nightmare without having the algorithms already working.
Btw I highly recommend using pythran for your use case which apart from speeding up python significantly allows you to translate your python code directly to C++. Here is a blog post exactly about this use case: https://serge-sans-paille.github.io/pythran-stories/pythran-...
How much time is actually spent in the python runtime/interpreter? That python can glue some libraries together at a high level was never really in doubt.
>impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.
It is super unfortunate that Pypy struggles for funding every quarter. Funding a mil for Pypy by the top 3 python shops (Google, Dropbox, Instagram?) should be rounding error for these guys...and has the potential to pay off in hundreds of millions atleast (given the overall infrastructure spend).
You can look at the last release of Pyston for why companies are not funding this work. They're just moving to other languages.
Consider that Dropbox, Google, and Instragram likely spend much more than 1M on optimizing/ improving Python, just to have it still be a relatively slow language.
At some point it becomes way cheaper to just move to other languages that don't require that constant level of effort. Think about how much time and money is spent on things like:
* Adding types to Python
* Improving performance
when you could just move performance critical code to Go/ Rust as Dropbox has done, for a fraction of the cost, with far less maintenance burden.
"Make Python faster" is just a losing game imo. It is fundamentally never going to be as fast as other languages, it's far too dynamic (and that's a huge appeal of the language). Just look at the optimizations done here - moving `str` to local scope to avoid a global lookup? And can you even avoid that? Not without changing semantics - what if I want to change `str` globally?
Still, I was surprised by some of the sorta nutty wins that were achieved here. There's clearly some perf left on the table, and I'm not an expert on interpreters, it just seems really hard to build a semantically equivalent Python ("I can change global functions from a background thread") that can automatically optimize these things.
> "Make Python faster" is just a losing game imo. It is fundamentally never going to be as fast as other languages, it's far too dynamic (and that's a huge appeal of the language).
This cannot be overstated. Unfortunately python is especially perversely dynamic. After all Javascript and Ruby are highly dynamic, but for a number of reasons have avenues that allow more meaningful optimization gains without changing language semantics. (And although it is true that Google had tremendous resources to pour into V8, it's not like this point doesn't stand - luajit as an example).
Python imho took the performance/productivity trade off way too far. You can get some very effective dynamism with some minimal restrictions that won't so badly shoot yourself in the foot for performance opportunities later. Frankly, for mainstream development, Kotlin or C# gives very pleasant, productive languages with strong ecosystems without paying such a penalty. Swift is good. Go isn't personally my cup of tea, but sure.
Python sort of got a foothold in science.. but that's never been known for being a field of quality software engineering.
Python isn't perversely dynamic, no more so than Self which was running within a factor of 2 of C like 30 years ago, and PyPy optimizes Python just fine. The major stumbling block is that the C API unintentionally sucks and requires lots of work to support https://morepypy.blogspot.com/2018/09/inside-cpyext-why-emul...
LuaJIT is a great example of how experimentation and the right tradeoffs can give you a faster language runtime without a huge effort but it's not at all a great example of a highly optimizing compiler.
There's no "map" or "hidden class" optimization in LuaJIT, for example, and getting good performance means avoiding repeated table lookups.
What exactly is known for being a field of quality software engineering?
100% of Python users find it to be fast enough for their use case. Let's stop beating this horse into a black hole, it doesn't need to be fast at converting integers to unicode 100,000 times, to allow quality software engineering.
To compare speed of CPython with quality of anything is such a narrow view, it is self-defeating.
> 100% of Python users find it to be fast enough for their use case.
This is unlikely and an odd thing to say especially in the context of a thread about people rewriting their software in a different stack in part because of python performance issues. There are plenty of people using python that feel the pain and need to spend resources on performance improvements.
The fact very many line of business apps will require more and more complicated hardware resources compared to using "boring shit" like .NET or Java, or Go (which is actually pretty boring, startup hype aside). I'm no huge fan of Java, but I don't feel any less productive in Kotlin than I do python, for many things it is even better. Python aside from a few things is still looks like a 1989 language with a few newer features - the language other than being basically binding lazy doesn't have many amazing tricks up its sleeve. Meanwhile 30 years of progress has been rolled into the mainstream of C#, Swift, and Kotlin.
> To compare speed of CPython with quality of anything is such a narrow view
I'm not saying you can't write quality software with python. What I was saying is the only niche that Python has maybe picked up a significant mindshare compared to alternatives is scientific computing. And I say this with no insult, but as a former grad student in the sciences and having written quite a bit of monstrous python - it's not a field of quality software engineering.
I've seen badly written scientific Python code, and I've seen very badly written enterprise Java code. I’ve also written probably more than hundred web apps using Django, that never needed to be written in Java. And Django is great work of software engineering, having not much to do with scientific software.
Some apps need to be re-written for performance reasons, or optimized, that is however, not the most common case.
For all the companies that are complaining about the performance issues, I suspect they are complaining about the change of their incentives or circumstances, unless they made an uninformed choice with choosing Python to begin with.
The missing piece of the discussion here seems to be socio-political (human) aspect of computer language usage. When choosing a "slow" language, many other factors beside performance and ecosystem are considered. The primary usage of a computer language is actually communicating with the fellow human beings, and there is a huge cost and overhead associated with all the software writing and interpretation. That is why it baffles me when Python is dismissed entirely on a pure performance basis.
I would wager that in a large percentage of cases you would see the same or larger gains as you see from going to different languages you would see by going for better algorithms. You example of scientific computing (and point of software engineering is well taken) is a good one, because in scientific computing people tend to research and use the fastest algorithms and once you do python (+numpy) is often fast enough.
Using numpy is fast, because it just uses the forever-optimized stuff bundled with it, and uses not much Python. But of course offers a handy interface.
Python suffers from the same problems as C in this regard. It's very powerful, easy to get going, because it doesn't force you into some "better" paradigm/architecture/ideology/thinking. (Like let's say mypy or Rust do.)
And that's okay. 80+% of Python scripts/programs are fine without that rigor, there are more important problems to worry about. (Like making the company profitable sooner instead of spending plus a few more months on figuring out the types/bindings for mypy/Rust.)
> 100% of Python users find it to be fast enough for their use case.
As someone who uses python because it tends to be installed on my targets and comes with a decent standard library, I would like the hours I spend optimizing my code back. Especially those hours I lost before I ended up porting the problematic code to C++ or Java.
That's the intended humor, they stop being Python users if they outgrow it. The language and ecosystem is growing, regardless. I've been using it professionally since 2007, and some tasks I know it is not good at, and thats when I know to use another tool. The point is that it is a tool that has its uses, and I consider it the best tool for certain categories. It doesn't need to be the best tool for all categories. That seems pretty "absolutist".
This is always mentioned as motivation factor, yet Self, Smalltalk, Dylan, Julia, Common Lisp, JavaScript are just as dynamic and manage to have a good set of JITs to choose from.
For example in Smalltalk you can completely change the structure of a given class across the whole process space, thus invalidating everything that the JIT tried to optimize.
So no, Python isn't any special snowflake, rather there isn't the willingness to actually spend the resources improving its JIT story.
This is a common view but I've never heard it from someone who has tried to optimize Python. Personally I think that Python is as much more dynamic than JavaScript as JavaScript is than C. (I can't talk for the other languages.)
Just look at the small example in this post:
- converting things to strings invokes several layers of dynamic dispatches
- Python has an extremely expressive calling convention that can cause a lot of overhead
Another example that I like to give is that in Python you can inspect the stack variables of exited frames! The Python-on-mature-VM projects have also not resulted in great performance.
I hear your argument a lot, and I disagree and I also believe it misguides optimization efforts ("let's just apply JS techniques to Python") including our early ones with Pyston. It's part of what I'm trying to get at with this blog post, but maybe I'll write another one specifically about Python vs other languages.
Is Julia really just as dynamic? Isn't the problem with Python that the low-level C API still needs to be respected during JIT, which just kills performance. (That's why PyPy largely doesn't support it, right?)
Even JS doesn't suffer from this, because folks can't just load V8 extensions in their browser, and Node went through quite a few NAPI versions - forced by V8 changes.
That said, of course it'd be possible to speed up Python/CPython with pouring a lot more money into it. But ... the CPython codebase is already old and has many problems, a lot of backward compatibility gotchas, and relatively few people willing to work on it. Because there's not a big reason to do so. Whereas with JS Google (and thus Mozilla too) was very incentivized to make it fast.
Julia (and likely Dylan and CL which are all similar languages) are not nearly as dynamic as Python. Or more accurately, they are nearly as dynamic, but writing code like that is not idiomatic and it will lead to performance similar to Python.
The most important factor is that those languages were designed with JIT in mind, with a clear separation of compile time and runtime. Not only the macros, which are low cost abstractions, but also how all the dynamic parts of the language can actually be resolved during this period, like type inference and static/multiple dispatch, which are enabled by it's carefully crafted type system that bridges the dynamic world and the static world. Idiomatic Julia is not strictly a dynamic language but a superposition of many static programs, one of which will be selected for each runtime pass.
So changing a variable type in Python has basically no effect, but in Julia , while allowed, it causes what's called type instability and the compiler will be pessimist and create a dynamic box for the type that can't be optimized. Which is also why global variables are so damaging to performance in the language since they can't be inferred. Defining or redefining a function (not a lambda) or types or importing a library dynamically during runtime is another feature that is also allowed, but avoided in practice since they'll invalidate the JIT cache and force a recompilation. The culture of performance aware programming is the second key factor in their speed.
Julia, I am not sure how far its dynamism goes, but Dylan, Common Lisp and Smalltalk certainly.
You can even at any point in time stop execution in the debugger, rewrite part of the world and just press continue as if that was how the code was originally written to start with.
> You can even at any point in time stop execution in the debugger, rewrite part of the world and just press continue as if that was how the code was originally written to start with.
Bit of a straw man, because you wouldn't do this regularly in code. Python on the other hand is a relatively large language at this point, and plain idiomatic python code leans relatively heavier on JIT unfriendly constructs compared to the other languages mentioned. Meanwhile, CL has a whole concept of "compile-time" that doesn't really exist in python.
Hence the "perversely" part.
PyPy has used similar tricks as Smalltalk, Self, and JS/V8, many which were old hat in the 90s, but PyPy demonstrates that writing a performant JIT with reasonable memory requirements for real world code is much harder for Python.
For me the only thing that PyPy sadly demonstrates is that the Python community doesn't care about JIT support and anyone that wishes to use languages that embrace JITs should look elsewhere instead of having a continuous disappointment.
> there isn't the willingness to actually spend the resources improving its JIT story.
We should admit that funding is really important here. The Opensmalltalk-vm is currently sitting on a known optimization (speculative inlining) that has the potential to improve performance 2-3x. But those guys don't have the funding to implement it. Things like JIT/VM work are specialized and complex, and small free-time contributions from people FOSS style won't cut it.
JS has less dynamism in various places (operators especially, which allows you to statically type more, and therefore elide guards).
But yeah, I very much agree: Python isn't special here, it's just a willingness to spend resources (though a lot of that is tied to the CPython C API: either you constrain your VM design choices to be compatible with it directly, or have to add overheads to any/many C API calls).
> Frankly, for mainstream development, Kotlin or C# gives very pleasant, productive languages with strong ecosystems without paying such a penalty
If dynamic is important, Groovy is an excellent option - it preserves almost all of the desirable dynamic properties of Python but is by default 3-5x as fast and with trivial effort 10-20x as fast, approaching Java speeds essentially, and with all of the advantages of that runtime (full concurrency, etc).
I think only Dropbox actually put any significant resources into it?
Google has donated a few tens of thousands of dollars to PyPy and didn't actually support Unladen Swallow it was just a side project by a few employees.
Instagram did contribute the GC.freeze work but that's all I know of? Do they also donate?
It's nothing like the effort these companies put into building V8 or HHVM. PyPy was all research funding and tiny compared to those investments.
Well, Google has their own Python type system and employed Guido for years. I'm sure they've spent millions on improving Python or their usage of Python.
I suspect Instragram has spent > 1M dollars of eng time on Python.
What I'm trying to say is that the companies spend money on the problem of Python performance. I doubt much of it is upstreaming improvements to Python, though I'm sure is (at least indirectly, funding Guido and other major Python developers), but most is probably just improving their own internal libraries, or, more likely, finding ways to FFI.
I think that is the main issue, while other dynamic language communities embrace their JIT efforts, in Python the majority just rewrites the code in C and moves on.
There is even this skewed vision that is still writing Python.
You will get in some long arguments with some guys or gals, trying to make the point that C isn't Python just because it gets a couple of Py_... calls.
Rewrites are the devil though, at least that's what Joel Spolsky taught me.
Additionally, spending time at a really large tech company taught me that rewrites apparently never work (or at least it's better to build compilers for your shitty slow code rather than rewrite it in a better language).
The linked post concerns Dropbox, who to be fair are an infrastructure company with low margins selling a commodity product. It doesn't surprise me that they would find such a move more important than say, Instagram who have a very large margin selling a differentiated product.
I do really appreciate python types (and things like monkeypatch from IG for adding them) as they are an incremental improvement to a widely used language.
Rewrites have probably saved Dropbox 10's of millions of dollars. They just have to be done well. They've rewritten many critical components in the last few years (magic pocket is a big one, nucleus) to great success.
Where rewrites are bad are when someone comes to a codebase, goes "wow this sucks I Can do this better", and starts from scratch. This is worth calling out because it's quite common, but it's very easy to avoid.
I'm a fan of Python for what it's good for. I use it professionally, and built a product around it. I'd be happy to see Python perf wins, but I'm never going to expect it to compete with other languages.
> Rewrites are the devil though, at least that's what Joel Spolsky taught me.
From scratch rewrites of anything large are, absolutely. But it's entirely possible to rewrite code component-by-component in many cases, and that's what most of the successful moves have done.
Even assuming you're right - that all of these companies should move away from Python because it's a losing game - it's STILL going to be efficient to spend a million on a drop in replacement that gives a boost for free....while the moving away is happening
Where are you getting this information that they're moving to other languages? Isn't Instagram almost completely Python? I talk to them at PyCon every year and they spend a lot of time talking about it.
The post I linked talks about Dropbox moving to other languages. I didn't mean to imply that I knew Instagram was also moving away from it, sorry, I see how I misworded that.
This is a fun benchmark in C++, where you can see that GCC has a more restrictive small string optimization. On my desktop the main python example runs in 3.1s. Then this code
void example() {
for (int64_t i = 1; i <= 20; ++i) {
for (int64_t j = 1; j <= 1'000'000; ++j) {
std::to_string(j);
}
}
}
runs with GCC in 2.0s and with clang in 133ms, so 15x faster.
I've also benchmarked it in Julia:
function example()
for j = 1:20, i = 1:1_000_000
string(i)
end
end
which runs in 592ms. Julia has no small string optimization and does have proper unicode support by default.
None of the compilers can see that the loop can be optimized out.
The result of std::to_string is defined to be the same as sprintf, so it depends on the current locale, which is global, mutable state, and initialized to the user's chosen value at runtime. That makes the optimization impossible to do safely without special knowledge about the locale functions.
The awful way that C hacked on locales via implicit global state has been a disaster, leading to both slow and broken code. If you want fast, predictable conversions between numbers and strings, you need to use std::to_chars. Or, std::format for localized conversions.
After a little more spelunking, it seems that this particular case was possible to optimize. While you can set the thousands separator for integers in C++ via locales, it only applies to iostreams and not to sprintf. You can also set it for sprintf in POSIX, but it doesn't apply to integers unless you opt-in via the format specifier.
So, there is no applicable locale information needed for integer formatting with std::to_string, and GCC began to replace the naive sprintf-based implementation with one based on std::to_chars in 2017: https://gcc.gnu.org/legacy-ml/gcc-patches/2017-05/msg02103.h...
The problems I previously mentioned would apply if that were a float, though, and I would still recommend using std::to_chars just so you don't have to think about it.
I rewrote the benchmark in perl and julia as well. Kept it to its origins, and including startup interpreter time.
Original python:
#!/usr/bin/env python3
def main():
for j in range(20):
for i in range(1000000):
str(i)
main()
/usr/bin/time ./st.py
3.16user 0.02system 0:03.19elapsed 99%CPU
Perl port
#!/usr/bin/env perl
use strict;
sub main {
my ($i,$j);
foreach $j (0 .. 20) {
foreach $i (1 .. 1000000) {
sprintf "%i",$i;
}
}
}
/usr/bin/time ./st.pl
2.31user 0.00system 0:02.31elapsed 99%CPU
And julia
#!/usr/bin/env julia
function main()
for j in 0:20
for i in 1:1000000
string(i)
end
end
end
main()
/usr/bin/time ./st.jl
1.43user 0.52system 0:01.30elapsed 149%CPU
This isn't a very useful test of anything. The optimizations ceased being python after the use of the map with list. Then it delved into C. Which is precisely the point that Julia is attempting to solve by providing the power and performance, without the multi-language costs/overhead.
Note: This is Julia 1.4.1, Perl 5.30.2, and Python 3.8.2 running on my 2 year old linux laptop with a kaby lake processor.
FYI, the Perl equivalent is probably "$i" (which in my benchmarks results in ~41% of the runtime. It doesn't appear to be optimized away either, as an empty loop and one with a simple incrementing variable that's printed at the end result in 19% of the running time.
In Perl the appropriate way to convert a number to a string is to interpolate it or use a string based operator ("." for concatenation, etc) on it and let it convert the type itself. sprintf is very heavy and does quite a bit more than simply convert a number to a string.
Neat, the way I benchmarked was through BenchmarkTools.jl, which runs the functions sufficiently many times to get a statistically relevant result. It will not measure compilation time, but I did not include that for gcc / clang either.
If it's using string optimizations due to short strings, you're cutting allocations in almost half I believe, as well as putting said strings on the stack instead of the heap which will give... significant performance gains. Short strings in the C++ standard library can be crazy fast.
Do you mean allocations as in heap? If so yeah, I guess I should have stated it better as it still needs to put space on the stack for the short string.
The stack space for the whole function call is allocated in one step. You only get individual allocations when you are messing around with variable length arrays or alloca.
This article seems to be using a very specific definition of interpreter, which is perhaps not what most people think of when they hear "interpreter" ?
If I understand correctly, they call the module generating Python opcodes from Python code the "interpreter", and everything else is a "runtime". But Python opcodes are highly specific to CPython, and they are themselves interpreted, right? Calling the former "interpreter" and the latter something else seems like an artificial distinction.
Not only is this definition of "interpreter" strange, but their definition of "runtime" also seems strange; in other languages, the runtime typically refers to code that assists in very specific operations (for example, garbage collection), not code that executes dynamically generated code.
> If I understand correctly, they call the module generating Python opcodes from Python code the "interpreter",
No, you misunderstand. They explicitly define the interpreter as "ceval.c which evaluates and dispatches Python opcodes". Maybe "evaluate and dispatch" suggest something else to you, but ceval.c really is the code that iterates over a list of opcodes and executes the associated computations. This is absolutely 100% the part of Python that is the interpreter. The module that generates Python opcodes is the "compiler" (or "bytecode compiler"), and the article specifically points out that it's not included.
But at what point do function calls from ceval.c stop counting as part of the interpreter? Okay, calling a C function clearly at some point ceases being part of the interpreter, but is the entirety of an attribute lookup (i.e., executing the LOAD_ATTR inst) on an ordinary Python object part of the interpreter or the runtime? In plenty of VMs the object representation is an intrinsic part of the VM design, with the VM having deep knowledge of it.
That's a very blurry distinction, and I'm not very interested in it. I was correcting the OP's first two paragraphs, I didn't take a stance on the third.
Etymology says that "interpreter" is made of "inter-" (between) and "-pretium" (price, value, worth, merit).
So an interpret is a middle man/woman whose merit is to help people communicate, I guess.
For interpreted languages, there are indeed two interpreters: the one that translates from source to bytecode (or whatever technique the implementation uses) - in Forth it is called the "outer interpreter' - and the one who translates bytecode to machine code (in Forth it is called the "inner interpreter").
That distinction between those two interpreters allows to understand, for instance, how you can "run" Lisp by compiling Lua bytecode.
"Runtime" is to me a fuzzy word, similar to "frameworks" that are actually just libraries. I am not in position to be a terminology Nazi but I think there's some misuse sometimes. To me the runtime of a language is everything the "inner interpreter" does. This includes "bytecode dispatch" and indeed, garbage collection but also more trivial things like performing the additions the user asks for.
> For interpreted languages, there are indeed two interpreters: the one that translates from source to bytecode (or whatever technique the implementation uses) - in Forth it is called the "outer interpreter' - and the one who translates bytecode to machine code (in Forth it is called the "inner interpreter").
This seems to be nonstandard usage that is entirely limited to Forth. Standard usage for the translator from source code to bytecode is "compiler". I'm not aware of any exceptions to to this except for Forth.
> That distinction between those two interpreters allows to understand, for instance, how you can "run" Lisp by compiling Lua bytecode.
I, for one, don't understand what you mean by this.
> This seems to be nonstandard usage that is entirely limited to Forth
Certainly. But sometimes artificially changing the terms avoids misunderstandings.
> I, for one, don't understand what you mean by this.
I was referring to what is usually called a "transpiler": the "compiler" translates source code for language A (eg Lisp) into bytecode for language B (eg Lua), then the "runtime" for language B is used to translate bytecode B into actual actions. So you end up with an "interpreter" (actually two interpreters in the etymological sense) that takes in Lisp source code but uses Lua runtime to execute it.
Conflating compilation and interpretation does not "avoid misunderstandings". Quite on the contrary. Compilation and interpretation do very different things.
Example: Imagine a Lisp program that prints "hello world" in an infinite loop. If you compile this to Lua bytecode, the compilation process (a) will not print "hello world", and (b) will not run into an infinite loop. Running the Lua bytecode will (a) print "hello world", and (b) will do so in an infinite loop.
Surely you agree these are very different things. Compilation translates input code into output code without executing it. Interpretation executes the input code, usually without further translating it into another representation. Why would you confuse them by trying to apply the same term to them when we have very good distinct terms for these distinct concepts?
Etymology does not come into it. Words mean what they mean, not what some related word might have meant in the past.
> Compilation translates input code into output code without executing it.
I think it is not totally true. There's Forth immediate words, there's Lisp macros and to a lesser extend there's C++ templates (which are Turing-complete).
> Why would you confuse them by trying to apply the same term to them when we have very good distinct terms for these distinct concepts?
Because from a slightly more generalized perspective, that's the same thing.
Compilation: translate source code to bytecode
Interpretation: translate bytecode into side-effects
If I say that side-effects can be formalized into a language, I think I have most Haskell programmers with me. If you run your program in a VM, and compare examine the bytes of the snapshots before and after. You can use diff to turn those changes into another language. And you can use patch to replay those exact changes.
> Etymology does not come into it. Words mean what they mean, not what some related word might have meant in the past.
> There's Forth immediate words, there's Lisp macros and to a lesser extend there's C++ templates (which are Turing-complete).
This is true. It's not what we have been discussing so far, but yes, there is also a notion of executing code at compile time. We might call compile-time code execution "outer interpretation", but that would still be less clear than "compile-time code execution". It would also be different from what you had wanted to call "outer interpretation" so far.
> Compilation: translate source code to bytecode Interpretation: translate bytecode into side-effects
https://en.wiktionary.org/wiki/translate#Verb lists three senses of "translate", the first two of which have six sub-senses each. Non-technical language can be vague. That is why we introduce technical terms to speak of technical things.
> You realize you say that on Hacker News?
I don't know what you are trying to tell me. Which is a fitting end to this discussion about your insistence on communicating less clearly than you might.
The fact that you are counting the second set of meanings of translate, which is archaic (who said "words have meaning, not what some related word might have meant in the past"?), shows that you know how weak your argument is.
> That is why we introduce technical terms to speak of technical things.
Yep. I introduced two technical terms "outer interpreter" and "inner interpreter", along with their definitions.
> I don't know what you are trying to tell me. Which is a fitting end to this discussion about your insistence on communicating less clearly than you might.
Being able to accept that others may have a different terminology, embrace it and speak with their words: this is communication. You don't have a monopoly on the meaning of words.
C++ templates are not usefully complete: which would mean that they can calculate any possible unit of C++ syntax (the relevant representation in their domain of operation).
Turing Complete means that any general recursive function can be computed; but not necessarily with a useful representation of inputs and outputs or reasonable use of resources.
Analogy: suppose you can compute any function over the domain of the whole numbers, but these have a goofy representation likee SSSSS0 (successor of successor of .. zero). Your system cannot actually compute "42", only 42 S's followed by 0 which is interpreted as 42, so if the requirement is to calculate the actual digits "42", the system fails to deliver. Moreover, the representation it works with is tremendously time and space wasting.
> And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.
Wow, that's pretty impressive. I never really got to use PyPy though, as it seems that for most programs either performance doesn't really matter (within a couple of orders of magnitude), or numpy/pandas is used, in which case the optimization in calling C outweighs any others.
If you are concerned about performance, it really shines. It speeds up computational workloads, for sure. But it also improves performance in lots of I/O scenarios too.
Anything where function call overhead is an issue and you don't have a native library escape hatch. Parsers written in Python are inherently slow. Especially so with parser combinators.
Well, anything that you need to do where the libraries are there and waiting for you.
If you get into the territory of missing libraries it can be a bit of a pain. Otherwise, it's a breath of fresh air as it's almost a drop in replacement.
Just to clarify, does it matter if these libraries are pure Python? What I have a hard time understanding is how PyPy is almost a drop-in replacement but then have issues with missing libraries. Couldn't you just pip install the libraries or just literally get the source and run them if they're pure python?
While unlikely to be a common use case, it's super handy to deploy onto Flatcar Linux because pypy is just "unpack the tgz and run," without dealing with a lot of crazy shared library dependencies
I did similar studies about a decode ago for perl with similar results.
But what he's missing are two much more important thing.
1. smaller datastructures.
They are way overblown, both the ops and the data. Compress, trade for smaller data and more simplier ops.
In my latest VM I use 32 bit words for each and for each data.
2. Inlining. A tellsign is when the calling convention (arg copying) is your biggest profiler contributor.
Python's bytecode and optimizer is now much better than perl's, but it's still 2x slower than perl. Python has by far the slowest VM. All the object and method hooks are insane. Still no unboxing or optimize refcounting away, which brought php ahead of the game.
Notice the call to PyArg_ParseTupleAndKeywords -- it takes an arguments tuple and a format string and executes a mini interpreter to parse the arguments from the tuple. It has to be ready to receive arguments as any combination of keywords and positional, but for a given callsite the matching will generally be static.
It used to be this way, but at some point they added a faster calling convention and moved a number of things to it. Calling type objects still falls back to the old convention though.
I wanted to play with variations of the code. For that it is useful to make it output a "summary" so you know the variation you tried is computationally equivalent.
For the first benchmark, I added a combined string length calcuclation:
def main():
r = 0
for j in range(20):
for i in range(1000000):
r += len(str(i))
print(r)
main()
When I execute it:
time python3 test.py
I get 8.3s execution time.
The PHP equivalent:
<?php
function main() {
$r = 0;
for ($j=0;$j<20;$j++)
for ($i=0;$i<1000000;$i++)
$r += strlen($i);
print("$r\n");
}
main();
When I execute it:
time php test.php
Finishes in 1.4s here. So about 6x faster.
Executing the Python version via PyPy:
time pypy test.py
Gives me 0.49s. Wow!
For better control, I did all runs inside a Docker container. Outside the container, all runs are about 20% faster. Which I also find interesting.
Would like to see how the code performs in some more languages like Javascript, Ruby and Java.
Due to Rust's concern about thread-safety, the standard println! macro acquires a lock prior to writing to stdout. This means that if you're doing a lot of printing, it can be quite slow due to the extra work of acquiring and releasing that lock.
This is fairly well known and if it's a bottleneck in your program the recommendation is to acquire the lock yourself and then write directly to stdout.
My Java timing is very close to Rust. I have a feeling PyPy is eliminating the string conversion as dead code. When I alter my Java toy code to not prevent elimination of the toString() call it runs close to speed of PyPy.
> I have a feeling PyPy is eliminating the string conversion as dead code.
If I'm not mistaken, the specific variation of the benchmark shown upthread could not be eliminating the string conversion, as it's adding the length of the resulting string to a variable which is later printed.
Both `-O` and `-C opt-level=3`. I'm guessing this is one of those high throughput scenarios where having a GC is quicker because it's able to defer deallocation until the end of the program.
I think he is making the distinction between 3 different categories that readers are in general lumping into the 'interpreter':
1) Python is executed by an interpreter (necessary overhead)
2) Python as a language is so dynamic/flexible/erfonomic that it has to do things that have overhead (necessary complexity unless you change the language)
3) the specific implementation of the interpreter achieves 1 and 2 in ways that can be significantly slower than necessary
Seems he is pointing out that a lot of performance issues that are generally thought to be due to 1 and 2 are really 3
> 2) Python as a language is so dynamic/flexible/erfonomic that it has to do things that have overhead (necessary complexity unless you change the language)
As PyPy demonstrates, much of this doesn't need to have anywhere near the overhead that it does in CPython. You can absolutely do better than CPython without changing the language, and you can do better than CPython without a JIT if you start specialising code.
While many here correctly observe that the "too much dynamism" can become a performance bottleneck, one has to analyze the common subset of python that most people use and see how much dynamism is intrinsic to those use cases.
Other languages like JS have tried a static subset (Static typescript is a good example), that can be compiled to other languages - usually C.
Python has had RPython, but no one uses it outside of the compiler community.
The argument here is that python doesn't have to be one language. It could be 2-3 with similar syntax catering to different use cases and having different linters.
* A highly dynamic language that caters to "do this task in 10 mins of coding" use case. This could be used by data scientists and other data exploration use cases.
* A static subset where performance is at a premium. Typically compiled down to another language. Strict typing is necessary. Performance sensitive and a large code base that lives for many years.
* Some combination of the two (say for a template engine type use case).
A problem with the static use case is that the typing system in python is incomplete. It doesn't have pattern matching and other tools needed to support algebraic data types. Newer languages such as swift, rust and kotlin are more competitive in this space.
Kevin knows much more than I do about optimising Python, but aren't lots of the things listed as 'not interpreter overhead' only slow because they're being interpreted? For example you only need integer boxing in this loop because it's running in an interpreter. If it was compiled that would go away. So shouldn't we blame most of these things on 'being interpreted'?
Not really. Integer boxing is totally orthogonal to compilation vs interpretation. Assuming we need to preserve program semantics, a naive compiler would have exactly the same boxing overhead. And on the other hand a smart interpreter could eliminate it. (E.g stack allocate integers by default, dynamically detect conditions that require promoting them to the heap, annotate the str builtin as escape safe.)
Isn't removing overhead like boxing the main point of a compiler? Seems like if you write a compiler that doesn't do that you might as well not have bothered.
> The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into.
I was a bit disappointed that converting numbers to strings was the only thing he didn't actually discuss. I've discovered that the conversion function is unnecessarily slow, basically O(n^2) on the number of digits. This despite being based on an algorithm from Knuth.
I'll never understand why there are so many fast, alternative python interpreters.
Is the language correctness of the official interpreter causing lower performance? Does it prevent using some modules? What's at stake here?
I'm planning to use python as a game scripting language but I hear so much about performance issues that it scares me to try learning how to use it in a project. I love python though.
They are mostly JITs or have specific purposes (like stackless python).
JITs like pypy have a raw performance advantage. But implementation is more complex, and there is startup time overhead. Also I have heard CPython reference implementation prefers clarity of code over optimizations.
> In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark. Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.
I'm probably uneducated here, but I don't understand the distinction between the runtime and the interpreter for an interpreted language? Isn't the interpreter the same as the runtime? What are the distinct responsibilities of the interpreter and the runtime? Is the interpreter just the C program that runs the loop while the runtime is the libpython stuff (or whatever it's called)?
Concretely they are the same, for an interpreted language.
I think the distinction they are trying to make is between slowness due to general constraints on interpreted languages vs. slowness because of particular features or implementation details of CPython's runtime.
It is kind of like, bicycles are slower than motorcycles, in general. Let' say I have a Snake 3000 bicycle with a max speed of 4mph. And people say, it's slow because it's a bicycle, if you want a fast vehicle get a motorcycle. And the author comes along and say, well, yeah, bicycles are not that fast, but the Snake 3000 has an alternator that uses 80% of the bike's mechanical energy to run a seat warmer, if you removed that you could reach 20mph. So is the real reason for the Snake 3000's slowness "because it is a bicycle" or "because it has a pedal-powered seat warmer"?
(The answer is both of course, but with respect to Python people rarely acknowledge the second one.)
That's a good question, I updated the post with a quick blurb on this. There may be a number of potential ways to distinguish them, so I gave the one that I was using for the blog post.
Consider a program written in Python. Imagine it converted into some "internal representation" that will be used by the interpreter to execute it.
We can ask questions like "when I write a statement in Python that adds two integers (typically a single machine instruction in a C program), how many machine instructions are executed while running that statement?"
Those are questions about the interpreter - essentially, the raw speed and execution patterns of the language.
But when you're running a Python program (or one written in many other languages these days), there is a lot going on inside the process that you cannot map back to the Python statements you wrote. The most obvious is memory management, but there are several others.
This stuff takes time to execute and can change the behavior of the statements that you did write in subtle (and not so subtle ways).
Note that this kind of issue exists even for compiled languages too: when you run a program written in C++, the compiler will have inserted varying amounts of code to manage a variety of things into the executable that do not map back explicitly to the lines you wrote. This is the "C++ runtime" at work, and even though the "C++ language" may run at essentially machine speed, the runtime still adds overhead in some places.
Interpreted languages are the same, just worse (by various metrics)
I understand the notion of a "runtime" in the compiled language case; my question is about the distinction between an interpreter and a runtime in the interpreted language case. Perhaps out of naivety, I would think that the interpreter is the runtime--the interpreter manages the call stack, memory, etc.
Well, imagine for example that you wrote this in python:
s = "hello" + "world"
You could measure the execution of the internal representation of this, and that would be all about the speed of the interpreter.
At some point, something may have to clean up temporaries and that may happen asynchronously with respect to the actual statement execution. Yes, it happens in the same process as the interpreter's execution of the code, but it's a theoretically "optional" cost that has no specific relationship with the statement.
There are interpreters, for example, where you can disable garbage collection. If the overall execution time decreased, it's not because the interpreter got faster, it's just doing less of the "runtime" work.
So is there anything besides the GC which satisfies this definition of "runtime"? What other asynchronous activities are going on apart from those of the Python program itself?
C does not have an interpreter (well, the most used implementations don't) but often has a runtime (libc). It's really a runtime and not just a library, because a C compiler can insert calls to its functions even if they aren't part of your code at all, like here : https://gcc.godbolt.org/z/_w78qd
I don't know how to interpret the results of --trace-opt, but when I increase the iteration bounds 10x the running time increases 10x, so I don't think the code is being eliminated. This is with node v12.13.0
Since people are talking about speed and JIT here, it's worth mentioning Numba (http://numba.pydata.org). Being in the quant field myself, it's often been a lifesaver - you can implement a c-like algorithm in a notebook in a matter of seconds, parallelise it if needed and get full numpy api for free. Often times if you're doing something very specific, you can beat pandas/numpy versions of the same thing by an order of magnitude.
for reference, a pure java version through JMH that takes 0.38 seconds on my machine. This uses parallel stream so its multithreaded.
Single threaded it takes 0.71 seconds. Removing the blackhole to allow dead code elimination takes single thread down to 0.41 seconds. This is close to PyPy, which I assume is dead code eliminating the string conversion as well.
package org.example;
import org.eclipse.collections.api.block.procedure.primitive.IntProcedure;
import org.eclipse.collections.impl.list.Interval;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.infra.Blackhole;
public class MyBenchmark {
@Benchmark
public void testMethod(final Blackhole blackhole) {
Interval.oneTo(20)
.parallelStream().forEach((IntProcedure)
outer ->
Interval.oneTo(1000000)
.forEach(
(IntProcedure)
inner ->
// prevent dead code elimination
blackhole.consume(Integer.toString(inner))));
}
}
thanks :)
Eclipse Collections can also do batched loops which might speed this up. Telling Java how many threads you want will probably help as well
Interestingly, I tried plain old for(i) loops and the result was exactly the same. At least for Eclipse Collections, the syntactic sugar for ranges and forEach is completely optimized away, apparently.
You might already know this, but there is one potential caveat with APIs like that when it comes to performance or at least measuring their performance. The hot loop (the code that actually iterates over your data points) does not live in your code base. But performance often depends on what the JIT compiler makes out of that loop. If the API is used at several locations in your program, the compiler might not be able to generate code that is optimal for your call-site and inputs. Instead, it will generate generalized code that works for all inputs but might be slower.
However, when writing benchmarks, there is often no other code around to force the JIT compiler to generate such generalized code.
The following code demonstrates this. If the parameter warmup is set, I invoke the forEach methods with different inputs first (they do the same but are different methods in the Java byte code). The purpose is to force the compiler to generate generalized code:
@Param({"true", "false"})
public boolean warmup;
@Setup
public void setup() {
if (!warmup) {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure) j -> Integer.toString(j)));
}
}
@Benchmark
public void coollectionsBlackhole(Blackhole blackhole) {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure)j -> blackhole.consume(Integer.toString(j))));
}
@Benchmark
public void collectionsDeadCode() {
Interval.oneTo(20).forEach(
(IntProcedure) i -> Interval.oneTo(1_000_000).forEach(
(IntProcedure)j -> Integer.toString(j)));
}
And the for loop implementations for reference:
@Benchmark
public void loopBlackhole(Blackhole blackHole) {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
blackHole.consume(Integer.toString(i));
}
}
}
@Benchmark
public void loopDeadCode() {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
Integer.toString(i);
}
}
}
@Benchmark
public void loopBlackholeOnly(Blackhole hole) {
for (int j = 0; j < 20; j++) {
for (int i = 0; i < 1_000_000; i++) {
hole.consume(0xCAFEBABE);
}
}
}
On my desktop machine, this gives me the following results (Java 11/Hotspot/C2):
All results are for single threaded code. Now, the difference is not huge, but significant. Overall it looks like the invocation of toString(int) is not removed and accounts for most of the runtime.
Just to be clear: I am not saying one should stay way from the stream APIs. As soon as the work per item is more than just a few arithmetic operations, there is a good chance the difference in runtime is negligible. But when doing numerical work (aggregations etc.), a simple loop might be the better option.
Finally, these differences are of course compiler-dependent. For example, C2 might behave differently than Graal and who knows what the future brings.
I didn't think about this!! That could a big performance hit to pay for nice API's.
I think inlining might save us here? I'm not sure what the metrics are around inlining lambdas... Does it look at the size of the code wrapping the lambda or the size of the lambda block itself to decide inlining eligibility?
If it only considers the size of the wrapper, maybe the small forEach method will always get inlined with the block it calls?
EDIT: according to this https://stackoverflow.com/questions/44161545/can-hotspot-inl...
HotSpot will inline lambda method calls as long as depth doesn't exceed maxinlinelevel (9 by default). Assuming the API designers made their methods small enough to be inlined, they should all be flattened out.
Yes, this is strongly related to inlining. Without the warmup code C2 most likely inlines your lambdas into the code generated for the for each method. At least I have seen this behavior with similar APIs (you can use a tool called JITWatch to visualize compiled and inlined methods). However, this does not scale.
Last time I checked C2 inlines at most two implementations at a polymorphic call-site (in this case the line that calls your lambda). If you pass in more lambda functions, it won’t inline them and might de-optimize existing compiled code to remove previously inlined functions (there are also cases where one implementation that is heavily used gets inlined and the others will be invoked via a function call). Graal does not seem to have the same limit but when I tested it two years ago it had others averaging out to the same performance.
Thus, increasing the max inline level does not necessarily help (you can already do that in Java 8 and 11) for this particular problem. What you would want is that the compiler inlines the for each method and the local lambdas into the methods using the API (the benchmark methods in our case), i.e., you want the compiler to copy the hot loop somewhere higher up in the call stack and then optimize it. But apparently this is easier said than done.
But again, this is probably only relevant for workloads where there is very little work to do for each item.
I looked around and indeed this still appears to be the case. Megamorphic call sites will not be inlined.
From reading mailing list messages, the main reason for this shortcoming appears to be C2 is such a mess nobody wants to touch it and they're writing a new compiler instead (Graal).
If optimizations start getting targeted towards Graal I'll be pissed. Oracle gimps the performance of the free version, the one thing you should never ever do with a programming langauge
> For example, C2 might behave differently than Graal and who knows what the future brings.
Yes, the GraalVM compiler is more aggressive on some optimizations that can be especially helpful with streams. Here are numbers from my machine (using ops/s, so higher is better!):
C2 (on Java(TM) SE Runtime Environment (build 1.8.0_251-b08)):
Amusingly, on the OP's benchmark GraalVM also does better than C2, and it somehow (I haven't tried to analyze this yet) makes the serial stream version as fast as the parallel one:
(I work in the GraalVM compiler team at Oracle Labs, I don't speak for Oracle, I'm not claiming anything about whether this is a good or bad benchmark, etc.)
The results for the parallel code are indeed interesting. However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.
The last time I looked into Graal (over a year ago), I ran into something I could not make sense of.
We were designing an API with methods similar to Arrays.setAll(double[], IntToDoubleFunction) which we expected to be called with large input arrays from many places in the code base.
The performance of the code generated by C2 dropped as one would expect once you were using more that two functions (lambda functions were no longer inlined and invoked for each element). For simple arithmetic operations the runtime increased by around 5x.
The performance of the code generated by Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25. However, once we hit that threshold, performance tanked. I believe the difference was closer to 15x than than C2's 5x.
We never figured out what caused this.
Do you have an idea? Or, since this is probably outdated data, do you know what the expected behavior is nowadays? Is there maybe any documentation on this?
> However, the speedup < 2 with C2 is suspicious. Even on an old dual core / 4 thread system (common worker pool should default to 3 workers), I would expect it to be above 2 for this kind of workload.
The speedup I found on C2 is 2.547 / 1.719 ~= 1.5, the OP's speedup (also on C2) was 0.71 s / 0.38 s ~= 1.9. Not the same factor, but not above 2 either. The benchmark is quite small, so the overhead of setting up the parallel computation might matter. The JDK version might matter as well.
> Graal was competitive with C2's code for mono- and bimorphic scenarios for a much larger number of input functions. I don't recall the exact number, but I believe it was around 25.
If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.
Performance will usually decrease as you transition from a call site with few candidates to one with many candidates. I agree that a factor of 15x looks steep, but it's hard to say more without context. The inliners (Community and Enterprise use different ones) are constantly being tweaked, though that's not my area of expertise. It's very possible that this behaves differently than it did more than a year ago.
If you have a concrete JMH benchmark you can link to and are willing to post to the GraalVM Slack (see signup link from https://www.graalvm.org/community/), that would be great. Alternatively, you could drop me a line at <first . last at oracle dot com>, and I could have a look at a benchmark, without promising anything.
> If we are talking about a call like Arrays.setAll(array, function) and you have one or two candidates for function, then you are mono- or bimorphic. If you have 25 candidates for function, you are highly polymorphic. I don't understand what you mean by a mono- or bimorphic scenario with up to 25 candidates.
Sorry, I phrased that poorly. What I was trying to say was that the moment we introduced a 3rd candidate the performance with C2 decreased by ~5x. Whereas with Graal the performance stayed virtually the same when introducing a 3rd or 4th candidate. And it was performing well. But it did decrease at some point after adding more and more candidates and then the performance hit was much bigger.
I will reach out to my colleagues still working on this next week (long weekend ahead where I live). If we can reproduce this with a current Graal release, I'll share the benchmark – always happy to learn something new and if it is 'just' why out benchmark is broken. :)
I really appreciate the in-depth replies on HN. Something Reddit lost ages ago. I gotta say your posts have introduced me to a lot of dark magic in the JVM that isn't really documented anywhere unless you count source code. Much appreciated!
Interesting indeed!
I'll try to avoid a long rant, but I won't go near Graal because Oracle is stuffing much of the performance optimizations behind a paywall (wtf). Who writes a compiler and makes the code slower for free version? That's like something MS would do in 1995.
If Graal does replace OpenJDK eventually and Oracle keeps the performance optimizations behind an enterprise license, I have a feeling the Java community will fork the project away from Oracle. Just like they did for OpenOffice, MySQL, and Jenkins.
Another comment about Graal. Since performance is so important, its likely a third-party like Redhat will want to merge code into community edition that does the same thing as code already present in an "official version". Like RedHat already did for GC with Shenandoah. What is Oracle going to do? Say "no because it competes with our commercial branch"? I think whats going to happen is what happened with Shenandoah, where the Oracle builds are crippled by excluding features. What happened there? 99% of people switched to AdoptOpenJDK.
I'm not attacking you here, I realize you have no control over the matter. And thanks for Graal, its super impressive! I'm just salty that the community decided to replace C2 a decade ago yet Oracle has decided not to release the full replacement back to this community. Many companies differentiate themselves by having commercial versions of tools that perform better. Thats fine with me. I'm salty because Graal was how the community decided to replace C2. Its the reference implementation. Now the main benefit of that replacement, better performance, is locked behind a paywall. This is not what anyone intended and had they known this would happen they probably just would have rewritten C2 and left it in Hotspot
I know this is supposed to be about python optimization. However, the post switches over to C at nearly the beginning of the process. Hence it really is about how to optimize python applications, by rewriting them in C (or other fast languages).
Which IMO, isn't about optimizing Python, apart from tangential API/library usage.
I've been under the impression that I'll get the best performance out of a language when I write code that leverages the best idiomatic features, and natural aspects of the language. If I have to resort to another language along the way to get needed performance, I guess the question is, isn't this a strong signal that I should be looking at different languages ... specifically fast ones?
Most of the fast elements of python are interfaces to C/Fortran code (numpy, pandas, ...). What is the rationale for using a slow language as glue versus using a faster language for processing?
Honestly, this is subjective. I generally agree that some languages are well designed for fast prototyping, but not really great for computationally intensive production. I see Python in this role.
Note that when the original author wished to move this to a faster execution capability, they had to change languages. So they get the disadvantage of having to do the port anyway.
I really sick of such kind of benchmarks. I never have seen a real world python program that doesn't depend on any IO or doesn't have some C code behind wrapped calls. If your code is heavy with computations there are numpy/scipy libs that are very good at this. These optimizations bring < 10% of speed to real project/programm, but will require a lot of developers time to support it. If performance is the key feature and very critical, then likely python is not the right choice, because python is more about flexibility, ability to maintain and write solid, easy to read code.
I don't mind about knowing limitations. I'm saying that these optimizations usually are very hard tradeoffs and opposite side of it – code readability, speed of developer work, ability to maintain it working. I tried Cython and PyPy. Both are really good if your project is started with them, but if you decide to migrate to them in order to increase performance, it's like rewrite project to another language. Also both have a lot of limitations and cpython gives you still a lot more flexibility in decision making (choose frameworks, libraries and approaches how to solve certain problem)
This is what I get for hosting my own wordpress server. It was struggling and I installed a caching plugin which took down the site. Should be back up, sorry!
On my somewhat old Linux machine his main() takes 5.22 seconds. Meanwhile, this Julia code
@time map(string, 1:1000000);
reports execution time 0.18 seconds. But that includes compilation time, if you use BenchmarkTools that runs the code repeatedly, I get 88.6 milliseconds
Using range() in php it takes about 1.3 seconds. The PHP docs imply that it's a generator now but I'm not positive about that. I wrote an equivalent php function just using for loops and calling strval($x) 20 million times and on my laptop it runs in .9 seconds... The second form of creating 20 lists with 1m elements runs in .4 seconds. Without needing to write 300 lines of Py/C stuff. So... shrug? microbenchmarks? I guess writing the optimized code was the fun part and the actual benchmark/timing part doesn't really matter. It's just for loops... the other comments are right that benchmarks definitely matter when you're doing more varied "work" though.
For what it's worth, I did benchmark a big application in PHP "for real" and parsing a large configuration file (10,000 lines +) on every request did take up about %15 of the wall clock time. We optimized a few things there because it was worth it. It was a HUGE application of about ~1M LoC and an average request was about 300-400 milliseconds so... I guess it wasn't doing a lot of for loops?
Edit: I had a few minutes before my next meeting to code golf this so I decided to test if range() worked with yield [1]
<?
function gen() { // 1.3s
foreach (range(0,20) as $i) {
foreach (range(0,1000000) as $j) {
strval($j);
}
}
}
function foo() { // .9s
for ($i = 0; $i < 20; $i++) {
for ($j = 0; $j < 1000000; $j++) {
strval($j);
}
}
}
function bar() { // .4s
for ($i = 0; $i < 20; $i ++) {
$x = range(0, 1000000);
}
}
// [1] returns in .06 seconds
// pretty sure this just executes yield once and does no real work just like me this morning
function gen2() {
foreach (range(0,20) as $i) {
foreach (range(0,1000000) as $j) {
yield;
}
}
}
this is an area, nim could've come and improved on. i.e if nim had an ability to port your python code and have it running 98% on nim. most python users would have been there already.
I don't see a complaint, just an analysis of things that make code slow. Assuming that PyPy will just fix the problem is unrealistic, considering PyPy's limitations (doesn't support every architecture Python supports, lags behind a Python major minor version or two, etc.)
Python is a great language and PyPy a great tool, but let's not become complacent or - worse - dismiss valid information just because we like them..
At first glance this seems to be "dynamism==slow" but surely you need to explain why Python is slower than Javascript and for many years has resisted a lot of effort to match the performance of v8 and it's cousins?
The title isn't quite right, though. Boxing, method lookup, etc. come under "interpreter" too.
There's a continuum of implementation between a naive interpreter and a full-blown JIT compiler, rather than a binary distinction. All interpreters beyond line-oriented things like shells convert source code into a different format more suitable for execution. When that format is in memory, it can be processed to different degrees to achieve different levels of performance.
For example, if looking up "str" is an issue, an interpreter could cache the last function it got for "str" conditional on some invalidation token (e.g. "global_identifier_change_count"), so it doesn't in fact need to look it up every time.
Boxing can be eliminated by using type-specific representations of the operations, and choosing different code paths depending on checking the actual type. Hoisting those type checks outside of loops is then a big win. Add inlining, and the hoisting logic can see more loops to move things out of. Inlining also gets rid of your argument passing overhead.
None of this requires targeting machine code directly, you can optimize interpreter code, and in fact that's what the back end of retargetable optimizers looks like - intermediate representation is an interpretable format.
Of course things get super-complex super-quickly, but that's the trade-off.