Hacker News new | past | comments | ask | show | jobs | submit login
Python 3.11 is faster than 3.8 (jott.live)
378 points by brrrrrm on Oct 26, 2022 | hide | past | favorite | 295 comments



Checking for my own “benchmark”, a gameboy emulator in several different languages[0]; it’s CPU-bound but across ~3k lines of code, so slightly more representative of real-world apps than a single-function tight-loop microbenchmark:

     zig: Emulated 600 frames in  0.24s (2521fps)
      rs: Emulated 600 frames in  0.37s (1626fps)
     cpp: Emulated 600 frames in  0.40s (1508fps)
     nim: Emulated 600 frames in  0.44s (1367fps)
      go: Emulated 600 frames in  1.75s (342fps)
     php: Emulated 600 frames in 23.74s (25fps)
      py: Emulated 600 frames in 26.16s (23fps)   # PyPy
      py: Emulated 600 frames in 33.10s (18fps)   # 3.11
      py: Emulated 600 frames in 61.43s (9fps)    # 3.10
Doubling the speed is pretty nice :D Still the slowest out of all implementations though :P

[0] https://github.com/shish/rosettaboy

EDIT> updated the nim compiler flags to build in release mode like most other languages, thanks @plainOldText!


Just a quick glance at you repo, and I'm noticing you're running Zig `zig build -fstage1 -Drelease-fast=true` and Rust `cargo run --release` with the release flags on. You should do the same for Nim `nimble build -d:release --opt:speed`; Go too.


Updating nim’s compiler flags, seems to be ~4x faster now :D

    nim: Emulated 600 frames in  0.44s (1367fps)
Do you happen to know the right flags for release-mode Go? Last time I checked (admittedly years ago) I thought they just had the one “reasonably fast and reasonably debuggable” build mode


With Nim `nimble build -d:danger -d:lto --passC:-march=native` I just got 1920 frames/s while with your rust build only 1237 fps on the same machine (EDIT: and 1307 frames/s with C++.)


In that case, did you also run rust with the equivalent RUSTFLAGS="-C target-cpu=native"? :)

Perhaps benchmarks should be compared on equal footing, say, with the default release flag or all optimizations turned on, otherwise they're improper.


I tried. That actually made the rust slower for me (i7-6700k, gcc-12.2, rustc-1.64). 1180 frames/s. And without the -march=native the Nim was at 1620. And it also did not help the C++ branch (but helped Nim about 1.2x).

But really the original author/poster should do some set on his box. I cannot even compile/run all his things. The point of my comment was just to give a reference for how far off impressions can be from build flags. PGO (available to Nim, c++, but maybe not to Rust yet?) is a whole other set of maybe nothing burgers or maybe big improvements.

(But, btw, I could not agree more that all experiments in this entire general space should have various big, bold disclaimers. Over-concluding from these things is rampant.)


Yes, tweaking the compiler flags can alter the performance substantially. I'm glad to see Nim so fast though.


And with the author's hot off the presses nim flags I get only 1464 fps. So, 1920/1464 = 1.31 for my nim compile flags vs. his new ones, only a little less than the 2521/1626 that was interesting people.

For something super jumpy like a simulator, I would find it unsurprising for PGO to make up (or surpass) the difference to Zig in both Nim and C++. 20 years ago there was this ACOVEA [1] project to try to discover great sets of gcc flags that could often find 2X improvements in object code speed for me.

The range from build flags/procedures is often much greater than the supposedly interesting cross-language variation. These things often more measure developer experience/persistence than something intrinsic (and build flags/procedures are only part of that experience/persistence).

[1] https://github.com/Acovea/libacovea


I do not, sorry. I'm sure some Go developer can chime in.


There is no release flag for Go.


To elaborate, Go is always in release mode because release compiles are about as fast as other languages' debug modes.


I'd phrase that as "release mode by default", not "always in release mode".

There are various debug things you can turn on, such as the race detector, the memory sanitizer, or coverage tracking.


Is there such a thing for go?


It's the default for Go; with Go you have to explicitly turn on the debug features that you want.


It's amazing that Zig is faster than both Rust and C++. Kudos to the Zig team & Andrew Kelly!

I wonder what optimizations Zig does that lets it generate machine code / LLVM bitcode faster than both C++ and Rust (which certainly have larger teams backing them), at least in the case of this Gameboy emulator project.


Yeah, I’m really not sure how it managed that. I am slightly suspicious, because while I was working on the zig version I spent more time running into compiler bugs[1] than writing my code, and so there’s a chance that it’s running so fast by throwing away random chunks of important behaviour… but it still passes all of the test suite, so as far as I can tell this specific build is working correctly.

[1] half the time I’d make the compiler crash; the other half it would generate a binary which crashes at runtime, with weird heisenbug behaviour like “adding a print statement to log how far down a function I am causes the code to stop crashing at all” — like right now there is a load-bearing print statement which shows the address of the SDL Window object, because otherwise the compiler seems to optimise the Window out of existence and then a few lines later it segfaults on the null pointer…


> a load-bearing print statement

Is the most glorious thing I ever heard.


> like right now there is a load-bearing print statement which shows the address of the SDL Window object, because otherwise the compiler seems to optimise the Window out of existence

that sounds like you have undefined behavior in your program


Does zig have computed goto? That's important for emulators and virtual machines and C/C++ don't have it. Projects will often go out of their way to have a MinGW/gcc built module for core loops with that (gcc has it as an extension) even if the main project is built with MSVC.


If Rust wasn't able to elide bounds checks, that is most likely where the perf difference is from. Analyzing the output the assembly output from goldbolt or looking at the MIR can help.

https://stefan-marr.de/2022/10/cost-of-safety-in-java/

Unsafety buys you a little more performance, but the baseline should the safe version, not the unsafe version. It is like having to explain on a case by case basis why you aren't using lead pipes for this application.

Always default to safety. The difference between safe and unsafe native code is usually single digit percentage points. Or weeks on a Moore scale.


This should also be the case for nim. Generated C always looks very simple with -d:danger (when all memory access checks disappear).


If I do nimble build -d:lto -d:danger --passC:-march=native I get 1920 fps while if I do nimble build -d:lto -d:release --passC:-march=native I still get 1775 fps. So, at least for Nim, the checks are only a 1.08x slowdown..Not so bad compared to the 2521 vs 1626 = 1.55x Zig-Rust slowdown on the author's machine.

Heck, I see 1.33x differences in run times between the first & second run of the rust branch of his benchmark, but only 1.05x diffs in run times between 1st & 2nd Nim branch.

In my experience, reasoning about things like this is rarely as simple as "bounds checks" which are (often) the most highly predictable branches.


True. Recently I noticed that std/sha1 takes 8x the time on M1 release vs danger. Yet on amd64 it's only a <10% difference.


I think you replied to the wrong comment, mine was about zig vs C++.


I am talking about the whole stack of benchmarks and why there is a spread in perf. I am replying to you and the top level comment, if only the convos were a graph and not a tree.


Here's his latest numbers after optimizations. From the repo.

     zig: Emulated 600 frames in  0.23s (2880fps)
      rs: Emulated 600 frames in  0.35s (1691fps)
     cpp: Emulated 600 frames in  0.40s (1519fps)
     nim: Emulated 600 frames in  0.44s (1367fps)
      go: Emulated 600 frames in  1.78s (338fps)
     php: Emulated 600 frames in  9.28s (65fps)
      py: Emulated 600 frames in 33.10s (18fps)
That's after Python cheated with native implementation of bitblt while all other implementations are doing it pixel by pixel, which is more correct.


Tables like this routinely lead to over-conclusion. Merely adding -d:lto (which changes no semantics) to the nimble build line boosts fps by 1.27x for me on nim-devel gcc-12 Skylake, making it 1.4x faster than the Rust on the same machine.

Applying BS scaling to the above table, Nim would score 1736, only slightly faster than the Rust. Who knows what little flag tweaks in cpp/rs/zig could similarly re-arrange the numbers?

Now, why is -d:lto (or -d:release) not the default build mode? Well, because it takes time and (some) people already complain about compile times and workflow (sometimes). Trade offs abound almost as much as over-conclusion. ;-)


Very cool idea for a benchmark, thanks for the numbers. Pretty readable code as well, nice work. Is your benchmark running headless? I feel like that could be a source of noise but idk.


Yeah, all these benchmarks are measured in headless mode - all the calculations of what should be on-screen are done, and pixels are written into a buffer ready to be displayed, but the Window is never opened and the buffer isn’t blitted.


I am shocked that PHP of all things has faster speed than python.


I am not.

PHP is traditionally used solely for websites. Some of those have grown rather large, to the point that having engineers optimize the language is cheaper than buying more servers.

Python, on the other hand, is first and foremost a scripting language. When performance does matter, you often end up using a wrapper around a C library, like NumPy. This means there is relatively little money in optimizing the Python interpreter.


There are websites and large scale software. Google uses Python for a lot of its products.

It would seem sensible that since Facebook poured a lot of resources in optimizing PHP, Google would have done the same for Python.

Also, Python is the first or second most popular language.


> It would seem sensible that since Facebook poured a lot of resources in optimizing PHP

Well they sort of half assed tried years ago - they employed GvR at one point. Then they gave up and they continue to lean heavily on C++, Java and there is thing they developed in the interim: Go.

> Also, Python is the first or second most popular language

Hogwash. For all the open source and other development that occurs and is “indexed” on internet discussion forums there is countless boring ass shit behind the scenes in sweatshops around the world and corporate back rooms. PHP, Java, and C# are still probably more popular to start.


> For all the open source and other development that occurs and is “indexed” on internet discussion forums there is countless boring ass shit behind the scenes in sweatshops around the world and corporate back rooms.

Much of which is also in Python.


I mean, sure, but I’m just saying point me to a source that is ranking Python as #1 or #2 most popular. If it’s TIOBE or something similar it’s meaningless.

I say this as someone that has made their career mostly on the “top 5” TIOBE languages - but that ranking is a bunch of crap for the overall commercial software world.

C? Give me a fucking break. Never in 20 years have I known there to be more C than Java jobs.

And now everyone and their dog including doctors and other non tech-savvy professionals are doing Intro to Python courses (and then never touching it again) so they too can feel like they know something about ML or stats. Lot of noise.


>I mean, sure, but I’m just saying point me to a source that is ranking Python as #1 or #2 most popular. If it’s TIOBE or something similar it’s meaningless.

According to 2022 Stack Overflow developer survey, if we exclude HTML, SQL and Bash, top five languages (as in being used by most developers) are: Javascript, Python, Typescript, Java and C#.


https://madnight.github.io/githut/#/pushes/

Checking here, if we go by pushes on GitHub, JS, Python, Ruby, Java, PHP are the top for 2021.

This is of course biased to only stuff on GitHu band public (assuming, but doubt they're publishing private data on BigQuery).

No idea how much of that is pet projects vs production of course.


JS is also very popular. Not for the same things as Python but in some aspects it's becoming modern PHP (which is still popular, if maybe aging somewhat kinda like Perl). But Python is definitely very popular for new back room stuff (source: long digital sweatshop career)


Corporate has a faceless horde of Java devs.


Lots of C#, too. And I specialized in C# since it seemed a little bit more enjoyable than Java at the time and the jobs were plenty. Of course, Java changed quite a bit since then so it might be more desirable than it was.


> Python is the first or second most popular language.

By what metrics? Redmond quarterly from June shows Python at #2, and PHP at #4.

https://redmonk.com/sogrady/2022/10/20/language-rankings-6-2...


Most metrics. Tiobe, Redmonk, Pypl, Stack Overflow survey. Top 10 and even top 5 is generally the same, the order might differ a bit.

I've looked at various sources since the last 5 years and not much changed. I hoped some languages like Nim, Crystal or Zig would pick up a bit of steam, but no.

Go and Rust moved up a bit and now seem safer bets to invest some time into.


> > Python is the first or second most popular language.

> By what metrics?

Hmm, why don’t you answer your own question?

> Redmond quarterly from June shows Python at #2,

That clearly is one that puts it “first or second”, yes.


Yes, it was partially an answer, and it's not saying Python isn't high up.

But still wanted to know by what metrics the GP was making their claim.


Instagram is a Django website (Django is a Python web framework).


It can't be explained by lack of effort. There have been several serious attempts to make Python run faster, including one by some Google engineers. Many of them fail, and the ones that succeeded to some extent aren't mainstream. It's hard.

For Python, the C integration probably makes things harder.


Interesting how a language written solely for websites, beats others in their own game which isn't about websites.


PHP is a much simpler language which helps a lot with optimizations. For instance, consider something like property access.

In PHP, doing `$a->b` accesses field `b` of object `$a`. The implementation is essentially type check (making sure `$a` is an object) and hash table lookup. If this fails, then it calls `__get` method if it exists.

In Python on the other hand, `a.b` involves `__getattr__`, `__getattribute__`, `__mro__`, `__get__`, `__set__`, `__delete__` and `__dict__`. Here is a description of how the attribute access works: https://docs.python.org/3/howto/descriptor.html#overview-of-....


PHP typically beats Python on CPU bound benchmarks: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

(Measures Python 3.10.4 against PHP 8.1.5 - so expecting these to change a bit).


Facebook dumped a ton of effort into making PHP fast over the last couple of decades. No one has been able to make Python especially fast without breaking compatibility with important libraries (Python exposed virtually the entire interpreter as its extension surface, and since Python is so slow, extensions are a major part of the ecosystem as they're the main way to recoup performance, which in turn means that changing the extension interface to make things faster would break a bunch of the ecosystem, and the Python maintainers are pretty scarred after the 2->3 breaking changes). Pypy comes close, and it has been grinding away to get compatibility, but last I checked you still couldn't so much as talk to a Postgres database through a reputable package.


The last couple of decades? Facebook is not even two decades old. And they turned it into their own language.


Yeah, it was a crude estimate. Call it 10-15 years if you want.


It's not as surprising if you look at the primary use case of PHP. It takes a request, does a bunch of stuff, and returns the result to the web server. And it needs to do all of it before the user who clicked on the link gets bored and goes somewhere else. (Or the API client calling it times out...)


How is that relevant? This is not about it being faster for "it's primary use case" (optimized for that, etc).

It's generally faster than Python in doing the same things as Python does, unrelated to web too.


The point is that PHP's primary, original use case is latency-sensitive in a way that Python's isn't. So it's always been subject to more performance pressure.


I think they did a lot of similar work to what python 3.11 did and more when releasing PHP7. I'm absolutely not in the loop but remember postings here on HN a few years ago.

Meta/FB is also pretty invested in Hack (was once a php dialect, again, out of the loop), maybe they contributed a thing or two?


PHP 5 was kind of slow, to a point it threatened the future of Facebook (now Meta). They eventually went to transform it to CPP for speed improvement hiphop-cpp then went to build a VM with jit compilation. You could then deploy your app by transfering a sqlite file containing the compiled byte code.

At Wikimedia we adopted it which has cut our CPU usage by half and has saved a few hundred of servers. We had some Facebook engineers helping which involved patching the Linux kernel while at it. Those were good times.

Eventually PHP 7 followed up with a similar approach and had more or less the same performance as HHVM. Facebook went then to focus on the Hack language (a dialect of PHP with strong typing) and eventually phased out back compat with Zend.

From what I remember, Sara Golemon at Facebook has done a lot of outreaching to Open Source project and gave us a lot of assistance (as well as others at Facebook).


> PHP 5 was kind of slow

I remember it being slower than PHP4 for a while


php7 was such a huge leap they skipped php6 (although there were other reasons for that).


> Version 6 is generally associated with failure in the world of dynamic languages. PHP 6 was a failure; Perl 6 was a failure. It's actually associated with failure also outside the dynamic language world - MySQL 6 also existed but never released.

TIL MySQL 6

https://wiki.php.net/rfc/php6


Whereas Python doesn't even plan for 4 because 3 hurt so bad


Meanwhile Java(ECMA)Script bucking the trend by getting its failure out of the way at 4.


PHP has always been several times faster than Python. Even more so with the speed updates post 7.


100% not shocked at all.

I recall a couple of times over the past 15-20 years where a current python significantly outperformed a current php version, but my recollection is that php was usually a bit faster, or sometimes a lot faster.

PHP 7 was released on dec 2015, and gained significant speed bumps and better memory usage, with the average php execution time being cut in half, or more, generally without any code change whatsoever. It was quite remarkable.

The path from 7.x-8.1 so far has generally seen incremental speed bumps again - usually somewhere between 3-8% improvements per release. Obviously this is going to be dependent on use cases, but overall it's been a fairly steady set of speed improvements over the last 7 years.


Clearly we need a Python to PHP transpiler (trigger word) so we can be webscale (trigger word).


PHP has always been very fast of the untyped scripting languages even from the web 1.0 days.


That doesn't relate to Python, though.


PHP itself is pretty fast. It's the things that people do in PHP that get slow. Most of the Yahoo frontends were rebuilt in PHP in 200x because it was fast enough and much more usable than the thing they used before(trigger warning: hf2k)

Of course, people then go and build up sculptures of objects that will be thrown away after every request, and that stuff makes everything slow (that style of code is why PHP wasn't good enough for Facebook, IMHO), but you can build trash sculpture in any language.


Back around 2003 I was hosting an internal web app for a fortune 50 company in python. It could handle 2 users at a time. I rewrote it in PHP. Scaled to hundreds with trivial work. Probably could have handled far more.


The key here is that you rewrote the whole thing. The old and new languages isn't as relevant.


PHP 5 and onwards have had a lot of resources put int o making it fast. The surprise for me is how close Python is getting to closing that gap.


Looking at Rust vs. Zig, you have the FLAGS register in a bitfield for Zig but it's separate bools for Rust. This is probably making your Rust code slower because the CPU can't set multiple flags at once.

I'm also wondering if all your #[inline(always)] is slowing things down.


I ran the Rust and Zig implementations through a profiler and (at least on my machine, using Zig nightly 0.10.0-dev.4583+875e98a57) the vast majority of the time difference is from the number of calls to the GPU paint_tile_line function- there's some behavioral difference upstream in the emulator and the GPU modules are just not doing the same work.

For the benchmark 600 frames, the Rust emulator calls it 2500049 times, but the Zig emulator only calls it 1808121 times. This is very roughly the ratio between the reported times.

At least one source of this discrepancy is that the Zig emulator doesn't think any sprites are active, but that doesn't account for all of it, and I'm hitting some crashes trying to run it with the display, so I'm probably not going to investigate it further.

(I also hit the same sort of strange Zig compiler unreliability as some other comments have mentioned, where I had to add some logging in seemingly random places to get the Zig emulator to run at all.

And the argument parser just blatantly returns a dead pointer to the ROM path; when I first ran it it just gave "error: InvalidUtf8" until I tracked that down.)


Fascinating, thanks a bunch for looking into this. Benchmarking is hard!


Yeah, I really like zig’s approach to bitfields, everything Just Works with no faffing about with bit-shifting and OR/AND'ing. I forget the exact reason I used separate bools for rust, but I remember spending a day trying to do something zig-like and failing…

IIRC each of the #[inline] statements was tested and each made a noticable performance improvement. That’s especially true for things like RAM::get() - since the gameboy does I/O by having different chunks of RAM act differently (some address ranges are just RAM, some are hardware controls, some read data from the cartridge, etc) you can replace the hundred-line generic get() with a single instruction if you happen to know that you are looking up one hard-coded address, and that address has no special behaviour.


Very cool benchmark. I don't think this will be a huge difference but could you try mypyc? I'm definitely curious how it compares. It uses mypy to perform type checks and claims "Existing code with type annotations is often 1.5x to 5x faster when compiled. Code tuned for mypyc can be 5x to 10x faster."[1]

[1] https://mypyc.readthedocs.io/en/latest/introduction.html


This might speed up the py impl a few %

  --- a/py/src/cpu.py
  +++ b/py/src/cpu.py
  @@ -260,7 +260,6 @@ class CPU:
               param = None
               cmd_str = cmd.name
               self.PC += 1
  -        self._debug_str = f"[{self.PC:04X}({ins:02X})]: {cmd_str}"


Your comment in PHP src here https://github.com/shish/rosettaboy/blob/master/php/run.sh

says

# opcache in 8.1 gives a nice speedup (25s to 10s)

Is the 23.7 seconds above using the 8.1 opcache?


I think a bigger thing to point out is that it's within shouting distance of PyPy, and there's plenty more work that can be done to make it faster. Next up is a mini-JIT, which should help in places with tight loops.


I find it /really/ funny that PHP is faster then python at this. as a PHP code hacker, I love that PHP keeps coming up faster then python in a ton of tasks


That's not much of a flex, Python is deliberately slow--the reasoning was that the implementation should be simple and they'd just expose the entire interpreter as the extension API, and then people would just write extensions in C when they needed the speed. Since vanilla Python is so slow, the entire ecosystem is highly dependent on C extensions for passable performance, and since the C extension API is virtually the whole interpreter, changes that would make the interpreter fast would typically break much of the ecosystem.

Unfortunately, "just write the performance-sensitive bits in C" is pretty impractical because it only works when you're handing the C routine a relatively small amount of data relative to the amount of work to be done on that data (otherwise the costs of marshaling to C data structures will quickly eat up any gains from processing in C). And unfortunately, it turns out that a whole bunch of code works this way, so the whole bargain of "slow interpreter + easy C extensions" breaks down for a lot of real-world applications, but now we're locked into it.


CPython can be fast once you eliminate the main bottleneck: the interpreter itself.

Processing loops adds a lot of overhead because the interpreter has to make the case jumps after each loop. Figuring out a way to minimize that overhead by using built-in data structures and stdlib library will speed up your code by an order of magnitude.

Don't forget, the built-in types are already running in C.


I mean, yes, but that's probably not viable in many real-world applications. Usually you have a pretty complex data model (some graph-like structure) with Python methods that traverse it. You can't easily push that into native Python structures (at least not with any significant performance gain), and while you can push the whole thing into Rust or some faster language, at that point all of the interesting stuff is happening in Rust so why use Python at all (why pay the significant costs of a hybrid application)?


At least writing Python extensions is relatively easy (I understand, not done it) unlike PHP. PHP leans heavily on poorly documented C macros and the internals is tricky to grasp. At least it now has a minimal FFI module.


Honestly I think it’s much better to optimize for the native use case rather than FFI.

FFI can pervade an ecosystem making changes (including performance optimizations) to the host language more difficult. It also tends to complicate build and deployment stories. For example, from my Mac I can trivially build a native binary that will Just Work on any Linux system, even one without a libc. Contrast that with Python where I can’t even install popular packages onto most non-Ubuntu Linux distros. And there’s a lot of other things like that which crop up with pervasive FFI.

I’m convinced that the FFI sweet spot is “difficult but possible” and native performance should be good enough 99% of the time.


I'd be super curious to run a profiler on those Python builds and see where it's spending most of its time.


To make Rust release build run faster you should enable LTO (Link Time Optimization). In the Cargo.toml put

[profile.release]

lto = true


Interesting. On the trifecta of money vs pain vs speed, Go seems to be a reasonable compromise.


FWIW I personally found Rust the least-painful language, but that may well be confirming my pre-established biases :)

- With Zig I kept running into compiler bugs, plus no package manager (I’ve vendored SDL and Clap into the source tree)

- C++ I’d occasionally shoot myself in the foot in ways that other languages would have caught, plus no package manager (OS-level package management does an OK job, so long as you don’t mind using old versions, and faffing about with different operating systems acting very differently)

- The pain from Rust was one time where the compiler wanted me to specify a lifetime, and I didn’t understand, so I just spammed lifetime specifiers in various places until it compiled. I’ve been using Rust for a couple of years now and I still don’t really understand lifetimes, but thankfully 99% of the time I can avoid them.

- Nim was a relatively nice language but massively lacking in available libraries (like even parsing command line arguments took me a day just trying to find a library which worked)

- Go is pretty nice, my main pain is the tolerable but constantly-annoying verboseness of error handling (`err := foo(); if err != nil {return err}` compared to rust’s `foo()?`)

- PHP I just hate on a deep and personal level thanks to years of being a PHP4/5 developer. The language is actually mostly-ok-ish these days, but the standard library is still full of frustration like inconsistent parameter orders within a family of functions.

- Python is all-round really nice to write, but the test suite takes like 20 minutes to run, which really messes with my flow-state


"Rust the least-painful language"

" I’ve been using Rust for a couple of years now and I still don’t really understand lifetimes"

Seems like a major pain point.


It would be if I ran into it regularly -- but after using the language for a variety of professional and personal projects for a couple of years, this is the only time I’ve actually needed to manually-specify lifetimes since the compiler normally figures it out for me :)


As long as you don't get too crazy with references in structures or async code, lifetimes are not going to chase you.


Generics too right?


I know people say that about everything but Rust generics are very readable and do make sense after you understand what problem are they solving.

They do look intimidating to start with, admittedly, and I'll concede that's a negative point for Rust. But it does get better if you practice for a bit.


No I just meant using generics seems to require explicit lifetimes decently often. I don't understand why.


Not sure that's the case btw. I have noticed it with some libraries but I've used and crated a fair amount of generics without having to annotate stuff with lifetimes.

Lifetimes are necessary when you want to explicitly say "variable X will live just as long as variable Y", or sometimes it's more complex (i.e. you have to specify 2 or more separate lifetimes and then return something that pertains to only one of them) but it's still fairly predictable if you keep it all in your head while coding.

Don't get me wrong I still hate it but it's not as terrible as many people make it out to be. It's hard to get into but also very logical and graspable.


The reason it's not a pain would be explained by the rest of the sentence which you omitted: "but thankfully 99% of the time I can avoid them"


They are saying rust is painful. Just that they find the other languages even more painful.


Except that they don't really say rust was painful....just that there was one specific moment / aspect that they found tricky.


This is actually a really nice write up for people deciding which language to learn/use if they aren't constrained.


Re CLIs in Nim..Most find https://github.com/c-blake/cligen easy to use.


Nim does seem nice. I worry that it will end up like D though... An interesting/cool/neat language with seemingly relatively little adoption. I'm keeping my eye on it though.


C++ has two relatively good package managers, conan and vcpkg.


It's probably "slow" because of sdl and cgo not native Go code. A gb emulator doesn't do much actually, it's about fixed array, bit shifting, switch cases etc ...

I ran a quick pprof and indeed it's spending a lot of time in cgo:

  Showing nodes accounting for 28720ms, 67.67% of 42440ms total
  Dropped 145 nodes (cum <= 212.20ms)
  Showing top 10 nodes out of 53
      flat  flat%   sum%        cum   cum%
   13080ms 30.82% 30.82%    16600ms 39.11%  runtime.cgocall
    4720ms 11.12% 41.94%     4750ms 11.19%  main.(*RAM).get
    2840ms  6.69% 48.63%    33070ms 77.92%  main.(*GPU).tick
    1970ms  4.64% 53.28%     3720ms  8.77%  runtime.mallocgc
    1450ms  3.42% 56.69%     1470ms  3.46%  main.(*RAM).set
    1160ms  2.73% 59.43%    41350ms 97.43%  main.(*GameBoy).tick
    1000ms  2.36% 61.78%     3160ms  7.45%  runtime.exitsyscall
     890ms  2.10% 63.88%     1610ms  3.79%  main.(*CPU).tick_interrupts
     820ms  1.93% 65.81%      850ms  2.00%  runtime.casgstatus
     790ms  1.86% 67.67%     5530ms 13.03%  main.(*CPU).tick


> It's probably "slow" because of sdl and cgo not native Go code

The other languages are also using SDL via their respective interacting-with-C interfaces - what makes Go special here?



The author stated that the benchmarks run in headless mode, so I am not sure that it's SDL that's slowing it down here.

Even if it is SDL slowing it down, Go FFI being slow is still a real disadvantage compared to the other languages, and you can't just pretend like it doesn't exist in this case.


Yup.. a 240x performance difference (zig-py) has very little to do with the language, vm, or whatever. As soon as I saw that, I dismissed the benchmark.

By these standards a 10 year old cpu with a beefy GPU will beat any new cpu as well.


>Yup.. a 240x performance difference (zig-py) has very little to do with the language, vm, or whatever.

It absolutely does. A simple for loop in the standard Python interpreter will literally take 100x longer than the same thing in a language like C/C++, try for yourself if you don't believe me. CPython is unbelievably slow.


Very interesting... Simple loop (0 -> 320000000) and add 1 to a variable.

I couldn't reproduce 100x (no optimization flags, otherwise it won't do anything)

  Apple clang version 14.0.0 (clang-1400.0.29.102) -> 0m0.347s
  ruby 3.1.2p20 -> 0m11.314s
  Python 3.8.12 -> 0m19.662s
So ruby is almost 60% faster, but the C version is "only" 32x faster than ruby. 55x python.

I thought the differences would be smaller these days


What does a GPU have to do with any of this? I think you may be a little confused about something but I am not sure what.


It depends really. In this case, a game emulator, you can only get away with it because it's for an ancient game console. But otherwise you definitely cannot afford 5x slowdown compared to CPP for a game.


Maximize pain for mediocre speed and money?


Also C# is quite reasonable.


Can you try PyPy as well?


Apparently not :( (These benchmarks were done on an M1 MacBook Pro)

    $ brew install pypy3
    pypy3: The x86_64 architecture is required for this software.


$ brew install pyenv

$ pyenv install pypy3.9-7.3.9

I like using PyEnv for managing my Python versions. It will natively compile Python builds and should be doing the same on M1 (which is what I'm using). `pyenv install --list` shows you what is available.

EDIT: Not sure why they don't have newer versions of PyPy there (I don't use PyPy) but all it takes is a PR to here: https://github.com/pyenv/pyenv


Thanks! Added to the list, it’s suspiciously only a little bit faster than CPython 3.11 though, which probably needs more investigation…

ED> Looks like CPython prefers using a dict as a lookup table for opcodes (which is what this implementation does), while PyPy prefers having a long series of if-statements. Hmm.


That makes sense because pypy can JIT a ladder of if/else to a compare/jump conditional each whereas a dict lookup can be an order of magnitude more complex. If they're 8bit opcodes, maybe having a list based lookup table will perform similarly on py3000 and still optimize on pypy?

Or maybe https://docs.python.org/3/library/array.html rather than list.


Can you do Node 19.0 and Ruby 3.1.x too please?


TypeScript and Java are the two remaining languages where I feel I know them well enough to translate 3k lines of code in a weekend each -- I have approximately zero knowledge of Ruby though, and no motivation to learn it ^^; Pull requests are welcome though!


OK, I'll watch your repo and track the updates and see when you release the JS/TS port and then I'll a give it a try and port it to Ruby and send it your way.

Best wishes


do you use numpy for the numerical python stuff


what about java graalvm ?


I wanted to see what the results for pypy would be.

On my machine (very similar, a macbook pro m1 max):

    python 3.10: 52s
    python 3.11: 35s
    pypy 3.9.12: 10s
(This test is basically a perfect test for JITs: one loop repeated many times)

https://gist.github.com/llimllib/7af8144a92d3c2e1fc58be62988...


Out of curiosity I ran the same test on a linux laptop with the Ryzen 7 PRO 6850U CPU.

    python 3.10: 60s
    python 3.11: 46s
    pypy 3.9.12:  6s
Looks like pypy performs comparatively better on x86_64


makes sense I guess, it's had a lot more development time I'm sure. Thanks!


I think PyPy runs under Rosetta on M1, so the overhead is probably from that.


I don't think so; they released ARM support a while ago: https://doc.pypy.org/en/latest/release-2.1.0.html

(I could be wrong)

edit: oh, huh I think you're right that it's running here on rosetta:

    $ file installs/python/pypy3.9-7.3.9/bin/pypy
    installs/python/pypy3.9-7.3.9/bin/pypy: Mach-O 64-bit executable x86_64
Wonder if there's a way to run a native version?

edit 2: there are nightlies here: https://buildbot.pypy.org/nightly/py3.9/

Running the latest, a native binary gives more than 2x speedup:

    # first you have to allow all the unsigned binaries to run
    $ xattr -dr com.apple.quarantine pypy-c-jit-106295-5dd3b18303e2-macos_arm64/bin/*
    # then we get 3.5s:
    $ time pypy-c-jit-106295-5dd3b18303e2-macos_arm64/bin/pypy nbody.py 10000000
    -0.169075164
    -0.169077842

    real 0m3.522s
    user 0m3.468s
    sys 0m0.045s
pypy continues to impress!


Why is there such a dramatic performance gap / slowdown for 3.10 + 3.11 compared to 3.9.12?


That's pypy 3.9.12, it has a JIT implementation to interpret python. The other two are standard CPython which doesn't JIT interpret the code.


Does pypy support numpy now?


yes

    $ pip install numpy
    <snip building wheel>
    $ python --version && python -c "import numpy; print(numpy.identity(5))"
    Python 3.9.12 (05fbe3aa5b0845e6c37239768aa455451aa5faba, Mar 29 2022, 09:54:47)
    [PyPy 7.3.9 with GCC Apple LLVM 13.0.0 (clang-1300.0.29.30)]
    [[1. 0. 0. 0. 0.]
     [0. 1. 0. 0. 0.]
     [0. 0. 1. 0. 0.]
     [0. 0. 0. 1. 0.]
     [0. 0. 0. 0. 1.]]


Oops, thanks for clarifying.


That's 3.9.12 of PyPy, not CPython


Not OP, but are you taking into consideration that it is not Python 3.9, but pypy, which is distinct


> On my machine (very similar, a macbook pro m1 max):

Should make essentially no difference then, since I rather doubt the Python implementation of n-body can leverage the GPU, or strains the RAM so much that the 200GB/s of the Pro (IIRC) would be an issue.


It looks very close for 3.11. The measurement from the parent comment includes two other python implementations (not covered in the article).


“Besides strict typing and having a generally ugly syntax, C++ also requires ahead of time compilation.”

What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Python is a higher level language than C++ so it requires less effort but newer compiled/typed languages offer more of what Python is good at


> What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Nothing, as long as the compilation is fast. C++ compilation is not fast. The large C++ projects I've worked on over the last few years, compilation is bordering on 2 hours for a full build, and 10-60s for incremental builds. At one point, incremental changes were taking 15 minutes to link at one point (resolved by [0]). Go is a great example of fast compilation and strict typing IMO.

[0] https://devblogs.microsoft.com/cppblog/improved-linker-funda...


Go was designed, in part, as a response to C++'s slow compile speeds. The Go compiler has gotten a bit slower over the years, but it's still pretty fast, especially when compared to C++ or Rust.

> In 2007, build engineers at Google instrumented the compilation of a major Google binary. The file contained about two thousand files that, if simply concatenated together, totaled 4.2 megabytes. By the time the #includes had been expanded, over 8 gigabytes were being delivered to the input of the compiler, a blow-up of 2000 bytes for every C++ source byte.

https://go.dev/talks/2012/splash.article#TOC_5.

I also hate how just adding some definition to a .h file that's only referenced in one .c (or .cpp) file will recompile loads just because the header file changed. Maybe there's ways to improve on that (ccache?), but I mostly write C to contribute to open source projects (rather than my own) and it can be an annoying wait.


You can limit the impact of this with forward declarations and only putting that is needed for the public API into the header includes. The rest can all live in the .cpp


C++ (especially template heavy with lots of headers and so on) is certainly not fast to compile. But 2 hours that's crazy. What sort of codebase and on what kind of machine?

A 128 cores threadripper is a great workstation to compile C++ code fast :D


For reference Unreal Engine 5 full build (building a lot of stuff you don't need) takes like 30 minutes on a 3950x. Typical incremental build is either 15 seconds, or a minute if I touch some popular header. For full build not only you can have a lot more cores, you can use IncrediBuild.


In my experience, a large Unreal Game can take as long as the engine on top of it. My current project is about 2 minutes of project code plus 15 minutes of engine code but my last project was more project code than engine code.


Unreal engine games, on a 3990x Threadripper with 128GB ram and NVMe SSD drives.


Chromium is known for taking hours.


That is true. But Chromium is not in the realm in which the compiled vs. interpreted debate is relevant. It is slow to compile, but it would also be very slow to run if it was written in Python.

(Unless a hypothetical Node.js rewrite would be comparable on speed. I wonder how it would do on memory.)


More like an hour or less on reasonable, consumer hardware[1]. A lot less on workstation/server CPUs, or with IncrediBuild.

[1] https://youtu.be/nRaJXZMOMPU?t=770


I am on Gentoo so I have some build time stats!

My system has a 3900x, 32GB ram, and a fast NVME SSD for reference.

The last chromium build took 1h:45m to complete. This build had LTO enabled and -j18 passed to ninja.

Without LTO it would probably be closer to an hour I think but I don't seem to have a non-LTO build in the logs.

Your estimation seems pretty accurate!


Most people will use ccache under Unixen.


And for those of us who work primarily with MSVC?


> Go is a great example of fast compilation and strict typing IMO.

I think an even better example might be OCaml. Ocaml's compilation speed (last time I checked) was on par with Go, but it provides a much nicer (IMO) type system.


OCaml is another example of fast compilation and static typing.


I have worked with very few C++ projects that compile fast, and my experience is that the errors in C++ projects are often more severe than the errors in Python projects. Even with modern C++ smart pointer style, I’ve seen all sorts of stuff like dangling pointers / use-after-free, buffer overruns, etc. All in code that had been reviewed.


Yes, C++ was created in the 1980’s and a lot had been added …

That’s why I was suggesting newer compiled languages incorporating some of what we’ve learned over the past 4 decades. eg type inference


In general, the more work the type system is doing for you, the slower the compilation speed. Kotlin and Scala, for instance, both compile far more slowly than Java. I think even C# compiles very slowly compared to Java. You mention type inference, but type inference actually slows down compilation, and the more sophisticated the type inference, the slower the compilation!

Similarly, Rust compilation speed is much slower than GoLang's, whose type system does relatively very little for you. I haven't seen exact numbers, but I've seen people say Rust and C++ have similar compilation speeds, and both are, in general, slow to compile.


Rust gives you so much more for the same compilation speed though.


Try any larger Swift project and suddenly C++ looks pretty fast. Agreed on the pointers though.


In my experience a good pipeline with decent unit tests makes c++ and python work fine.


And Valgrind (and / or Memory Sanitizer), they mostly remove a big source of problems in C++ code.


>Python is a higher level language than C++ so it requires less effort but newer compiled/typed languages offer more of what Python is good at

I see Nim as clearly better if you write a larger software. It's a shame it's usage is low and thus there aren't many Nim resources.


I think Nim is best placed as the choice for the fast functional core performance-critical and cpu-bound few parts of a larger python codebase.


I see a strong convergence of typed / fast / expressive languages between python/php/ruby and cpp/ada. Rust is one trendy instance but I believe it's gonna create a spot around it.


> What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Nothing, it's all good things: It's easier to write, easier to debug, and the compiler (not the user) catches bugs.


C++ compilation is anything but fast.


I'm really impressed with the performance improvements in Python 3.11.

I ran a very basic benchmark against a local web application: I got 413.56 requests/second on 3.10 and the exact same code gave me 533.89 requests/second on 3.11.

That's a big enough increase that I think it's worth actively upgrading projects. Usually I wait for a few months for things to settle in first.


That n-Body simulation is a bad case for Python today in that you loop over the differential equation solver in Python. (It's like the very branchy semantic web and old AI stuff that I do for fun... I am migrating a lot of that to Java, PyPy helps a great deal but Java does a lot better.)

If you are doing heavy matrix math, numpy runs at FORTRAN speed, tools like scikit-learn and Tensorflow also get high performance by doing the heavy lifting outside Python.


Sounds like a case of "write the code to show what it should look like and show what numbers you get compared to the code they posted"?

But remember that the exercise wasn't "to run the n-body simulation", or even "to run maths", but just to see how fast "plain Python" is with the release of 3.11 - using numpy and scipy, which rely on compiled code that they can hand work off to, would make any runtime value you get completely meaningless for the purposes of benchmarking pure Python =)


The viability of Python for scientific work is predicated on using Python as glue code for code written in other languages. If pandas ran at the speed of Python people really would be using Julia or some other language instead of Python.


Agreed! Which is why this is a benchmark of Python, not of "running scientific computations". Of course you're going to use numpy/scipy for that. Or you start using R, if you really care. This article isn't about what the code does, there are a million improvements to make if that's what we were looking at. It's about how fast a small program that uses some standard pure Python patterns runs today, compared to a few versions ago, with some "but what if not Python?".

Because "you wouldn't write this code" applies to those examples as well. In JS you'd tap into a C-compiled native library the exact same way numpy/scipy does in Python. And in C, if you absolutely needed the fastest performance, the code would be full of micro optimizations and a sprinkling of assembler.


Who cares about "pure" Python? The whole point is that you have powerful libraries at your fingertips and you can leverage those to get fast code. Can you do the same in Go? JavaScript? PHP?

I'm tired of people comparing languages but then leaving out the major winning points for Python. Numpy & co. are integral part of Python, no serious ddv would use just "pure" Python for numerical methods, ever. So let's compare real'world Python, shall we? I doubt Go has a chance then. /rant


Turns out other people don't care if you're tired of something, because they're not you. Plenty of folks do care about how much pure Python has improved. That's how it got to the front page.

If you want to test baseline performance, you want a little program that's small enough to be easily understood, but just elaborate enough to hit enough of the standard programming patterns. So it's basically irrelevant what this code actually does, we're just looking at how much faster Python has gotten. Because that's something we care about. We don't care that "go is faster" or "C is faster", we use Python and we like to know that the newer versions actually are substantially better than the older versions we used to (or still have to) work with.


Clearly the solution is to add a jitter to Python that identifies code that looks like an matrix multiplication, and calls numpy instead.


> numpy runs at FORTRAN speed

Not everyone will be aware that this meant as praise. ;-)


Yep.

FORTRAN codes persist today because (1) the old school memory model of FORTRAN is fast, and (2) it is so easy to write numeric codes that do the wrong thing with rounding and numerical instability. There's a reason why Foreman Acton wrote a book titled Numerical methods that (usually) work.

https://www.amazon.com/Numerical-Methods-that-Work-Spectrum/...

Code something up in C, Haskell, oCAML or CUDA and you miss out on the 40+ years of experience people have had with a FORTRAN code from the 1970s.


A lot of physics simulations use FORTRAN for the existing libraries, and there's a way to run it on GPUs. It's not going anywhere.


> (2) it is so easy to write numeric codes that do the wrong thing with rounding and numerical instability.

Are you saying FORTRAN avoids these problems, or that it is prone to them? If the former, how does it do it?


Ideally, people who understood numerics wrote the code in the 1970s and it has gotten heavy use since then so if there are problems with numerical instability they've been detected and solved.

Today somebody who doesn't know numerics frequently codes something up for the wrong reasons (e.g. to learn a new language, because they think the 1970s FORTRAN code is obsolete, ...) and never did the testing to know that the code they wrote is numerically stable or not.

That is, you might think it is pretty easy to code something numerical up, and sometimes it is, but frequently you write something that's a little bit wrong and sometimes you write something that's terribly wrong, sometimes it isn't even wrong.

It's not that FORTRAN is necessarily more accurate than another language, but that you can trust a code that has been around for 40 years and codes that have been around 40 years have been written in FORTRAN.

---

As for the memory model I think about it the most when I write embedded programs for my Arduino.

It drives me nuts that C diddles the stack pointer around meaninglessly when for most of the programs I write there are a small number of parameters that decide the size of all the arrays (like an old FORTRAN program) and local variables, recursion and all that are a source of problems and not solutions.

The only reason I write C for that thing at all is that some of the programs I write are performance sensitive and I could get a bigger boost running the C code on an ARM or ESP32 than I could get writing AVR8 assembly and eliminating meaningless loads, stores and other activity that C does "just because".


It is still informative to see how slow native Python really is.


I agree. This was honestly news to me - who very often uses Python for maths. However, I would never write the code as he did (instead I would rely on numpy/scipy). So I would also be intersted in a numpy version of the same test.


How exactly would you write a gameboy emulator in numpy/scipy?

It's sequential code with fiddly side effects. I know I've written one.

But I'm generally curious if this is in-fact possible in someway.


Numpy would probably be even slower here. Numpy is good when you have large arrays, but it adds roughly .1 to 1 us per call in overhead.


Without validating anything myself, I was able to find this post https://hilpisch.com/Continuum_N_Body_Simulation_Numba_27072... which showed a simple n-body program sped up by ~670 times when moving from pure python to numpy+numba.


2 things to notice: the first is that this is with 5 bodies while your link was with 1000. For 1000 bodies, numpy is a noticeable speedup (100x). For 5 particles (I used the same code as your article but adjusted the number of particles) numpy is 5x slower. Adding numba would make this fast again since it would remove the overhead, but at that point, just use a fast language in the first place.


To paraphrase:

> That benchmark shows that Python is slow. If you avoid writing your code in Python it can be really fast!


You should give JAX a go.

https://github.com/google/jax


+1 for JAX. Basically designed to be the successor to TensorFlow, and much nicer to work with. Strangely I've not seen it discussed around HN much but it's what I do 100% of my work in these days.

Whilst I'm here: shameless self-promotion for Equinox and Diffrax:

https://github.com/patrick-kidger/equinox https://github.com/patrick-kidger/diffrax

Which are neural network and differential equation libraries for JAX.

[Obligatory I-am-a-googler-my-opinions-do-not-represent-your-employer...]


I feel like JAX usually comes up on PyTorch posts (in the same way PyTorch usually comes up in the comments on TF posts).

I think the delta between JAX and PT on commodity hardware is just a little bit too small to really create much splash on HN.


The point of making python faster is not just making the above nbody simulation faster but also to improve the performance of when python is used as glue code to other more efficient systems.

But, the language before has a bigger shift to making it faster due to the amount of resources it has now that are dedicated to making CPython faster similar to what happened with JavaScript. I hope CPython eventually is as fast as JavaScript.


I see a lot of comments about how other languages are faster.

Please think about your actual workload and take them with a grain of salt.

For instance, for most web apps, you spend a large amount of time waiting on database responses. It looks nothing like the modeling the tests in the article do. Benchmarks are not typical workloads.

Don't just assume because some random person on hn says zig is faster you should rewrite your business apps.


I've actually found this to be completely untrue in practice. Just about every web service I've dealt with in production, if not all of them, have been CPU bound. This is for a number of reasons:

1) Network speeds have increased dramatically compared to CPU speeds.

2) People don't optimize code very much.

3) Web apps tend to do more work per request than they did in the 90s.

Regardless, I've never seen an app saturate it's network pipe, but I've seen plenty saturate all their CPU cores, including relatively well-tuned ones. For instance, I wrote a Netty-based reverse proxy app once, and while I got it to run far faster than the typical app in that company, it was still CPU-bound in all my tests.


+1 to this.

We picked Python Asyncio based Tornado async for our services. As soon as we hit scale we were getting CPU bound.

Deep diving into profiling came up with JSON parsing being the culprit.

It's a painful problem to crack once you hit those limits. In some cases you could genuinely skip parsing the JSON (by returning a raw JSON containing string to client) such as when you are simply getting data from cache or db.

In other cases you simply can't skip it. For eg when you are interfacing with a 3rd party library that will only speak JSON. At that stage you are stuck.

You could try and use a wrapper around faster native JSON parser (say uJson) but it will be a trade-off between the parsing time and the time taken to copy the string to the FFI parser and copy back the results. And deal with all the complexity that that entails.

Or you could hand it off as a job to an async queue (this might be the canonical architectural approach to prevent blocking the event loop) but then you have just shifted the problem to a different place where you'll still need to throw more instances at the problem. And this adds extra latency.

I too was in the "don't optimize prematurely" camp but picking Python today for new services IMO would be taking that principle a bit too far.

Especially considering the ergonomics that modern languages like Golang or Rust offer.


> It's a painful problem to crack once you hit those limits.

If you hit that doing something novel and obscure, sure. If it is literally parsing JSON, you spend a little effort researching non-stdlib JSON parsing libraries, pick one of the several stable much-faster-than-stdlib ones, and move on.


Like I mentioned we did that. You can get a small arithmetic multiple speed up. But moving these services to rust basically made our instance counts drop from 32 to 2.

Once the team has the know-how to write a service in Rust/Golang or even the modern pleasant to write Java, it becomes hard to justify why we'd pick Python for a new service at all.


He explicitly mentioned that path…


He explicitly mentioned some of the troubles you might have implementing and maintaining that yourself via FFI.

He very much did not explicitly mention that its already done, with established results, and that the described “pain” isn’t something you have to take on at all.


And then there is the resource usage... a typical dynamic interpreted language uses orders of magnitude more CPU and memory compared to compiling to machine code. Enough of these apps running in a data center can add up to a lot of wasted resources!


I'd argue a lot of web apps are just serializing to json, after doing minimal business logic that likely has to do db requests.

Once again, I said to look at what your load actually is. If you're CPU bound, then yes, buy a bigger instance, optimize, or switch languages.

It's like how if you are putting up a blog, you probably shouldn't be looking at running kubernetes clusters for just that blog.


You might be surprised to find out how CPU-intensive even something as simple as a DB request can be. Your driver has to do the work of communicating with the DB server via its protocol, which is similar in cost to any other network request. More importantly, it has to deserialize the result and marshal the data into objects on the managed heap. Most DB drivers don't support streaming, so for larger requests you have to read all the data into memory, then convert it all into objects at once.

And that's not even taking into account frameworks and ORMs like Hibernate, which itself can multiply the CPU used several-fold on top of JDBC, or whatever lower-level interface it wraps. I've never measured frameworks in dynamic languages, but I have no reason to believe they aren't similarly inefficient.

And, yes, one of the optimizations I did for my reverse proxy app was to upgrade the JSON library, which brought a significant performance boost. But it's not the only source of CPU usage for apps, nor was it the only major optimization I successfully applied.


Yeah I’ve profiled slow Python apps before and it’s almost always serialization that’s the sticking point. Default implementation of date parsing in particular is really bad


> I've never seen an app saturate it's network pipe

True, but that has nothing to do with what OP said:

> you spend a large amount of time waiting on database responses.


Yeah this is a constant excuse for Python's slowness but I have yet to work with a Python codebase that wasn't slow.

I mean I could buy it if Python was maybe 5-10x slower than "fast" languages, but the benchmarks people are throwing around show it is 50-100x slower. Every Python codebase I have used (apart from one-off scripts I guess) has eventually run into the "ok it's slow, how can we make it faster?" barrier.


So, python is slower than go, rust, zig, etc. OK we agree.

Now does it matter? Not all codebases are the same. If I'm serving api requests, does it being 55ms vs 15ms make a difference? 110ms vs 75ms? If you're writing analysis on a huge data set and an iteration takes 100ms vs 50ms? yeah that could be the difference between weeks vs days.

Almost like someone should look at what they're doing before saying 'IshKebab said python is slow, so we should rewrite all our code'


I've worked at places where both your comparisons absolutely matter. 55ms vs 15ms for a domain specific search engine breaks the budget of 20ms. 110 vs 75 broke the budget of a ML API. Not everyone has the luxury of not caring about speed


Well those sort of time differences would make a difference.

But usually 90% of the response time for the api request is the time it took the SQL server to execute the relevant queries.


For most cases it just doesn't matter.

I make software that requires searching over large datasets (image recognition for construction drawings). It's a web app running python on the server and it feels instantaneous for users as they're searching.

Most people aren't even doing that - they're just pushing and pulling data from a db. Building maintainable software is what really matters.


The issue is that at the start of the project it doesn't matter, but in my experience projects always reach a point where it does matter. And at that point you have 500k lines or Python that you can't do anything with and you're looking enviously at your competitors who chose Go or Java or C# or Typescript or Kotlin or Swift or Nim or Dart or Rust or ...

That's why I'd never start a new project in Python (among other reasons). It backs you into a poor performance corner.


>If I'm serving api requests, does it being 55ms vs 15ms make a difference?

Yes, it's the difference between being able to serve 66 r/s or 18 r/s.


>For instance, for most web apps, you spend a large amount of time waiting on database responses.

Still, for web apps there are still huge gaps based on language and framework used. [0]

[0] https://www.techempower.com/benchmarks/#section=data-r21


>"C++ is a compiled language, which means that it lacks some of the convenience of Python and JavaScript. Besides strict typing and having a generally ugly syntax, C++ also requires ahead of time compilation."

To me ahead of time compilation is a convenience that in many cases catches bugs before they present all their glory to a customer.

As for "generally ugly syntax" - beauty is in the eyes of the beholder. I am for example multilingual and do not get hung up on syntax unless it resembles brainfuck. However having fn instead of function, not having brackets when supplying parameter list, having variables to use some special characters does not qualify for "beauty". It is just a different way of doing the same thing and often feels that it is done for the whole purpose of being different or half arsed attempt to make parsing simpler.


> As for "generally ugly syntax" - beauty is in the eyes of the beholder.

It's not entirely subjective. For example a language that is inconsistent in it's use of syntax could fairly be described as "objectively ugly".


You just uglified nearly every language.


Ok. Maybe I should have qualified the example a little more.

My point is - I don't buy that "language aesthetics are entirely subjective".


If they are "objective" then people would not be creating "ugly" languages. Fact that they do shows that unless they are making brainfuck for purpose they believe their creations are not ugly. After that it only comes down to how many percent believe that language XYZ is a "beauty".


Javascript being 41x faster than Python seems a bit excessive.

In my experience, having done quite a lot of dynamic language benchmarks, Javascript is about as fast as PHP and both are about 6x faster than Python.

Has someone looked at the Python code used here, if it has any obvious gotchas?


There is one. The author's implementation of `combinations` runs in quadratic time making copies of lists on each iteration. I replaced that for `itertools.combinations`, which uses an iterator, and found a pretty big difference:

     ~/p/not_that_slow  python3.11 modified.py 5000000
    -0.169075164
    0.183753791
    Time: 6.335 s

     ~/p/not_that_slow  python3.11 main.py 5000000
    -0.169075164
    -0.169083134
    Time: 48.515 s
Other than this, the only modification I made is to include a `print` statement at the end to show the time taken.

EDIT: That is not actually equivalent, my bad. The iterator is consumed and the execution ends early, I didn't realize the algorith iterated over it multiple times.


how're you using it? what you printed out (modified.py) seems to be a different result than expected


Heavy list creation is one way to really slow down Python.

And itertools is a real gem, full of lots of goodies to handle list manipulation tasks.


It’s a benchmark that is basically the ideal case for a JIT so it’s really Python vs. JS that got native compiled.

I’m in no way criticizing the benchmark but that’s the reason for the starker difference.


let me know if you find anything! Happy to edit the post with a correction.

For a little more information, this is using Bun 0.2.1, which uses JavaScriptCore (JSC), a part of WebKit which powers Safari. Since I'm running on an M1 Pro (Apple's ARM chip), there is probably somewhat of a benefit in using JSC.


It's not excessive in my experience. Javascript is usually very fast. Much faster than PHP and Phython.

There are many flaws in this benchmark but the order of magnitude looks right to me: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


Even a 2x faster Python still makes a terribly slow language.

Improvements look amazing relative to old Python. But compare it to PHP, Javascript, Lua. Ok, it might be better than Ruby sometimes.


I thought ruby is now faster than python. ( I know that wasn't always the case though)


For a long time many of the .NET Runtime's most fundamental and performance-sensitive functions were written in C++. In the beginning the JIT wasn't fast enough, or C# didn't have good enough native interoperability, or it didn't have the tools to write fast code. As the years went on the JIT got faster, and more and more improvements were added to help write high performance C# code. Now there's a mass effort to port most of the runtime to C# to make it safer, more maintainable, and often faster. RyuJIT may not generate code as nice as LLVM but there's so much overhead calling into native code that there are gains to be had.

Imagine the alternate timeline where numpy is removing some of its C function calls because Python is fast enough on its own. I'm not sure we can make it there from here.


The boundaries are getting fuzzier. Implement this in Python _with JAX_ with little if any extra effort, run it on GPU, and you get performance you’d alternatively get only by writing custom CUDA kernels in C++.


I don’t think the borders are that fuzzy.

Who has a C++ repo that isn’t in a low level language because it has to be?

There is no production repo where python gets fast enough to replace Rust/C++.

Still great to see improvements because there are repos where python replaces other languages.


Yeah but Python with Rust modules might be quite effective.


Even cupy too. I have been using cupy for some of our own work, and it's so easy to do GPU based scientific computing, that it's funny now


Nice speedup! I'd love to see a comparison between 3.8 and 3.11 of the same script, but using numpy


wouldn't you expect numpy to mostly mitigate the differences?


Now just waiting for the GitHub Actions runners to be updated with Python 3.11 (deployment next week - they deploy image updates weekly).

https://github.com/actions/runner-images/issues/6459


The actual title should be “Python 3.11 is much faster than 3.8” to match the page.

That missing word is important.


Indeed - I clicked expecting there to be some kind of catch.


I am not sure if the JS vs. C++ benchmark is really fair. Using unix "time" utility, you are also including the startup of the JS runtime.


> Python is a popular but reputably slow interpreted language.

_sigh_

Why do people still say that?

1. There are no interpreted "languages", only interpreters for said language. One can compile or interpret anything.

2. When Java didn't have a Jit, it was still called "compiled language", even though it was running bytecode, same as Python.

I think this "interpreted" vs "compiled" distinction is an anachronism. Pure interpreters are almost extinct.

> (Javascript) It's a JIT compiled language with far more investment

Oh, so it's compiled now?

It's indeed due to investment, it wasn't always the case.

> C++ is a compiled language

Or is it?

https://root.cern/cling/


What they don't tell you is that python 3.8 was slower than 3.3.


Can you test some alternate pythons: pypy, pyston, pyston-lite ?


Does someone have these times for more languages?

Would be curious to see the performance of Java Rust Ruby W/e

I can estimate by using the energy consumption of each language as compared to C (https://storage.googleapis.com/cdn.thenewstack.io/media/2018...) but real numbers are appreciated.


I am very partial to python, but i have hit a lot of bottle necks with it. I find GO to be a reasonable intermediate, balancing ease of use and flexibility with some speed trade offs. I should probbably learn rust but it seems like a lot for writing web apps, and I could never get my head around lifetimes


> JS is 41x faster than Python

Except that Pythonistas often delegate such computations to libraries like numpy that achieve C-level speed. Python is supposed to be a glue language, JS is supposed to be a language that can run on browsers, so of course the two have different goals and performances.


I don't think there's much holding JS back from also being a glue language. JavaScript has certainly evolved from being only a browser language these days.

The only request I'd have is more granular garbage collection (marking FFI memory as refcounted for more immediate destruction).


TypeScript is better as a glue language than Python IMO.


Is the performance difference of Python and Javascript due to the language or due to the runtime?

If one would compile both, the Python version and the Javascript version, to WebAssembly and execute that, how would the performance differ then?


The WebAssembly version of this exact test takes about twice as long as running natively. I just happened to do exactly this benchmark a few minutes ago:

Python 3.11.0 native: 0m43.755s

Python 3.11.0 WebAssembly (on node.js): 1m31.305s (about 2.1x as long).

For what it is worth, I also maintain a Python --> Javascript transpiler (https://www.npmjs.com/package/pylang) and I tried it on exactly this benchmark (with some very minor changes):

PyLang: 0m19.386s (on node.js) about half as long as Python 3.11.0 native

The article uses bun, which uses Safari's JS runtime, which sometimes has significantly better performance than node.js for these sorts of benchmarks (especially for Webassembly), in my experience.

These tests were all on an M1 Macbook Pro.


It's due to the runtime.

JavaScript has a runtime that updates as it executes. It notices a loop is being run frequently and then compiles a new version of the loop that runs faster.

Python doesn't have this functionality.

WebAssembly is for running things in the browser and would slow everything down (Python, JS, C++). The tests in the post are run natively (on my command line).


Python does have this functionality, but it's hamstrung by the implementation.

Javascript is, for the most part, completely sandboxed and separate. Which means the JIT for Javascript is totally free to do neat things like move an object from one region of memory to another.

CPython, on the other hand, has to care about things like "was this object created in C? Is it pinned to that memory location? Does this method end up delegating to C?"

And you see that in the complexity it takes to introduce native methods in both. Moving and using data from JS to Webassembly is a PITA. Accessing data in a Python object from c ( and vice versa) is trivial.

That triviality is what gets in the way of JIT optimizations.


I don't buy this argument. WebAssembly is a vastly different beast than trivial C extensions (since WASM is a browser thing). JavaScript has both and is still much faster.

Bun (the runtime I used for the post), has a native C interface that's far simpler to use than Python (even with pybind/nanobind): https://github.com/oven-sh/bun#bunffi-foreign-functions-inte...

Deno (a v8-based runtime) has the same thing: https://deno.land/manual@v1.25.4/runtime/ffi_api

These use the standard C-ABI directly without requiring headers (unlike Python).

Node.js standardized a more Python-like C API called Node-API: https://nodejs.org/api/n-api.html

I believe Python can continue to add performance and is not hamstrung by the implementation.


The difference between all of these and how the python c api works is that 1 layer of abstraction and translation.

Take the deno example, in order to invoke a C function you have to tell deno "load this lib, this dynamic function is here, and the method signature looks like this".

Look at what types are permitted to be passed between the two as well. deno isn't exposing deno objects to C, it's exposing only simple primitive types. The cpython FFI, on the other hand grants C full access to python objects.

That one layer of abstraction and distance makes all the difference and is hard to backport in.

There's a reason alternative pythons (pypy, graalpy) don't really support it and struggle to support libs that rely heavily on it (like tensorflow).


You can expose objects. Here's how it is done in Bun: https://github.com/facebookresearch/shumai/blob/main/shumai/...

We've been using this feature heavily in Shumai.

I think you are vastly overestimating the complexity associated with this (user exposed ref-counting/garbage collection) and may not be totally up to date on what's implemented.


> I think you are vastly overestimating the complexity associated with this

No, I think you are misunderstanding the problem.

I'm not saying that, with a good 3rd party lib, generating FFI APIs can't be easy and slick. I'm saying that non-python languages have more complicated FFI APIs that practically necessitate these sorts of generation libraries (like bun).

When you grab a bit of memory out of an object in C with python, you are reaching directly into python's internal representation of the object and tickling the bits there.

When you do that with deno/javascript/others, there's a layer of abstraction introduced to keep our native method from directly tickling the bits that the VM is aware of.

That's the problem.

From C, you can create a new python object, store it off in a global variable, send it back to the python vm, and later in a thread go tickle some of the bits and see that tickling in the Python VM. Because that object is the same one used by Python and C.

The complication that arises with CPython is many very popular librarys (numpy, tensorflow, pandas) rely HEAVILY on the fact that the objects they are working with in C are the same ones python uses. That's why they've been so slow to port to pypy if at all.

And that ability for C libraries to very deeply interact with the VM is exactly the problem that makes it hard to improve CPython's JIT. That's the reason other python JITs, like pypy, either don't or have very limited support for CPython's FFI capabilities.

The reason FFI works so well with other languages is they drew very tight and clear boundaries around how interactions work and who owns what when.

So please, stop spamming "bun". It's a non sequitur.


I'll concede that Python has more nuanced bits of complexity exposed, but I suspect a willingness to break compatibility for the sake of performance would benefit the language a lot more than it would hurt it. I personally hope the hacker mentality and the immensely perf-driven nature of JavaScript is one day adopted by Python.

For those following, I want to make it clear that the "bits" being discussed are extremely nuanced and won't get in the way of doing obviously useful things in JavaScript.

Here's how you create a JS Object from C using Node-API: https://nodejs.org/api/n-api.html#napi_create_object

It conforms to ECMAScript 6.1.7 Object Types and is very much the same internal representation used by the VM.

Here's how one might tickle the internal representation used by Node.js (V8 in this case) to change the value of the underlying property of a JavaScript object from C: https://nodejs.org/api/n-api.html#napi_set_element


> I suspect a willingness to break compatibility for the sake of performance would benefit the language a lot more than it would hurt it

It took a long time for the community to recover from the python3 incompatibility with python2. I don't think we are ready for another round.


> but I suspect a willingness to break compatibility for the sake of performance would benefit the language a lot more than it would hurt it.

That's already happen in the form of graalpy and pypy. The community has voted and they want compatibility more than they want performance.


> Accessing data in a Python object from c ( and vice versa) is trivial. That triviality is what gets in the way of JIT optimizations.

Can't it be automated? Accessing data is often done in patterns that are very similar.

Perhaps we should access Python from Rust, would that help?


Absolutely, and it usually is. The FFI isn't generally hard to use because of the automation.


Just some extra info to add to this for OP/others. The keyword to Google to learn more about this would be a "JIT compiler". If you were to run each of those benchmarks to loop only a few times (or steps for the n-body example), say n=10, they would appear much slower because the JIT compiler needs lines of code to be run multiple times before it can begin making optimizations.

"pypy" is an alternative runtime for Python which can do JIT optimizations like Javascript interpreters do. Another commenter in a different subthread already posted a quick speed comparison involving pypy, it's much faster than the standard python runtime (called CPython).

But pypy isn't necessarily a synonym for Python (which almost always means CPython). It doesn't have 100% compatibility with 3rd party libraries (it's probably 99.9% but just 1 unsupported dependency is enough to prevent your whole project from going pypy).

And also with most Python projects, the python code isn't the slow part of the stack. If you're making a web app, your cache and database is probably where 99% of your request time is spent. For scientific computing, data analysis, or ai training, you're usually using 3rd party libraries written in a "fast" language like C/Fortran/Rust with python hooks, and your python just acts as glue code to pass data between these library calls.


> Is the performance difference of Python and Javascript due to the language or due to the runtime?

It's almost always the runtime. Unless the language spec makes it particularly difficult to implement a feature in a way that's performant while still respecting the spec.

Which is why the discussion on which languages are "faster" is seldom productive. Languages are not fast or slow by themselves. And, even when they are on average, implementing stuff on Assembly is no guarantee of speed. Quite often, higher level abstractions lead to optimizations that are difficult to replicate if one is operating at a lower level.

As a very trivial example, consider short circuit evaluation. It's something that you explicitly have to code in assembly, but most languages have implemented out of the box. Or garbage collection - allocating and deallocating memory on demand is memory efficient, but it's not necessarily what you want to do for performance. In many cases, deferring garbage collection allows for often-accessed memory to be retained, where as a C implementation would be constantly allocating and freeing. Fixing that requires programmer effort.

What I think matters the most is the ability of a language (+runtime) to access lower layers when needed (ASM blocks in C, FFI in Python).


>And, even when they are on average, implementing stuff on Assembly is no guarantee of speed. Quite often, higher level abstractions lead to optimizations that are difficult to replicate if one is operating at a lower level.

When browsing Code Golf, most often than not, the fastest implementation is in assembler. And sometimes it can be 2x - 4x faster than next fastest implementation which usually is C++.


While with enough trust even pigs can fly (witness the heroic efforts to make JS fast), some languages are indeed inherently faster than others either because the semantics are such that they are easier to translate to an actual machine efficiently or there is less abstraction between the language and the machine.


It's both. Language semantics make it hard to optimize Python, even for the runtime / even for a JIT.


An underrated factor in this conversation is implementation effort. Javascript is arguably the most widely run interpreted language in the world. Pre-Chrome, JS was uniformly rather slow, but Google was betting on webapps being good and usable, so they brought on some top-tier compilation/VM talent to build V8. As fast Javascript existed, it made it possible to do more natively in the browser, and other browser vendors joined in an arms race there. Tons of effort overall, tons of money and time from top-tier experts.

There was no equivalent competition between major organizations to make Python fast. Folks have tried, but not on the same scale. Yes, factors like language design and existing ecosystems entanglements played a role here, but if Python had the investment in performance Javascript had (financial and expert attention), it'd be dramatically faster today (though it may have required a fork or similar, depending on how tied to simplicity Guido was feeling).

To be clear, I'm not saying nobody tried to make Python faster, or that Python devs don't know what they're doing. I know very well both of those things aren't true. But, the kind of sophistication required to make Python fast in absolute terms is very hard to build and maintain, and for various reasons nobody who has found Python to be too slow has found "invest in a bunch of person-years of VM engineer attention dedicated to performance work" to be their best path.


I think you are right, but there are signs of great effort - pypy, unladen swallow, and many other projects that still exist or do not.


Python and JS are basically the same language in terms of semantics.


> Python and JS are basically the same language in terms of semantics.

Python is dynamic class-based OO with multiple inheritance. JS is dynamic prototypical OO, though it has recently added convenience syntax implementing single-inheritance class-based OO on top of it.

They are not the same.


It's still redirecting to another object at runtime in both cases. The details are obviously not exactly identical but there's no great difference in the object model just because one is called a "class" and one is called a "prototype".


Python also has much stronger typing in to converting types of things willy nilly without being instructed to.


well, as long as both of them are descendant of modula 2 :)

but no. while both languages share some similarities (dynamically typed, some form of "objects" and "classes") there are a lot of differences, both at "language level" and at "implementation level"


What are some important differences do you think?


Lexical scope and first class functions in JavaScript as a descendant from Lisp.


Testing this out on a simple project. As a straw poll measurement, I was seeing 4.5 to 6 seconds on version 3.9 and 4.29 to 4.4 seconds on version 3.11. Definitely a noticeable improvement, for what it's worth.


Anecdotally: wow just by upgrading from python 3.8 to 3.11 I measured a speedup from ~90 sec to ~60 sec runtime (in simulation code that mostly manipulates python built-in numbers and container datatypes)


I would love to see the OP benchmarks with Taichi applied: https://github.com/taichi-dev/taichi


Semantic Versioning with numbers > 9 really breaks my brain sometimes - I read the headline as "Python 3.1.1 is faster than 3.8" - a performance regression - and thought "Oh no!"


I’m still patiently waiting for Python to eventually decide to jump from 3.X directly to X+1


Wow bun is fast! Would be interesting to see how NumPy or CuPy compare.


> Wow bun is fast!

Relative to python? I ran the benchmark in the post with both node.js and bun, and running it with bun consistently took ~60% more time. That was with bun v0.1.10 though. I tried with bun v0.2.1 but it just crashed.


Node (19.0.0) is quicker than bun (0.2.2) which is quicker than Deno (1.26.2) in my tests. (Edit: added Deno)

  mitata 'node sim.ts 10000000' 'bun sim.ts 10000000' 'deno run --allow-read sim_deno.ts 10000000'
  cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  runtime: shell (x86_64-unknown-linux-gnu)
  
  benchmark                                       time (avg)             (min … max)
  ----------------------------------------------------------------------------------
  node sim.ts 10000000                        803.35 ms/iter  (773.08 ms … 848.1 ms)
  bun sim.ts 10000000                            1.23 s/iter        (1.19 s … 1.4 s)
  deno run --allow-read sim_deno.ts 10000000     1.37 s/iter       (1.27 s … 1.69 s)
  
  summary
    node sim.ts 10000000
     1.53x faster than bun sim.ts 10000000
     1.71x faster than deno run --allow-read sim_deno.ts 10000000


How much faster? Should I update immediately or at my own convenience? Is this a game-changing speedup for my business?


I'm seeing about a 50% speedup. It depends what you mean by game changing, but you should certainly see if 3.11 is a drop in replacement for whatever you are doing.


Does anyone actually choose Python with speed as a defining factor? It's one of the slowest popular languages.


most of the machine learning industry uses it almost exclusively as the driving language for GPUs


Yeah but that's because all the well documented tooling was built for Python as a first party language no? Its not much of a choice.


I wonder what the performance would be if the code would have been compiled with GraalVM.


Does anybody have a Cython benchmark for this one?


What about Python 3.10?


Python is in its awkward stage, something like PHP in ~2008. It's not super fast, it's got a good following but the changes to the core api's cause breaking changes, it's being used for things that maybe it shouldn't be, and it's facing a lot of competition from other languages.

I had to laugh a few months back when someone suggested we switch a newish project to python from PHP 8.1 because "python is built in C" and "it's better for multithreaded math functions". A lot of misinformation out there.


That anecdote says more about the people you work with than about Python.


The second half of my reply sure but the first is just facts. I’m sure pythons got a future, it’s just going to have to step up a little.

Also yeah, that guy wasn’t great. As i said i had to laugh.


Waiting for Python 3.11 for Workgroups


It is going to come on half a dozen or so double density disks.


At this point virtually everyone agrees that dynamic type is only good for scripting and cross-domain code, not for being robust software applications. Considerable effort has been spent over the years adding typing back to dynamically typed languages (Typescript, python has several).

If you want to build real software, you should be looking at Go or Rust, not Python. They are already fast, they already have concurrency, and now with Go's generics it's safe to say that their type systems are exactly what are needed to build real software in 2022.


You can build robust software applications with Python. It just needs different development practices. Using static typing with Python is a bit dumb imo.

Basically you write a ton of 3k line scripts, have them talk to each other a message queue like RabbitMQ and have a SQL Server e.g. PostgreSQL server do all the heavy lifting for you.

You can't do any program like that but 80% of programs that are written in businesses can be done like that. Typical CRUD stuff.

And you can develop it at 3x times the speed then if it was done in a more traditional language like Java/C# etc..

Python is a powerful language if you use it for the right use cases.


How do I run the latest diffusion model in Go or Rust?


I said it was good for cross-domain code.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: