Python 3.11 is faster than 3.8

Shish2k · on Oct 26, 2022

Checking for my own “benchmark”, a gameboy emulator in several different languages[0]; it’s CPU-bound but across ~3k lines of code, so slightly more representative of real-world apps than a single-function tight-loop microbenchmark:

     zig: Emulated 600 frames in  0.24s (2521fps)
      rs: Emulated 600 frames in  0.37s (1626fps)
     cpp: Emulated 600 frames in  0.40s (1508fps)
     nim: Emulated 600 frames in  0.44s (1367fps)
      go: Emulated 600 frames in  1.75s (342fps)
     php: Emulated 600 frames in 23.74s (25fps)
      py: Emulated 600 frames in 26.16s (23fps)   # PyPy
      py: Emulated 600 frames in 33.10s (18fps)   # 3.11
      py: Emulated 600 frames in 61.43s (9fps)    # 3.10

Doubling the speed is pretty nice :D Still the slowest out of all implementations though :P

[0] https://github.com/shish/rosettaboy

EDIT> updated the nim compiler flags to build in release mode like most other languages, thanks @plainOldText!

plainOldText · on Oct 26, 2022

Just a quick glance at you repo, and I'm noticing you're running Zig `zig build -fstage1 -Drelease-fast=true` and Rust `cargo run --release` with the release flags on. You should do the same for Nim `nimble build -d:release --opt:speed`; Go too.

Shish2k · on Oct 26, 2022

Updating nim’s compiler flags, seems to be ~4x faster now :D

    nim: Emulated 600 frames in  0.44s (1367fps)

Do you happen to know the right flags for release-mode Go? Last time I checked (admittedly years ago) I thought they just had the one “reasonably fast and reasonably debuggable” build mode

cb321 · on Oct 26, 2022

With Nim `nimble build -d:danger -d:lto --passC:-march=native` I just got 1920 frames/s while with your rust build only 1237 fps on the same machine (EDIT: and 1307 frames/s with C++.)

plainOldText · on Oct 26, 2022

In that case, did you also run rust with the equivalent RUSTFLAGS="-C target-cpu=native"? :)

Perhaps benchmarks should be compared on equal footing, say, with the default release flag or all optimizations turned on, otherwise they're improper.

cb321 · on Oct 26, 2022

I tried. That actually made the rust slower for me (i7-6700k, gcc-12.2, rustc-1.64). 1180 frames/s. And without the -march=native the Nim was at 1620. And it also did not help the C++ branch (but helped Nim about 1.2x).

But really the original author/poster should do some set on his box. I cannot even compile/run all his things. The point of my comment was just to give a reference for how far off impressions can be from build flags. PGO (available to Nim, c++, but maybe not to Rust yet?) is a whole other set of maybe nothing burgers or maybe big improvements.

(But, btw, I could not agree more that all experiments in this entire general space should have various big, bold disclaimers. Over-concluding from these things is rampant.)

plainOldText · on Oct 26, 2022

Yes, tweaking the compiler flags can alter the performance substantially. I'm glad to see Nim so fast though.

cb321 · on Oct 26, 2022

And with the author's hot off the presses nim flags I get only 1464 fps. So, 1920/1464 = 1.31 for my nim compile flags vs. his new ones, only a little less than the 2521/1626 that was interesting people.

For something super jumpy like a simulator, I would find it unsurprising for PGO to make up (or surpass) the difference to Zig in both Nim and C++. 20 years ago there was this ACOVEA [1] project to try to discover great sets of gcc flags that could often find 2X improvements in object code speed for me.

The range from build flags/procedures is often much greater than the supposedly interesting cross-language variation. These things often more measure developer experience/persistence than something intrinsic (and build flags/procedures are only part of that experience/persistence).

[1] https://github.com/Acovea/libacovea

plainOldText · on Oct 26, 2022

I do not, sorry. I'm sure some Go developer can chime in.

Thaxll · on Oct 26, 2022

There is no release flag for Go.

throwaway894345 · on Oct 26, 2022

To elaborate, Go is always in release mode because release compiles are about as fast as other languages' debug modes.

LukeShu · on Oct 26, 2022

I'd phrase that as "release mode by default", not "always in release mode".

There are various debug things you can turn on, such as the race detector, the memory sanitizer, or coverage tracking.

streblo · on Oct 26, 2022

Is there such a thing for go?

LukeShu · on Oct 26, 2022

It's the default for Go; with Go you have to explicitly turn on the debug features that you want.

winter_blue · on Oct 26, 2022

It's amazing that Zig is faster than both Rust and C++. Kudos to the Zig team & Andrew Kelly!

I wonder what optimizations Zig does that lets it generate machine code / LLVM bitcode faster than both C++ and Rust (which certainly have larger teams backing them), at least in the case of this Gameboy emulator project.

Shish2k · on Oct 26, 2022

Yeah, I’m really not sure how it managed that. I am slightly suspicious, because while I was working on the zig version I spent more time running into compiler bugs[1] than writing my code, and so there’s a chance that it’s running so fast by throwing away random chunks of important behaviour… but it still passes all of the test suite, so as far as I can tell this specific build is working correctly.

[1] half the time I’d make the compiler crash; the other half it would generate a binary which crashes at runtime, with weird heisenbug behaviour like “adding a print statement to log how far down a function I am causes the code to stop crashing at all” — like right now there is a load-bearing print statement which shows the address of the SDL Window object, because otherwise the compiler seems to optimise the Window out of existence and then a few lines later it segfaults on the null pointer…

sirsinsalot · on Oct 26, 2022

> a load-bearing print statement

Is the most glorious thing I ever heard.

kristoff_it · on Oct 27, 2022

> like right now there is a load-bearing print statement which shows the address of the SDL Window object, because otherwise the compiler seems to optimise the Window out of existence

that sounds like you have undefined behavior in your program

cma · on Oct 26, 2022

Does zig have computed goto? That's important for emulators and virtual machines and C/C++ don't have it. Projects will often go out of their way to have a MinGW/gcc built module for core loops with that (gcc has it as an extension) even if the main project is built with MSVC.

nerpderp82 · on Oct 26, 2022

If Rust wasn't able to elide bounds checks, that is most likely where the perf difference is from. Analyzing the output the assembly output from goldbolt or looking at the MIR can help.

https://stefan-marr.de/2022/10/cost-of-safety-in-java/

Unsafety buys you a little more performance, but the baseline should the safe version, not the unsafe version. It is like having to explain on a case by case basis why you aren't using lead pipes for this application.

Always default to safety. The difference between safe and unsafe native code is usually single digit percentage points. Or weeks on a Moore scale.

vjerancrnjak · on Oct 26, 2022

This should also be the case for nim. Generated C always looks very simple with -d:danger (when all memory access checks disappear).

cb321 · on Oct 26, 2022

If I do nimble build -d:lto -d:danger --passC:-march=native I get 1920 fps while if I do nimble build -d:lto -d:release --passC:-march=native I still get 1775 fps. So, at least for Nim, the checks are only a 1.08x slowdown..Not so bad compared to the 2521 vs 1626 = 1.55x Zig-Rust slowdown on the author's machine.

Heck, I see 1.33x differences in run times between the first & second run of the rust branch of his benchmark, but only 1.05x diffs in run times between 1st & 2nd Nim branch.

In my experience, reasoning about things like this is rarely as simple as "bounds checks" which are (often) the most highly predictable branches.

vjerancrnjak · on Oct 27, 2022

True. Recently I noticed that std/sha1 takes 8x the time on M1 release vs danger. Yet on amd64 it's only a <10% difference.

cma · on Oct 26, 2022

I think you replied to the wrong comment, mine was about zig vs C++.

nerpderp82 · on Oct 26, 2022

I am talking about the whole stack of benchmarks and why there is a spread in perf. I am replying to you and the top level comment, if only the convos were a graph and not a tree.

habibur · on Oct 27, 2022

Here's his latest numbers after optimizations. From the repo.

     zig: Emulated 600 frames in  0.23s (2880fps)
      rs: Emulated 600 frames in  0.35s (1691fps)
     cpp: Emulated 600 frames in  0.40s (1519fps)
     nim: Emulated 600 frames in  0.44s (1367fps)
      go: Emulated 600 frames in  1.78s (338fps)
     php: Emulated 600 frames in  9.28s (65fps)
      py: Emulated 600 frames in 33.10s (18fps)

That's after Python cheated with native implementation of bitblt while all other implementations are doing it pixel by pixel, which is more correct.

cb321 · on Oct 27, 2022

Tables like this routinely lead to over-conclusion. Merely adding -d:lto (which changes no semantics) to the nimble build line boosts fps by 1.27x for me on nim-devel gcc-12 Skylake, making it 1.4x faster than the Rust on the same machine.

Applying BS scaling to the above table, Nim would score 1736, only slightly faster than the Rust. Who knows what little flag tweaks in cpp/rs/zig could similarly re-arrange the numbers?

Now, why is -d:lto (or -d:release) not the default build mode? Well, because it takes time and (some) people already complain about compile times and workflow (sometimes). Trade offs abound almost as much as over-conclusion. ;-)

insanitybit · on Oct 26, 2022

Very cool idea for a benchmark, thanks for the numbers. Pretty readable code as well, nice work. Is your benchmark running headless? I feel like that could be a source of noise but idk.

Shish2k · on Oct 26, 2022

Yeah, all these benchmarks are measured in headless mode - all the calculations of what should be on-screen are done, and pixels are written into a buffer ready to be displayed, but the Window is never opened and the buffer isn’t blitted.

humanistbot · on Oct 26, 2022

I am shocked that PHP of all things has faster speed than python.

crote · on Oct 26, 2022

I am not.

PHP is traditionally used solely for websites. Some of those have grown rather large, to the point that having engineers optimize the language is cheaper than buying more servers.

Python, on the other hand, is first and foremost a scripting language. When performance does matter, you often end up using a wrapper around a C library, like NumPy. This means there is relatively little money in optimizing the Python interpreter.

DeathArrow · on Oct 26, 2022

There are websites and large scale software. Google uses Python for a lot of its products.

It would seem sensible that since Facebook poured a lot of resources in optimizing PHP, Google would have done the same for Python.

Also, Python is the first or second most popular language.

fuckstick · on Oct 26, 2022

> It would seem sensible that since Facebook poured a lot of resources in optimizing PHP

Well they sort of half assed tried years ago - they employed GvR at one point. Then they gave up and they continue to lean heavily on C++, Java and there is thing they developed in the interim: Go.

> Also, Python is the first or second most popular language

Hogwash. For all the open source and other development that occurs and is “indexed” on internet discussion forums there is countless boring ass shit behind the scenes in sweatshops around the world and corporate back rooms. PHP, Java, and C# are still probably more popular to start.

dragonwriter · on Oct 26, 2022

> For all the open source and other development that occurs and is “indexed” on internet discussion forums there is countless boring ass shit behind the scenes in sweatshops around the world and corporate back rooms.

Much of which is also in Python.

fuckstick · on Oct 27, 2022

I mean, sure, but I’m just saying point me to a source that is ranking Python as #1 or #2 most popular. If it’s TIOBE or something similar it’s meaningless.

I say this as someone that has made their career mostly on the “top 5” TIOBE languages - but that ranking is a bunch of crap for the overall commercial software world.

C? Give me a fucking break. Never in 20 years have I known there to be more C than Java jobs.

And now everyone and their dog including doctors and other non tech-savvy professionals are doing Intro to Python courses (and then never touching it again) so they too can feel like they know something about ML or stats. Lot of noise.

DeathArrow · on Oct 27, 2022

>I mean, sure, but I’m just saying point me to a source that is ranking Python as #1 or #2 most popular. If it’s TIOBE or something similar it’s meaningless.

According to 2022 Stack Overflow developer survey, if we exclude HTML, SQL and Bash, top five languages (as in being used by most developers) are: Javascript, Python, Typescript, Java and C#.

rovr138 · on Oct 28, 2022

https://madnight.github.io/githut/#/pushes/

Checking here, if we go by pushes on GitHub, JS, Python, Ruby, Java, PHP are the top for 2021.

This is of course biased to only stuff on GitHu band public (assuming, but doubt they're publishing private data on BigQuery).

No idea how much of that is pet projects vs production of course.

ivan23178 · on Oct 26, 2022

JS is also very popular. Not for the same things as Python but in some aspects it's becoming modern PHP (which is still popular, if maybe aging somewhat kinda like Perl). But Python is definitely very popular for new back room stuff (source: long digital sweatshop career)

hnzix · on Oct 26, 2022

Corporate has a faceless horde of Java devs.

DeathArrow · on Oct 27, 2022

Lots of C#, too. And I specialized in C# since it seemed a little bit more enjoyable than Java at the time and the jobs were plenty. Of course, Java changed quite a bit since then so it might be more desirable than it was.

mgkimsal · on Oct 26, 2022

> Python is the first or second most popular language.

By what metrics? Redmond quarterly from June shows Python at #2, and PHP at #4.

https://redmonk.com/sogrady/2022/10/20/language-rankings-6-2...

DeathArrow · on Oct 27, 2022

Most metrics. Tiobe, Redmonk, Pypl, Stack Overflow survey. Top 10 and even top 5 is generally the same, the order might differ a bit.

I've looked at various sources since the last 5 years and not much changed. I hoped some languages like Nim, Crystal or Zig would pick up a bit of steam, but no.

Go and Rust moved up a bit and now seem safer bets to invest some time into.

dragonwriter · on Oct 26, 2022

> > Python is the first or second most popular language.

> By what metrics?

Hmm, why don’t you answer your own question?

> Redmond quarterly from June shows Python at #2,

That clearly is one that puts it “first or second”, yes.

mgkimsal · on Oct 26, 2022

Yes, it was partially an answer, and it's not saying Python isn't high up.

But still wanted to know by what metrics the GP was making their claim.

zitterbewegung · on Oct 29, 2022

Instagram is a Django website (Django is a Python web framework).

skybrian · on Oct 26, 2022

It can't be explained by lack of effort. There have been several serious attempts to make Python run faster, including one by some Google engineers. Many of them fail, and the ones that succeeded to some extent aren't mainstream. It's hard.

For Python, the C integration probably makes things harder.

habibur · on Oct 27, 2022

Interesting how a language written solely for websites, beats others in their own game which isn't about websites.

GlitchMr · on Oct 26, 2022

PHP is a much simpler language which helps a lot with optimizations. For instance, consider something like property access.

In PHP, doing `$a->b` accesses field `b` of object `$a`. The implementation is essentially type check (making sure `$a` is an object) and hash table lookup. If this fails, then it calls `__get` method if it exists.

In Python on the other hand, `a.b` involves `__getattr__`, `__getattribute__`, `__mro__`, `__get__`, `__set__`, `__delete__` and `__dict__`. Here is a description of how the attribute access works: https://docs.python.org/3/howto/descriptor.html#overview-of-....

captn3m0 · on Oct 26, 2022

PHP typically beats Python on CPU bound benchmarks: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

(Measures Python 3.10.4 against PHP 8.1.5 - so expecting these to change a bit).

throwaway894345 · on Oct 26, 2022

Facebook dumped a ton of effort into making PHP fast over the last couple of decades. No one has been able to make Python especially fast without breaking compatibility with important libraries (Python exposed virtually the entire interpreter as its extension surface, and since Python is so slow, extensions are a major part of the ecosystem as they're the main way to recoup performance, which in turn means that changing the extension interface to make things faster would break a bunch of the ecosystem, and the Python maintainers are pretty scarred after the 2->3 breaking changes). Pypy comes close, and it has been grinding away to get compatibility, but last I checked you still couldn't so much as talk to a Postgres database through a reputable package.

depr · on Oct 26, 2022

The last couple of decades? Facebook is not even two decades old. And they turned it into their own language.

throwaway894345 · on Oct 26, 2022

Yeah, it was a crude estimate. Call it 10-15 years if you want.

bityard · on Oct 26, 2022

It's not as surprising if you look at the primary use case of PHP. It takes a request, does a bunch of stuff, and returns the result to the web server. And it needs to do all of it before the user who clicked on the link gets bored and goes somewhere else. (Or the API client calling it times out...)

coldtea · on Oct 26, 2022

How is that relevant? This is not about it being faster for "it's primary use case" (optimized for that, etc).

It's generally faster than Python in doing the same things as Python does, unrelated to web too.

lmm · on Oct 27, 2022

The point is that PHP's primary, original use case is latency-sensitive in a way that Python's isn't. So it's always been subject to more performance pressure.

sottol · on Oct 26, 2022

I think they did a lot of similar work to what python 3.11 did and more when releasing PHP7. I'm absolutely not in the loop but remember postings here on HN a few years ago.

Meta/FB is also pretty invested in Hack (was once a php dialect, again, out of the loop), maybe they contributed a thing or two?

hashar · on Oct 26, 2022

PHP 5 was kind of slow, to a point it threatened the future of Facebook (now Meta). They eventually went to transform it to CPP for speed improvement hiphop-cpp then went to build a VM with jit compilation. You could then deploy your app by transfering a sqlite file containing the compiled byte code.

At Wikimedia we adopted it which has cut our CPU usage by half and has saved a few hundred of servers. We had some Facebook engineers helping which involved patching the Linux kernel while at it. Those were good times.

Eventually PHP 7 followed up with a similar approach and had more or less the same performance as HHVM. Facebook went then to focus on the Hack language (a dialect of PHP with strong typing) and eventually phased out back compat with Zend.

From what I remember, Sara Golemon at Facebook has done a lot of outreaching to Open Source project and gave us a lot of assistance (as well as others at Facebook).

nikcub · on Oct 26, 2022

> PHP 5 was kind of slow

I remember it being slower than PHP4 for a while

tedivm · on Oct 26, 2022

php7 was such a huge leap they skipped php6 (although there were other reasons for that).

captn3m0 · on Oct 26, 2022

> Version 6 is generally associated with failure in the world of dynamic languages. PHP 6 was a failure; Perl 6 was a failure. It's actually associated with failure also outside the dynamic language world - MySQL 6 also existed but never released.

TIL MySQL 6

https://wiki.php.net/rfc/php6

klyrs · on Oct 27, 2022

Whereas Python doesn't even plan for 4 because 3 hurt so bad

KwanEsq · on Oct 26, 2022

Meanwhile Java(ECMA)Script bucking the trend by getting its failure out of the way at 4.

coldtea · on Oct 26, 2022

PHP has always been several times faster than Python. Even more so with the speed updates post 7.

mgkimsal · on Oct 26, 2022

100% not shocked at all.

I recall a couple of times over the past 15-20 years where a current python significantly outperformed a current php version, but my recollection is that php was usually a bit faster, or sometimes a lot faster.

PHP 7 was released on dec 2015, and gained significant speed bumps and better memory usage, with the average php execution time being cut in half, or more, generally without any code change whatsoever. It was quite remarkable.

The path from 7.x-8.1 so far has generally seen incremental speed bumps again - usually somewhere between 3-8% improvements per release. Obviously this is going to be dependent on use cases, but overall it's been a fairly steady set of speed improvements over the last 7 years.

nerpderp82 · on Oct 26, 2022

Clearly we need a Python to PHP transpiler (trigger word) so we can be webscale (trigger word).

booi · on Oct 26, 2022

PHP has always been very fast of the untyped scripting languages even from the web 1.0 days.

kstrauser · on Oct 26, 2022

That doesn't relate to Python, though.

toast0 · on Oct 26, 2022

PHP itself is pretty fast. It's the things that people do in PHP that get slow. Most of the Yahoo frontends were rebuilt in PHP in 200x because it was fast enough and much more usable than the thing they used before(trigger warning: hf2k)

Of course, people then go and build up sculptures of objects that will be thrown away after every request, and that stuff makes everything slow (that style of code is why PHP wasn't good enough for Facebook, IMHO), but you can build trash sculpture in any language.

treeman79 · on Oct 26, 2022

Back around 2003 I was hosting an internal web app for a fortune 50 company in python. It could handle 2 users at a time. I rewrote it in PHP. Scaled to hundreds with trivial work. Probably could have handled far more.

pphysch · on Oct 27, 2022

The key here is that you rewrote the whole thing. The old and new languages isn't as relevant.

talideon · on Oct 26, 2022

PHP 5 and onwards have had a lot of resources put int o making it fast. The surprise for me is how close Python is getting to closing that gap.

pcwalton · on Oct 26, 2022

Looking at Rust vs. Zig, you have the FLAGS register in a bitfield for Zig but it's separate bools for Rust. This is probably making your Rust code slower because the CPU can't set multiple flags at once.

I'm also wondering if all your #[inline(always)] is slowing things down.

Rusky · on Oct 27, 2022

I ran the Rust and Zig implementations through a profiler and (at least on my machine, using Zig nightly 0.10.0-dev.4583+875e98a57) the vast majority of the time difference is from the number of calls to the GPU paint_tile_line function- there's some behavioral difference upstream in the emulator and the GPU modules are just not doing the same work.

For the benchmark 600 frames, the Rust emulator calls it 2500049 times, but the Zig emulator only calls it 1808121 times. This is very roughly the ratio between the reported times.

At least one source of this discrepancy is that the Zig emulator doesn't think any sprites are active, but that doesn't account for all of it, and I'm hitting some crashes trying to run it with the display, so I'm probably not going to investigate it further.

(I also hit the same sort of strange Zig compiler unreliability as some other comments have mentioned, where I had to add some logging in seemingly random places to get the Zig emulator to run at all.

And the argument parser just blatantly returns a dead pointer to the ROM path; when I first ran it it just gave "error: InvalidUtf8" until I tracked that down.)

pcwalton · on Oct 28, 2022

Fascinating, thanks a bunch for looking into this. Benchmarking is hard!

Shish2k · on Oct 26, 2022

Yeah, I really like zig’s approach to bitfields, everything Just Works with no faffing about with bit-shifting and OR/AND'ing. I forget the exact reason I used separate bools for rust, but I remember spending a day trying to do something zig-like and failing…

IIRC each of the #[inline] statements was tested and each made a noticable performance improvement. That’s especially true for things like RAM::get() - since the gameboy does I/O by having different chunks of RAM act differently (some address ranges are just RAM, some are hardware controls, some read data from the cartridge, etc) you can replace the hundred-line generic get() with a single instruction if you happen to know that you are looking up one hard-coded address, and that address has no special behaviour.

b_mc2 · on Oct 26, 2022

Very cool benchmark. I don't think this will be a huge difference but could you try mypyc? I'm definitely curious how it compares. It uses mypy to perform type checks and claims "Existing code with type annotations is often 1.5x to 5x faster when compiled. Code tuned for mypyc can be 5x to 10x faster."[1]

[1] https://mypyc.readthedocs.io/en/latest/introduction.html

justinsaccount · on Oct 27, 2022

This might speed up the py impl a few %

  --- a/py/src/cpu.py
  +++ b/py/src/cpu.py
  @@ -260,7 +260,6 @@ class CPU:
               param = None
               cmd_str = cmd.name
               self.PC += 1
  -        self._debug_str = f"[{self.PC:04X}({ins:02X})]: {cmd_str}"

mgkimsal · on Oct 26, 2022

Your comment in PHP src here https://github.com/shish/rosettaboy/blob/master/php/run.sh

says

# opcache in 8.1 gives a nice speedup (25s to 10s)

Is the 23.7 seconds above using the 8.1 opcache?

talideon · on Oct 26, 2022

I think a bigger thing to point out is that it's within shouting distance of PyPy, and there's plenty more work that can be done to make it faster. Next up is a mini-JIT, which should help in places with tight loops.

jetbalsa · on Oct 26, 2022

I find it /really/ funny that PHP is faster then python at this. as a PHP code hacker, I love that PHP keeps coming up faster then python in a ton of tasks

throwaway894345 · on Oct 26, 2022

That's not much of a flex, Python is deliberately slow--the reasoning was that the implementation should be simple and they'd just expose the entire interpreter as the extension API, and then people would just write extensions in C when they needed the speed. Since vanilla Python is so slow, the entire ecosystem is highly dependent on C extensions for passable performance, and since the C extension API is virtually the whole interpreter, changes that would make the interpreter fast would typically break much of the ecosystem.

Unfortunately, "just write the performance-sensitive bits in C" is pretty impractical because it only works when you're handing the C routine a relatively small amount of data relative to the amount of work to be done on that data (otherwise the costs of marshaling to C data structures will quickly eat up any gains from processing in C). And unfortunately, it turns out that a whole bunch of code works this way, so the whole bargain of "slow interpreter + easy C extensions" breaks down for a lot of real-world applications, but now we're locked into it.

wheelerof4te · on Oct 26, 2022

CPython can be fast once you eliminate the main bottleneck: the interpreter itself.

Processing loops adds a lot of overhead because the interpreter has to make the case jumps after each loop. Figuring out a way to minimize that overhead by using built-in data structures and stdlib library will speed up your code by an order of magnitude.

Don't forget, the built-in types are already running in C.

throwaway894345 · on Oct 26, 2022

I mean, yes, but that's probably not viable in many real-world applications. Usually you have a pretty complex data model (some graph-like structure) with Python methods that traverse it. You can't easily push that into native Python structures (at least not with any significant performance gain), and while you can push the whole thing into Rust or some faster language, at that point all of the interesting stuff is happening in Rust so why use Python at all (why pay the significant costs of a hybrid application)?

pbowyer · on Oct 26, 2022

At least writing Python extensions is relatively easy (I understand, not done it) unlike PHP. PHP leans heavily on poorly documented C macros and the internals is tricky to grasp. At least it now has a minimal FFI module.

throwaway894345 · on Oct 26, 2022

Honestly I think it’s much better to optimize for the native use case rather than FFI.

FFI can pervade an ecosystem making changes (including performance optimizations) to the host language more difficult. It also tends to complicate build and deployment stories. For example, from my Mac I can trivially build a native binary that will Just Work on any Linux system, even one without a libc. Contrast that with Python where I can’t even install popular packages onto most non-Ubuntu Linux distros. And there’s a lot of other things like that which crop up with pervasive FFI.

I’m convinced that the FFI sweet spot is “difficult but possible” and native performance should be good enough 99% of the time.

wk_end · on Oct 26, 2022

I'd be super curious to run a profiler on those Python builds and see where it's spending most of its time.

lr1970 · on Oct 27, 2022

To make Rust release build run faster you should enable LTO (Link Time Optimization). In the Cargo.toml put

[profile.release]

lto = true

gw99 · on Oct 26, 2022

Interesting. On the trifecta of money vs pain vs speed, Go seems to be a reasonable compromise.

Shish2k · on Oct 26, 2022

FWIW I personally found Rust the least-painful language, but that may well be confirming my pre-established biases :)

- With Zig I kept running into compiler bugs, plus no package manager (I’ve vendored SDL and Clap into the source tree)

- C++ I’d occasionally shoot myself in the foot in ways that other languages would have caught, plus no package manager (OS-level package management does an OK job, so long as you don’t mind using old versions, and faffing about with different operating systems acting very differently)

- The pain from Rust was one time where the compiler wanted me to specify a lifetime, and I didn’t understand, so I just spammed lifetime specifiers in various places until it compiled. I’ve been using Rust for a couple of years now and I still don’t really understand lifetimes, but thankfully 99% of the time I can avoid them.

- Nim was a relatively nice language but massively lacking in available libraries (like even parsing command line arguments took me a day just trying to find a library which worked)

- Go is pretty nice, my main pain is the tolerable but constantly-annoying verboseness of error handling (`err := foo(); if err != nil {return err}` compared to rust’s `foo()?`)

- PHP I just hate on a deep and personal level thanks to years of being a PHP4/5 developer. The language is actually mostly-ok-ish these days, but the standard library is still full of frustration like inconsistent parameter orders within a family of functions.

- Python is all-round really nice to write, but the test suite takes like 20 minutes to run, which really messes with my flow-state

Thaxll · on Oct 26, 2022

"Rust the least-painful language"

" I’ve been using Rust for a couple of years now and I still don’t really understand lifetimes"

Seems like a major pain point.

Shish2k · on Oct 26, 2022

It would be if I ran into it regularly -- but after using the language for a variety of professional and personal projects for a couple of years, this is the only time I’ve actually needed to manually-specify lifetimes since the compiler normally figures it out for me :)

pkolaczk · on Oct 26, 2022

As long as you don't get too crazy with references in structures or async code, lifetimes are not going to chase you.

galangalalgol · on Oct 26, 2022

Generics too right?

pdimitar · on Oct 27, 2022

I know people say that about everything but Rust generics are very readable and do make sense after you understand what problem are they solving.

They do look intimidating to start with, admittedly, and I'll concede that's a negative point for Rust. But it does get better if you practice for a bit.

galangalalgol · on Oct 27, 2022

No I just meant using generics seems to require explicit lifetimes decently often. I don't understand why.

pdimitar · on Oct 27, 2022

Not sure that's the case btw. I have noticed it with some libraries but I've used and crated a fair amount of generics without having to annotate stuff with lifetimes.

Lifetimes are necessary when you want to explicitly say "variable X will live just as long as variable Y", or sometimes it's more complex (i.e. you have to specify 2 or more separate lifetimes and then return something that pertains to only one of them) but it's still fairly predictable if you keep it all in your head while coding.

Don't get me wrong I still hate it but it's not as terrible as many people make it out to be. It's hard to get into but also very logical and graspable.

Ar-Curunir · on Oct 26, 2022

The reason it's not a pain would be explained by the rest of the sentence which you omitted: "but thankfully 99% of the time I can avoid them"

robocat · on Oct 26, 2022

They are saying rust is painful. Just that they find the other languages even more painful.

davisoneee · on Oct 26, 2022

Except that they don't really say rust was painful....just that there was one specific moment / aspect that they found tricky.

UnpossibleJim · on Oct 26, 2022

This is actually a really nice write up for people deciding which language to learn/use if they aren't constrained.

cb321 · on Oct 26, 2022

Re CLIs in Nim..Most find https://github.com/c-blake/cligen easy to use.

stock_toaster · on Oct 27, 2022

Nim does seem nice. I worry that it will end up like D though... An interesting/cool/neat language with seemingly relatively little adoption. I'm keeping my eye on it though.

pjmlp · on Oct 26, 2022

C++ has two relatively good package managers, conan and vcpkg.

Thaxll · on Oct 26, 2022

It's probably "slow" because of sdl and cgo not native Go code. A gb emulator doesn't do much actually, it's about fixed array, bit shifting, switch cases etc ...

I ran a quick pprof and indeed it's spending a lot of time in cgo:

  Showing nodes accounting for 28720ms, 67.67% of 42440ms total
  Dropped 145 nodes (cum <= 212.20ms)
  Showing top 10 nodes out of 53
      flat  flat%   sum%        cum   cum%
   13080ms 30.82% 30.82%    16600ms 39.11%  runtime.cgocall
    4720ms 11.12% 41.94%     4750ms 11.19%  main.(*RAM).get
    2840ms  6.69% 48.63%    33070ms 77.92%  main.(*GPU).tick
    1970ms  4.64% 53.28%     3720ms  8.77%  runtime.mallocgc
    1450ms  3.42% 56.69%     1470ms  3.46%  main.(*RAM).set
    1160ms  2.73% 59.43%    41350ms 97.43%  main.(*GameBoy).tick
    1000ms  2.36% 61.78%     3160ms  7.45%  runtime.exitsyscall
     890ms  2.10% 63.88%     1610ms  3.79%  main.(*CPU).tick_interrupts
     820ms  1.93% 65.81%      850ms  2.00%  runtime.casgstatus
     790ms  1.86% 67.67%     5530ms 13.03%  main.(*CPU).tick

Shish2k · on Oct 26, 2022

> It's probably "slow" because of sdl and cgo not native Go code

The other languages are also using SDL via their respective interacting-with-C interfaces - what makes Go special here?

Thaxll · on Oct 27, 2022

Go FFI is notoriously slow.

https://stackoverflow.com/questions/28272285/why-cgos-perfor...

chlorion · on Oct 27, 2022

The author stated that the benchmarks run in headless mode, so I am not sure that it's SDL that's slowing it down here.

Even if it is SDL slowing it down, Go FFI being slow is still a real disadvantage compared to the other languages, and you can't just pretend like it doesn't exist in this case.

jbverschoor · on Oct 26, 2022

Yup.. a 240x performance difference (zig-py) has very little to do with the language, vm, or whatever. As soon as I saw that, I dismissed the benchmark.

By these standards a 10 year old cpu with a beefy GPU will beat any new cpu as well.

logicchains · on Oct 27, 2022

>Yup.. a 240x performance difference (zig-py) has very little to do with the language, vm, or whatever.

It absolutely does. A simple for loop in the standard Python interpreter will literally take 100x longer than the same thing in a language like C/C++, try for yourself if you don't believe me. CPython is unbelievably slow.

jbverschoor · on Oct 27, 2022

Very interesting... Simple loop (0 -> 320000000) and add 1 to a variable.

I couldn't reproduce 100x (no optimization flags, otherwise it won't do anything)

  Apple clang version 14.0.0 (clang-1400.0.29.102) -> 0m0.347s
  ruby 3.1.2p20 -> 0m11.314s
  Python 3.8.12 -> 0m19.662s

So ruby is almost 60% faster, but the C version is "only" 32x faster than ruby. 55x python.

I thought the differences would be smaller these days

chlorion · on Oct 27, 2022

What does a GPU have to do with any of this? I think you may be a little confused about something but I am not sure what.

redox99 · on Oct 26, 2022

It depends really. In this case, a game emulator, you can only get away with it because it's for an ancient game console. But otherwise you definitely cannot afford 5x slowdown compared to CPP for a game.

friedman23 · on Oct 26, 2022

Maximize pain for mediocre speed and money?

DeathArrow · on Oct 26, 2022

Also C# is quite reasonable.

stavros · on Oct 26, 2022

Can you try PyPy as well?

Shish2k · on Oct 26, 2022

Apparently not :( (These benchmarks were done on an M1 MacBook Pro)

    $ brew install pypy3
    pypy3: The x86_64 architecture is required for this software.

verst · on Oct 26, 2022

$ brew install pyenv

$ pyenv install pypy3.9-7.3.9

I like using PyEnv for managing my Python versions. It will natively compile Python builds and should be doing the same on M1 (which is what I'm using). `pyenv install --list` shows you what is available.

EDIT: Not sure why they don't have newer versions of PyPy there (I don't use PyPy) but all it takes is a PR to here: https://github.com/pyenv/pyenv

Shish2k · on Oct 26, 2022

Thanks! Added to the list, it’s suspiciously only a little bit faster than CPython 3.11 though, which probably needs more investigation…

ED> Looks like CPython prefers using a dict as a lookup table for opcodes (which is what this implementation does), while PyPy prefers having a long series of if-statements. Hmm.

hnav · on Oct 26, 2022

That makes sense because pypy can JIT a ladder of if/else to a compare/jump conditional each whereas a dict lookup can be an order of magnitude more complex. If they're 8bit opcodes, maybe having a list based lookup table will perform similarly on py3000 and still optimize on pypy?

Or maybe https://docs.python.org/3/library/array.html rather than list.

FerociousTimes · on Oct 26, 2022

Can you do Node 19.0 and Ruby 3.1.x too please?

Shish2k · on Oct 26, 2022

TypeScript and Java are the two remaining languages where I feel I know them well enough to translate 3k lines of code in a weekend each -- I have approximately zero knowledge of Ruby though, and no motivation to learn it ^^; Pull requests are welcome though!

FerociousTimes · on Oct 26, 2022

OK, I'll watch your repo and track the updates and see when you release the JS/TS port and then I'll a give it a try and port it to Ruby and send it your way.

Best wishes

make3 · on Oct 30, 2022

do you use numpy for the numerical python stuff

sandGorgon · on Oct 27, 2022

what about java graalvm ?

llimllib · on Oct 26, 2022

I wanted to see what the results for pypy would be.

On my machine (very similar, a macbook pro m1 max):

    python 3.10: 52s
    python 3.11: 35s
    pypy 3.9.12: 10s

(This test is basically a perfect test for JITs: one loop repeated many times)

https://gist.github.com/llimllib/7af8144a92d3c2e1fc58be62988...

CalebJohn · on Oct 26, 2022

Out of curiosity I ran the same test on a linux laptop with the Ryzen 7 PRO 6850U CPU.

    python 3.10: 60s
    python 3.11: 46s
    pypy 3.9.12:  6s

Looks like pypy performs comparatively better on x86_64

llimllib · on Oct 26, 2022

makes sense I guess, it's had a lot more development time I'm sure. Thanks!

nerdponx · on Oct 26, 2022

I think PyPy runs under Rosetta on M1, so the overhead is probably from that.

llimllib · on Oct 27, 2022

I don't think so; they released ARM support a while ago: https://doc.pypy.org/en/latest/release-2.1.0.html

(I could be wrong)

edit: oh, huh I think you're right that it's running here on rosetta:

    $ file installs/python/pypy3.9-7.3.9/bin/pypy
    installs/python/pypy3.9-7.3.9/bin/pypy: Mach-O 64-bit executable x86_64

Wonder if there's a way to run a native version?

edit 2: there are nightlies here: https://buildbot.pypy.org/nightly/py3.9/

Running the latest, a native binary gives more than 2x speedup:

    # first you have to allow all the unsigned binaries to run
    $ xattr -dr com.apple.quarantine pypy-c-jit-106295-5dd3b18303e2-macos_arm64/bin/*
    # then we get 3.5s:
    $ time pypy-c-jit-106295-5dd3b18303e2-macos_arm64/bin/pypy nbody.py 10000000
    -0.169075164
    -0.169077842

    real 0m3.522s
    user 0m3.468s
    sys 0m0.045s

pypy continues to impress!

metadat · on Oct 26, 2022

Why is there such a dramatic performance gap / slowdown for 3.10 + 3.11 compared to 3.9.12?

qbasic_forever · on Oct 26, 2022

That's pypy 3.9.12, it has a JIT implementation to interpret python. The other two are standard CPython which doesn't JIT interpret the code.

charlieyu1 · on Oct 26, 2022

Does pypy support numpy now?

llimllib · on Oct 26, 2022

yes

    $ pip install numpy
    <snip building wheel>
    $ python --version && python -c "import numpy; print(numpy.identity(5))"
    Python 3.9.12 (05fbe3aa5b0845e6c37239768aa455451aa5faba, Mar 29 2022, 09:54:47)
    [PyPy 7.3.9 with GCC Apple LLVM 13.0.0 (clang-1300.0.29.30)]
    [[1. 0. 0. 0. 0.]
     [0. 1. 0. 0. 0.]
     [0. 0. 1. 0. 0.]
     [0. 0. 0. 1. 0.]
     [0. 0. 0. 0. 1.]]

metadat · on Oct 26, 2022

Oops, thanks for clarifying.

arc-in-space · on Oct 26, 2022

That's 3.9.12 of PyPy, not CPython

fordsmith · on Oct 26, 2022

Not OP, but are you taking into consideration that it is not Python 3.9, but pypy, which is distinct

masklinn · on Oct 26, 2022

> On my machine (very similar, a macbook pro m1 max):

Should make essentially no difference then, since I rather doubt the Python implementation of n-body can leverage the GPU, or strains the RAM so much that the 200GB/s of the Pro (IIRC) would be an issue.

bee_rider · on Oct 26, 2022

It looks very close for 3.11. The measurement from the parent comment includes two other python implementations (not covered in the article).

melling · on Oct 26, 2022

“Besides strict typing and having a generally ugly syntax, C++ also requires ahead of time compilation.”

What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Python is a higher level language than C++ so it requires less effort but newer compiled/typed languages offer more of what Python is good at

maccard · on Oct 26, 2022

> What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Nothing, as long as the compilation is fast. C++ compilation is not fast. The large C++ projects I've worked on over the last few years, compilation is bordering on 2 hours for a full build, and 10-60s for incremental builds. At one point, incremental changes were taking 15 minutes to link at one point (resolved by [0]). Go is a great example of fast compilation and strict typing IMO.

[0] https://devblogs.microsoft.com/cppblog/improved-linker-funda...

Beltalowda · on Oct 26, 2022

Go was designed, in part, as a response to C++'s slow compile speeds. The Go compiler has gotten a bit slower over the years, but it's still pretty fast, especially when compared to C++ or Rust.

> In 2007, build engineers at Google instrumented the compilation of a major Google binary. The file contained about two thousand files that, if simply concatenated together, totaled 4.2 megabytes. By the time the #includes had been expanded, over 8 gigabytes were being delivered to the input of the compiler, a blow-up of 2000 bytes for every C++ source byte.

https://go.dev/talks/2012/splash.article#TOC_5.

I also hate how just adding some definition to a .h file that's only referenced in one .c (or .cpp) file will recompile loads just because the header file changed. Maybe there's ways to improve on that (ccache?), but I mostly write C to contribute to open source projects (rather than my own) and it can be an annoying wait.

snovv_crash · on Oct 26, 2022

You can limit the impact of this with forward declarations and only putting that is needed for the public API into the header includes. The rest can all live in the .cpp

the_svd_doctor · on Oct 26, 2022

C++ (especially template heavy with lots of headers and so on) is certainly not fast to compile. But 2 hours that's crazy. What sort of codebase and on what kind of machine?

A 128 cores threadripper is a great workstation to compile C++ code fast :D

redox99 · on Oct 26, 2022

For reference Unreal Engine 5 full build (building a lot of stuff you don't need) takes like 30 minutes on a 3950x. Typical incremental build is either 15 seconds, or a minute if I touch some popular header. For full build not only you can have a lot more cores, you can use IncrediBuild.

maccard · on Oct 26, 2022

In my experience, a large Unreal Game can take as long as the engine on top of it. My current project is about 2 minutes of project code plus 15 minutes of engine code but my last project was more project code than engine code.

maccard · on Oct 26, 2022

Unreal engine games, on a 3990x Threadripper with 128GB ram and NVMe SSD drives.

MH15 · on Oct 26, 2022

Chromium is known for taking hours.

nequo · on Oct 26, 2022

That is true. But Chromium is not in the realm in which the compiled vs. interpreted debate is relevant. It is slow to compile, but it would also be very slow to run if it was written in Python.

(Unless a hypothetical Node.js rewrite would be comparable on speed. I wonder how it would do on memory.)

redox99 · on Oct 26, 2022

More like an hour or less on reasonable, consumer hardware[1]. A lot less on workstation/server CPUs, or with IncrediBuild.

[1] https://youtu.be/nRaJXZMOMPU?t=770

chlorion · on Oct 27, 2022

I am on Gentoo so I have some build time stats!

My system has a 3900x, 32GB ram, and a fast NVME SSD for reference.

The last chromium build took 1h:45m to complete. This build had LTO enabled and -j18 passed to ninja.

Without LTO it would probably be closer to an hour I think but I don't seem to have a non-LTO build in the logs.

Your estimation seems pretty accurate!

anthk · on Oct 26, 2022

Most people will use ccache under Unixen.

maccard · on Oct 27, 2022

And for those of us who work primarily with MSVC?

wtetzner · on Oct 26, 2022

> Go is a great example of fast compilation and strict typing IMO.

I think an even better example might be OCaml. Ocaml's compilation speed (last time I checked) was on par with Go, but it provides a much nicer (IMO) type system.

grumpyprole · on Oct 27, 2022

OCaml is another example of fast compilation and static typing.

klodolph · on Oct 26, 2022

I have worked with very few C++ projects that compile fast, and my experience is that the errors in C++ projects are often more severe than the errors in Python projects. Even with modern C++ smart pointer style, I’ve seen all sorts of stuff like dangling pointers / use-after-free, buffer overruns, etc. All in code that had been reviewed.

melling · on Oct 26, 2022

Yes, C++ was created in the 1980’s and a lot had been added …

That’s why I was suggesting newer compiled languages incorporating some of what we’ve learned over the past 4 decades. eg type inference

PathOfEclipse · on Oct 26, 2022

In general, the more work the type system is doing for you, the slower the compilation speed. Kotlin and Scala, for instance, both compile far more slowly than Java. I think even C# compiles very slowly compared to Java. You mention type inference, but type inference actually slows down compilation, and the more sophisticated the type inference, the slower the compilation!

Similarly, Rust compilation speed is much slower than GoLang's, whose type system does relatively very little for you. I haven't seen exact numbers, but I've seen people say Rust and C++ have similar compilation speeds, and both are, in general, slow to compile.

cestith · on Oct 26, 2022

Rust gives you so much more for the same compilation speed though.

tomovo · on Oct 26, 2022

Try any larger Swift project and suddenly C++ looks pretty fast. Agreed on the pointers though.

galangalalgol · on Oct 26, 2022

In my experience a good pipeline with decent unit tests makes c++ and python work fine.

ahartmetz · on Oct 26, 2022

And Valgrind (and / or Memory Sanitizer), they mostly remove a big source of problems in C++ code.

DeathArrow · on Oct 26, 2022

>Python is a higher level language than C++ so it requires less effort but newer compiled/typed languages offer more of what Python is good at

I see Nim as clearly better if you write a larger software. It's a shame it's usage is low and thus there aren't many Nim resources.

polotics · on Oct 26, 2022

I think Nim is best placed as the choice for the fast functional core performance-critical and cpu-bound few parts of a larger python codebase.

agumonkey · on Oct 26, 2022

I see a strong convergence of typed / fast / expressive languages between python/php/ruby and cpp/ada. Rust is one trendy instance but I believe it's gonna create a spot around it.

exabrial · on Oct 26, 2022

> What’s wrong with strict typing and ahead of time compilation as long as it compiles fast? Doesn’t this prevent many runtime errors that can occur in Python?

Nothing, it's all good things: It's easier to write, easier to debug, and the compiler (not the user) catches bugs.

wiseowise · on Oct 26, 2022

C++ compilation is anything but fast.

simonw · on Oct 26, 2022

I'm really impressed with the performance improvements in Python 3.11.

I ran a very basic benchmark against a local web application: I got 413.56 requests/second on 3.10 and the exact same code gave me 533.89 requests/second on 3.11.

That's a big enough increase that I think it's worth actively upgrading projects. Usually I wait for a few months for things to settle in first.

PaulHoule · on Oct 26, 2022

That n-Body simulation is a bad case for Python today in that you loop over the differential equation solver in Python. (It's like the very branchy semantic web and old AI stuff that I do for fun... I am migrating a lot of that to Java, PyPy helps a great deal but Java does a lot better.)

If you are doing heavy matrix math, numpy runs at FORTRAN speed, tools like scikit-learn and Tensorflow also get high performance by doing the heavy lifting outside Python.

TheRealPomax · on Oct 26, 2022

Sounds like a case of "write the code to show what it should look like and show what numbers you get compared to the code they posted"?

But remember that the exercise wasn't "to run the n-body simulation", or even "to run maths", but just to see how fast "plain Python" is with the release of 3.11 - using numpy and scipy, which rely on compiled code that they can hand work off to, would make any runtime value you get completely meaningless for the purposes of benchmarking pure Python =)

PaulHoule · on Oct 26, 2022

The viability of Python for scientific work is predicated on using Python as glue code for code written in other languages. If pandas ran at the speed of Python people really would be using Julia or some other language instead of Python.

TheRealPomax · on Oct 26, 2022

Agreed! Which is why this is a benchmark of Python, not of "running scientific computations". Of course you're going to use numpy/scipy for that. Or you start using R, if you really care. This article isn't about what the code does, there are a million improvements to make if that's what we were looking at. It's about how fast a small program that uses some standard pure Python patterns runs today, compared to a few versions ago, with some "but what if not Python?".

Because "you wouldn't write this code" applies to those examples as well. In JS you'd tap into a C-compiled native library the exact same way numpy/scipy does in Python. And in C, if you absolutely needed the fastest performance, the code would be full of micro optimizations and a sprinkling of assembler.

bornfreddy · on Oct 26, 2022

Who cares about "pure" Python? The whole point is that you have powerful libraries at your fingertips and you can leverage those to get fast code. Can you do the same in Go? JavaScript? PHP?

I'm tired of people comparing languages but then leaving out the major winning points for Python. Numpy & co. are integral part of Python, no serious ddv would use just "pure" Python for numerical methods, ever. So let's compare real'world Python, shall we? I doubt Go has a chance then. /rant

TheRealPomax · on Oct 26, 2022

Turns out other people don't care if you're tired of something, because they're not you. Plenty of folks do care about how much pure Python has improved. That's how it got to the front page.

If you want to test baseline performance, you want a little program that's small enough to be easily understood, but just elaborate enough to hit enough of the standard programming patterns. So it's basically irrelevant what this code actually does, we're just looking at how much faster Python has gotten. Because that's something we care about. We don't care that "go is faster" or "C is faster", we use Python and we like to know that the newer versions actually are substantially better than the older versions we used to (or still have to) work with.

bee_rider · on Oct 26, 2022

Clearly the solution is to add a jitter to Python that identifies code that looks like an matrix multiplication, and calls numpy instead.

andybak · on Oct 26, 2022

> numpy runs at FORTRAN speed

Not everyone will be aware that this meant as praise. ;-)

PaulHoule · on Oct 26, 2022

Yep.

FORTRAN codes persist today because (1) the old school memory model of FORTRAN is fast, and (2) it is so easy to write numeric codes that do the wrong thing with rounding and numerical instability. There's a reason why Foreman Acton wrote a book titled Numerical methods that (usually) work.

https://www.amazon.com/Numerical-Methods-that-Work-Spectrum/...

Code something up in C, Haskell, oCAML or CUDA and you miss out on the 40+ years of experience people have had with a FORTRAN code from the 1970s.

fragmede · on Oct 26, 2022

A lot of physics simulations use FORTRAN for the existing libraries, and there's a way to run it on GPUs. It's not going anywhere.

pbowyer · on Oct 26, 2022

> (2) it is so easy to write numeric codes that do the wrong thing with rounding and numerical instability.

Are you saying FORTRAN avoids these problems, or that it is prone to them? If the former, how does it do it?

PaulHoule · on Oct 26, 2022

Ideally, people who understood numerics wrote the code in the 1970s and it has gotten heavy use since then so if there are problems with numerical instability they've been detected and solved.

Today somebody who doesn't know numerics frequently codes something up for the wrong reasons (e.g. to learn a new language, because they think the 1970s FORTRAN code is obsolete, ...) and never did the testing to know that the code they wrote is numerically stable or not.

That is, you might think it is pretty easy to code something numerical up, and sometimes it is, but frequently you write something that's a little bit wrong and sometimes you write something that's terribly wrong, sometimes it isn't even wrong.

It's not that FORTRAN is necessarily more accurate than another language, but that you can trust a code that has been around for 40 years and codes that have been around 40 years have been written in FORTRAN.

---

As for the memory model I think about it the most when I write embedded programs for my Arduino.

It drives me nuts that C diddles the stack pointer around meaninglessly when for most of the programs I write there are a small number of parameters that decide the size of all the arrays (like an old FORTRAN program) and local variables, recursion and all that are a source of problems and not solutions.

The only reason I write C for that thing at all is that some of the programs I write are performance sensitive and I could get a bigger boost running the C code on an ARM or ESP32 than I could get writing AVR8 assembly and eliminating meaningless loads, stores and other activity that C does "just because".

systemvoltage · on Oct 26, 2022

It is still informative to see how slow native Python really is.

bkanuka · on Oct 26, 2022

I agree. This was honestly news to me - who very often uses Python for maths. However, I would never write the code as he did (instead I would rely on numpy/scipy). So I would also be intersted in a numpy version of the same test.

ReflectedImage · on Oct 26, 2022

How exactly would you write a gameboy emulator in numpy/scipy?

It's sequential code with fiddly side effects. I know I've written one.

But I'm generally curious if this is in-fact possible in someway.

adgjlsfhk1 · on Oct 26, 2022

Numpy would probably be even slower here. Numpy is good when you have large arrays, but it adds roughly .1 to 1 us per call in overhead.

bkanuka · on Oct 26, 2022

Without validating anything myself, I was able to find this post https://hilpisch.com/Continuum_N_Body_Simulation_Numba_27072... which showed a simple n-body program sped up by ~670 times when moving from pure python to numpy+numba.

adgjlsfhk1 · on Oct 26, 2022

2 things to notice: the first is that this is with 5 bodies while your link was with 1000. For 1000 bodies, numpy is a noticeable speedup (100x). For 5 particles (I used the same code as your article but adjusted the number of particles) numpy is 5x slower. Adding numba would make this fast again since it would remove the overhead, but at that point, just use a fast language in the first place.

IshKebab · on Oct 26, 2022

To paraphrase:

> That benchmark shows that Python is slow. If you avoid writing your code in Python it can be really fast!

nerpderp82 · on Oct 26, 2022

You should give JAX a go.

https://github.com/google/jax

patrickkidger · on Oct 26, 2022

+1 for JAX. Basically designed to be the successor to TensorFlow, and much nicer to work with. Strangely I've not seen it discussed around HN much but it's what I do 100% of my work in these days.

Whilst I'm here: shameless self-promotion for Equinox and Diffrax:

https://github.com/patrick-kidger/equinox https://github.com/patrick-kidger/diffrax

Which are neural network and differential equation libraries for JAX.

[Obligatory I-am-a-googler-my-opinions-do-not-represent-your-employer...]

brrrrrm · on Oct 27, 2022

I feel like JAX usually comes up on PyTorch posts (in the same way PyTorch usually comes up in the comments on TF posts).

I think the delta between JAX and PT on commodity hardware is just a little bit too small to really create much splash on HN.

zitterbewegung · on Oct 26, 2022

The point of making python faster is not just making the above nbody simulation faster but also to improve the performance of when python is used as glue code to other more efficient systems.

But, the language before has a bigger shift to making it faster due to the amount of resources it has now that are dedicated to making CPython faster similar to what happened with JavaScript. I hope CPython eventually is as fast as JavaScript.

ehutch79 · on Oct 26, 2022

I see a lot of comments about how other languages are faster.

Please think about your actual workload and take them with a grain of salt.

For instance, for most web apps, you spend a large amount of time waiting on database responses. It looks nothing like the modeling the tests in the article do. Benchmarks are not typical workloads.

Don't just assume because some random person on hn says zig is faster you should rewrite your business apps.

PathOfEclipse · on Oct 26, 2022

I've actually found this to be completely untrue in practice. Just about every web service I've dealt with in production, if not all of them, have been CPU bound. This is for a number of reasons:

1) Network speeds have increased dramatically compared to CPU speeds.

2) People don't optimize code very much.

3) Web apps tend to do more work per request than they did in the 90s.

Regardless, I've never seen an app saturate it's network pipe, but I've seen plenty saturate all their CPU cores, including relatively well-tuned ones. For instance, I wrote a Netty-based reverse proxy app once, and while I got it to run far faster than the typical app in that company, it was still CPU-bound in all my tests.

hannofcart · on Oct 26, 2022

+1 to this.

We picked Python Asyncio based Tornado async for our services. As soon as we hit scale we were getting CPU bound.

Deep diving into profiling came up with JSON parsing being the culprit.

It's a painful problem to crack once you hit those limits. In some cases you could genuinely skip parsing the JSON (by returning a raw JSON containing string to client) such as when you are simply getting data from cache or db.

In other cases you simply can't skip it. For eg when you are interfacing with a 3rd party library that will only speak JSON. At that stage you are stuck.

You could try and use a wrapper around faster native JSON parser (say uJson) but it will be a trade-off between the parsing time and the time taken to copy the string to the FFI parser and copy back the results. And deal with all the complexity that that entails.

Or you could hand it off as a job to an async queue (this might be the canonical architectural approach to prevent blocking the event loop) but then you have just shifted the problem to a different place where you'll still need to throw more instances at the problem. And this adds extra latency.

I too was in the "don't optimize prematurely" camp but picking Python today for new services IMO would be taking that principle a bit too far.

Especially considering the ergonomics that modern languages like Golang or Rust offer.

dragonwriter · on Oct 26, 2022

> It's a painful problem to crack once you hit those limits.

If you hit that doing something novel and obscure, sure. If it is literally parsing JSON, you spend a little effort researching non-stdlib JSON parsing libraries, pick one of the several stable much-faster-than-stdlib ones, and move on.

hannofcart · on Oct 26, 2022

Like I mentioned we did that. You can get a small arithmetic multiple speed up. But moving these services to rust basically made our instance counts drop from 32 to 2.

Once the team has the know-how to write a service in Rust/Golang or even the modern pleasant to write Java, it becomes hard to justify why we'd pick Python for a new service at all.

superbatfish · on Oct 26, 2022

He explicitly mentioned that path…

dragonwriter · on Oct 26, 2022

He explicitly mentioned some of the troubles you might have implementing and maintaining that yourself via FFI.

He very much did not explicitly mention that its already done, with established results, and that the described “pain” isn’t something you have to take on at all.

williamcotton · on Oct 26, 2022

And then there is the resource usage... a typical dynamic interpreted language uses orders of magnitude more CPU and memory compared to compiling to machine code. Enough of these apps running in a data center can add up to a lot of wasted resources!

ehutch79 · on Oct 26, 2022

I'd argue a lot of web apps are just serializing to json, after doing minimal business logic that likely has to do db requests.

Once again, I said to look at what your load actually is. If you're CPU bound, then yes, buy a bigger instance, optimize, or switch languages.

It's like how if you are putting up a blog, you probably shouldn't be looking at running kubernetes clusters for just that blog.

PathOfEclipse · on Oct 26, 2022

You might be surprised to find out how CPU-intensive even something as simple as a DB request can be. Your driver has to do the work of communicating with the DB server via its protocol, which is similar in cost to any other network request. More importantly, it has to deserialize the result and marshal the data into objects on the managed heap. Most DB drivers don't support streaming, so for larger requests you have to read all the data into memory, then convert it all into objects at once.

And that's not even taking into account frameworks and ORMs like Hibernate, which itself can multiply the CPU used several-fold on top of JDBC, or whatever lower-level interface it wraps. I've never measured frameworks in dynamic languages, but I have no reason to believe they aren't similarly inefficient.

And, yes, one of the optimizations I did for my reverse proxy app was to upgrade the JSON library, which brought a significant performance boost. But it's not the only source of CPU usage for apps, nor was it the only major optimization I successfully applied.