Oh, I'm familiar with numba and while it certainly helps, it has plenty of it's own issues. You don't always get a performance gain and you only find this out at the end of a refactoring. Your code can get less readable if you need to transport data in and out of formats that it's compatible with (looking at you List()).
To say nothing of adding yet another long dependancy chain to the language (python 3.11 is still not supported even though work started in Aug of last year).
I do wonder if the effort put into making this slow language fast could have been put to better use, such as improving a language with python's ease of use but which was build from the beginning with performance in mind.
I've rewritten real world performance critical numpy code in C and easily gotten 2-5x speedup on several occasions, without having to do anything overly clever on the C side (ie no SIMD or multiprocessing C code for example).
Did you rewrite the whole thing or just drop into C for the relevant module(s)? Because the ability to chuck some C into the performance critical sections of your code is another big plus for Python.
But... pretty much any language can interoperate with C, it's calling conventions have become the universal standard. I mean, I still remember at $previousJob when I was deprecating a C library and carefully searched for any mention of the include file... only to discover that a whole lot of Fortran code depended on the thing I was changing, and I had just broken all of it (since Fortran doesn't use include files the same way, my search for "#include <my_library" didn't return any hits, but the function calls were there none-the-less).
Julia, to use the great-great-grand-op's example, seems to also have a reasonably easy C interop (I've never written any Julia, so I'm basing this off skimming the docs, dunno, it might actually be much more of a pain than it looks like here).
I’ve done the same but moved from vanilla numpy to numba. The code mostly stayed the same and it took a couple hours vs however long a port to C or Rust would have taken.
For a package whose pitch is "Just apply one of the Numba decorators to your Python function, and Numba does the rest." a few hours of work is a long time.
2-5x speedup is not a lot, I would say it is not worth it to rewrite from py to C if you don't have an order of magnitude improvement.
Because if you compare the benefit to the cost of rewrite from py to C and cost of maintaining/updating C code and possible C footguns like manual memory safety, etc - then there is no benefit left
I highly doubt that numpy can ever be a bottleneck. In typical python app - there are other things like I/O that consume resources and become bottleneck, before you run into numpy limits and justify rewrite in C.
I haven't personally run into IO bottlenecks so I have no idea how you would speed those up in Python.
But there's two schools of thoughts I've heard from people regarding how to think about these bottlenecks:
1. IO/network is such a bottleneck so it doesn't matter if the rest is not as fast as possible.
2. IO/network is a bottleneck so you have to work extra hard on everything else to make up for it as much as possible.
I tend to fall in the second camp. If you can't work on the data as it's being loaded and have to wait till it's fully loaded, then you need to make sure you process it as quickly as possibly to make up for the time you spend waiting.
In my typical python apps, it's 0.1-20 seconds of IO and pre-processing, followed by 30 seconds to 10 hours of number crunching, followed by 0.1-20 seconds of post processing and IO.
2-5x speedup barely seems worth re-writing something for, unless we're talking calculations that take literally days to complete, or you're working on the kernel of some system that is used by millions of people.