In my opinion it's this sort of short-sighted thinking that has cursed the Python project. "Everyone uses CPython" leads to "let's just let third party packages depend on any part of CPython" which leads to "Can't optimize CPython because it might break a dependency" which leads to "CPython is too slow, the ecosystem needs to invest heavily in c-extensions [including numpy]" which leads to "Can't create alternate Python implementations because the ecosystem depends concretely on CPython"[^1] and probably also the mess that is Python package management.
I'm not sure that the Numpy/Pandas hegemony over Python scientific computing will last. Eventually the ecosystem might move toward Arrow or something else. In this case it's probably not such a big deal because Arrow's mainstream debut will probably predate any serious adoption of Cython, but if it didn't then the latter would effectively preclude the former--Arrow becomes infeasible because everyone is using Cython/Numpy and Cython/Arrow performance is too poor to make the move, and since no one is making the move it's not worth investing in an Arrow special case in Cython and now no one gets the benefits that Arrow confers over Numpy/Pandas.
[^1]: Yes, Pypy exists and its maintainers have done yeoman's work in striving for compatibility with the ecosystem, and still (last I checked) you couldn't do such exotic things as "talking to a Postgres database via a production-ready (read: 'maintained, performant, secure, tested, stable, etc') package".
You are mixing up "how things are implemented" with "stuff that data scientists interact with."
Arrow is a low-level implementation detail, like BLAS. "Using" Arrow in data science in Python would mean implementing an Arrow-backed Pandas (or Pandas-like) DataFrame.
Your rank-and-file data scientist doesn't even know that Arrow exists, let alone that you can theoretically implement arrays, matrices, and data frames backed by it.
If you want to break the hegemony of Numpy, you will have to reimplement Numpy using CFFI instead of the CPython C API. There is no other way, unless you get everyone to switch to Julia.
Scientists are typically not trained computer scientists. They do not care, nor appreciate these technical arguments. They have two datasets A, and B, and want their sum, expressed in a neat tidy form.
C = A + B
Python with Numpy perfectly service just that need. We all have our grief with the status quo, but Python needs data processing acceleration from somewhere. In my view, Python needs to implement a JIT to alleviate 95% of the need for Numpy.
Scientists aren't the only players at the scientific computing table these days. There's increasing demand to bring data science applications to market, which implies engineering requirements in addition to the science requirements.
> In my view, Python needs to implement a JIT to alleviate 95% of the need for Numpy.
Numpy is just statically typed arrays. This seems like best case for AOT compilers, no? I'm all for JIT as well, but I don't have faith in the Python community to get us there.
JIT works great here too. It would see iteration and the associated mathematical calculations as a hotspot, and optimize only those parts, which is easy since the arrays are statically typed and sized.
I say this as a Computer Scientist at NASA that tends to re-write the scientific code in straight C. But for many workloads, a JIT would make my team more productive, basically for free as a user.
Yes, JIT would work well also, and I would strictly prefer a JIT, but I don’t think we’re likely to see a JIT Python with good ecosystem compatibility in the next decade. Good luck to the people who are using Python these days, but I’m tired of fighting the same major problems we had 15 years ago. Other ecosystems solved those problems and they actually improve materially.
Julia was created to tackle problems in applied disciplines (physics, neuroscience, genetics, material engineering, etc.). I was expecting it not to be picked up by your everyday app developer or by the overly abstract functional programmer. As an afterthought, personally I think Julia can do much more than that, I would say it can do at least as much as Python is capable today, but better.
The ecosystem is slowly expanding beyond applied disciplines, because when those people need to code something else, e.g. a Web site for their research data, then as usual they try to use the hammer they already know.
I'm really interested in Julia's performance for general purpose application development. It's great that it can work with large numerical arrays very efficiently, but what about large, diverse graphs of small objects like you commonly find in general purpose application development? I think I want a hybrid between Julia and JVM or something.
Where able, but its poor treatment of SWIG makes interfacing with standard tooling a royal pain. In many cases, I've rewritten Numba code in Cython or C for this very reason.
The hope is to create a new C API which doesn't expose CPython interpreter details, is easily exposed by interpreters other than CPython, and then port C-based APIs to it. Sadly it seems they aren't making much progress in 2020/2021. And I don't think it will eliminate Cython/Numpy overhead entirely, so Cython adding Numpy-specific features will still improve performance.
Also Pypy now has a compatibility shim for CPython extension modules. But last time I checked, it was slower than CPython for running one of my Numpy-based programs (corrscope), due to interfacing overhead.