We were very heavy numba users at my former company. I would even go so far as to say numba was probably the biggest computational enabler for the product. I’ve also made a small contribution to the library.
It’s a phenomenal library for developing novel computationally intensive algorithms on numpy arrays. It’s also more versatile than Jax.
In presentations, I’ve heard Leland McInnes credits numba often when he speaks of his development of UMAP.
We built a very computationally intensive portion of our application with it and it has been running in production, stable, for several years now.
It’s not suitable for all use cases. But I recommend testing it if you need to do somewhat complex calculations iterating over numpy arrays for which standard numpy or scipy functions don’t exist. Even then, often we were surprised that we could speed up some of those calculations by placing them inside numba.
Edit: ex of a very small function I wrote with numba that speeds up an existing numpy function (note - written years ago and numba has undergone quite some amount of changes since!): https://github.com/grej/pure_numba_alias_sampling
Disclosure - I now work for Anaconda, the company that sponsors the numba project.
Numba compiles functions down to machine code or cuda kernels, that's it.
XLA is "higher level" than what Numba produces.
You may be able to get the equivalent of jax via numba+numpy+autograd[1], but I haven't tried it before.
IMHO, jax is best thought of as a numerical computation library that happens to include autograd, vmapping, pmapping and provides a high level interface for XLA.
I have built a numerical optimisation library with it, and although a few things became verbose, it was a rather pleasant experience as the natural vmapping made everything a breeze, I didn't have to write the gradients for my testing functions, except for special cases that involved exponents and logs that needed a bit of delicate care.
I’ve been disappointed with Jax which I was trying to use for backward auto differentiation.
The issue is that XLA JIT compilation is very slow and easily adds half a minute of overhead to the first call of the base function just by using jax.numpy instead of numpy, which made it a non starter for my use case. It’s definitely optimised for large flow computations where the JIT overhead is dwarfed by the rest.
In the end I reverted to using autograd which did the job fine.
I had never heard of tai chi until now, I’m curious how it compares.
Software from our group (cij[1], qha[2]) were developed when numba seems to be the best option for JIT. It generates more pain in the hindsight. It generates a lot of depreciated warning due to unstable API, locked numpy to a certain version (i remember 1.21) due to compatibility issues, and when M1 Mac comes out, there were for a long time lack of llvmlite porting to the new platform, so cannot run on these new Macs.
If I had to do it again I would just use plain numpy or use the JAX from Google if JIT is really necessary.
I have personally gotten a lot of mileage from just writing the compute heavy parts of my code in C++ and exposing it to Python with a tool like PyBind11 [1] or NumpyEigen [2]. I find tools like numba and cython to be more trouble than they're worth.
I prototype in python or whatever, then, if the project survives into market and has legs I either buy more hardware or rewrite the expensive parts in C++.
Reduces calendar time, risk, cost. And I'm likely to make better decisions once the code and market is better understood after the prototype is tested under real world conditions and the requirements have changed (like they always seem to do).
+1 for pybind11. I wrote python bindings using pybind11 for two C++ based simulators: MOOSE and Smoldyn. It was surprisingly easy to use given how badly Python C-API and c++ tooling suck. Though you have to create binary wheels for every version of python and platform separately.
pypy is great if you are not already using numpy heavily. Pure python libraries like networkx and myhdl showed 20x speedup when I used it a couple of years ago. For pure python code, pypy provides free lunch.
As a slight contrast to the other responses, I found setting up maturin (Rust + Python) very straightforward since the documentation is recent, and I find it's easy to write parsers in Rust because the ADT syntax is very terse.
When I wrote my bachelor thesis years back I worked on a particle-in-cell code [1] that makes heavy use of numba for GPU kernels. At the time it was the most convenient way to do that from python. I remember spending weeks to optimizing these kernels to eek out every last bit of performance I could (which interestingly enough did eventually involve using atomic operations and introducing a lot of variables[2] instead of using arrays everywhere to keep things in registers instead of slower caches).
I remember the team being really responsive to feature requests back then and I had a lot of fun working with it. IIRC compared to using numpy we managed to get speedups of up to 60x for the most critical pieces of code.
As someone who uses the python numerical computing libraries extensively, Numba is my biggest disappointment in the ecosystem.
The main problem with Numba is that simple functions are easy enough, and this lulls you into a false sense of security- that things will work.
Unfortunately, every time it turns into an a hair tearing exercise of trying to structure the code such that Numba's vast array of unpredictable edge cases isn't hit.
The error messages are often infuriatingly bad.
At this point I've banned Numba from our codebase. If there's a case for Numba, we just do it in C++ instead.
I use numba a lot nowadays. Works perfectly well on all platforms (linux, windows, mac, even the M1) and gives speedups as expected (few percent for already well vectorized numpy code, and extra-large speedups for loopy code). I strongly recommend it for the performance critical part of your code. Many things are not supported yet, so it has to be used with care. I remember I needed a missing scipy special function and I the end I implemented it myself by vectorizing math.erf: it was surprisingly easy to do and a big success in terms of performance.
As a side note, now it is easy to write Rust code, which can be directly used in Python - https://github.com/PyO3/pyo3.
It cannot use NumPy and other libraries (since it is Rust), but at the same time, I see its potential in creating high-performance code to be used in Python numerical environment.
I am really intrigued by the Codon project, which aims to be a JIT compiler for Python with Numba/JAX decorator syntax: https://github.com/exaloop/codon
It's not going to take off, since it doesn't have full (or even most) API compatibility with Python. Numba seems strictly better because it can interop with Python.
I didn't think Pypy uses LLVM so I wonder who produced better code.
That said, they're targeted at different audiences. I feel Numba is targeted at data science and machine learning and even AI.
I feel a large portion of using or programming a computer is structural and not the actual work of adding numbers together. Very little of the code generated does the useful part a computer does: addition. The rest is control flow management and data placement! It's all preparation for the code to do an addition. The hard part is putting together the structure for the computer to do things that are useful.
So we invented methods, variables, classes, functions, closures, expressions to create that structure easier.
I thought about creating a language which tries to eliminate the structure that most programs accumulate and focus on the critical addition or calculation and let the computer do the arrangement. A JIT compiler for structure.
I'm discovering JS at the moment. I don't fully understand the async model, but the promise seems like a generic constraint of "the result is now available"
Maybe you could have the "flow managements" as other constraints?
I'm thinking the code for your average CRUD or even desktop compositor. A compositor copies pixels from multiple places into one place. Surely that can be defined with a simple loop? But no there's hundreds of APIs in the way. Add Wayland and X11 and you have something that is opaque and understood by very few people.
The motivation behind my comment was that most of programming computers is gluing together APIs to shift data from one place to another before doing something useful with it. The APIs themselves do very little addition or subtraction of data but actual just moving data around and placing it into the right place.
Maybe defining where things should be, declaratively, in order to do a calculation would be useful. So the shape of the calculation defines the data structure, rather than the data structure defining the caclulation.
For a compositor, I'd think of the set pixels being changed (an "invalidation") a good example: the constraint would be to update it on the screen.
Unchanged? Don't bother, leave it as-is. I think that's how Intel power saving works.
Now think about the MVC model: some changes in the data could result in a change in the view if the data currently shown on screen is what has changed - like triggers in SQL.
I wonder if you could have everything work like that?
You're right, and thank you for bringing async up.
And thankyou for bringing up constraint propagation.
One of my ideas is the definition of formulas that act as materialized views over other materialised views. So we can layer materialized views over other materialized views and then work out a derived formula that is potentially nearer to what we want and potentially summarise the formula without needing to calculate the underlying views, we can compute the formula directly.
Is this differential dataflow?
I think it's an application of algebra and JIT compilers could do it to expressions if we fed symbolic expressions of programming languages into sympy or machine algebra.
In react, react does diffing between virtual DOM nodes to see if there are changed. There is also dirty region checking in old games and damage regions. These problems are mathematically defined.
> I think it's an application of algebra and JIT compilers could do it to expressions if we fed symbolic expressions of programming languages into sympy or machine algebra
Yes and the constraints could be the used to reduce the computational costs, giving higher performance and lower latency.
A while back, a good friend (we even shared HN accounts for a while lol) pointed me to pipelinedb: a PostgreSQL timeseries plugin for continuously updating """materialized views"""
I use a lot of quotes around, because it wasn't either like a regular view (computed when you query it, which introduces latency) or a materialized view (frozen, needs to be refreshed, same problem) but more like the NO_HZ tickless kernel: the update of the calculations was caused by the introduction of new data, not the passage of time (which would be wasteful)
The general approach makes a lot of sense to me, and I see how it could be used for more generic problems.
I've only used numba once but I was really impressed. We have an analysis at work that runs hundreds of times a day which uses a Hampel filter written in numpy, but still requires iterating over an array. Just adding a @numba.jit decorator above the function gave us a 10x speed improvement.
* PyPy JITs everything, so it can do _normal_ Python numerical code that is quite fast and regular Python code that is fast. However, its interactions with libraries like NumPy add overhead, and it seems like it can't JIT code that interacts with NumPy in a useful way (AFAIK, would be happy to be proven wrong). So not useful for optimizing numeric functions that interact with libraries like NumPy.
* Plain old NumPy and friends. This is great... if the operation you want is already available as a "vectorized" API. "Vectorized" in this context does NOT mean SIMD, it's a Python-specific usage, see below.
* Numba: JIT compilation specifically focusing on interop with NumPy and similar libraries. Lets you write subset of Python but unlike NumPy you can use for loops and go fast.
* AOT compilation: Cython, Rust, C++, etc.. You have a longer feedback loop, but you have a full programming language, especially if you avoid Cython. OTOH Cython has nicer Python interop so for simple just-a-little-addon it can be easier to use if you don't already know Rust. You really shouldn't be writing new C++ in this day and age (but wrapping an existing library is useful). Like C++, Cython doesn't help with memory safety. Cython also suffers from two compilers, so debugging can be harder, especially if you use the C++ interop; if you are wrapping existing C++ library, I'd probably start with PyBind11 based on long-ago experience with Boost::Python.
Good overview. "Vectorized" is an old term that's been around since the early days of supercomputers and maybe before, not sure where it came from. Numba does a bunch of different things for code written to the Numpy API including CUDA acceleration. Certain machine learning frameworks like PyTorch and JAX also roughly follow the Numpy API because it is widely familiar and easy enough to work with. The kind of code that benefits from this kind of acceleration is hard to write yourself. A lot of workloads lean on linear algebra operations that are conceptually simple but complicated to implement with good performance, thus why all of this tooling isn't just a couple thousand lines of C. Good overview of matmul on CPU:
Cray supercomputers used to have special "vector" units that would perform operations on multiple scalars (eg 128 doubles) in parallel. A bit like gpu units. Any algorithm that could be cast in a form that benefited from this type of parallelism would be called vectorizable. Linear algebra obviously fits perfectly (but, depending on the problem, you might0 need to juggle the vector dimensions).
Vectorizing code was fairly straightforward using the latter versions of fortran. It was all quite sweet and productive but could not provide the required hpc scaling so was eventually abandoned in favor of massively parallel designs.
I think this is backwards, the early Crays and maybe other machines used essentially serial + aggressively pipelined CPUs while modern CPUs do actually execute operations in parallel via multiple ALUs but are from a programmer point of view kind of similar (fixed size vector).
What is a vector unit was is simd etc can be confusing but at least some wikipedians are sure the early Crays had true vector processing (which may not be parallel processing according to your definition)
Yes, in the Cray design a vector instruction processes one item per clock so total latency increases with number of elements. Only parallel in the sense of pipelining being a form of parallelism. Something like an AVX-equipped Intel CPU processes all elements in parallel to deliver a result in an essentially constant number of cycles.
Indeed! Converting one's entire code base to a different language ecosystem, finding equivalents to each of your third-party dependencies, is less painful than employing a library to selectively compile a few performance bottlenecks in your code.
(Modules like PyJulia facilitate a more incremental approach.)
I would. It's much safer than Python which has a package ecosystem that is known to be critically unsafe. https://pytorch.org/blog/compromised-nightly-dependency/ is just the latest example of security issues it's been having, https://moyix.blogspot.com/2022/09/someones-been-messing-wit... is general numerical incorrectness which is non-local and cannot be turned off, and that's not even getting to the specific inaccuracies of Numba. I'd switch away from a numerically incorrect security issue today!
Are there a standard set of benchmarks these python JIT projects use?
I’m very interested in adding something like this to some projects but it needs to be 10-100x faster to be worth the hassle. Otherwise, for our applications, it’s a better time investment to rewrite in Go and get the speed and pro tooling than to further optimize python.
I’m surprised you’d rewrite in Go rather than Julia. I’d expect Julia would be much easier to translate to from Python and have much better support for any mathematical operation.
If you have numeric code that's too slow in Numba your next stop will likely involve a big multi-language effort and GPU specialists and none of that would be in Go except maybe a wrapper for your apps.
Does anyone know how this approach of adding decorators to numerical functions compare to Elixir's Nx approach of compiling those functions through a specialized macro for numerical computations? Would Numba benefit if (PEP 638?) macros were added to python?
I think numba still makes sense for loopy algorithms but not so much if youre more vector oriented given that Jax is more or less a drop in replacement for numpy and is shockingly fast.
JAX introduced a lot of cool concepts (e.g. autobatching (vmap), autoparallel (pmap)) and supported a lot of things that PyTorch didn't (e.g. forward mode autodiff).
And at least for my applications (scientific computing), it was much faster (~100x) due to a much better JIT compiler and reduced Python overhead.
...but! PyTorch has worked hard to introduce all of the former, and the recent PyTorch 2 announcement was primarily about a better JIT compiler for PyTorch. (I don't think anyone has done serious non-ML benchmarks for this though, so it remains to be seen how this holds up.)
There are still a few differences. E.g. JAX has a better differential equation solving ecosystem. PyTorch has a better protein language model ecosystem. JAX offers some better power-user features like custom vmap rules. PyTorch probably has a lower barrier to entry.
(FWIW I don't know how either hold up specifically for DSP.)
I'd honestly suggest just trying both; always nice to have a broader selection of tools available.
That's a bit too cynical, I think. People post follow-up/related stories because the brain likes to follow chains of associations.
You're right that these chains tend towards already-familiar associations, which lower their value as HN stories. The best HN stories are the ones that can't be predicted from any existing sequence: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
It’s a phenomenal library for developing novel computationally intensive algorithms on numpy arrays. It’s also more versatile than Jax.
In presentations, I’ve heard Leland McInnes credits numba often when he speaks of his development of UMAP. We built a very computationally intensive portion of our application with it and it has been running in production, stable, for several years now.
It’s not suitable for all use cases. But I recommend testing it if you need to do somewhat complex calculations iterating over numpy arrays for which standard numpy or scipy functions don’t exist. Even then, often we were surprised that we could speed up some of those calculations by placing them inside numba.
Edit: ex of a very small function I wrote with numba that speeds up an existing numpy function (note - written years ago and numba has undergone quite some amount of changes since!): https://github.com/grej/pure_numba_alias_sampling
Disclosure - I now work for Anaconda, the company that sponsors the numba project.