Numba: A High Performance Python Compiler

grej · on Dec 27, 2022

We were very heavy numba users at my former company. I would even go so far as to say numba was probably the biggest computational enabler for the product. I’ve also made a small contribution to the library.

It’s a phenomenal library for developing novel computationally intensive algorithms on numpy arrays. It’s also more versatile than Jax.

In presentations, I’ve heard Leland McInnes credits numba often when he speaks of his development of UMAP. We built a very computationally intensive portion of our application with it and it has been running in production, stable, for several years now.

It’s not suitable for all use cases. But I recommend testing it if you need to do somewhat complex calculations iterating over numpy arrays for which standard numpy or scipy functions don’t exist. Even then, often we were surprised that we could speed up some of those calculations by placing them inside numba.

Edit: ex of a very small function I wrote with numba that speeds up an existing numpy function (note - written years ago and numba has undergone quite some amount of changes since!): https://github.com/grej/pure_numba_alias_sampling

Disclosure - I now work for Anaconda, the company that sponsors the numba project.

melony · on Dec 27, 2022

These days I have switched to

https://www.taichi-lang.org/

PheonixPharts · on Dec 27, 2022

> It’s also more versatile than Jax

Does numba do automatic differentiation?

I view JAX as primarily an automatic differentiation tool with the bonus that it makes great use of XLA and can easy make use of GPU/TPUs.

I don’t usually see numba and JAX as solving the same problem, but would be excited to be wrong

PartiallyTyped · on Dec 27, 2022

They solve different problems.

Numba compiles functions down to machine code or cuda kernels, that's it. XLA is "higher level" than what Numba produces.

You may be able to get the equivalent of jax via numba+numpy+autograd[1], but I haven't tried it before.

IMHO, jax is best thought of as a numerical computation library that happens to include autograd, vmapping, pmapping and provides a high level interface for XLA.

I have built a numerical optimisation library with it, and although a few things became verbose, it was a rather pleasant experience as the natural vmapping made everything a breeze, I didn't have to write the gradients for my testing functions, except for special cases that involved exponents and logs that needed a bit of delicate care.

[1] https://github.com/HIPS/autograd

rg111 · on Jan 5, 2023

> You may be able to get the equivalent of jax via numba+numpy+autograd[1], but I haven't tried it before.

I have tried it before [0]. And it works. As expected. Not much friction.

[0]: sorry no github

fasttriggerfish · on Dec 27, 2022

I’ve been disappointed with Jax which I was trying to use for backward auto differentiation. The issue is that XLA JIT compilation is very slow and easily adds half a minute of overhead to the first call of the base function just by using jax.numpy instead of numpy, which made it a non starter for my use case. It’s definitely optimised for large flow computations where the JIT overhead is dwarfed by the rest. In the end I reverted to using autograd which did the job fine.

I had never heard of tai chi until now, I’m curious how it compares.

chazeon · on Dec 27, 2022

Software from our group (cij[1], qha[2]) were developed when numba seems to be the best option for JIT. It generates more pain in the hindsight. It generates a lot of depreciated warning due to unstable API, locked numpy to a certain version (i remember 1.21) due to compatibility issues, and when M1 Mac comes out, there were for a long time lack of llvmlite porting to the new platform, so cannot run on these new Macs.

If I had to do it again I would just use plain numpy or use the JAX from Google if JIT is really necessary.

[1]: https://github.com/MineralsCloud/cij

[2]: https://github.com/MineralsCloud/qha

gjvc · on Dec 27, 2022

What if I'm (in Python) doing non-numerical stuff like parsing text and generating code? What JIT / AOT tooling (if any) is suitable?

fwilliams · on Dec 27, 2022

I have personally gotten a lot of mileage from just writing the compute heavy parts of my code in C++ and exposing it to Python with a tool like PyBind11 [1] or NumpyEigen [2]. I find tools like numba and cython to be more trouble than they're worth.

[1] https://github.com/pybind/pybind11 [2] https://github.com/fwilliams/numpyeigen

netjiro · on Dec 27, 2022

I prototype in python or whatever, then, if the project survives into market and has legs I either buy more hardware or rewrite the expensive parts in C++.

Reduces calendar time, risk, cost. And I'm likely to make better decisions once the code and market is better understood after the prototype is tested under real world conditions and the requirements have changed (like they always seem to do).

dilawar · on Dec 28, 2022

+1 for pybind11. I wrote python bindings using pybind11 for two C++ based simulators: MOOSE and Smoldyn. It was surprisingly easy to use given how badly Python C-API and c++ tooling suck. Though you have to create binary wheels for every version of python and platform separately.

brnt · on Dec 28, 2022

> Though you have to create binary wheels for every version of python and platform separately.

cibuildwheel makes this easy.

auxym · on Dec 27, 2022

Pypy, probably. You could also consider writing pre compiled extensions for your "hot" code, eg. in Cython.

dilawar · on Dec 28, 2022

pypy is great if you are not already using numpy heavily. Pure python libraries like networkx and myhdl showed 20x speedup when I used it a couple of years ago. For pure python code, pypy provides free lunch.

ptype · on Dec 27, 2022

I think what will be the most maintainable and bring you the least long term pain is Cython

chazeon · on Dec 27, 2022

I think most parsing-heavy code are just use C/C++ extension.

Example I can think of include:

1. pyyaml’s parser in C vs the Python version get a huge speed up on large files

2. some parsing table (~GB size) using pandas vs self-implemented Python code with a lot of for loop gain 20x speed up at least.

singhrac · on Dec 27, 2022

As a slight contrast to the other responses, I found setting up maturin (Rust + Python) very straightforward since the documentation is recent, and I find it's easy to write parsers in Rust because the ADT syntax is very terse.

Escapado · on Dec 27, 2022

When I wrote my bachelor thesis years back I worked on a particle-in-cell code [1] that makes heavy use of numba for GPU kernels. At the time it was the most convenient way to do that from python. I remember spending weeks to optimizing these kernels to eek out every last bit of performance I could (which interestingly enough did eventually involve using atomic operations and introducing a lot of variables[2] instead of using arrays everywhere to keep things in registers instead of slower caches).

I remember the team being really responsive to feature requests back then and I had a lot of fun working with it. IIRC compared to using numpy we managed to get speedups of up to 60x for the most critical pieces of code.

[1]: https://github.com/fbpic/fbpic [2]: https://github.com/fbpic/fbpic/blob/1867a4f216baf4269f2314ab...

short_sells_poo · on Dec 27, 2022

As someone who uses the python numerical computing libraries extensively, Numba is my biggest disappointment in the ecosystem.

The main problem with Numba is that simple functions are easy enough, and this lulls you into a false sense of security- that things will work.

Unfortunately, every time it turns into an a hair tearing exercise of trying to structure the code such that Numba's vast array of unpredictable edge cases isn't hit.

The error messages are often infuriatingly bad.

At this point I've banned Numba from our codebase. If there's a case for Numba, we just do it in C++ instead.

Edit: we've been looking at Taichi https://www.taichi-lang.org/

micheles · on Dec 27, 2022

I use numba a lot nowadays. Works perfectly well on all platforms (linux, windows, mac, even the M1) and gives speedups as expected (few percent for already well vectorized numpy code, and extra-large speedups for loopy code). I strongly recommend it for the performance critical part of your code. Many things are not supported yet, so it has to be used with care. I remember I needed a missing scipy special function and I the end I implemented it myself by vectorizing math.erf: it was surprisingly easy to do and a big success in terms of performance.

stared · on Dec 27, 2022

As a side note, now it is easy to write Rust code, which can be directly used in Python - https://github.com/PyO3/pyo3.

It cannot use NumPy and other libraries (since it is Rust), but at the same time, I see its potential in creating high-performance code to be used in Python numerical environment.

kylebarron · on Dec 27, 2022

On the contrary, it can use and interface with numpy quite easily: https://github.com/PyO3/rust-numpy

stared · on Dec 27, 2022

Good to know!

baggiponte · on Dec 27, 2022

I am really intrigued by the Codon project, which aims to be a JIT compiler for Python with Numba/JAX decorator syntax: https://github.com/exaloop/codon

ipsum2 · on Dec 27, 2022

It's not going to take off, since it doesn't have full (or even most) API compatibility with Python. Numba seems strictly better because it can interop with Python.

samsquire · on Dec 27, 2022

How would this compare to Pypy?

I didn't think Pypy uses LLVM so I wonder who produced better code.

That said, they're targeted at different audiences. I feel Numba is targeted at data science and machine learning and even AI.

I feel a large portion of using or programming a computer is structural and not the actual work of adding numbers together. Very little of the code generated does the useful part a computer does: addition. The rest is control flow management and data placement! It's all preparation for the code to do an addition. The hard part is putting together the structure for the computer to do things that are useful.

So we invented methods, variables, classes, functions, closures, expressions to create that structure easier.

I thought about creating a language which tries to eliminate the structure that most programs accumulate and focus on the critical addition or calculation and let the computer do the arrangement. A JIT compiler for structure.

csdvrx · on Dec 27, 2022

> let the computer do the arrangement

Isn't that constraint propagation?

I'm discovering JS at the moment. I don't fully understand the async model, but the promise seems like a generic constraint of "the result is now available"

Maybe you could have the "flow managements" as other constraints?

samsquire · on Dec 27, 2022

Thank you for your reply.

I'm thinking the code for your average CRUD or even desktop compositor. A compositor copies pixels from multiple places into one place. Surely that can be defined with a simple loop? But no there's hundreds of APIs in the way. Add Wayland and X11 and you have something that is opaque and understood by very few people.

The motivation behind my comment was that most of programming computers is gluing together APIs to shift data from one place to another before doing something useful with it. The APIs themselves do very little addition or subtraction of data but actual just moving data around and placing it into the right place.

Maybe defining where things should be, declaratively, in order to do a calculation would be useful. So the shape of the calculation defines the data structure, rather than the data structure defining the caclulation.

csdvrx · on Dec 27, 2022

For a compositor, I'd think of the set pixels being changed (an "invalidation") a good example: the constraint would be to update it on the screen.

Unchanged? Don't bother, leave it as-is. I think that's how Intel power saving works.

Now think about the MVC model: some changes in the data could result in a change in the view if the data currently shown on screen is what has changed - like triggers in SQL.

I wonder if you could have everything work like that?

samsquire · on Dec 27, 2022

You're right, and thank you for bringing async up.

And thankyou for bringing up constraint propagation.

One of my ideas is the definition of formulas that act as materialized views over other materialised views. So we can layer materialized views over other materialized views and then work out a derived formula that is potentially nearer to what we want and potentially summarise the formula without needing to calculate the underlying views, we can compute the formula directly.

Is this differential dataflow?

I think it's an application of algebra and JIT compilers could do it to expressions if we fed symbolic expressions of programming languages into sympy or machine algebra.

In react, react does diffing between virtual DOM nodes to see if there are changed. There is also dirty region checking in old games and damage regions. These problems are mathematically defined.

Here's my writings on the idea https://github.com/samsquire/ideas4#31-algebraic-materialise...

csdvrx · on Dec 27, 2022

> I think it's an application of algebra and JIT compilers could do it to expressions if we fed symbolic expressions of programming languages into sympy or machine algebra

Yes and the constraints could be the used to reduce the computational costs, giving higher performance and lower latency.

A while back, a good friend (we even shared HN accounts for a while lol) pointed me to pipelinedb: a PostgreSQL timeseries plugin for continuously updating """materialized views"""

I use a lot of quotes around, because it wasn't either like a regular view (computed when you query it, which introduces latency) or a materialized view (frozen, needs to be refreshed, same problem) but more like the NO_HZ tickless kernel: the update of the calculations was caused by the introduction of new data, not the passage of time (which would be wasteful)

The general approach makes a lot of sense to me, and I see how it could be used for more generic problems.

_Wintermute · on Dec 27, 2022

I've only used numba once but I was really impressed. We have an analysis at work that runs hundreds of times a day which uses a Hampel filter written in numpy, but still requires iterating over an array. Just adding a @numba.jit decorator above the function gave us a 10x speed improvement.

dang · on Dec 27, 2022

Numba: a JIT compiler for Python that works best on code that uses NumPy - https://news.ycombinator.com/item?id=21614533 - Nov 2019 (9 comments)

How Numba and Cython speed up Python code - https://news.ycombinator.com/item?id=17678758 - Aug 2018 (45 comments)

Numba: High-Performance Python with CUDA Acceleration - https://news.ycombinator.com/item?id=15301766 - Sept 2017 (62 comments)

Numba - JIT specializing compiler for annotated Python and NumPy code to LLVM - https://news.ycombinator.com/item?id=5927787 - June 2013 (8 comments)

Accelerating Python Libraries with Numba (Part 2) - https://news.ycombinator.com/item?id=5757231 - May 2013 (23 comments)

Accelerating Python Libraries with Numba - https://news.ycombinator.com/item?id=5680722 - May 2013 (30 comments)

Numba: NumPy-aware optimizing compiler for Python - https://news.ycombinator.com/item?id=4430780 - Aug 2012 (23 comments)

NumPy aware dynamic Python compiler using LLVM - https://news.ycombinator.com/item?id=3864659 - April 2012 (9 comments)

Numba - A NumPy aware (LLVM-based) optimizing compiler for Python - https://news.ycombinator.com/item?id=3692055 - March 2012 (6 comments)

itamarst · on Dec 27, 2022

Quick overview of the design space:

* PyPy JITs everything, so it can do _normal_ Python numerical code that is quite fast and regular Python code that is fast. However, its interactions with libraries like NumPy add overhead, and it seems like it can't JIT code that interacts with NumPy in a useful way (AFAIK, would be happy to be proven wrong). So not useful for optimizing numeric functions that interact with libraries like NumPy.

* Plain old NumPy and friends. This is great... if the operation you want is already available as a "vectorized" API. "Vectorized" in this context does NOT mean SIMD, it's a Python-specific usage, see below.

* Numba: JIT compilation specifically focusing on interop with NumPy and similar libraries. Lets you write subset of Python but unlike NumPy you can use for loops and go fast.

* AOT compilation: Cython, Rust, C++, etc.. You have a longer feedback loop, but you have a full programming language, especially if you avoid Cython. OTOH Cython has nicer Python interop so for simple just-a-little-addon it can be easier to use if you don't already know Rust. You really shouldn't be writing new C++ in this day and age (but wrapping an existing library is useful). Like C++, Cython doesn't help with memory safety. Cython also suffers from two compilers, so debugging can be harder, especially if you use the C++ interop; if you are wrapping existing C++ library, I'd probably start with PyBind11 based on long-ago experience with Boost::Python.

Longer form:

* "Vectorization" in the context of Python: https://pythonspeed.com/articles/vectorization-python/

* PyPy and Numba as alternatives to vectorization: https://pythonspeed.com/articles/vectorization-python-altern...

* Choosing a compiled language: https://pythonspeed.com/articles/rust-cython-python-extensio...

* The performance overhead of AOT compiled libraries (less relevant if you're doing anything numeric): https://pythonspeed.com/articles/python-extension-performanc...

* Numba intro: https://pythonspeed.com/articles/numba-faster-python/

hedgehog · on Dec 27, 2022

Good overview. "Vectorized" is an old term that's been around since the early days of supercomputers and maybe before, not sure where it came from. Numba does a bunch of different things for code written to the Numpy API including CUDA acceleration. Certain machine learning frameworks like PyTorch and JAX also roughly follow the Numpy API because it is widely familiar and easy enough to work with. The kind of code that benefits from this kind of acceleration is hard to write yourself. A lot of workloads lean on linear algebra operations that are conceptually simple but complicated to implement with good performance, thus why all of this tooling isn't just a couple thousand lines of C. Good overview of matmul on CPU:

https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...

college_physics · on Dec 27, 2022

Cray supercomputers used to have special "vector" units that would perform operations on multiple scalars (eg 128 doubles) in parallel. A bit like gpu units. Any algorithm that could be cast in a form that benefited from this type of parallelism would be called vectorizable. Linear algebra obviously fits perfectly (but, depending on the problem, you might0 need to juggle the vector dimensions).

Vectorizing code was fairly straightforward using the latter versions of fortran. It was all quite sweet and productive but could not provide the required hpc scaling so was eventually abandoned in favor of massively parallel designs.

hedgehog · on Dec 28, 2022

I think this is backwards, the early Crays and maybe other machines used essentially serial + aggressively pipelined CPUs while modern CPUs do actually execute operations in parallel via multiple ALUs but are from a programmer point of view kind of similar (fixed size vector).

college_physics · on Dec 28, 2022

What is a vector unit was is simd etc can be confusing but at least some wikipedians are sure the early Crays had true vector processing (which may not be parallel processing according to your definition)

https://en.m.wikipedia.org/wiki/Vector_processor

hedgehog · on Dec 30, 2022

Yes, in the Cray design a vector instruction processes one item per clock so total latency increases with number of elements. Only parallel in the sense of pipelining being a form of parallelism. Something like an AVX-equipped Intel CPU processes all elements in parallel to deliver a result in an essentially constant number of cycles.

Edit: There's a period write-up of the general Cray 1 design here: https://inst.eecs.berkeley.edu/~n252/sp07/Papers/Cray.pdf

chestertn · on Dec 27, 2022

I will save you the pain: switch to Julia.

m_c_g · on Dec 27, 2022

Indeed! Converting one's entire code base to a different language ecosystem, finding equivalents to each of your third-party dependencies, is less painful than employing a library to selectively compile a few performance bottlenecks in your code.

(Modules like PyJulia facilitate a more incremental approach.)

xigoi · on Dec 27, 2022

That's why you should switch before creating the codebase in the first place.

asidogbniobio · on Dec 28, 2022

Normally I'd agree with your sarcasm, but Python is such a disaster that I am in favor of ditching it ASAP in favor of literally anything else.

Alifatisk · on Dec 27, 2022

rg111 · on Jan 5, 2023

I don't recommend Julia for anything critical to production.

ChrisRackauckas · on Jan 5, 2023

I would. It's much safer than Python which has a package ecosystem that is known to be critically unsafe. https://pytorch.org/blog/compromised-nightly-dependency/ is just the latest example of security issues it's been having, https://moyix.blogspot.com/2022/09/someones-been-messing-wit... is general numerical incorrectness which is non-local and cannot be turned off, and that's not even getting to the specific inaccuracies of Numba. I'd switch away from a numerically incorrect security issue today!

voz_ · on Dec 27, 2022

Very impressive project. If compiling Python interests you, check out the pytorch compiler stack too!

https://pytorch.org/get-started/pytorch-2.0/

optimalsolver · on Dec 27, 2022

I went out and learned C++ because Numba was so finicky to work with.

IceHegel · on Dec 27, 2022

Are there a standard set of benchmarks these python JIT projects use?

I’m very interested in adding something like this to some projects but it needs to be 10-100x faster to be worth the hassle. Otherwise, for our applications, it’s a better time investment to rewrite in Go and get the speed and pro tooling than to further optimize python.

ellisv · on Dec 27, 2022

I’m surprised you’d rewrite in Go rather than Julia. I’d expect Julia would be much easier to translate to from Python and have much better support for any mathematical operation.

doliveira · on Dec 27, 2022

Lol, matrix arithmetics and scientific programming in Go

hedgehog · on Dec 27, 2022

If you have numeric code that's too slow in Numba your next stop will likely involve a big multi-language effort and GPU specialists and none of that would be in Go except maybe a wrapper for your apps.

lvass · on Dec 27, 2022

Does anyone know how this approach of adding decorators to numerical functions compare to Elixir's Nx approach of compiling those functions through a specialized macro for numerical computations? Would Numba benefit if (PEP 638?) macros were added to python?

usgroup · on Dec 27, 2022

I think numba still makes sense for loopy algorithms but not so much if youre more vector oriented given that Jax is more or less a drop in replacement for numpy and is shockingly fast.

galangalalgol · on Dec 27, 2022

I have used pytorch as a (almost) drop in replacement for numpy. Are there good reasons to look at jax instead assuming I'm doing DSP and not ML?

patrickkidger · on Dec 27, 2022

Honestly, the two are now incredibly close.

JAX introduced a lot of cool concepts (e.g. autobatching (vmap), autoparallel (pmap)) and supported a lot of things that PyTorch didn't (e.g. forward mode autodiff).

And at least for my applications (scientific computing), it was much faster (~100x) due to a much better JIT compiler and reduced Python overhead.

...but! PyTorch has worked hard to introduce all of the former, and the recent PyTorch 2 announcement was primarily about a better JIT compiler for PyTorch. (I don't think anyone has done serious non-ML benchmarks for this though, so it remains to be seen how this holds up.)

There are still a few differences. E.g. JAX has a better differential equation solving ecosystem. PyTorch has a better protein language model ecosystem. JAX offers some better power-user features like custom vmap rules. PyTorch probably has a lower barrier to entry.

(FWIW I don't know how either hold up specifically for DSP.)

I'd honestly suggest just trying both; always nice to have a broader selection of tools available.

martinsmit · on Dec 27, 2022

If you are doing array or vector-based work where the operations can be written as maps as opposed to for loops then JAX is king imo.

melling · on Dec 27, 2022

[flagged]

dang · on Dec 27, 2022

That's a bit too cynical, I think. People post follow-up/related stories because the brain likes to follow chains of associations.

You're right that these chains tend towards already-familiar associations, which lower their value as HN stories. The best HN stories are the ones that can't be predicted from any existing sequence: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

melling · on Dec 27, 2022

Sure. Some people do. While others…

https://news.ycombinator.com/item?id=34110321