Hacker News new | past | comments | ask | show | jobs | submit login
From Python to Numpy (labri.fr)
348 points by haraball on Jan 9, 2017 | hide | past | web | favorite | 48 comments

As someone who has used numpy for many years and written a great deal of production code using it, I was surprised when I read through this and saw some numpy tricks that I didn't know regarding the speeds of various operations! This is really a fantastic reference that provides a deeper level of understanding of what numpy does under the hood.

One thing I will highlight that the author just touched on briefly, is that numpy combined with numba is really a phenomenal combination for dealing with very computationally intensive problems.

The folks at Continuum Analytics have really done a fantastic job building numba (numba.pydata.org), which JIT compiles a subset of python functions using LLVM, and is designed to work seamlessly with numpy arrays. Numba makes it much easier to speed up performance bottlenecks and allows you to easily create numpy ufuncs which can take advantage of array broadcasting.

Can I ask how intensively you have used Numba and over what period? I'm interested in how Numba has progressed over the last few years, with a view to using it over Cython.

My team and I looked at Numba a year ago or so for optimisation of a fairly large calculation, and found that the speed-ups were impressive where they worked, but were not consistent or predictable.

We used Cython for large parts, and while there was boilerplate and incantations, the gains were achievable, incremental and certain. The annotation tools were also quite helpful for identifying bottlenecks where Cython code could be effective.

Incidentally, once we decided that Cython was our go-to tools, we often wrote simple looping code rather than vectorised code because it was simpler to transition to Cython, alá Julia.

Sure. I've used numba for the past 1 1/2 years, and I've seen it grow quite a bit. When I first started using it, there was a separate product called numbapro that did all of the gpu jit, which they've now included in numba for example.

Regarding whether it would be appropriate vis-a-vis cython really depends on your application

First, Cython is fantastic as well, and my endorsement of numba doesn't take anything away from it. Cython is much more fully featured and mature, in the sense that you can really develop your own data structures and control flows. Pretty much anything you could do in C, you could do in Cython. I've written Cython and it also plays very nicely with numpy.

In comparison, Numba is much more limited. You are basically limited to using numpy arrays and matrixes as your data structures, and you really need to understand exactly what is going to be used prior to the jit loop or you won't be able to use it in nopython mode (which is where you get the most benefit). It also doesn't handle strings really at all. One fairly recent thing Numba does is allow you to use a list of a single type within nopython mode. Under the hood it handles the malloc for you.

My endorsement of Numba really boils down to ease of integration with existing python codebase. For me the "killer feature" was the ability to simply comment out the @jit or @njit decorator and step through the code like I would step through normal python code, then just turn it back on again when I needed it. The other was that numba gained the ability early in our adoption to chain functions together, so while you can't generate a numpy array in the nopython mode, for instance, you can generate a numpy array in a @jit function (object mode) outside of the nopython mode, then call the looping function (nopython mode) from that jitted function, and numba handles that seamlessly and cuts out a lot of the overhead. For us, our speed of development of a custom algorithm has really been helped greatly by Numba.

The other thing I will mention is that when I first started, getting LLVM to work with numba was, initially, a nightmare on different OS's. That has completely gone away with improvements in conda package manager now.

All that said, you cannot go wrong with Cython, it just has a little more of a learning curve and was a little tricker to implement in our codebase.

>> once we decided that Cython was our go-to tools, we often wrote simple looping code rather than vectorised code because it was simpler to transition to Cython, alá Julia.

If you're used to doing this with cython, you might find it even easier to do this with numba. This is how I develop all the time with numba now. I find that it's incredibly beneficial to step through it as though it was just regular python initially during algorithm and test development, then once the algorithm is right and tests pass, turn on the jit when ready. You sort of get a sense for what numba will accept and still have performant no-python mode jitting after a while, and knowing those limitations actually tends to cause me to write more modular code to take advantage of the speed boosts.

Just to jump on the numba train, I've generally found it to reliably obtain C like performance from C-like Python code. This property also holds when you use Python as a preprocessor language for generating computational kernels, which provides a lot of flexibility not evident in the documentation.

It also has simple-to-use openmp-like multicore parallelization, limited class support, AOT compilation and CUDA & AMD HSAIL support.

Are you using python as a preprocessor to glue together text strings of numba code which you then eval? Or are you using python to generate the numba AST?

Travis Oliphant, Numpy creator, is CEO of Continuum Analytics.

Immediately recognized the domain name. Months ago I was doing yet another search on how to do geospatial plotting with Matplotlib, the kind that mostly works-out-of-the-box in R/ggplot2, but, because of some latent fragmentation from Py2v3, was not well-documented anywhere in Python/matplotlib. And while I've come to really like and respect Matplotlib, the documented examples stray far from what they should for purposes of API illustration, and so learning it has been a test in patience.

Anyway, Mr. Rougier's Matplotlib was both informative, concise, and beautiful. Actually, I think my appreciation for matplotlib came from reading his guide: https://www.labri.fr/perso/nrougier/teaching/matplotlib/

The site is down for me, but you can see the content nicely formatted on GitHub: https://github.com/rougier/from-python-to-numpy

For cmd-f: mirror

You mean search? Not everybody uses a Mac ...

Yet somehow you figured out what he meant.

It was not tongue in cheek. I do not know what cmd is. I assume it is a modifier key in an Apple keyboard, since it is not in a Windows keyboard, which I have (although I am on linux).

Sure, I guess he means search, but he could just say that.

Or should I start talking using references to my keyboard shortcuts? I mostly use emacs, very often with custom bindings, or tmux, so it will not make sense to a lot of people.

But hey, it's a free world.

I'd be very curious to know if there is any impact to choosing Numpy C ordered arrays or Fortran ordered arrays. As a long time Matlab user (since 1993) who moved to Python 3 years ago, I have always defaulted to Fortran order because it was what I was used to and seemed more intuitive. I did play with C ordered arrays but didn't find an advantage in my limited investigation.

I think it depends on whether an algorithm is forced to traverse an array in the cache-efficient direction or not. Oftentimes you can't choose whether to make your outer loop rows or columns so the performance could go either way.

There may still be a few routines that expect C-ordered arrays and so require a copy be made when given a Fortran-ordered array --- especially as you extend to one of libraries that use NumPy. For the most, part, however, Fortran-ordered arrays should work well. It all comes down to the expectation of the routine writer.

Does anyone have a recommendation for something similar to this but for Python itself? I have been trying to find something that is not necessarily an intro or crash course book but a book with tips, great explanations, and neat examples (which this e-book(?)/site has).

I see that the author has responded to a couple comments here. Thank you for your great work! It's always great to have a nice reference material with concise examples. I think this will be helpful to everyone(beginners and advanced python users alike)!

I would recommend Julien Danjou's "The Hacker's Guide to Python". He charges $29 for the PDF, but provides updates every year or so. I think it's a great book for getting more depth out of Python. :)

Topics include: modules/libraries, documentation, distribution, virtual environments, unit testing, methods/decorators, functional programming, optimization, scaling, RDBMS, and more.


Thanks for the suggestion! I downloaded the free chapter from the book and I'm going to check it out tonight. If I enjoy it, I might have to buy the physical and electronic copy!

Thank you. For some Python introduction, you can have a look at the end of this page: https://github.com/rougier/Scipy-Bordeaux-2016/blob/master/i...

Thank you for the suggestion!

I'm skimming through Effective Python, and so far I think it's a pretty good "best practices" guide for intermediate Python developers.

Seconded. `Effective Python` is a fantastic resource detailing a lot of the ways Python differs from other languages and how to write to the language's strengths.

I'm no hot shit, but it definitely moved me from 'middling hobbyist' to 'semi-capable dev'.

Becoming a "semi-capable dev" is exactly what I'm trying to aim for. I started learning Python at a relatively young age and I picked up a set of bad habits. I ended up becoming a Mechanical Engineer so I never had the time or opportunity to code more to become more proficient and write better code. I'm currently taking the self-driving car course through Udacity. I'm hoping this would be a good opportunity to correct my bad habits and become a better coder.

There's Fluent Python from OReilly which is a pretty good follow-on for people who are comfortable with the basics and want to know dig into some more advanced features:


Yes, Effective Python is awesome!

Hey I work at Dataquest (dataquest.io) and we have a lot of intermediate & advanced Python content. It's all done through an in-browser coding environment which lets us do answer checking and so on.

this book is amazing! specially the authors sense of humor makes reading it fun.

> For example, can you tell what the two functions below are doing? Probably you can tell for the first one, but unlikely for the second (or your name is Jaime Fernández del Río and you don't need to read this book).

I just check the first example from introduction to vectorization: http://www.labri.fr/perso/nrougier/from-python-to-numpy/#id5, (add_python and add_numpy) and benchmark results are nearly the same: 75.4ns and 77.7ns accordingly.

Anyone (author) know what was used to generate the cover image of cubes and shadows?

Edit: it's sketchup - there's a .skp file in the data/ subdirectory of the github repo for the book.

Is there an epub/mobi version?

Does the book exist in PDF?

Not yet but I'm working on it (meanwhile you can try a rst2latex.py on the sources) for a very rough draft.

Amazing work. If you take a look at latex please consider also XeTeX for its beautiful font support. It can give a lot of satisfaction when packaging good content. Microtype is also nice if one sticks to pdflatex.

I've failed to build pdf from book.tex produced by rst2latex.py (the issues are likely fixable if you work with latex).

I've converted the rst files to e-book using rst2html.py + calibre instead.

> be warned that I'm a bit picky about typography & design: Edward Tufte is my hero

And it shows, the theme is beautiful. Also some of the best ASCII diagrams I've seen. Worth a look at the source, even if you don't care about Python.

Thanks. If you love ascii art, make sure to have a look at https://casual-effects.com/markdeep/

Wouldn't normally criticize website design, but since this came up... yes, the fonts are pretty and all, but on my humble 24" monitor the site uses barely half of my horizontal space and looks awkward (the table of contents especially). "Mobile-first"?

> but on my humble 24" monitor the site uses barely half of my horizontal space and looks awkward (the table of contents especially). "Mobile-first"?

That's actually a feature, as lines that are too long are harder to read (https://en.wikipedia.org/wiki/Line_length). It's the same as why books actually have empty margins.

To me, the font is too thin to be easy to read.

I don't mind empty margins, but having 60% of my screen empty for no reason is excessive, and I do read slower because of it. If I want shorter lines, I'll just resize my browser, thanks.

I agree with you about the font being too thin, but the line length annoyed me much more.

Optimizing for super-rare users who have a deliberate desire for long lines is less important than optimizing for the vastly-more-common case that the browser's size is large as an artifact of content the user was viewing on a previous site, but the user still prefer a sane line length for text content.

Citation that these users are super rare? Almost everyone I've had conversations with on the subject bemoan too large margins on websites.

That's odd. I wonder if these people that like looong lines are mostly younger people that have read more from screens than books (but not so young that they've read more from phones than computer screens :))? A book with very long lines would be super weird.

Agreed. I'm specifically referencing wasted space. As I've gotten older, that "wasted" space gets replaced with zoomed fonts using ctrl+=. I've found this attitude and use isn't all that unique anecdotally, so I found the assertion that it's "rare" to be contrary to my own experience. I have no citation the other way though!

On the flip side you can use Reader Mode to make long lines shorter, but you can't use Anti-Reader Mode to make short lines longer.

My problem is that the notes are off-screen - their seems to be an assumption that I'm reading on a widescreen monitor...

Yes, sorry for that. I need to work a bit more on a more responsive CSS.

Fonts are barely readable: too thin, too white on my browser/screen.

I really appreciated the problem vectorization chapter. New approaches require new thinking and this is often forgotten when teaching new concepts.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact