

Accelerating Python Libraries with Numba - jasonsoja
http://www.continuum.io/blog/numba_growcut

======
aschreyer
It is a bit unfortunate that you have to "unroll" your code to get the most
out of Numba. Hopefully at some point it will be able to translate NumPy
functions such as np.abs(X - y).sum(axis=1)) efficiently into LLVM. Those
extreme performance improvements are misleading though; in my experience the
speed up compared to concise, optimal NumPy code is more in the range 50-100%.

~~~
dbecker
I'm sure numba is an impressive product. But I keep seeing examples on the
continuum blog with an obscene number of for loops, and correspondingly great
speedups.

My only response is: "My code doesn't look like that."

I'd be more impressed if they didn't cherry pick code with a dozen for-loops,
and instead showed a moderate speedup from more standard (and vectorized)
code.

~~~
onalark
(I wrote this last blog post)

In general, we're looking for results that will get people excited, and in
some communities, we can get fairly silly speedups over native Python, so we
are still picking that low-hanging fruit.

However, it's also important to be tied in to real applications. I'd be happy
to take a crack at applying numba to your code. Do you have a reasonably small
example we could take a look at?

~~~
aschreyer
The _Ultrafast Shape Recognition_ (USR) algorithm is a very simple yet
interesting application used in drug discovery that I tried speeding up with
Numba (the similarity calculation part). The NumPy implementation looks
roughly like this:

    
    
        def usr(X, y, S=0.9, N=10):
            scores = 1.0 / (1.0 + 1/12.0 * np.abs(X - y).sum(axis=1))
            scores = scores[scores>=S]
            scores.sort()
            
            return scores[-N:][::-1]
    

Where X.shape could be (2000000, 12) (or more rows) and y.shape (12,). The
idea is to retrieve the top N most similar hits above a similarity score of S.

~~~
onalark
This is a good one :)

The numba code isn't as pretty as it could be because slicing doesn't work for
overlapping memory regions or wraparound indexing yet, and we don't have
inlining :(

Here's what I get on a 2.6 GhZ Intel Core i7.

I rewrote your code to minimize memory traffic, then jitted it with numba:

    
    
      def usr_numba(x, y, S, num_best):
        m, n = x.shape
        best = np.zeros(num_best)
        best_low = 0.0
    
        for i in xrange(m):
            d = abs(x[i,0]-y[0])
            for j in xrange(1,n):
                d += abs(x[i,j] - y[j])
            d = 1.0 / (1.0 + 1/12.0 *d)
            if d > best_low and d > S:
                k = 0
                for k in xrange(0,num_best):
                    if d > best[k]:
                        break
                for l in xrange(num_best-1, k, -1):
                    best[l] = best[l-1]
                best[k] = d
                best_low = best[num_best-1]
        return best
    
      _usr = autojit()(usr_numba)
    
      In [1]: import numba_usr
    
      jitted kernel checks out
    
      N = 1000000
      usr   (s): 0.233645
      numba (s): 0.0115586
      20X speedup
    
      N = 2000000
      usr   (s): 0.566954
      numba (s): 0.023487
      24X speedup
    
      N = 4000000
      usr   (s): 1.14992
      numba (s): 0.0472016
      24X speedup
    
      N = 8000000
      usr   (s): 2.34968
      numba (s): 0.092601
      25X speedup
    
      N = 10000000
      usr   (s): 2.96395
      numba (s): 0.116032
      26X speedup
    
      N = 20000000
      usr   (s): 17.4779
      numba (s): 0.236304
      74X speedup
    

For the case aschreyer is interested in, I see a 24x speedup from half a
second to two hundredths of a second. For a really big problem (2 x 10^7),
numba is still well under a second and the numpy code is starting to really
suffer.

My full code is here: <https://gist.github.com/ahmadia/5550933>

I'm putting it into a wakari notebook so you can actually check me on this :)

 _Edit 1_ \- Made the speedup a little more comprehensible (and fixed gist)

~~~
aschreyer
Thanks, that was a good example indeed! Improving the memory traffic was
actually really important because the real-life application has N=20M+, up to
200M. Around 85M calculations per seconds is pretty impressive I have to say
and the example really helps to understand how to write efficient Numba code.

------
tlarkworthy
I get 10ms to cluster and segment an image with tuned numpy and opencv. With
better results. Indeed the loops regarding the inner patch convolution are the
killer. I used fingmineig, corner detectors, to turn a pixel location into a
vector of corners at different spatial scales. Those routines are performed in
parralell on opencv so I got rid of the inner patch kernel and it all
delegates to blas. No loops

~~~
onalark
I'm always interested in working with practical examples. Do you have some
sample code I could look at?

~~~
tlarkworthy
EDIT: Formatting is a bit whacked. I have two consecutive frames of a sonar
image (same size) as the input, you will have to exchange those parts but I
ran it fine. There are more parameters in the "DEFAULT_PARAMS" dict as I
copied this from a much larger program. I work in greyscale so that might
actually be a problem for genralization.

EDIT2: deleted as the src has been truncated, <http://pastebin.com/dPMsRF78>

~~~
onalark
Hi tlarkworthy, thanks for sharing your code with me!

\- It's really hard to make valid comparisons against data I don't have access
to. Do you have any open data sets to try this comparison against?

\- It looks like the meat of the work is being done here by scikit-learn. As
was mentioned earlier, Numba at this stage is mostly useful for improving
kernel performance, not large library routines.

I'm planning on taking a deeper look into some of the scikit-learn kernels in
the future. Keep an eye open for a blog post from Continuum on this.

~~~
tlarkworthy
OK these are the images, feel free to replicate them distribute them,
whatever, there is no licensing.:-
<http://img716.imageshack.us/img716/5808/frame0108.jpg> and
<http://img254.imageshack.us/img254/9562/frame0109.jpg>

Yeah I use openCV and scikit learn to do the heavy lifting indeed. But then if
I used Numba surely that is what you are advocating too?? I tried unrolling
the inner kernel a few different ways before settling on the code you have
before you. It doesn't compute the feature vector exactly how I would want,
BUT ITS _really_ FAST, which for sonar analysis to run on AUV in real time
that's essential. I will fit my math to the library to meet my CPU budgets.
Anyway I hope you can use this as a benchmark or something even if its
implemented in a totally different way than you might do.

feel free to email me tom _dot_ larkworthy <at> gmail

Tom

~~~
tlarkworthy
oh I should also add that the clustering (kmeans) is very fast. The analogous
part to the inner loop is "feature_vectors". Which has 2 main cases: case 1,
in the case nothing has been computed it calculates the corner response
_images_ at all the different spatial scales (big operations). case 2,
feature_vector just selects data from the corner images for the pixels
demanded.

Now my algorithm is sparse so its normally jsut selecting a subset of pixels,
although as it calculated the spatial responses of the whole image it doesn't
really make any odds to my runtime.

~~~
onalark
Thanks. I'll take a look.

------
CoffeeDregs
I looked at Numba a few weeks ago and it looked very impressive. One question
I had that I did not see addressed in Numba's documentation was whether Numba
is _generally_ applicable or if it is focused on mathematical computation? I
understand that math loops might be most in need to speed-up, but it would
still be nice to speed-up non-math-centric code (e.g. Django...).

~~~
kingkilr
It is not general purpose.

~~~
onalark
yet. :)

~~~
fijal
do you really think it'll ever be able to compile arbitrary Python and make it
fast most of the time? Cython never got there and numba is having a vastly
different purpose.

~~~
art187
Cython isn't 1) able to compile Python or 2) fast?

Hate to have to tell all my Cython code it needs to break itself because the
hackernews thinks so. =P

~~~
fijal
Cython isn't able to compile all Python. It definitely can compile a large
subset of Python, but certainly not all of it. I don't actually know how fast
the compiled Cython is - in my (very limited) experience, if you don't provide
types, it's not a massive speedup over CPython.

------
skierscott
Are there any Python vs. Numba vs. Cython vs C examples?

~~~
travisoliphant
Yes, there are several. See slide 15 of this talk
<http://www.slideshare.net/teoliphant/numba-siam-2013> and also the github
repo: <https://github.com/teoliphant/speed>

Note that the array-expressions previously in numbapro only have been moved to
numba (our plan for all premium products is to move features into open source
as we get funds to support that).

~~~
travisoliphant
But, the upshot is that Numba produces code that is either faster or roughly
the same speed as C or Cython.

