Hacker News new | past | comments | ask | show | jobs | submit login

If you really care about extracting every possible ounce of performance out of Python's scientific stack, which relies extensively on Numpy, the best practical guide I have found for doing that is "From Python to Numpy," by Nicolas P. Rougier of Inria: https://www.labri.fr/perso/nrougier/from-python-to-numpy/

Here's a typical example of the kinds of optimizations this guide teaches you, in this case by avoiding the creation of temporary copies of Numpy arrays in memory:

  # Create two int arrays, each filled with with one billion 1's.
  X = np.ones(1000000000, dtype=np.int)
  Y = np.ones(1000000000, dtype=np.int)

  # Add 2 * Y to X, element by element:
  
  # Slowest
  %time X = X + 2.0 * Y
  100 loops, best of 3: 3.61 ms per loop

  # A bit faster
  %time X = X + 2 * Y
  100 loops, best of 3: 3.47 ms per loop

  # Much faster
  %time X += 2 * Y
  100 loops, best of 3: 2.79 ms per loop

  # Fastest
  %time np.add(X, Y, out=X); np.add(X, Y, out=X)
  100 loops, best of 3: 1.57 ms per loop
That's a 2.3x speed improvement (from 3.61 ms to 1.57 ms) on a simple vector operation (your mileage will vary!).[1] This only scratches the surface. The guide goes into quite a bit of explicit detail about how Numpy arrays are constructed and stored in memory and always explains the underlying reasons why some operations are faster than others. In addition, the guide has a section titled "Beyond Numpy" that points to even more ways of improving performance, e.g., by using Cython, Numba, PyCUDA, and a range of other tools.

I highly recommend reading the whole thing!

--

[1] Example copied from here: https://www.labri.fr/perso/nrougier/from-python-to-numpy/#an...




Faster than fastest:

  import numba
  @numba.vectorize(nopython=True)
  def add2(x, y):
      return x + 2 * y

  add2(X, Y, out=X)
This is 40% faster on my machine. It only needs to read X and Y once, and write X once. It does take a bit of extra time on the first run, because it uses JIT compilation (LLVM).

Numba lets you implement vectorized operations in terms of elements. In the above, x and y are scalars. You get support for various NumPy "ufunc" features for free, such as the out parameter used here.

Ref: http://numba.pydata.org/numba-doc/dev/user/vectorize.html


In Julia this would read using broadcasting:

    X = ones(Int, 1_000_000_000)
    Y = ones(Int, 1_000_000_000)

    # explicit dot notation
    X .+= 2 .* Y

    # or with a macro
    @. X += 2 * Y
The + and * do not allocate, the operations are fused into element-wise operations so the vectors are traversed only once, and the loop is compiled to vectorized code. No need for Numpy / BLAS.


That's impressive! What does fusion mean? I've heard fusion multiple times and I've the feel that fusion (in haskell at least) is something very advanced that's not really thought. Can you explain?


Two passes with one operation each are fused into one pass that has two operations?

Too bad common languages today (Python and its extensions for example) are a big step back from Matlab when it comes to matrix and vector operations syntax.


Not only now you are employing a foreign data type (numpy array) that requires separate learning independent (potentially confusing) to your existing Python knowledge, you also need mentally differentiate:

    X = X + 2.0 * Y
    X = X + 2 * Y
    X += 2 * Y
    X += Y; X += Y
, all irrelevant to your semantic logic.

Why not realizing that it is much more straight forward to directly do that in C? It is only a bit more to type but a much simpler mental load to understand (and maintain in the long run). If performance is at stake, spell it out in C (with x86 intrinsics if necessary) and put them into the semantics of your code.

Concerned with performance (in ms order) in Python is ill.


There's no need to mentally differentiate between all those approaches when writing code with Python's scientific stack.

In practice, code is initially written with relatively little regard for how it affects the performance of Numpy and other libraries like it, and then, afterwards, and only if necessary, these techniques are used to optimize those lines of code that prove critical to performance, typically only a small fraction of all lines.

However, if you're working on a project for which every line of code is performance-critical, then I would agree with you, Python would not be a good choice for that.


When you use numpy, you're not one semicolon away from a segfault.


How could it possibly take on the order of 1ms to operate on arrays of size 1e9?


numpy calls BLAS libraries, which are heavily optimized and parallelized. Note that it is also very dependent on which particular BLAS library is used, as several are older and not nearly as optimized as newer ones (the difference can be rather dramatic). If you're curious, the best one to use is OpenBLAS.


BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:

    In [2]: import numpy as np

    In [3]: X = np.ones(1000000000, dtype=np.int)

    In [4]: Y = np.ones(1000000000, dtype=np.int)

    In [5]: %time X = X + 2.0 * Y
    CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
    Wall time: 46 s

    In [6]: %time X = X + 2 * Y
    CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
    Wall time: 42.6 s

    In [7]: %time X += 2 * Y
    CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
    Wall time: 37.7 s

    In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
    CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
    Wall time: 42.6 s
No surprise, Julia makes nearly the same result:

    julia> X = ones(Int, 1000000000);
    julia> Y = ones(Int, 1000000000); 

    julia> @btime X .= X .+ 2Y
      34.814 s (6 allocations: 7.45 GiB)

UPD. Just noticed 7.45Gib allocations. We can get rid of it as:

    julia> @btime X .= X .+ 2 .* Y
      20.464 s (4 allocations: 96 bytes
or:

    julia> @btime X .+= 2 .* Y
      20.098 s (4 allocations: 96 bytes)


I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.


I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!


I still don't see how it's possible, no matter how optimized it is. Assuming 8-byte ints (which is what np.int seems to be on 64-bit) you're looking at reading at least 16GB of data since you're operating on two 8GB arrays and you have to read the data in each one at least once. If you can do that in a millisecond, that's a memory bandwidth of about 16TB/s. I thought modern CPUs had memory bandwidth of tens of GB/s, maybe low hundreds for really high-end stuff, and some brief searching seems to confirm that. What am I missing?

Edit: testing the given code on my 2013 Mac Pro, the fastest one at the end completes in one second or so (just eyeballing it), which makes a lot more sense.


The example the OP gave was from a tutorial/website that is hosted at the Laboratoire Bordelais de Recherche en Informatique. I imagine they probably have some heavy duty machines to crunch numbers on.


Not only that, but going through the array twice is apparently faster than doing it once and multiplying by 2. Is multiplying more expensive than fetching/storing from memory? This is counter-intuitive. I must be missing something.


You're going through the array twice whatever you do (once when multiplying by two then a second time when adding the arrays together), Python/NumPy isn't clever enough to figure out that it can be done in a single loop.


Good point.


Would this be as fast as 'Fastest'?

    %time X += Y; X += Y
It's a bit easier to read, imo.


Looks like it:

  In [1]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: np.add(x, y, out=x)
     ...: np.add(x, y, out=x)
     ...: 
  1 loops, best of 3: 287 ms per loop

  In [2]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: x += y
     ...: x += y
     ...: 
  1 loops, best of 3: 287 ms per loop

  In [3]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: np.add(x, y, out=x)
     ...: np.add(x, y, out=x)
     ...: 
  1 loops, best of 3: 286 ms per loop

  In [4]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
     ...: x += y
     ...: x += y
     ...: 
  1 loops, best of 3: 280 ms per loop




Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: