Typically straightforward Numpy gets me from 40× slower than straightforward C to 5× slower. Tricky Numpy (output arguments, conversions to lower precision, weird SciPy functions) gets me another factor of 2. C SIMD (intrinsics or GCC’s portable vector types) gets me a factor of 4 faster than straightforward C. Presumably CUDA would buy me another factor of 100 but I haven't tried it and I haven't tried Numba either.
You're probably expecting a bit much from CUDA. If you have heavily optimized CPU code running on a high-core count Xeon, it's probably more like 2-3x. The reason why CUDA is so popular is because it makes that comparatively easy to achieve. Optimizing x86 to the last 10% is a dark art only very few programmers are capable of, while writing decent GPU code is IMO an order of magnitude easier, i.e. just a craft.
Main difference being: High memory bandwidth vs. heavy usage of cache, unified programming model for vectorization and multicore parallelism
Yes, but it's much less emphasized on GPU I think. If you have a data parallel algorithm, as long as you design the array ordering to allow coalesced access, the memory architecture will usually already allow better performance than what you can hope to get from CPU even with heavily cache optimized code that's basically unmaintainable (as it will likely perform much differently on the next architecture).
without lots of CPU cores and with a high-end NVIDIA card your speedup expectations just can become a bit higher. Typically 100x when comparing GPU-friendly algos to unoptimized (but native) CPU code or 10x when comparing it to decently optimized code running on slower CPUs.
Generally I think a performance/cost comparison is more useful: Take the price of the GPU and compare it to something with equivalent cost in CPU+RAM.
> Typically straightforward Numpy gets me from 40x slower than straightforward C to 5x slower.
I find this hard to believe. What kind of numerical work are you doing? Even something as simple as matrix-matrix multiplication should be hard to beat with C, unless your C code is using a cache efficient algorithm.
Branch heavy code, for example trading order book updating.
People always say "use numpy", but that is only possible if your algorithm can be described in terms of vectorized operations. For many kinds of processing, the only alternative is C/C++ (through Cython)
> People always say "use numpy", but that is only possible if your algorithm can be described in terms of vectorized operations. For many kinds of processing, the only alternative is C/C++ (through Cython
I think using numpy is always good first step after just trying to improve the algorithm. Numpy will be less effort than going to cython. After that cython is a good next step. I seriously don't know any situation where I would do the kind of micro-optimizations mentioned in the article.
My personal experience is that you can actually get another factor of 2 or 3 speed-up by ditching Cython and using actual C instead (I think it's because optimizers have a hard time cleaning up the C that Cython produces), even if you've turned off thing's like bounds checking.
I guess you haven't tried it, then. But your lack of knowledge is not a reasonable justification for attacking my integrity.
> Even something as simple as matrix-matrix multiplication
That's the best case for Numpy, not the worst. SGEMM is indeed just as fast when invoked from Numpy as when invoked from C, at least for large matrices.
Indeed, numpy is awesome. I used to do my computational experimenting (some of them including neural networks) the plain way (using classic data structures, the languages' built-in arithmetic operators and functional facilities) in many different languages. Once I've tried Python with numpy my mind was blown, it's so much faster than anything. Now I feel like I enjoyed writing functional code more but given the performance difference I can hardly imagine coming back. So the very reason I use Python is performance.