I think for pure numerics using numpy basically recovers all performance loss. I...

kragen · on Jan 14, 2020

Typically straightforward Numpy gets me from 40× slower than straightforward C to 5× slower. Tricky Numpy (output arguments, conversions to lower precision, weird SciPy functions) gets me another factor of 2. C SIMD (intrinsics or GCC’s portable vector types) gets me a factor of 4 faster than straightforward C. Presumably CUDA would buy me another factor of 100 but I haven't tried it and I haven't tried Numba either.

m_mueller · on Jan 14, 2020

You're probably expecting a bit much from CUDA. If you have heavily optimized CPU code running on a high-core count Xeon, it's probably more like 2-3x. The reason why CUDA is so popular is because it makes that comparatively easy to achieve. Optimizing x86 to the last 10% is a dark art only very few programmers are capable of, while writing decent GPU code is IMO an order of magnitude easier, i.e. just a craft.

Main difference being: High memory bandwidth vs. heavy usage of cache, unified programming model for vectorization and multicore parallelism

FpUser · on Jan 14, 2020

Btw GPU has its own cache and all problems related to it. You still have to watch what/how you do things on GPU to be cache friendly

m_mueller · on Jan 15, 2020

Yes, but it's much less emphasized on GPU I think. If you have a data parallel algorithm, as long as you design the array ordering to allow coalesced access, the memory architecture will usually already allow better performance than what you can hope to get from CPU even with heavily cache optimized code that's basically unmaintainable (as it will likely perform much differently on the next architecture).

kragen · on Jan 14, 2020

Thanks! But what if I don't have any kind of Xeon, but do have an NVIDIA card?

m_mueller · on Jan 14, 2020

without lots of CPU cores and with a high-end NVIDIA card your speedup expectations just can become a bit higher. Typically 100x when comparing GPU-friendly algos to unoptimized (but native) CPU code or 10x when comparing it to decently optimized code running on slower CPUs.

Generally I think a performance/cost comparison is more useful: Take the price of the GPU and compare it to something with equivalent cost in CPU+RAM.

yalok · on Jan 14, 2020

You don’t need any kind of Xeon - most of recent i5/i7/i9 cores have AVX and even AVX2 support.

steev · on Jan 14, 2020

> Typically straightforward Numpy gets me from 40x slower than straightforward C to 5x slower.

I find this hard to believe. What kind of numerical work are you doing? Even something as simple as matrix-matrix multiplication should be hard to beat with C, unless your C code is using a cache efficient algorithm.

hoiuyoi9087 · on Jan 14, 2020

Branch heavy code, for example trading order book updating.

People always say "use numpy", but that is only possible if your algorithm can be described in terms of vectorized operations. For many kinds of processing, the only alternative is C/C++ (through Cython)

steev · on Jan 14, 2020

That was my hunch.

> People always say "use numpy", but that is only possible if your algorithm can be described in terms of vectorized operations. For many kinds of processing, the only alternative is C/C++ (through Cython

Agreed

cf · on Jan 14, 2020

I think using numpy is always good first step after just trying to improve the algorithm. Numpy will be less effort than going to cython. After that cython is a good next step. I seriously don't know any situation where I would do the kind of micro-optimizations mentioned in the article.

Sean1708 · on Jan 14, 2020

> (through Cython)

My personal experience is that you can actually get another factor of 2 or 3 speed-up by ditching Cython and using actual C instead (I think it's because optimizers have a hard time cleaning up the C that Cython produces), even if you've turned off thing's like bounds checking.

hoiuyoi9087 · on Jan 14, 2020

I meant "through Cython" as in "passing through Cython", ie, Cython is just a thin wrapper layer over the real C/C++ code.

kragen · on Jan 14, 2020

> I find this hard to believe

I guess you haven't tried it, then. But your lack of knowledge is not a reasonable justification for attacking my integrity.

> Even something as simple as matrix-matrix multiplication

That's the best case for Numpy, not the worst. SGEMM is indeed just as fast when invoked from Numpy as when invoked from C, at least for large matrices.

imtringued · on Jan 14, 2020

Using GPUs involves significant tradeoffs that go way beyond writing SIMD code. (mostly batching and off device transfers)

CPU SIMD code can be trivially mixed with non SIMD code but mixing GPU and CPU code may negate the benefits.

qwerty456127 · on Jan 14, 2020

Indeed, numpy is awesome. I used to do my computational experimenting (some of them including neural networks) the plain way (using classic data structures, the languages' built-in arithmetic operators and functional facilities) in many different languages. Once I've tried Python with numpy my mind was blown, it's so much faster than anything. Now I feel like I enjoyed writing functional code more but given the performance difference I can hardly imagine coming back. So the very reason I use Python is performance.