Hacker News new | past | comments | ask | show | jobs | submit login

Some thoughts on my experiments with SIMD Programming:

1. AVX2 is good, but tedious to use manually. The difficulty problem with AVX2 is that it is SIMD of SIMD: Its 2-way SIMD of 128-bit. Going "across" the lanes of the 128-bit bunches can only be done with rare instructions... or through L1 cache (write to memory, then read back in another register).

2. "#pragma OpenMP SIMD" seems to be the most portable way to attempt to "force" autovectorization. Its compatible across GCC, CLang, and ICC, and other compilers. Visual Studio unfortunately does NOT support this feature, but VSC++ has well documented auto-vectorization features.

3. If you are sticking to Visual C++, its auto-vectorization capabilities are pretty good. Enable compiler warnings so that you know which loops fail to auto-vectorize. Be sure to read those warnings carefully. https://docs.microsoft.com/en-us/cpp/parallel/auto-paralleli...

4. If you keep reaching for the SIMD button, the GPU Programming model seems superior. If you must use a CPU, try the ISPC: Intel's SPMD Program Compiler (https://ispc.github.io/) so that your "programming model" is at least correct.

5. If a huge portion of your code is SIMD-style, a dedicated GPU is better. GPUs have more flops and memory bandwidth. GPUs have faster local memory (aka "shared memory" on NVidia, or "LDS" on AMD) and faster thread-group communications than a CPU. Know how amazing vpshufb is? Well, GPUs will knock your socks off with ballot(), CUDA __shfl(), AMD Cross-lane operations and more (https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/).

6. To elaborate on point #5: GPUs simply have better gather/scatter capabilities. The GPU's "shared" or "LDS" memory (just 64kB or so) is very small, but provides arbitrary gather/scatter capabilities across "lanes" of GPU SIMD units. They even support relatively efficient atomic operations. Yes, even vpshufb seems relatively limited compared to what is available on GPUs.

7. Raw AVX2 assembly seems "easy" if what you need is a 128-bit or 256-bit register. For example, if you are writing Complex-Doubles (real+imaginary number), then it is very straightforward to write 128-bit SIMD code to handle your arithmetic. But if you are writing "true SIMD" code, such as the style in the 1986 Seminal Paper "Data Parallel Algorithms" (https://dl.acm.org/citation.cfm?id=7903), then stick with ISPC or GPU-style coding instead.

8. Be sure to read that paper: "Data Parallel Algorithms" to get insight into true SIMD Programming. GPU programmers already know what is in there, but its still cool to read one of the first papers on the subject (in 1986 nonetheless!)




I can't comment on the GPU comments, but you may be better off leaving the vectorization to gcc than using the simd pragma. On haswell, it uses avx2, but not fma, so you loose a factor of two on GEMM, for instance. The GCC manual also gives an example for the ivdep pragma.


Would not that be just a question of missing improvements?

If I recall correctly, OpenJDK can use FMA thanks to Intel contributions.


I don't see how openjdk is related to the openmp pragma. GCC has no problem using FMA if you just let it, avoiding the pragma which simply says "simd".


I understood that GCC auto-vectorization wouldn't do it currently, and hence gave an example where auto-vectorization does make use of it, assuming I remember Intel's session at CodeONE correctly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: