> If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.
You can say that again. Its still not very fast btw.
VPSCATTERDD ZMM-register is measured to be 17-clock cycles (!!!), while VPGATHERDD ZMM-register is 9-clock cycles. Gather/scatter on CPUs is very, very slow!
Its actually faster to gather/scatter through a for-loop than to actually use the VPGATHERDD or VPSCATTERDD instructions.
In contrast, Shared Memory on GPUs is a full crossbar on NVidia and AMD.
-------------
I think my details are getting a bit AMD specific. Lemme do a citation:
> As a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 and ds_bpermute_b32 . They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
The important tidbit is that the Load/Store units of AMD's GPU Core can support a gather/scatter to separate LDS memory banks at virtually no cost. This is a full crossbar that allows GPUs to swap lanes between SIMD registers in different lanes.
Intel CPUs do NOT have this feature, outside of vpshufb. I'm arguing that GPU Shared memory is of similar performance to vpshufb (a little bit slower, but still way faster than even a CPU's Gather/Scatter).
So yes, the "bpermute" and "permute" instructions on AMD are a full-crossbar and can execute within 2-clock cycles (if there are no bank conflicts). That's 64-dwords that can be shuffled around in just 2-clock cycles.
In contrast, Intel's Gather/scatter to L1 cache is 9-clocks or 17-clocks respectively.
-----------
The important thing here is to do a high-speed permute. The programmer can choose to use vpgatherdd, vpshufb, GPU permute, GPU bpermute, GPU LDS Memory, etc. etc. It doesn't matter for program correctness: it all does the same thing.
But GPUs have the highest-performance shuffle and permute operators. Even if you go through LDS Memory. In fact, the general permute operators of AMD GPUs just go through LDS memory, that's their underlying implementation!
> while VPGATHERDD ZMM-register is 9-clock cycles.
Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The reason why it’s OK on GPU is massive parallelism hiding latency i.e. the hardware computes other threads instead of just waiting for data to arrive. But even on CPUs, if the requested data is not too far (in L1 or L2), hyperthreading in most modern CPUs does acceptable job in this situation.
> Intel CPUs do NOT have this feature, outside of vpshufb
Yes they have. If you prefer assembly, look vpermps instruction. It can permute lanes arbitrary even across 128-bit lanes (this is unlike vpshufb/vshufps/etc.), with permutation indices being taken from another vector register. Quite fast, specifically on Haswell/Broadwell/Skylake it’s 3 cycles latency, 1 cycle throughput, single micro-op.
Stream Compaction can be used to remove redundant whitespace from strings, or to "compress" the raytracer rays so that they are all able to be read by a simple load (as opposed to a gather/scatter operation). How do you perform stream compaction?
Now, how do you do this using vpshufb? You can't. vpshufb is backwards, and won't efficiently solve this stream compaction problem. GPUs can solve this problem very efficiently, but Intel's current implementation of AVX512 is missing the "backwards" vpshufb command, to perform this operation.
Or as I've been trying to say: vpshufb is equivalent to a "gather" over SIMD Registers. But Intel is MISSING a scatter over SIMD Registers. The instruction is just... not there. I've looked for it, and it doesn't exist. As such, GPU-code (such as the stream compaction algorithm) CANNOT be implemented efficiently on a CPU.
I mean, you can use vpscatterdd to implement it, but again... vpscatterdd is 17-clock cycles. That's way too slow.
> Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The L1 cache has the bandwidth to do it, it just isn't wired up correctly for this mechanism. Intel's load/store units can read/write 32-bytes at a time, in parallel across 2xload units + 1x store unit.
But the thing is: writing many small "chunks" of 4-bytes here and there (ie: a vpgatherdd / vpscatterdd operation) is not a contiguous group of 32-bytes. Therefore, Intel's cores lose a LOT of bandwidth in this case.
GPUs on the other hand, have a great-many number of load/store units. Effectively one per SIMD unit. As such, reading / writing to LDS "shared" memory on AMD GPUs can be done 32-at-a-time.
So the equivalent "vpgatherdd" over LDS cache will execute in something like 2-clock ticks on AMD GPUs (assuming no bank conflicts), while it'd take 9-clock ticks on Intel cores.
Again, LDS cache is so fast on GPUs, that it is effectively functioning as if any LDS-load/store is as fast as a vpshufb instruction. (Not quite: vpshufb is 1-clock tick, and doesn't have to worry about bank-conflicts. So vpshufb is still faster... but GPUs gather/scatter capabilities are downright incredible)
How long before you think Intel will implement a true crossbar so that the vpgatherdd and vpscatterdd instructions can actually execute quickly on a CPU?
-----------
GPUs actually implement a richer language for data-movement. Intel could very easily fix this problem by writing a "backwards vpshufb" instruction, but I'm not aware of anything that exists like that... or any plans to implement something like that.
> Now, how do you do this using vpshufb? You can't.
You keep mentioning vpshufb despite it's unable to move data across 128 bit lanes.
Here's how to do that with vpermps, there's some overhead but not much, very likely much faster than RAM access even when in L1: https://stackoverflow.com/a/36951611/126995
Besides, new CPUs have AVX-512 that has vcompressps instruction just for that use case.
> The L1 cache has the bandwidth to do it
It has bandwidth, but there's extra latency involved. Registers are faster.
It's the same on GPU, that's why nVidia recommends these CUDA permute lanes intrinsics over scatter/gather RAM access.
You can say that again. Its still not very fast btw.
https://www.agner.org/optimize/instruction_tables.pdf
VPSCATTERDD ZMM-register is measured to be 17-clock cycles (!!!), while VPGATHERDD ZMM-register is 9-clock cycles. Gather/scatter on CPUs is very, very slow!
Its actually faster to gather/scatter through a for-loop than to actually use the VPGATHERDD or VPSCATTERDD instructions.
In contrast, Shared Memory on GPUs is a full crossbar on NVidia and AMD.
-------------
I think my details are getting a bit AMD specific. Lemme do a citation:
https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
> As a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 and ds_bpermute_b32 . They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
The important tidbit is that the Load/Store units of AMD's GPU Core can support a gather/scatter to separate LDS memory banks at virtually no cost. This is a full crossbar that allows GPUs to swap lanes between SIMD registers in different lanes.
Intel CPUs do NOT have this feature, outside of vpshufb. I'm arguing that GPU Shared memory is of similar performance to vpshufb (a little bit slower, but still way faster than even a CPU's Gather/Scatter).
So yes, the "bpermute" and "permute" instructions on AMD are a full-crossbar and can execute within 2-clock cycles (if there are no bank conflicts). That's 64-dwords that can be shuffled around in just 2-clock cycles.
In contrast, Intel's Gather/scatter to L1 cache is 9-clocks or 17-clocks respectively.
-----------
The important thing here is to do a high-speed permute. The programmer can choose to use vpgatherdd, vpshufb, GPU permute, GPU bpermute, GPU LDS Memory, etc. etc. It doesn't matter for program correctness: it all does the same thing.
But GPUs have the highest-performance shuffle and permute operators. Even if you go through LDS Memory. In fact, the general permute operators of AMD GPUs just go through LDS memory, that's their underlying implementation!