
AMD RDNA 1.0 Instruction Set Architecture - dragontamer
https://gpuopen.com/compute-product/amd-rdna-1-0-instruction-set-architecture/
======
dragontamer
There are a lot of changes from Vega: Warp32 to match NVidia (although Warp64
is still supported for backwards compatibility). It seems like Warp64 has some
nifty tricks in it still, so anyone writing for Vega can still rest assured
that you'll get benefits in RDNA / Navi.

The L0 / L1 / L2 cache structure looks complicated. The "s_waitcnt"
instruction can now wait for loads vs stores, which is kinda nifty. I'm
wondering if that's how memory barriers are supposed to be handled now (Vega
compiles into an L1 cache flush. But that buffer_wbinv instruction no longer
exists in RDNA).

The main issue is that bpermute / permute will only function across a Warp32.
I'm pretty sure this will break my code, but it makes sense because RDNA
executes severely differently from Vega.

\-----

DPP has changed. DPP8 is an arbitrary 8-way swizzle, while DPP16 are a very
limited 16-way swizzle (designed for "scan" or parallel prefix sum
algorithms). I'm curious what the plan is for broadcasting: maybe SGPRs are
the faster way of doing things rather than the old DPP32 operations.

EDIT: I guess V_READLANE_B32 and V_WRITELANE_B32 would be better "broadcast"
instructions in theory than the old DPP commands.

