Same for x86. "String" instructions for copying between regions of memory is the closest thing to a memory-to-memory DMA, a feature commonplace on microcontrollers.
 Of course there is some architected program state including the registers, fp stack, flags, and various other processor state that can be saved/loaded but let's keep to byte-addressable storage
Versus Intel's vpopcntX reg1, reg2 where X determines element size.
Seems like Cray had several 64*64 = 4096 bit vector register, but you worked on it only 64 bit at a time while as current Intel CPUs have 512-bit vector registers up from 256-bit for AVX-2.
Are those Intel vector register sizes going to increase until they catch up to the old Cray? Or was going up from 256 to 512 bit chosen to fit something else in the CPU architecture, like that you can fill the register in so many clock cycles?
What makes you say that? There seem to be the usual vector-vector instructions:
161ijk Vi Vj+Vk
I don't see how they could. The vector size has increased from XMM to YMM to ZMM, there is obviously no more room for expansion ;-)
One thought comes to mind:
WMM, standing for "Wider MM register"
EXMM, EYMM, EZMM -- as in A, AX, EAX, RAX...
> What makes you say that?
Ye olde Crays used 'vector pipelining', meaning that while vector registers held many elements, there was only one ALU. So a single vector instruction took many cycles to execute. OTOH this enabled the execution units to be well utilized even without a cache, heroic OoO etc.
A 64x wide logical wavefront on a Vega64 would be physically executed by a 16x wide ALU. The ALU would pipeline itself over 4-clock ticks, providing the programmer a logical 64x wide vector.
The opposite. GPUs seem to be converging onto 1024-bits wide (32x 32-bits)
GPUs used to be 64x 32-bits wide (2048-bits), but both AMD and NVidia seem to have settled on 32x 32-bits wide (1024-bits).
It seems that at the point of ~1024-bit wide, its more appropriate to parallelize your processors instead of increasing vector size. Ex: Instead of having 32x (Compute Units) 64x (Threads per CU) 32-bit, you should have 64x (Compute Units) 32x (Threads per CU) 32-bit compute units.
The smaller size (32x instead of 64x) makes thread-divergence easier to handle.
AMD Vega64 was logically 64x wide (2048-bits), although it was physically a 16x wide processor (the 16x cores per vALU would repeat themselves for 4 clock cycles. Logically 64x cores, but physically only 16x cores).
By switching to NAVI 32x wide instead, efficiency went up but overall TFlops went down. The AMD 5700 XT is 40x 2x32x 32-bit in organization (40x compute units, 2x 32x SIMDs per compute unit, x32 bits each). Total of 2560 cores.
Vega64 was 64x 4x16x 32-bit in organization, for a total of 4096 SIMD cores.
Vega64 and 5700xt are roughly the same speed in practice, despite the 5700xt having only 62% of the cores and fewer TFlops than the Vega64. I guess the CryptoCoin miners prefered the ol Vega64, but in practice, its easier to write efficient programs for a narrower SIMD unit.
Just as a modern programmer learns about DEC PDP-11 and its influence on the C-programming language, a modern GPU programmer could look at these Cray-notes and learn about the influence of that machine onto the modern GPU.
The SIMD-principles on this Cray have found their way to normal CPUs, in the form of AVX-commands or AVX512.