
Cray-2 vectorization instruction notes by former Principal Engineer of Cray - kick
https://web.archive.org/web/20070929163851/http://klausler.com/cray2.txt
======
fulafel
Bacj when the Cell was around (PS3), there were sometimes discusisons about
the similarities between it and the Cray - local memory, different insn set
for vector processors vs the "CPU". I guess the Cray was easier to program
because you could still address the shared memory from the vector programs
without DMA or other hoops.

~~~
Const-me
I guess cell is more efficient due to the DMA. Note the Cray manual says
"There is no path between Local Memory and real memory. Vector registers must
be used to implement block copies.", i.e you have to spend CPU cycles copying
data.

~~~
datenwolf
> i.e you have to spend CPU cycles copying data

Same for x86. "String" instructions for copying between regions of memory is
the closest thing to a memory-to-memory DMA, a feature commonplace on
microcontrollers.

~~~
fulafel
The cache is the only[1] core-private memory in x86. If you think about the
cache as the local storage, there's automatic memory-to-memory DMA between
cache and shared main memory :)

[1] Of course there is some architected program state including the registers,
fp stack, flags, and various other processor state that can be saved/loaded
but let's keep to byte-addressable storage

------
Erwin
I like the infix notation e.g. "Vi Pvj" the Population count instruction,
working on vector j as input and outputting into i the count of bits in each
element.

Versus Intel's vpopcntX reg1, reg2 where X determines element size.

Seems like Cray had several 64*64 = 4096 bit vector register, but you worked
on it only 64 bit at a time while as current Intel CPUs have 512-bit vector
registers up from 256-bit for AVX-2.

Are those Intel vector register sizes going to increase until they catch up to
the old Cray? Or was going up from 256 to 512 bit chosen to fit something else
in the CPU architecture, like that you can fill the register in so many clock
cycles?

~~~
tom_mellior
> Seems like Cray had several 64*64 = 4096 bit vector register, but you worked
> on it only 64 bit at a time

What makes you say that? There seem to be the usual vector-vector
instructions:

    
    
        161ijk  Vi Vj+Vk
    

> Are those Intel vector register sizes going to increase

I don't see how they could. The vector size has increased from XMM to YMM to
ZMM, there is obviously no more room for expansion ;-)

~~~
jabl
> > but you worked on it only 64 bit at a time

> What makes you say that?

Ye olde Crays used 'vector pipelining', meaning that while vector registers
held many elements, there was only one ALU. So a single vector instruction
took many cycles to execute. OTOH this enabled the execution units to be well
utilized even without a cache, heroic OoO etc.

~~~
gpderetta
Yes, if I understand correctly, at the time instruction fetch/dispatch was the
bottleneck, so vector instructions would keep the execution units busy with
data streamed directly from main memory (there was no need for cache because,
at least for throughput oriented applications, main memory was not
significantly slower than the cpu itself).

~~~
jabl
Also, Crays of yore used SRAM for main memory. And back then there was also
much less of a gap between memory bus speed and cpu speed. This combined with
the vector pipelining made caches somewhat unnecessary.

~~~
jtlienwis
Cray-2 was all dram except for maybe one of the first to ship which was sram.

------
acoye
I kind of want to run Doom on this too : 167i-k Vi *QVk reciprocal square root
approximation

~~~
ygra
Doom doesn't have arbitrary normals; wasn't that from Quake?

~~~
smcl
Quake III Arena, iirc

~~~
acoye
You are correct. I was running low on caffeine .
[https://en.wikipedia.org/wiki/Fast_inverse_square_root](https://en.wikipedia.org/wiki/Fast_inverse_square_root)

------
codezero
Really looking forward to someone who can comment on this to break it down for
those of us who want to know but can't even. :)

~~~
dragontamer
The early 80s and 90s "vector supercomputers" serve as the basis of modern
GPUs. A GPU-programmer can immediately see the similarity of the assembly
language here, with modern GPU-assembly languages (AMD RDNA GPUs or NVidia
Turing GPUs).

Just as a modern programmer learns about DEC PDP-11 and its influence on the
C-programming language, a modern GPU programmer could look at these Cray-notes
and learn about the influence of that machine onto the modern GPU.

\------------

The SIMD-principles on this Cray have found their way to normal CPUs, in the
form of AVX-commands or AVX512.

------
acoye
Hey look, it had the NSA's magic instruction: 106ij0 Si PSj population count

