The Cray vector processors had a set of 8 64-element x 64-bit 'vector' (V) registers, as well as 8 64-bit 'scalar' (S) and 8 24-bit 'address' (A) registers - so it would sort of be similar to 4096-bit wide SIMD. When you did an operation like a vector add, you could do "V0 V1+V2", and it would automatically do 64 consecutive adds, and it would be done in 64 + a few cycles (since the hardware was still only doing 1 add per cycle). As someone else mentioned, it also supported "Vector chaining", so if your next instruction was "V2=V0*V3", it could take the result from the adder and pipe it into the multiplier so now your addition and multiplication are nearly fully overlapped (and you're cruising along at 160 MFLOPS in 1976!). I think it might have supported 3 chains, so you could very briefly peak at 240 MFLOPS, but you couldn't sustain it because of the startup latencies involved.
As a 'practical' example, I was able to write an N-body simulator of Jupiter and 63 of its moons (using the vector registers) orbiting one another in only 127 total instructions!
Another key feature of these architectures is they had a vector length register. This allowed you to write strip mined loops that would move through arbitrary size vectors in units of the hardware vector lane width, without knowing that width until runtime. This means unlike MMX/SSE, the same binary works on machines with different numbers of lanes.
This idea has been resurrected recently with RISC-V and ARM's scalable vector instructions. There the general idea is an instruction that assigns the minimum of an argument value and the hardware vector register length to a register, and sets the masking appropriately if the argument is smaller. This makes for a very straightforward strip mined loop without a branch to check for and handle the remainder in the last iteration.
As a 'practical' example, I was able to write an N-body simulator of Jupiter and 63 of its moons (using the vector registers) orbiting one another in only 127 total instructions!