> RISC-V however does not work like this. The RISC-V vector registers are in a separate register file not shared with the scalar floating point registers.
Honestly... in hardware, they probably are actually in the same register file. It just now means you have two sets of architectural registers that rename to the same register file.
As for the rest of the article, it looks like it mostly boils down to "I'm intimidated by assembly programming" as opposed to any actual critique of the strengths and weaknesses of the vector ISAs. There's superficial complaints about the numbers of instructions, or different ways to write (the same? I only know scalar ARM assembly, not any vector extensions) instructions. On a quick reread, I see a complaint that's entirely due to how ARM represents indexed load operations, which has absolutely nothing to do with the vector ISA whatsoever.
If your goal is to understand how hardware SIMD works, you're probably better off sticking to C code with intrinsics, that way you're not distracted by the extra hoops you may have to go through that arise just by translating C into assembly.
>> The RISC-V vector registers are in a separate register file not shared with the scalar floating point registers.
> Honestly... in hardware, they probably are actually in the same register file. It just now means you have two sets of architectural registers that rename to the same register file.
You could have a single unified pool of physical registers that can be handed out to any register, but there's only some advantage to do so and a lot of advantages in keeping them separate. Either way, that's a micro-architectural detail that the designers have the freedom to choose (or not choose) to do.
From the software's point of view, there's a lot of advantages in keeping different architectural registers separate.
> You could have a single unified pool of physical registers that can be handed out to any register, but there's only some advantage to do so and a lot of advantages in keeping them separate. Either way, that's a micro-architectural detail that the designers have the freedom to choose (or not choose) to do.
What's the advantage to keeping them separate? If you're implementing vector instructions, then your scalar floating-point units are probably going to be the same as the vector floating-point units, with zero-extension for the unused vector slots. At that point, keeping them separate hardware register slots is detrimental: it's now costing you extra area as well, with concomitant power costs. You also need larger register files to accommodate all of the vector registers and the floating-point registers, when you're only likely to use half of them at any times. If you're pushing the vector units to their throttle, you'll have little scalar code to need all the renaming; if you're pushing the scalar units to their throttle, you'll similarly have little vector code.
From a software viewpoint, eh, there's not really any advantage to keeping them separate. You tend to use scalar xor vector floating point code, not both (this isn't true for integer, though), so there's little impact on actual register pressure. More architectural registers means more state to have to spill on context switches.
> On a quick reread, I see a complaint that's entirely due to how
> ARM represents indexed load operations, which has absolutely
> nothing to do with the vector ISA whatsoever.
Not exactly true.
If you can use fancy addressing modes in your vector loads and stores and you have a fixed length 32 bit opcode (as both Aarch64 and RISC-V do[1]) then specifying an index register and how much to shift it by is taking up an extra 7 bits of your opcode (5 for register number, 2 for shift amount) vs an instruction that just specifies a base pointer register.
That means one instruction is taking up the opcode space that could otherwise be used by 128 different instructions instead.
That means either your vector ISA has fewer instructions and capabilities than it otherwise could have, or else it is taking up a lot more of the overall opcode space.
loop:
LD1D z1.d, p0/z, [x1, x3, LSL #3] // load x
LD1D z0.d, p0/z, [x2, x3, LSL #3] // load y
FMLA z0.d, p0/m, z1.d, z2.d
ST1D z0.d, p0, [x2, x3, LSL #3]
INCD x3 // i
WHILELT p0.d, x3, x0 // i, n
B.ANY loop
Yeah, it's a little more code (4 bytes in RISC-V) to increment x1 and x2 by 8 using extra instructions but 1) that FMLA almost certainly takes 3 or 4 clock cycles, giving you plenty of spare cycles to execute the adds even on a single-issue machine, and 2) vectorized loops are likely to take up so little of your overall application that it makes no difference.
There's an argument to be made that it's worth the opcode space for scaled indexed addressing for integer loads and stores. Reasonable people may differ. But the case for FP and Vector loads and stores is pretty much non-existent.
It's not just a matter of "Ohhh .. that instruction looks so scary"
[1] RISC-V has 16 bit opcodes for very simple and common instructions, but the Vector ISA is entirely 32 bit opcodes.
> If your goal is to understand how hardware SIMD works, you're probably better off sticking to C code with intrinsics
Agreed, and we're also using intrinsics in time-critical places. I am confident we will be able to hide both SVE and RVV behind the same C++ interface (https://github.com/google/highway) - works for RVV, just started SVE.
If your vector registers are only 128 bits long then it's probably ok to have one big pool of registers. But if your vector registers are 512 bits, 4096 bits, 65536 bits then it's an awful waste not not be able to use one of those just because you need another int or FP scalar value in your loop.
Honestly... in hardware, they probably are actually in the same register file. It just now means you have two sets of architectural registers that rename to the same register file.
As for the rest of the article, it looks like it mostly boils down to "I'm intimidated by assembly programming" as opposed to any actual critique of the strengths and weaknesses of the vector ISAs. There's superficial complaints about the numbers of instructions, or different ways to write (the same? I only know scalar ARM assembly, not any vector extensions) instructions. On a quick reread, I see a complaint that's entirely due to how ARM represents indexed load operations, which has absolutely nothing to do with the vector ISA whatsoever.
If your goal is to understand how hardware SIMD works, you're probably better off sticking to C code with intrinsics, that way you're not distracted by the extra hoops you may have to go through that arise just by translating C into assembly.