Using LMUL > 1 likely kills vrgather performance, so I'd always use LMUL=1.
VL doesn't affect the table AFAIK - it only affects what's written out. The table doesn't need to be more than 128 bits, but I don't think you can limit it. Though I'd be interested if setting VL<VLMAX improves performance.
Unfortunately RVV has very limited shuffle/permutation functionality, which often forces you into using vrgather. For example, there's no endian swap instruction (in base V at least).
A byte permute instruction has many uses, from character matching, string processing, bit reversal etc. On x86, Reed Solomon coding is often limited by (V)PSHUFB throughput, for example.
> Though I'd be interested if setting VL<VLMAX improves performance.
It does for ara but not for all other implementations I could get my hands on (which all have a VLEN of 128 or 256, so relatively small).
I've been thinking in which direction implementations need to optimize vrgather to get the most benefit out of software.
My thinking was that as long as you've got 128 or 256 bit, and LMUL=1 vrgather working quickly you should cover most cases.
> For example, there's no endian swap instruction (in base V at least).
Yeah, but there you can just use multiple LMUL=1 vrgathers instead of higher LMUL vrgathers, because this is is inherently simpler since you don't need to cross full lanes anyways.
I think most lookup table uses of vrgather fit within 128 or 256 bits, 4 byte lut is greate for character matching.
> It does for ara but not for all other implementations I could get my hands on (which all have a VLEN of 128 or 256, so relatively small).
I generally would expect it to have no effect, because it's difficult to base scheduling/forwarding decisions off a register value. Though for Ara2, given your other post, it sounds like it's really just a 512-bit vector processor that pretends to be 4096-bit, in which case, it can probably adapt to VL changes.
> Yeah, but there you can just use multiple LMUL=1 vrgathers
I never considered LMUL>1 - I've always been talking about LMUL=1.
> I think most lookup table uses of vrgather fit within 128 or 256 bits, 4 byte lut is greate for character matching.
Ideally yes, but the problem is that there's no way to actually do just this under RVV. According to spec, vrgather must consider the full vector, regardless of VL. In other words, if your VLMAX is 4096b, so is your lookup table.
And this is a major criticism I have about RVV in general. They tout support up to 65536-bit vectors, however the spec includes aspects which are difficult to scale to large vectors. IMO, SVE2/2.1 does a much better job.
> Though for Ara2, given your other post, it sounds like it's really just a 512-bit vector processor that pretends to be 4096-bit, in which case, it can probably adapt to VL changes.
I think they use a larger register file to reduce sceduling overhead and memory latency.
> According to spec, vrgather must consider the full vector, regardless of VL. In other words, if your VLMAX is 4096b, so is your lookup table.
Interesting I didn't know that. I suppose large LMUL implementations would need to tag vector registers with their vl to make this work or take more cycles if it goes outside of a certain range.
> I suppose large LMUL implementations would need to tag vector registers with their vl to make this work
Again, I'm only considering LMUL=1 here.
You'd need to also tag how tail elements are handled; tail undisturbed probably won't work, so it'd only work with tail agnostic, and if the processor always sets tail elements to -1.
It'd also be more complicated with LMUL>1, since the processor would have to check the tagged VL for all registers in the group. And having scheduling depend on multiple register values is going to be problematic, but it's probably doable if high performance isn't a major concern.
In general, it's wrong to think of VL meaning "vector length". It's more accurate to think of it as an additional mask such that mask=(1<<vl)-1
It's easy to get confused with RVV, given all the acronyms and config options. Despite it's "RISC" background, I find it much more complex and confusing than AVX and SVE.
The ISA is variable length/scalable, but this implementation uses a 4096 wide register file.
They are a bit disingenuous in claiming they support rvv 1.0 while others only a subset, as they haven't implemented vrgather or vcompress yet, but there are open pull request for them [0].
Sadly there also seem to be a few bugs when simulating with verilator [1], so I couldn't measure all instructions, but here is `vadd.vv` and `vwaddu.vv` for the VLEN=4096, four lane configuration:
vadd.vv:
e32m1: 16 cycles
e32m2 32 cycles
e32m4 63 cycles
e32m8 126 cycles
vwaddu.vv:
e32m1: 34 cycles
e32m2: 69 cycles
e32m4: 140 cycles
I don't have the time to rebuild ara now, but the 16 lane configuration should roughly be four times faster.
How well does it perform with widen/narrow and vrgather (IMO the most important instruction)?