I think you may be correct about gracemont v golden cove. Rumors/insiders say th...

dzaima · 2024-08-11T04:07:21 1723349241

> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.

Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.

janwas · 2024-08-11T05:38:43 1723354723

> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction

+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.

camel-cdr · 2024-08-11T09:19:00 1723367940

> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.

janwas · 2024-08-11T13:37:21 1723383441

Going through Highway's set of shuffle ops:

64-bit OddEven/Reverse2/ConcatOdd/ConcatEven, OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse, CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and Broadcast especially for 8-bit, TwoTablesLookupLanes, InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).

All of these are considerably more expensive on RVV. SVE has a nice set, despite also being VL-agnostic.