I think you may be correct about gracemont v golden cove. Rumors/insiders say that Intel has supposedly decided to kill off either the P or E-core team, so I'd guess that the P-core team is getting layed off because the E-core IPC is basically the same, but the E-core is massively more efficient. Even if the P-core wins, I'd expect them to adopt the 3x3 decoder just as AMD adopted a 2x4 decoder for zen5.
Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).
There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.
Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).
In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.
> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.
Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.
Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.
> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction
+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.
I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.
> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.
What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.
Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).
There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.
Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).
In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.