I wonder what impact the vector extension will have on auto vectorization. Less instructions means it is easier to pick the best instructions in the first place. Variable length means you don't have to hope your arrays are a multiple of the SIMD length. Your auto vectorized code also gets faster with each new CPU generation.
All these factors are especially important when the developer isn't even aware of the auto vectorization and just wants performance increases for free.
One complication that is not mentioned explicitly, but can be inferred from the discussion on cycles is that context switching should take longer, right? In the most basic design, you'd wait for the entire instruction to complete, for who knows how many tens of cycles more. You could also just cancel the instruction and retry it after servicing the interrupt, wasting some work. Or let it still run, but only stall when the interrupt code wants to access the same vector registers. Which is going to happen eventually anyway if you're doing a context switch (barring some lazy state dump/restore optimization).
Maybe the vector instructions are interruptable like the "increment and repeat" instructions on the Z80 (e.g. LDIR)? LDIR could copy a block of memory up to 64 KBytes (which would take nearly 1.4 million cycles), but even though it looked and behaved like a single instruction to the programmer it was actually looping over itself until the transfer has completed by simply not incrementing the program counter until register BC (used as byte counter) is zero. The downside is that each copied byte required to fetch and decode the whole 2-byte instruction again (costing 21 cycles per copied byte), but the upside is that those increment-and-repeat instructions are interruptible, since they are actually a regular loop of the same 21-cycle instruction over and over again.
My understanding of these RISC-V vector instructions is that they're limited to a vector register, which probably won't be too big, perhaps 512-bits or so. Compared to the cost of a context switch, the cost of finishing the current instruction should be negligible.
All these factors are especially important when the developer isn't even aware of the auto vectorization and just wants performance increases for free.