RISC-V Vector benchmark results

Loq · on Nov 11, 2023

The official formal specification of the Vector Extension has just been merged into the Golden RISC-V model:

https://github.com/riscv/sail-riscv/commit/c90cf2e6eff5fa4ef...

I expect that this will help stabilise compiler support and facilitate other RISC-V extensions, like RISC-V Vector Crypto, also to progress.

rwmj · on Nov 11, 2023

I wonder if qemu-system-riscv64 results are interesting? On the one hand they mostly tell you about the x86 hardware you're running it on and the quality of TCG. On the other hand we're quite interested in how qemu performs compared to actual hardware because we (Fedora) may wish to use qemu until hardware capabilities catch up. I have qemu on AMD 7950X and Genoa if interested.

dzaima · on Nov 11, 2023

QEMU's RVV is all ahead-of-time-compiled scalar loops when I looked at it a couple months ago, so the results would be very bad (worse than scalar equivalents). Even x86-64 SSE2 on x86-64 is scalar ahead-of-time-compiled helpers.

rwmj · on Nov 13, 2023

Yes, it's all just implemented using qemu helpers. I expect performance will be terrible.

camel-cdr · on Nov 11, 2023

I don't know how rdcycle works on qemu. This benchmark is more meant for developers to figure out how to vectorize algorithms effectively, as in which instructions to choose.

rwmj · on Nov 11, 2023

> I don't know how rdcycle works on qemu.

That's a good question! I had to look it up myself ...

Obviously qemu TCG isn't a cycle-accurate emulation. Using RDCYCLE{,H} / reading the corresponding CSRs eventually calls https://gitlab.com/qemu-project/qemu/-/blob/69680740eafa1838... which calls cpu_get_host_ticks which is basically an arch-independent wrapper around RDTSC, assuming you are running qemu on x86 host. I think icount_enabled is false in normal qemu builds.

Therefore it would measure the time taken by qemu to emulate the vector instruction, in host instructions. Which I guess is what you would want (maybe?).

> This benchmark is more meant for developers to figure out how to vectorize algorithms effectively, as in which instructions to choose.

Absolutely, I'm not saying the qemu results would say anything very deep, but they're kind of interesting from the point of view of either optimizing qemu or if you have to use qemu because the hardware you want isn't available / isn't cheap enough.

cmrdporcupine · on Nov 11, 2023

Cool, this is useful.

I bought into the Milk-V Pioneer board, hope to get it around Christmas and build a home server around it. I'm curious to see how I can get my application (Rust-based, database & VM-ish) to perform on it and what power draw will be like.

camel-cdr · on Nov 11, 2023

Really cool, but you probably won't see any advantage from the vector extension, because the board uses the 0.7.1 pre standardization version of the extension. llvm and gcc don't support that version anymore. I needed to use an older gcc branch to compile the assembly code. Otherwise it's performs pretty good though.

cmrdporcupine · on Nov 11, 2023

That is definitely unfortunate given that I understand the 0.7.1 pre-standard seems to have been shipped very widely, it seems a bad decision for LLVM not to at least provide an option to use intrinsics from that.

That said my code as it is right now isn't doing a lot of array/vector type operations. There are likely places where some of my datastructures could be optimized to do so, but not right now.

I'm most interested in just taking advantage of high parallelism on the 64 cores. We shall see how that works out.

snvzz · on Nov 11, 2023

>I understand the 0.7.1 pre-standard seems to have been shipped very widely

Hardware, yes. Note it will still run non-vector code fine.

Software, not at all. e.g. there's no public builds of Linux distributions targeting 0.7.1, not even unofficial.

It is not even easy to get working under Linux, although René Rebe managed to[0].

0. https://www.youtube.com/watch?v=trgEUBN_36M

brucehoult · on Nov 12, 2023

I find it amusing that René is known for two hacks:

- enabling RDNA2-based Radeon GPU cards to work. The hack: saving and restoring FPU state in the kernel so that the kernel (more precisely: the GPU driver) can use floating point instructions.

- enabling the use of RVV instructions on a standard Linux distro. The hack: primarily saving and restoring the Vector state in the kernel.

It's essentially the same hack.

I note also that the OS images recommended by the SBC vendors already have the Vector extension enabled e.g. the OS preinstalled on the Lichee Pi 4A, or Tina for the Nezha (and other D1 boards).

The patches to do so are readily available. It's no big trick to apply them to upstream kernel sources, the problem is getting the result of doing so upstreamed.

crest · on Nov 11, 2023

It wasn't supposed to happen like this. The burden to support the 0.7.1 should be on those that pushed it into general availability instead of expected the rest of community to develop a code generator + optimiser for an overcome draft and maintain package repos etc. targeting only one vendors CPU core.

cmrdporcupine · on Nov 11, 2023

That is fair. Hopefully the # of pre-1.0 V units out there gets dwarfed quickly by a wave of new machines in the next couple years.

Do you happen to know if the Pioneer board's CPU is socketed? I can't tell from the pictures, but it seems like might just be soldered on, which is a bummer. Because the hope would be that it could be upgraded later if/when SOPHON brings out a new chip, but I suspect that isn't going to be possible.

brucehoult · on Nov 11, 2023

It's soldered.

Socketed is a lot more expensive. There's also no point as the SoC is worth vastly more than the motherboard, so you'd probably pay 80% as much for a new CPU as for a new motherboard with CPU. The RAM is socketed, so you can reuse that.

brucehoult · on Nov 11, 2023

Agreed that THead and/or their customers should have been much faster off the mark with toolchain support (they're doing it now), but there is another equally vital aspect -- the upstream software maintainers have to agree in principle to accept patches. This agreement has been either withheld or else slow to come, with such weak reasons as "we don't support draft specs", intended to mean "specs that are never implemented in hardware".

Mass production of chips with a "draft spec" make it not a draft at all in this sense -- it's effectively a custom extension, which patches SHOULD be accepted for.

It's just a custom extension that happens to be 95% source code and binary compatible with the official extension.

NewJazz · on Nov 11, 2023

To be fair, THead released the cores as open source cores. Firms then just picked them up and put them in their SoC AIUI.

brucehoult · on Nov 12, 2023

Absolutely not.

For a start, every SoC containing a C906 or C910 core has at least one core implementing the V extension. THead's V extension implementations are optional (and extra cost) and are not part of the open-sourced cores on github.

NewJazz · on Nov 12, 2023

Oh. Didn't know they were an add-on with additional cost. That is interesting.

NewJazz · on Nov 11, 2023

Yep. This is one of the major reasons I've held off on buying any riscv hardware. Everything I've seen has not had the 1.0 vector extension.

brucehoult · on Nov 11, 2023

Unsurprisingly as the spec was ratified slightly less than two years ago, and mass production hardware generally takes closer to four years.

I don't know what Herculean effort was made to get the K230 SoC and dev board out already, and at a very low price.

MrDrMcCoy · on Nov 11, 2023

I'll bought the same board, and am somewhat disappointed by this news. Do you happen to have the name of the gcc branch on hand so I can get that going when the thing arrives?

camel-cdr · on Nov 11, 2023

I've been using the one archived by bruce hoult: https://github.com/brucehoult/riscv-gnu-toolchain

It doesn't support autovectorization though.

Together with some assembly macros to allow writing in a subset of rvv 1.0 that also works on rvv 0.7.1 (source compatible, but not binary compatible. [0]

There is also the xuantie toolchain that does support intrinsic and some sort of autovectorization, I think. But I haven't used that one, and it won't be as good as upstream gcc/llvm. [1]

[0] https://github.com/camel-cdr/rvv-bench/blob/main/thirdparty/...

[1] https://github.com/T-head-Semi/xuantie-gnu-toolchain

MrDrMcCoy · on Nov 12, 2023

Thanks!

brucehoult · on Nov 11, 2023

That shouldn't be news.

Other than the CanMV-K230 (Kendryte K230, single 1.6 GHz C908 core implementing RVV) which just started shipping in the last two weeks, every RISC-V board with RVV has either C906 or C910 cores which implement draft 0.7.1.

Those CPU cores were announced in mid 2019 (when RVV 0.7.1 was the current draft) and boards using them start arriving in mid to late 2021.

RVV 1.0 boards will start arriving in force next year, probably starting with the StarFive JH8110 SoC, and (apparently, though I'm not sure I believe it) an update of the SG2042 in the Pioneer, and also the 16 core (but faster cores) SG2380.

> Do you happen to have the name of the gcc branch

The branch has been deleted from the official repo. I have a snapshot on my github:

https://github.com/brucehoult/riscv-gnu-toolchain

Note that it is primarily binutils which understands RVV 0.7.1. GCC understands it only to the extent of accepting "v" in "-march" and passing the right flags to the assembler. This enables using the gcc driver to build .s files and inline RVV asm in C. There is no support for RVV intrinsic functions or auto-vectorisation.

It's also a somewhat old gcc. I use it to build .o files from assembly language, and then link them with C compiled by a newer gcc or llvm. Or not, most of the time gcc 9 is fine.

THead have RVV 0.7.1 support in newer gcc, but I haven't been tracking that closely.

snvzz · on Nov 11, 2023

Just remember this is C908, a small core meant for microcontrollers. This is a very small, power optimized vector unit.

Very interesting regardless, because it has already shipped and will enable developers to start working on optimized Vector 1.0 code on real hardware.

In 2024, hardware targeting very high performance arrives, and they implement RVA22+V, while some do a few newer extensions from RVA23.

It is going to be fun.