I wonder if qemu-system-riscv64 results are interesting? On the one hand they mostly tell you about the x86 hardware you're running it on and the quality of TCG. On the other hand we're quite interested in how qemu performs compared to actual hardware because we (Fedora) may wish to use qemu until hardware capabilities catch up. I have qemu on AMD 7950X and Genoa if interested.
QEMU's RVV is all ahead-of-time-compiled scalar loops when I looked at it a couple months ago, so the results would be very bad (worse than scalar equivalents). Even x86-64 SSE2 on x86-64 is scalar ahead-of-time-compiled helpers.
I don't know how rdcycle works on qemu. This benchmark is more meant for developers to figure out how to vectorize algorithms effectively, as in which instructions to choose.
That's a good question! I had to look it up myself ...
Obviously qemu TCG isn't a cycle-accurate emulation. Using RDCYCLE{,H} / reading the corresponding CSRs eventually calls https://gitlab.com/qemu-project/qemu/-/blob/69680740eafa1838... which calls cpu_get_host_ticks which is basically an arch-independent wrapper around RDTSC, assuming you are running qemu on x86 host. I think icount_enabled is false in normal qemu builds.
Therefore it would measure the time taken by qemu to emulate the vector instruction, in host instructions. Which I guess is what you would want (maybe?).
> This benchmark is more meant for developers to figure out how to vectorize algorithms effectively, as in which instructions to choose.
Absolutely, I'm not saying the qemu results would say anything very deep, but they're kind of interesting from the point of view of either optimizing qemu or if you have to use qemu because the hardware you want isn't available / isn't cheap enough.
I bought into the Milk-V Pioneer board, hope to get it around Christmas and build a home server around it. I'm curious to see how I can get my application (Rust-based, database & VM-ish) to perform on it and what power draw will be like.
Really cool, but you probably won't see any advantage from the vector extension, because the board uses the 0.7.1 pre standardization version of the extension. llvm and gcc don't support that version anymore. I needed to use an older gcc branch to compile the assembly code.
Otherwise it's performs pretty good though.
That is definitely unfortunate given that I understand the 0.7.1 pre-standard seems to have been shipped very widely, it seems a bad decision for LLVM not to at least provide an option to use intrinsics from that.
That said my code as it is right now isn't doing a lot of array/vector type operations. There are likely places where some of my datastructures could be optimized to do so, but not right now.
I'm most interested in just taking advantage of high parallelism on the 64 cores. We shall see how that works out.
I find it amusing that René is known for two hacks:
- enabling RDNA2-based Radeon GPU cards to work. The hack: saving and restoring FPU state in the kernel so that the kernel (more precisely: the GPU driver) can use floating point instructions.
- enabling the use of RVV instructions on a standard Linux distro. The hack: primarily saving and restoring the Vector state in the kernel.
It's essentially the same hack.
I note also that the OS images recommended by the SBC vendors already have the Vector extension enabled e.g. the OS preinstalled on the Lichee Pi 4A, or Tina for the Nezha (and other D1 boards).
The patches to do so are readily available. It's no big trick to apply them to upstream kernel sources, the problem is getting the result of doing so upstreamed.
It wasn't supposed to happen like this. The burden to support the 0.7.1 should be on those that pushed it into general availability instead of expected the rest of community to develop a code generator + optimiser for an overcome draft and maintain package repos etc. targeting only one vendors CPU core.
That is fair. Hopefully the # of pre-1.0 V units out there gets dwarfed quickly by a wave of new machines in the next couple years.
Do you happen to know if the Pioneer board's CPU is socketed? I can't tell from the pictures, but it seems like might just be soldered on, which is a bummer. Because the hope would be that it could be upgraded later if/when SOPHON brings out a new chip, but I suspect that isn't going to be possible.
Socketed is a lot more expensive. There's also no point as the SoC is worth vastly more than the motherboard, so you'd probably pay 80% as much for a new CPU as for a new motherboard with CPU. The RAM is socketed, so you can reuse that.
Agreed that THead and/or their customers should have been much faster off the mark with toolchain support (they're doing it now), but there is another equally vital aspect -- the upstream software maintainers have to agree in principle to accept patches. This agreement has been either withheld or else slow to come, with such weak reasons as "we don't support draft specs", intended to mean "specs that are never implemented in hardware".
Mass production of chips with a "draft spec" make it not a draft at all in this sense -- it's effectively a custom extension, which patches SHOULD be accepted for.
It's just a custom extension that happens to be 95% source code and binary compatible with the official extension.
For a start, every SoC containing a C906 or C910 core has at least one core implementing the V extension. THead's V extension implementations are optional (and extra cost) and are not part of the open-sourced cores on github.
I'll bought the same board, and am somewhat disappointed by this news. Do you happen to have the name of the gcc branch on hand so I can get that going when the thing arrives?
Together with some assembly macros to allow writing in a subset of rvv 1.0 that also works on rvv 0.7.1 (source compatible, but not binary compatible. [0]
There is also the xuantie toolchain that does support intrinsic and some sort of autovectorization, I think. But I haven't used that one, and it won't be as good as upstream gcc/llvm. [1]
Other than the CanMV-K230 (Kendryte K230, single 1.6 GHz C908 core implementing RVV) which just started shipping in the last two weeks, every RISC-V board with RVV has either C906 or C910 cores which implement draft 0.7.1.
Those CPU cores were announced in mid 2019 (when RVV 0.7.1 was the current draft) and boards using them start arriving in mid to late 2021.
RVV 1.0 boards will start arriving in force next year, probably starting with the StarFive JH8110 SoC, and (apparently, though I'm not sure I believe it) an update of the SG2042 in the Pioneer, and also the 16 core (but faster cores) SG2380.
> Do you happen to have the name of the gcc branch
The branch has been deleted from the official repo. I have a snapshot on my github:
Note that it is primarily binutils which understands RVV 0.7.1. GCC understands it only to the extent of accepting "v" in "-march" and passing the right flags to the assembler. This enables using the gcc driver to build .s files and inline RVV asm in C. There is no support for RVV intrinsic functions or auto-vectorisation.
It's also a somewhat old gcc. I use it to build .o files from assembly language, and then link them with C compiled by a newer gcc or llvm. Or not, most of the time gcc 9 is fine.
THead have RVV 0.7.1 support in newer gcc, but I haven't been tracking that closely.
https://github.com/riscv/sail-riscv/commit/c90cf2e6eff5fa4ef...
I expect that this will help stabilise compiler support and facilitate other RISC-V extensions, like RISC-V Vector Crypto, also to progress.