
Programming with RISC-V Vector Instructions - pantalaimon
https://gms.tf/riscv-vector.html
======
filereaper
Wow, this RISC-V ISA truly is quite brilliant.

Having worked on PPC, with a RISC Instruction set the difference between 128
bit vs 256 bit instructions really does eat up the limited opcode space for
really trivial differences.

Also having written say one fast version of vector array copy can now just be
used between different vector version lengths, no need to write different
versions to exploit expanded width, and same goes for many vector compiler
optimizations.

How does this work with physical registers as opposed to architectural ones?
Typically in PPC the 128 bit and 256 bit ones were architecturally and
physically non-overlapping so you did get extra registers when you go from 128
to 256 or 512. I don't know if that's the case for RISC-V here.

But yeah, brilliant looking forward to more!

~~~
brigade
The 32 variable length registers don’t overlap; the register grouping is
orthogonal to the length of each register.

The grouping... sounds “interesting” to implement in an OoOE design. Most
obvious would be to have the instruction decoder emit one uop per register in
the grouping... but that means vsetvli would have to stall decoding until it’s
resolved. But that also seems to be how element size is set, so that would
kill the performance of mixing precision in the same kernel...

Well I guess it could assume the grouping doesn’t change and flush the
pipeline if it did. But you still don’t want to be mixing kernels with
different groupings...

~~~
brucehoult
It's absolutely essential to be able to change vsetvli in the middle of a
kernel.

It is expected that in a future expanded Vector instruction set with 48 bit or
64 bit opcodes the vtype will be explicitly encoded in every instruction and
can change every instruction.

Right now the vsetvli is setting (slightly) persistent state that affects
following instructions. You're allowed to put one before every vector
instruction if you want, without significant execution penalty -- there will
be a little, from extra instruction fetch and decode -- but similar to doing,
say, an integer add between each vector instruction.

The natural implementation even now is to have each vector instruction pick up
the current vtype when it is decoded and carry it along with it through the
pipeline as a few extra bits of opcode.

You certainly don't want to have any stalls or pipeline flushes just because
the vtype changes.

------
lukeplato
I highly recommend Lex Fridman's recent podcast with David Patterson [1] for
anyone interested in learning about the history of RISC, computer
architecture, and also interesting predictions re: Moore's Law.

[1]
[https://www.youtube.com/watch?v=naed4C4hfAg](https://www.youtube.com/watch?v=naed4C4hfAg)

~~~
asdajsdh
Like a year ago I only learned about Patterson, because I got a (random) book
out of my uni library about computer architectures. His books are fascinating.
He is fascinating.

------
twic
Irrelevant i know, but:

> For the purpose of our example, the exercise is to write vector code that
> efficiently converts a BCD string such as { 0x12, 0x34, ..., 0xcd, 0xef } to
> a corresponding ASCII string (e.g. { '1', '2', '3', '4', ..., 'c', 'd', 'e',
> 'f' }). On a high-level, a solution involves separating the nibbles into
> single bytes and then converting each byte to the matching ASCII value.

If your BCD string has 0xcd or 0xef in it, it's not BCD, is it? It's "binary
coded hexadecimal", or as we usually call it, "binary".

This code converts a byte string to its hex representation. It has nothing to
do with BCD, right?

~~~
microcolonel
> _This code converts a byte string to its hex representation. It has nothing
> to do with BCD, right?_

BCD to ASCII is a strict subset of bin to hex ASCII; and in this case there is
no runtime cost to supporting both. This also covers nybble-coded octal.

~~~
ghusbands
Describing something that produces a hex string from binary as "BCD to ASCII"
is unusual and misleading. And claiming that 0xcd and 0xef are BCD is simply
incorrect.

------
brucehoult
This is a reasonably good example (except it's binary to hex, not just BCD to
hex), but it's from the start of the year and based on the already out of date
version 0.8 draft spec.

A couple of things need to be changed to bring it up to date:

    
    
      -vlbu.v v16, (a1)
      +vle8.v v16, (a1)
      +vzext.vf2 v16, v16
    
      -vsb.v v24, (a0)
      +vse8.v v24, (a0)
    

I'd also probably make (or at least compare the speed of) one more change:

    
    
      -vrgather.vv v24, v8, v16
      +vmsgtu.vi v0, v16, 9 # set mask-bit if >9
      +vadd.vi v16, v16, '0' # add '0' to each element
      +vadd.vi v16, v16, 'a'-0xA-'0', v0.t # masked add to correct A..F
    

That's basically the same code as he used to create the lookup table in v8 for
the vrgather. I think it might run faster on many machines, and also the
vrgather would fail on the smallest machines (with only 32 bits in each vector
register) if anyone modified the code to not use m8 (LMUL=8).

For more explanation see my post at
[https://www.reddit.com/r/RISCV/comments/i5alno/programming_w...](https://www.reddit.com/r/RISCV/comments/i5alno/programming_with_riscv_vector_instructions/g0p6rz0/)

[NB slight cheat -- those vadd.vi immediates are too big to fit .. you'd
actually need to put them in integer registers and use vadd.vx, which can be
set up outside the loop]

------
castratikron
Is there a big penalty for context switching to another process that uses a
different vector length?

~~~
vlmutolo
It’s really funny that you asked this because I clicked on the RISC-V “V”
extension spec on GitHub, and literally the only thing I read was “you can’t
context switch to another CPU with different vector lengths”.

EDIT: Non-layman wording

> Thread contexts with active vector state cannot be migrated during execution
> between harts that have any difference in VLEN or ELEN parameters.

[https://github.com/riscv/riscv-v-
spec/blob/master/v-spec.ado...](https://github.com/riscv/riscv-v-
spec/blob/master/v-spec.adoc#2-implementation-defined-constant-parameters)

~~~
castratikron
Very cool, thanks.

------
lachlan-sneff
Whoa, that's a well designed ISA.

~~~
renox
The vector extension yes, the C (compressed) extension is unusual: you can
have 32 bits instructions aligned on 16bit, while nice for code density this
means that the implementation is much more complex than Thumb/MIPS16
extensions..

~~~
rwmj
RISC-V has a variable-length instruction encoding. It's just that unlike x86
you can easily tell from parsing a few bits the length of every instruction in
the stream, and like MIPS etc most "ordinary" instructions are 32 bit.

BTW if unaligned 32 bit instructions are a concern there is a Compressed NOP
(C.NOP == addi x0, x0, 0 but without RAW hazards).

~~~
renox
C.Nop isn't a solution, if you implement a compliant RISC V processor with the
C extension you need to handle all the possible case, for example an
instruction straddling several pages.

------
guerrilla
Would something like this be possible for FPUs as well? I see there are
currently three separate extensions for floating point instructions varying by
register widths.

~~~
Taniwha
The vector extension includes FP instructions too

------
rhn_mk1
What's the deal with the group sizes here?

vsetvli t0, a6, e8, m8 # switch to 8 bit element size, # i.e. 4 groups of 8
registers

vmsgtu.vi v0, v8, 9 # set mask-bit if greater than unsigned immediate # --> v0
= | 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |

If vsetvli results in groups of 8 registers, then surely vmsgtu.vi only
affects v0 which is the first 8 registers? The following 8 are in v8 if I
understood the previous writing correctly.

~~~
saagarjha
As I understand it, it affects the whole “group” at v0, and it’s only storing
16 8-bit elements which likely fit in the full 8 registers you have grouped
together. (That is, it’s not storing one element per register, I’m not sure
exactly how the layout is but it’s probably packing them somewhere along the
lines of, if the register size was 64 bit, v8 would be 0x0706050403020100, v9
would be 0x0f0e0d0c0b0a0908, and then v0 would be 0x0101010101010101 and v1
would be 0x0000000000000101.)

(What personally don’t understand is the point of these register groupings;
they seem a bit extraneous and error prone, as you already can set the element
size and you have to guess at the minimum vector register size…and why
wouldn’t you always just set them to 8? I think ARM’s SVE does something
similar but it fixes the size for you essentially.)

~~~
nybble41
AIUI the group size is a trade-off between the number of independently
addressable vectors and the vector size. If the group size were always set to
eight then you could only have four distinct vectors, whereas a group size of
one would give you 32 vectors. You want to use the largest group size you can
to take full advantage of the hardware, but you're limited by the number of
vectors required by your algorithm. This is orthogonal to the size of an
element within each vector.

The current RISC-V "V" draft standard requires the vector registers to be at
least 32 bits wide (VLEN ≥ SLEN ≥ 32)[1], so a 128-bit vector with 16 8-bit
elements may require up to four registers. Setting the group size to eight is
a bit extravagant, but since the total number of elements was limited to 16
any extra registers in the group will not be affected. With a smaller group
size you could potentially use those registers for something else.

P.S. I think you have the mask bits reversed. The instruction is "set if
greater than" so v1 should be 0x0101010101010000 and v0 should be
0x0000000000000000 (corresponding to the |1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0|
state shown in the article for the combined v1:v0 register group).

[1] [https://github.com/riscv/riscv-v-
spec/releases/tag/0.9](https://github.com/riscv/riscv-v-spec/releases/tag/0.9)

------
anonymousDan
Pretty cool. For RISC-V experts out there, can someone explain to me the
purpose of the proxy kernel? I can't seem to wrap my head around it. Why not
just run a normal kernel (e.g. Linux) on top of the emulator? What
advantages/disadvantages does the proxy kernel have?

~~~
seldridge
It gives you a hardware testing environment with system calls for reasonable
amounts of wall clock time.

If you're simulating actual hardware, e.g., a Verilog description of a RISC-V
microprocessor compiled to a cycle-accurate simulation with Verilator, your
simulation rate is going to be ~10KHz. You can write useful tests with the
Proxy Kernel (or something like it) that run in ~1 million instructions
(minutes of wall clock time) while still getting full system calls like
printf. However, booting Linux is out of the question (days of wall clock
time). Running bare metal is useful, too, but you don't have system calls
there.

If you're doing fast RISC-V virtualization, like on QEMU, or doing emulation
on an FPGA, you're running at >1MHz and running a "normal kernel" like Linux
is totally tractable. However, it would be foolhardy to expect to jump from
hardware design to immediately booting Linux.

~~~
fulafel
Nitpick: printf is is a libc call.

~~~
brucehoult
... and that libc code (usually NewLib[nano]) formats a buffer then calls
write(2), which pk provides.

