Hacker News new | past | comments | ask | show | jobs | submit login
Examples of RISC-V Assembly Programs (utk.edu)
92 points by azhenley 8 months ago | hide | past | favorite | 23 comments

Neat, though they look unoptimized at a glance. For example, I'd expect the strlen loop to be rotated so that there is only a single backwards branch instruction per iteration. Basically, you want the loop structure to be:

        count = 0
        goto check
        count++, ptr++
        if *ptr != 0 goto body
(The initial jump could be a copy of the check instead)

  > count++, ptr++
Also, you should avoid this; it adds a extra addition per iteration for no real benefit. Try:

    p = ptr - 1  # also avoid branching[0] over p++
    if *p != 0 goto body
    return p - ptr
0: I assume a single addition is cheaper than a unconditional branch, which is probable, but not nearly as big a difference as with conditional branches.

GCC did something like this to one of my loops last month and it took me about half an hour to figure out what was going on.

Simplified, the loop was something like this:

    for (int j = 0; j != 5000; j++) {

        if (bar()) {
        } else {

The straightforward way to compile this is as follows:

        j = 0
    1:  if (j == 5000) goto 2
        t = bar()
        if (!t) goto 3
        goto 4
    3:  quux()
    4:  corge()
        goto 1
Imagine my surprise when I found that GCC had instead emitted the equivalent of the following:

        j = 0
        goto 6
    3:  quux()
    4:  corge()
        if (j == 5000) goto 2
    6:  foo()
        t = bar()
        if (!t) goto 3
        goto 4
It took me a while to understand what was going on, though you can work out that the two are equivalent. Note that the straightforward version runs 9 “instructions” each iteration when bar() is true and 8 when bar() is false. But the optimized version runs 8 when bar() is true and 7 when bar() is false.

By rotating the loop to put the two-branched if-else conditional at the end, as you explain, you get the jump back to the top for free! (The extra jump is still there, but it's executed once, to enter the loop, instead of on every iteration.) And in this case this might matter significantly because the whole inner loop was only 12 instructions, so it might be speeding the whole program up by a measurable percentage.

That said, it's probably better to start by teaching people the straightforward version.

(The program was http://canonical.org/~kragen/sw/dev3/kmregion.h, a simple arena allocator about 10× faster for small allocations than malloc/free, if you're curious.)

They’re for teaching sophomores, so I don’t think optimizations would help.

reading the source code of these essential functions like ´strlen strcpy´ etc is like magician showing you how the trick is performed.

and it feels utterly devastating, like I have been cheated, the banality of how it works and how simple it really is...

If that's too simple for you, you may have fun with the strlen implementation in glibc:


Then look at the strlen that is probably running on your computer: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...

Needs a static assert instead of abort().

Excellent reply! Thanks! ..and also thanks for pointing out the obvious *magic in the code!

I'm not really sure what you were expecting?

Someone could have a go at adding RISC-V assembler to http://www.rosettacode.org/ .

I see there are several 3 arguments instructions; is that common among modern architectures? I only wrote some M68K assembly ages ago, so it seems rather unusual to me.

It's absolutely normal with RISC ISAs that use 4 byte instructions. There is plenty of room to specify three 5-bit register numbers, and it often prevents needing extra MOV instructions.

RISC-V also has instructions that are two bytes long. The assembler automatically uses a 2 byte instruction instead of the corresponding four byte instruction if it can.

Examples of available two byte instructions:

Using any of the 32 registers:

- move or add one register to another

- shift left by a constant 0..63

- add a constant -32..+31 to a register

- load or store a register within 0..63 register sizes above the stack pointer

Using only the eight "most popular" registers (s0..s1, a0..a5):

- subtract, AND, OR, XOR two registers

- AND with a constant -32..+31

- logical or arithmetic shift right by a constant 1..63

- load or store a 32 or 64 bit value at an offset of 0..31 operand widths above the address in a register.

The programmer (or even compiler) doesn't need to remember which operations and registers are supported by 2 byte instructions, the assembler just selects them automatically when possible.

Of course you can decrease your code size if you do know what can use a 2 byte instruction and organise your code to maximise it.

Yeah, it's really common for RISCs particularly to be three address instructions. Two address means a read modify write of one of the registers, which can make it more difficult to split dependencies for an example of one way it complicates higher perf designs.

Not an expert but I don't think that's quite right. No three-address ISAs (that I know of?) do anything to forbid using one of the sources as a destination, I'm not even aware of a microarch that punishes you for doing so, and the three-address design goes back to the original RISC/MIPS designs where any sort of advanced dependency analysis in the pipeline wasn't even a consideration (it was considered somewhat un-RISC-y maybe even).

I think the motivation was something more like: there's room in the instruction encoding and it's more flexible. Like you can do things like move a value from one register to another using an `add` instruction instead of a dedicated `move` by using `dest = source + zero`. Internally I think some CPUs were doing that sort of thing anyway to simplify the datapath, and the basic point of RISC was to expose the microarch in the instruction set.

I gave it as an example of one way it can complicate a higher perf design, and why it's stuck on new ISAs. There's a lot of choices old school RISC made that happen to still be valid at much higher gate counts, but for different reasons. Three address is one of those. "Why do we still have it?" is the more interesting question compared to "why did we do it in the first place" IMO.

You can see CISC archs hacking in the "xor REG" instruction as 'erase dependency in OoO hardware' to get around the dependency issues I talked about. Three address doesn't have that root issue because you can always specify a destination that wasn't a source.

Additionally, it's not a case of exposing the CISC internal datapath as RISC instructions. CISC vertical microinstructions tend to be two address as well. The horizontal microinstructions can't really be said to have clear source and destination enough to be either two or three address.

it's very common - risc-v also has a 2-instruction compressed subset

Note: "jalr zero, 1b" can also be written as "j 1b", "jalr zero, 0(ra)" can be written as "ret"

> it's very common - risc-v also has a 2-instruction compressed subset > > Note: "jalr zero, 1b" can also be written as "j 1b", "jalr zero, 0(ra)" can be written as "ret"

`j` and `ret` are so-called "pseudo instructions" [1], not compressed instructions.

Pseudo instructions are just shortcuts used in assembly language to pretend that some common operations really "exist" without the need to type (or display) the corresponding more complex (but actually existing) instructions. `nop` is a common pseudo instruction. RISC-V has no real `nop` instruction, but, instead, the "do nothing instruction" is canonically encoded as `addi x0, x0, 0`. The programmer can write a more understandable `nop`, and the assembler will write instead the binary code equivalent to `addi x0, x0, 0`.

The compressed instruction set (a.k.a "extension C"), instead, is a subset of the full [2] instruction set, in which a restricted combinations of operands are possible. The assembly (human readable) code of the compressed instruction set looks similar to that of the full instruction set (including pseudo instructions), but they are encoded as completely different binary sequences.

[1] https://github.com/riscv/riscv-asm-manual/blob/master/riscv-...

[2] https://riscv.org/wp-content/uploads/2019/06/riscv-spec.pdf#...

> modern

X86 has them too (because of course it does)

There is an interesting point about example code for beginners (or the lazy), vs the actual production code in the standard library.

For example the strcpy:

    void stringcopy(char *dst, const char *src) {
        int i;
        char c;
        do {
            c = *src++;
            *dst++ = c;
        } while (c != '\0');
With asm code (I've changed it to use the normal pseudo-ops):

    .section .text
    .global stringcopy
        # a0 = destination
        # a1 = source
        lb      t0, 0(a1)  # Load a char from the src
        sb      t0, 0(a0)  # Store the value of the src
        beqz    t0, 1f     # Check if it's 0
        addi    a0, a0, 1
        addi    a1, a1, 1
        j       1b
For some reason they've made the assembly language not actually a direct translation of the C code. Ironically, this has actually slowed it down. On a typical single-issue in-order core (which everything in RISC-V land is so far, except the SiFive U74 in the upcoming HiFive Unmatched and Beagle-V) this will take 7 clock cycles per byte copied, as the sb will stall for 1 cycle waiting for the lb.

If they'd at least put the first addi between the lb and sb that would save a cycle. But keeping it organized the same as the C code would reduce it to 5 clock cycles per byte:

    .section .text
    .global stringcopy
        # a0 = destination
        # a1 = source
        lb      t0, 0(a1)  # Load a char from the src
        addi    a0, a0, 1
        sb      t0, 0(a0)  # Store the value of the src
        addi    a1, a1, 1
        bneqz   t0, 1b     # Repeat if not 0
That's shorter and faster for any string with at least one character before the NULL. (I'm ignoring branch prediction here as it will affect both equally)

So there you have 40% faster code just by sticking more closely to the C code.

The author has called this function stringcopy not strcpy which is probably a good thing because it doesn't meet the contract for strcpy -- the return value for strcpy is the start of the destination buffer i.e. return with a0 unchanged from how you found it. The code should copy a0 to somewhere else ... anything from a2..a7 or t1..t6 (since t0 is already used) and then work with that register instead of a0.

Real strcpy code in libc is much more complex because it tries to copy a whole register (8 bytes) each loop, which means you want to initially get the src and/or dst pointers aligned to a multiple of 8 and then also do some shifting and masking each iteration if the src and dst are not aligned the same as each other. And you also have the problem of detecting a zero byte in the middle of a register. It's also important if the string is near the end of a memory page not to try to read a few bytes of the next page, as you might not have access rights for it.

You quickly find you have hundreds of bytes of code for an optimised strcpy.

The current RISC-V glibc code simplifies the problem by calling strlen first, which depends only on the src, and then using optimised memcpy for the actual copy and ends up running at about 1.5 clock cycles per byte copied on long strings. Which is better than 5 or 7.

ARM and x86 strcpy improve on this by using NEON / SSE / AVX to copy more at a time, but they still need rather long and complex code to deal with alignment issues, and scalar code to deal with odd-sized tails.

The new RISC-V Vector extension gives a huge improvement for all these issues.

Version 1.0 of the Vector extension is not ratified yet (will probably happen in June or July) and there are no chips out using it, but Allwinner now have an SoC called "D1" using the C906 core from Alibaba/T-Head, and it has a Vector unit implementing the 0.7.1 draft version of the RISC-V Vector extension.

In some ways this is unfortunate, as the 1.0 spec is not in general compatible with the 0.7.1 spec. Some simple code is binary compatible between them, and the structure of how you write loops etc is the same, but some instruction semantics and opcodes have changed (for the better).

I currently have ssh access to a EVB (EValuation Board) from Allwinner in Beijing and expect to have my own board here in New Zealand early next month. Sipeed and Pine64 will have mass-production boards in a couple of months. Sipeed have promised a price of $12.50 for at least one version (probably with 256 or 512 MB of RAM I think) and Pine64 have said "under $10". The clock speed of this Allwinner D1 is 1.0 GHz.

Here is vectorized strcpy code I've tested on the board:

    # char\* strcpy(char *dst, const char* src)
        mv a2, a0  # Copy dst
    1:  vsetvli x0, x0, e8,m4 # Vectors of bytes
        vlbuff.v v4, (a1) # Get src bytes
        csrr t1, vl  # Get number of bytes fetched
        vmseq.vi v0, v4, 0 # Flag zero bytes
        vmfirst.m a3, v0 # Zero found?
        vmsif.m v0, v0  # Set mask up to and including zero byte.
        add a1, a1, t1  # Bump pointer
        vsb.v v4, (a2), v0.t # Write out bytes
        add a2, a2, t1  # Bump pointer
        bltz a3, 1b  # Zero byte not found, so loop
This relatively simple code (not as simple as memcpy, obviously) copies 64 bytes (512 bits) in each loop iteration on this chip that has 128 bit vector registers, used in groups of 4 (the m4 in the vsetvli). It correctly handles all the problems:

- unaligned src or dst works fine, and doesn't significantly affect the speed

- if the vlbuff.v load instruction attempts to read into a memory page you don't have access rights to, it automatically shortens the vector length to the number of bytes it could actually read. vlbuff.v only causes an exception if the first byte can not be read (the ff means "Fault on First")

- the vsb.v store instruction uses a mask v0.t to ensure it doesn't disturb any bytes past where the terminating null is written. It will correctly copy a string into the middle of existing data.

On the Allwinner D1 (a low end SoC being marketed against ARM Cortex A7 or A35) this strcpy code runs at 43.75 clock cycles per 64 bytes copied.

That's 10.24x faster than the example code presented in this article, 7.3x than my improved version (matching the C code), and 2.2x faster than the current (non-vector) glibc code.

That's pretty good, especially considering that the code is barely more complex than the naive C byte-at-a-time loop.

Benchmark results on the Allwinner D1, and the glibc code can be found here: http://hoult.org/d1_strcpy.txt

And the same for memcpy here: http://hoult.org/d1_memcpy.txt

ARM SVE should allow fairly similar code, but I believe general consumer availability of chips with SVE is probably a year or more away still.

Here I was thinking that we should all be ashamed of the existence of strlen & strcpy etc as they are simply security holes masquerading as something else. Is that wrong? Do they have a place?

strnlen(const char* s, size_t maxlen);

strncpy(char* dest, const char* src, size_t n);

Is this a silly concern in this context? Seems a lot of people still don't know the old school libc functions are a disaster and should not be used. (And you have to be careful with the ones with a maximum size as well).

Here's how to do something wrong you should never do..? Thoughts?

You're right, modern code definitely shouldn't use them. But there's a lot of code that isn't remotely modern so we still have to support them.

Also nobody ever bothered to add a better string library to the standard C library so lots of people who write C just take the lazy route and use `strlen` etc.

You probably shouldn't be blindly using the suggestions you provided either.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact