LLVM's Libc Gets Much Faster memcpy For RISC-V

saagarjha · on May 21, 2023

Actual patch: https://reviews.llvm.org/rGf4a35492504d. It seems like this changes memcpy to use aligned accesses because some processors may implement other loads very slowly?

moonchild · on May 22, 2023

What I've heard is that riscv architecturally guarantees that unaligned accesses are legal, but many processors don't implement them natively, so they are trapped and emulated. This seems like a serious flaw—if the architecture designers did not believe the hardware designers would be able to commit to supporting fast unaligned accesses, they should not have mandated that unaligned accesses be available everywhere.

saagarjha · on May 22, 2023

Yeah I’ve generally wondered how this will work on that ecosystem. On other platforms there are different rules that range from “unaligned accesses will trap” to “don’t bother they’re too slow” and “they’re ok actually” but I think there are ways to identify the last case that people typically use. What’s the equivalent for RISC-V? Will there be one?

tux3 · on May 22, 2023

I suspect small microcontrollers that optimize for cost will have terribly slow emulated unaligned accesses, but in the other direction if you're going through the trouble of writing a big OoO application core with vectors, virtualization, and the whole shebang, teaching your memory pipeline to make unaligned accesses fast is well within reason.

Hardware designers have added much more complex features for low single digit percent gains.

moonchild · on May 22, 2023

Then make it an optional feature software has to check for. God knows riscv is fragmented enough already...

zamadatix · on May 22, 2023

On one hand, it seems fine to allow the CPU to run anything for the family without requiring the compiler make up to 2^n code paths of each routine that uses this kind of "sometimes implemented fast" feature in order for the binary to be portable across the family. On the other hand, it seems useful to allow the compiler to choose when such optimizations might be worthwhile. Maybe, rather than require software to always check, it'd be good enough to let there be a CPU flag which hints at the implementation the compiler can optionally look at if it thinks it'd be worthwhile.

wyldfire · on May 22, 2023

My only beef with LLVM's libc is that it does not permit specialization of things (like memcpy) with assembly implementations. Of course -- I get the goal, we want to be able to use sanitizers and the like to validate the implementation. I wish there were a way to have it both ways, somehow.

rurban · on May 22, 2023

Of course assembler spezializations are an anti-pattern, because the optimizer should be fixed to do it much better. Better C code is often 2x faster than hand optimized assembler.

Eg my C memcpy with inlined and vectorized clang beats glibc or gcc memcpy in assembler easily. https://github.com/rurban/safeclib/blob/master/tests/perf_me...

moonchild · on May 22, 2023

> Better C code is often 2x faster than hand optimized assembler

Only if you suck at optimising assembly.

> my C memcpy with inlined and vectorized clang beats glibc or gcc memcpy in assembler easily

1. For what workload?

2. glibc strings functions are ok, but not particularly good, ime

The advantage of compilers is economy. They can not—at least not at present—beat a good human.

pjmlp · on May 22, 2023

It is quite easy to suck at optimizing Assembly unless we are talking about 8 and 16 bit classical CPUs.

First of all, if one isn't using something like V-Tune to profile the microcode of the written Assembly, they are doing it wrong.

Second, better keep up with all those opcodes.

brucehoult · on May 22, 2023

What is this "microcode" you speak of? "all those opcodes"? This post is about RISC-V.

Also, in the near future, this is going to be the only optimised `memcpy()` you need on Applications-class (and many embedded too) RISC-V processors:

    memcpy:
        mv a3,a0
    1:  vsetvli a4, a2, e8,m4 # Vectors of bytes
        vle.v v0, (a1)  # Load bytes
        add a1, a1, a4  # Bump pointer
        sub a2, a2, a4  # Decrement count
        vse.v v0, (a3)  # Store bytes
        add a3, a3, a4  # Bump pointer
        bnez a2, 1b  # Any more?
        ret

pjmlp · on May 22, 2023

It can be VHDL or Verliog for the ISA, if it makes you happier.

So have you validated that Assembly example works the same way across all RISC-V vendors, and memory controllers, in regards to performace?

brucehoult · on May 23, 2023

All that exist at the moment! i.e.

- THead C906 in-order single-issue core in e.g. AllWinner D1 and Bouffalo BL808 SoCs

- THead C910 3-wide OoO core in THead TH1520 and Sophon SG2042 SoCs

LMUL=4 (64 bytes per loop) is optimal for all the above. It is unlikely to be bad on anything.

I look forward to the first shipping RVV 1.0 implementations, maybe sometime in 2024. The same code is binary compatible between RVV draft 0.7.1 (which the above implement) and the ratified RVV 1.0, and source compatible with the addition of ",ta,ma" to the `vsetvli` (sadly, a decision was made not to allow that to default in asm)

As can be seen at...

https://hoult.org/d1_memcpy.txt

... a test I ran on the Allwinner D1 two years ago. Compared to glibc memcpy() at the time, the vector version is 2-3 times faster for sizes up to 1k, then gradually slides back to equality at sizes in L2 cache and RAM where normal integer code can saturate the bus.

That's a core (C906) which today is available on for example the $6 Pine64 Ox64 board.

xoranth · on May 22, 2023

> 2. glibc strings functions are ok, but not particularly good, ime

Do you mean for RISC-V, or in general? (and in particular, what about x64?) What problems do they have?

[Also what are in your experience better implementations?]

moonchild · on May 23, 2023

I only have experience with their amd64 code.

> What problems do they have?

Nothing in particular, just not particularly amazing performance. They work fine. One thing they have going for them is that they typically have separate versions for every interesting architecture feature level/set, whereas e.g. bionic only has sse code. I guess I can point at my own implementations of memset and memcmp (https://github.com/moon-chilled/fancy-memset https://github.com/moon-chilled/fancy-memcmp), both of which employ novel techniques not used by glibc; but I've not yet gotten around to doing proper benchmarks on either.

xoranth · on May 25, 2023

Thank you! I'll take a look.

chmod600 · on May 22, 2023

Can you please explain the trade off in more detail?

wyldfire · on May 22, 2023

Sure. Some of the C library functions are red-hot code that is a bottleneck for many programs (memcpy is a striking example, but sometimes strcpy variants are too). So popular C libraries like musl, glibc, bionic, etc - they sometimes create a dedicated assembly implementation of these functions that get called a lot. This requires more care than a C implementation would (or C++ in LLVM's case). It also requires the additional burden of maintaining sometimes both a C implementation and assembly for each supported architecture.

AFAIK the LLVM libc maintainers recognize the value of the assembly implementation and specify as an anti-goal because it conflicts with their primary goal of having a portable C implementation that can be instrumented by the compiler. The compiler instrumentation can help detect defects like buffer overruns -- even in production, in some cases. Note that the intent here is to detect defects in the C library itself, so maybe enabling it in production (w/HWASan, e.g.) is not a use case they tend to focus on.

But Address Sanitizer is a feature for "C-style" languages (that term I am using covers many languages). But it cannot be used for low-level languages like assembly. So assembly was ruled out by design to preserve this capability.