Actual patch: https://reviews.llvm.org/rGf4a35492504d. It seems like this changes memcpy to use aligned accesses because some processors may implement other loads very slowly?
What I've heard is that riscv architecturally guarantees that unaligned accesses are legal, but many processors don't implement them natively, so they are trapped and emulated. This seems like a serious flaw—if the architecture designers did not believe the hardware designers would be able to commit to supporting fast unaligned accesses, they should not have mandated that unaligned accesses be available everywhere.
Yeah I’ve generally wondered how this will work on that ecosystem. On other platforms there are different rules that range from “unaligned accesses will trap” to “don’t bother they’re too slow” and “they’re ok actually” but I think there are ways to identify the last case that people typically use. What’s the equivalent for RISC-V? Will there be one?
I suspect small microcontrollers that optimize for cost will have terribly slow emulated unaligned accesses, but in the other direction if you're going through the trouble of writing a big OoO application core with vectors, virtualization, and the whole shebang, teaching your memory pipeline to make unaligned accesses fast is well within reason.
Hardware designers have added much more complex features for low single digit percent gains.
On one hand, it seems fine to allow the CPU to run anything for the family without requiring the compiler make up to 2^n code paths of each routine that uses this kind of "sometimes implemented fast" feature in order for the binary to be portable across the family. On the other hand, it seems useful to allow the compiler to choose when such optimizations might be worthwhile. Maybe, rather than require software to always check, it'd be good enough to let there be a CPU flag which hints at the implementation the compiler can optionally look at if it thinks it'd be worthwhile.
My only beef with LLVM's libc is that it does not permit specialization of things (like memcpy) with assembly implementations. Of course -- I get the goal, we want to be able to use sanitizers and the like to validate the implementation. I wish there were a way to have it both ways, somehow.
Of course assembler spezializations are an anti-pattern, because the optimizer should be fixed to do it much better. Better C code is often 2x faster than hand optimized assembler.
- THead C906 in-order single-issue core in e.g. AllWinner D1 and Bouffalo BL808 SoCs
- THead C910 3-wide OoO core in THead TH1520 and Sophon SG2042 SoCs
LMUL=4 (64 bytes per loop) is optimal for all the above. It is unlikely to be bad on anything.
I look forward to the first shipping RVV 1.0 implementations, maybe sometime in 2024. The same code is binary compatible between RVV draft 0.7.1 (which the above implement) and the ratified RVV 1.0, and source compatible with the addition of ",ta,ma" to the `vsetvli` (sadly, a decision was made not to allow that to default in asm)
... a test I ran on the Allwinner D1 two years ago. Compared to glibc memcpy() at the time, the vector version is 2-3 times faster for sizes up to 1k, then gradually slides back to equality at sizes in L2 cache and RAM where normal integer code can saturate the bus.
That's a core (C906) which today is available on for example the $6 Pine64 Ox64 board.
Nothing in particular, just not particularly amazing performance. They work fine. One thing they have going for them is that they typically have separate versions for every interesting architecture feature level/set, whereas e.g. bionic only has sse code. I guess I can point at my own implementations of memset and memcmp (https://github.com/moon-chilled/fancy-memsethttps://github.com/moon-chilled/fancy-memcmp), both of which employ novel techniques not used by glibc; but I've not yet gotten around to doing proper benchmarks on either.
Sure. Some of the C library functions are red-hot code that is a bottleneck for many programs (memcpy is a striking example, but sometimes strcpy variants are too). So popular C libraries like musl, glibc, bionic, etc - they sometimes create a dedicated assembly implementation of these functions that get called a lot. This requires more care than a C implementation would (or C++ in LLVM's case). It also requires the additional burden of maintaining sometimes both a C implementation and assembly for each supported architecture.
AFAIK the LLVM libc maintainers recognize the value of the assembly implementation and specify as an anti-goal because it conflicts with their primary goal of having a portable C implementation that can be instrumented by the compiler. The compiler instrumentation can help detect defects like buffer overruns -- even in production, in some cases. Note that the intent here is to detect defects in the C library itself, so maybe enabling it in production (w/HWASan, e.g.) is not a use case they tend to focus on.
But Address Sanitizer is a feature for "C-style" languages (that term I am using covers many languages). But it cannot be used for low-level languages like assembly. So assembly was ruled out by design to preserve this capability.