RISC-V J extension – Instructions for JITs

aidenn0 · on March 12, 2022

For tagged values, I loved the POWER rlwinm: Rotate Left Word Immediate aNd with Mask (and it's companion rlimi). Pretty much any sane tagging scheme could be converted to the unboxed value with that single instruction; even somewhat exotic tagging schemes like mixing high-bit and low-bit tagging could be handled by it.

Of course in modern architectures being able to do something in one instruction is only tenuously related to being able to do something quickly, but it was a super handy instruction back in the day.

rwmj · on March 12, 2022

Most tagged arithmetic can be converted to one or two regular instructions. For OCaml which tags the bottom bit I wrote about it here: https://web.archive.org/web/20090810001400/https://caml.inri... and here (scroll down to bottom): https://rwmj.wordpress.com/2009/08/04/ocaml-internals/

KerrAvon · on March 12, 2022

I heard someone use those instructions once as examples of something compilers could do better than humans writing assembly -- Apple's MPW C compilers for PowerPC were capable of peephole optimizations that would produce them where a human might not think of them. (At least, that was the argument.)

mhh__ · on March 12, 2022

That depends whether you mean a human who knows the instructions exist or not or a human who hasn't worked out how to use shifts to do integers mul/div by 2 yet.

aidenn0 · on March 12, 2022

The proper argument was always that optimizing compilers generate better assembly than 90% of the people using them could generate, and in a fraction of the time.

However these things often get turned into stronger (or different) arguments as they pass from mouth to ear repeatedly.

Sometimes they change completely, as in "the plural of anecdote is data"

blippage · on March 12, 2022

I wanted to write a memcpy() routine for a microcontroller. I wrote a naive version where I copied from src to dst one byte at a time. You can find algorithms which are more efficient than this, which will typically copy 32 bit words at a time.

The interesting thing is, I turned on compiler optimisations. When I examined the assembled output (even though my knowledge of assembly is poor), I discovered that it had made the optimisations that you would find in a more complex C implementation. The compiler obviously thought to itself "I see what you're doing here", and put in a better version.

So the moral of the story is: your compiler is likely to be able to figure out a lot.

mhh__ · on March 12, 2022

Even ignoring the usual optimizations like using SIMD and loop unrolling to find parallelism when doing memcpy, the compiler actually has techniques for spotting certain loop idioms so it can actually replace the loop with a memcpy library call if it deems it profitable (e.g. tell it it's likely to have N>bigNumber and it'll go for a library)

cbm-vic-20 · on March 12, 2022

There are additional optimizations like using C's printf without any extra arguments, the compiler will replace that with a call to puts, which doesn't have the formatting code. You can see this in Compiler Explorer.

https://godbolt.org/z/dvdzE4M6T

ghusbands · on March 12, 2022

Quite often, that doesn't end up very efficient, because without "restrict", the result has to be identical to what it would be if it was copied byte by byte, for all possible overlaps of the two inputs.

Sprite_tm · on March 13, 2022

Lots of memcpy() implementations are still more efficient than a dumb byte-by-byte copy. They'll copy the (unaligned) head and the tail in bytes, but the bulk of the data using whatever data type and method is fastest.

samus · on March 12, 2022

> "the plural of anecdote is data"

I love that line!

KerrAvon · on March 12, 2022

I didn’t say it was a good argument.

eyelidlessness · on March 12, 2022

Me, every day, even discussing my own points.

Taniwha · on March 12, 2022

It's worth noting that on systems with real cache coherency (MOESI for example) where for example writing data into the dcache to an address A results in cache line shootdown in the icache as part of fetching an 'exclusive/modified' line into the dcache - in this world EXPORT.I is essentially a no-op because what it requires the icache implement (shootdown of icache lines) has already happened naturally.

Equally on such a system the only thing left for FENCE.I to do is to flush any (potentially now bogus) subsequent instructions that are in the execution pipe that might have been prefetched before the writes occurred. In such a system FENCE.I and IMPORT.I are identical.

Hopefully the people writing this spec are listening ... please make sure your spec understands high end systems like this and doesn't add stuff that require special cases in systems that do ubiquitous coherency right

sdbbp · on March 12, 2022

This organization of functionality is intentional. It provides support for code modification orthogonal to instruction cache coherency support. The range of types of implementations of RISC-V is broad enough that imposing instruction cache coherency on all of them wouldn't be optimal. The I/D consistency proposal provides SW control now, while not requiring particular implementations.

Particular RISC-V Platform specs may end up requiring I/D coherency, like Arm is recommending in SBSA Level 6, but that's left for later, if ever.

Taniwha · on March 12, 2022

Right, I think it's OK as written, I'm just encouraging people to make general specs rather than ones with special cases that are important for one end but slow everything else down

olliej · on March 12, 2022

Counting down to someone pointing at the annoyingly named ARM FJCVTZS instruction. The naming is obviously more about legal problems than reality, but so it goes.

To be very very clear: FJCVTZS does not do anything amazing, clever, or special. The problem it solves is very simple: the behaviour of double->int conversion in JS is the default x86 behaviour. Getting that behaviour on any non-x86 platform is expensive. So a more accurate name would be FXCVTZS. The implementation of FJCVTZS in a CPU is also not expensive, it simply requires passing a specific rounding mode to the FPU for the integer conversion (overriding the default/current global mode), and matching the x86 OOB result.

(Also I really wish people would stop posting to GitHub repos unless the repos have the actual readable spec available or linked, rather than the unbuilt markup version. It just makes reading them annoying.)

snek_case · on March 12, 2022

There's a document in there about pointer masking: https://github.com/riscv/riscv-j-extension/blob/master/point...

It seems like the objective of this is to implement different access privileges... but why do you need specialized instructions for this? This is typically done by the OS and memory protection. The pointer masking extension would be to have multiple levels of privilege within a single process? I'm assuming that this is to protect the JIT from a JITted program? Except it's not completely safe, because there might still be bugs in the JIT that could allow messing with the pointer tags. Struggling to think of a real use case.

pjmlp · on March 12, 2022

Fixing C, hardware memory tagging is the ultimate mitigation strategy for pointer tricks.

Already being successfully used for decades in Solaris SPARC, iOS/macOS and Android are increasingly pushing for it on ARM CPUs, Pluton on Azure Sphere OS,...

snek_case · on March 13, 2022

I found this post on ARM MTE which was helpful in understanding the concept: https://www.anandtech.com/show/16759/sponsored-post-keep-you...

Seems to me this will have an execution overhead though, and that the best way to improve security would be to finally move beyond C. Most modern languages make buffer overflows impossible.

pjmlp · on March 13, 2022

Except all those fine people writing UNIX clones and embedded stuff will never do it, so here we are.

It was already known since the early days how bad C was versus the competition.

UNIX made it famous, UNIX won the server room wars, UNIX will keep it going.

olliej · on March 12, 2022

Memory tagging isn't a privilege level thing, it's an anti-compromise mechanism similar to PAC (in the sense the goal is to make it harder for an attacker to compromise code, they are functionally completely different).

The basic idea is you often want finer the page level granularity on memory access rights. An example ARM give in the documentation covering the ARM MTE is an allocator. With memory tagging you can make it so unallocated memory in the allocator is not accessible.

Essentially every piece of memory gets a tag, and you can only access a piece of memory through a pointer that has the matching tag. To illustrate imagine an allocator (which is the example ARM have in the documentation for the ARM MTE)

You the allocator has a bunch of memory, and has all of it set to be tagless (uncolored in ARM terminology IIRC):

    |bbbbbbbbbb|

When you allocator allocates a byte it does the following:

1. Find a free block 2. Choose a tag (randomly if it wants) 3. Set the tag on that memory to the selected tag from (2) 4. returns a pointer to that memory tagged with(2)

So we get something like:

    |1bbbbbbbbb|

    p = (1,0) // pointer with a tag of 1 and the address 0

Now any access to the memory in at address 0 must be via a pointer with the tag 1, and any memory accessed via that pointer must be tagged with 1

So imagine you have a bunch of allocations

    |13251bbbbb|

You can see we've re-used a tag, because there is a finite amount of space for tags in a pointer, so while our original allocation was a 1 byte allocation at 0, we can do p[4] and the access will work. However, if we're choosing the tag randomly and attacker is in theory unlikely to be able to luck out and get the correct tag so your process crashes (it's super important for these mechanisms that any failure results in a unstoppable crash, e.g. no signal handlers or anything). Another thing you allocator does is revert memory to being untagged (or I guess tagged distinctly) on free, so a use after free also cannot work.

In reality the tagging is not per byte because that would be insane: MTE has a significant increase in the physical ram requirements for a system. If you have an N-bit tag, that means you need to have N extra bits in the physical ram for every granule. I don't know what sort of granule sizes people are looking at but the overhead in physical ram requirements is literally (granule size in bits + bits for tag)/(granule size in bits) so you can see how significant this is.

Unlike PAC, my understanding is there is no cryptographic logic linking the tag to pointer, so pointer arithmetic continues to work without overhead whereas in a PAC model p += 1 say would be: temp = AUTH(p), temp = temp + 1, p = SIGN(temp).

The purpose of PAC is not to protect the memory, but rather the pointer itself. For example imagine you have a C++ object, the basic layout is essentially:

    struct {
        void* vtable
        data fields
    }

For those unfamiliar, a vtable is essentially just a list of function pointers to support polymorphism. In this case the vtable pointer is tagged with the appropriate tag for wherever the vtable is. Because the vtable itself is stored in tagged memory it can't be modified by the attacker (in reality tables are all in read only memory, but pretend they're not for this example). But if the attacker can get some random, correctly tagged pointer what they can do is build their own vtable in that memory, and then simply overwrite the vtable pointer with their correctly tagged pointer for the malicious vtable. Of course you can just have the memory holding the object itself also be tagged, so they need the correct pointer tagging for that :D

In the PAC model the pointer is signed by a secret key (it's literally inaccessible to the process) and a nonce (on Mac + iOS this nonce includes the address of the vtable pointer itself). For an attacker to create a valid pointer they need to be able to generate the correct signature over the bits in the pointer and the nonce. Because different nonces are used for pointers in different uses, they can't just get (for example) one object to overwrite another. If the nonce includes the address of the pointer they can't even just copy a validly signed pointer from another location in memory.

I really do like the PAC model a lot, but to me the MTE mechanism seems to be a much stronger protection mechanism, albeit a very expensive one (PAC doesn't require additional ram for the signed pointers).

my123 · on March 12, 2022

Arm MTE uses a 4-bit tag for each 16 bytes region.

olliej · on March 12, 2022

Which would eat a little more than 3% of the physical memory in a device.

Does ARM allow any freedom in tag size, or is it strictly 4 bits?

I realize I may not have been clear for people unfamiliar with MTE* tagging is device level so you can't (for example) put the tags in a separate mapping and just increase your usage of existing memory by 3% (obviously a software implementation could do that, but the perf would probably be suboptimal :D ). You literally need X% more dram cells.

* Not saying @my123 doesn't understand, just I can't edit my original comment and I figure contextually this is reasonable :D

my123 · on March 12, 2022

Strictly 4 bits. For the Morello prototype architecture with full CHERI, it’s 1 bit for each 16 bytes region. (capability valid bit)

saagarjha · on March 12, 2022

Of course, CHERI faces very different challenges than MTE does ;)

Decabytes · on March 12, 2022

Wasn’t this tried with Jazelle and Java? I wonder how they will overcome the shortcomings of that attempt

pjmlp · on March 12, 2022

This has been tried plenty of times, ARM just decided something else because reasons.

Also to note that all hardware vendors are adopting hardware memory tagging as the only way to fix C.

Intel messed up with MPX, but I definitely see they coming with an alternative, as I bet they won't like to be seen as the only vendor left without such capabilities.

adgjlsfhk1 · on March 12, 2022

I'm honestly not sure why we haven't just admitted C isn't fixable.

pjmlp · on March 12, 2022

Because that requires throwing away UNIX and many people feel quite strongly about it, given that it has won the data center wars.

> C Language. Dialect ISO C. ISO C source programs invoking the services of this Product Standard must be supported by the registered product.

-- http://get.posixcertified.ieee.org/docs/si-2016.html

I should also note that many attempts to add safer types to C have been tried, WG14 just doesn't care about them.

blibble · on March 12, 2022

ntoskrnl.exe is C too

pjmlp · on March 12, 2022

Not since Vista.

https://docs.microsoft.com/en-us/cpp/build/reference/kernel-...

> Creates a binary that can be executed in the Windows kernel. The code in the current project gets compiled and linked by using a simplified set of C++ language features that are specific to code that runs in kernel mode.

And then there is WIL, https://github.com/microsoft/wil

https://community.osr.com/discussion/291326/the-new-wil-libr...

> First off, let me point out that this library is used to implement large parts of the OS. There are hundreds of developers here who use it. So unlike, uh, some other things that get tossed onto github, this project is not likely to wither and die tomorrow.

> There are, however, only a handful of kernel developers working on the library, so the kernel support has been coming along much slower. I'd like to expand the existing kernel features in depth ....

blibble · on March 12, 2022

can explain how the existence of a compiler flag that allows third parties to compile C++ such that it can run in kernel means that the kernel has been rewritten in C++?

the fact LLVM allows javascript to be transpiled to C doesn't mean Linux kernel has been rewritten in Javascript

pjmlp · on March 12, 2022

Apparently someone is lacking reading skills in how WIL is used on the kernel.

blibble · on March 12, 2022

okay, so there exists a library that allows people to write C++ code that can be loaded into the kernel

this doesn't mean the ntoskrnl.exe is written in C++

the fact nvidia's linux loadable kernel blob is written in C++ doesn't suddenly mean linux is written in C++

"grasping at straws" would seem to sum up your position

pjmlp · on March 13, 2022

Why do you think Microsoft decided to drop C support beyond C89 and only caved in due to the pressure of FOSS projects?

A kernel without drivers, only produces heat.

> "grasping at straws" would seem to sum up your position

Fits exactly the position of someone that desperately wants to assert ntoskernel.exe is written just like when NT 3.51 got released into the world.

"Kernel proper - This is mostly written in C. Things like the memory manager, object manager, etc. are mostly written in C. The boot loaders are written in ASM, but set up a C environment rather quickly.

Drivers - that said, a lot of newer kernel mode drivers are actually written in C++ (however, its style is more akin to "C with classes". Lower level code has been much slower to adopting anything past C++98)

User land - Mostly C++ with varying levels of quality and version compliance. If it's a pre-Windows 8.0 component, it was written against mostly C++98. More recent features are C++14 and better."

-- https://www.reddit.com/r/cpp/comments/4oruo1/windows_10_code...

Bye, have fun with C.

blibble · on March 13, 2022

> "Kernel proper - This is mostly written in C.

thank you

only took 17 hours to get there, but we finally got there

socialdemocrat · on March 14, 2022

Zig seems like a good replacement. It interfaces really well with C and works as a drop in replacement. Better type checking, error handling, memory management etc.

ridruejo · on March 12, 2022

Nice to see Wasm popping up in proposals like this one :)

userbinator · on March 12, 2022

I suspect it won't be long before RISC-V becomes not-so-RISC. Even ARM added FJCVTZS.

mhh__ · on March 12, 2022

RISC these days really refers mostly to uniformity with a bit of simplicity bolted on the side. Big instructions sets aren't really avoidable in practice, but the advantage AArch64 and RV64 have over X86 in theory is that they aren't totally insane (e.g. AArch64 is fixed width) and reliant on lots of trickery to preserve a machine model from the 70s.

RISC-V basically eliminates a lot of microarchitectural state (flags), whereas AArch64 updates that state conditionally. We will find out which approach is superior soon.

fredoralive · on March 12, 2022

Successful architectures seem to need a certain degree of pragmatism. ARM isn't exactly the RISCiest RISC, nor is AMD64 as baroque as the outer limits of CISC like iAPX 432.

FJCVTZS is an example of pragmatism, the JavaScript spec says float to int should be done the way that x86 does it, the original ARM FCVTZS (no J) didn't do it the same way, but JavaScript is so important you have to add a special case.

I hope I'm not mischaracterising the RISC-V side, but I seem to recall their argument against things like FJCVTZS was that that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?). Which kinda felt like "we must keep the instruction set as simple as possible, even if it makes the implementation of high performance CPUs complex". See also the "compressed instructions" stuff, which feels again like passing the buck for complexity onto the CPU implementation side (unless it's just a Thumb like 16 bit wide instruction set thing given a misleading name).

jasonwatkinspdx · on March 12, 2022

So, with RISC-V the design pretty deliberately enables a combination of compressed instructions and macro op fusion.

The compressed instructions are quite lightweight. It's generally an assembly level thing, and the decoder on the cpu side is apparently ~400 gates.

The compressed instructions are indeed a 16 bit wide thing, but fixing some of the flaws in Thumb. Generally they have more implicit operands or operands range over a subset of registers to fit in 16 bits.

But the hat trick is these two dovetail into each other, such that a sequence of compressed instructions can decompress into a fuse-able pair/tuple, which then decodes into a single internal micro op. This creates a way to handle common idioms and special cases without introducing an ever growing number of instructions. Or at least that's the basic claim by the RISC-V folks. I think they've done enough homework on this to not be trivially wrong, so it'll be interesting to see how things go.

sakras · on March 12, 2022

To be honest I kind of understand this “passing the buck”. In computing in general you never trust the guy up the stack to give you good input. Query engines do filter reordering because they don’t trust the optimizer to get it right. Compilers do optimizations because they don’t trust the programmer to get the order of operations right (rightfully). CPUs do OOO because they don’t trust compilers to get the order of instructions right. The way I see it is there are 2 variants: 1) make a specific instruction (clutters the instruction set, makes processors who don’t care implement it), 2) rely on processors who care to implement instruction fusion, and those who don’t will do it the slow way. Either way, it gets implemented in hardware, and processors who care need to make a change in the front end.

atq2119 · on March 12, 2022

> CPUs do OOO because they don’t trust compilers to get the order of instructions right.

Not really. CPUs do out-of-order because cache hits are unpredictable and it is crucial for single-threaded performance to make progress on dependent operations as soon as a loaded value is available.

There may be other, lower order, factors, but variable memory latency is the real reason.

brucehoult · on March 12, 2022

To defend ARM (what? A RISC-V guy defending ARM?) there is absolutely nothing un-RISC about FJCVTZS. Every instruction set with floating point has some way to convert an FP value to an integer. FJCVTZS is no more complex than the existing FCVTZS -- it simply uses a different rounding mode and different behaviour if the value is too big.

I don't know what you think RISC-V "compressed instruction" means. It's precisely equivalent to ARM Thumb2 -- there are 16 bit opcode and 32 bot opcodes and you can tell which you have by looking at 2 bits (RISC-V) or 3 bits (Thumb2) in the first 16 bits of the instruction.

I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.

As for "making the implementation of high performance CPUs complex" … high end CPUs are unavoidably complex. A little bit more is not a big deal. On the other hand, adding complexity to low end CPUs can easily be a complete deal-killer. Splitting an instruction into µops might be a little simpler than combining instructions into macro-ops, but it's not as simple as not having to do it.

Ironically, the people who criticise RISC-V for talking about macro-op fusion seem to be ignorant of the fact that no currently shipping RISC-V SoC does macro-op fusion [1], while every current higher end ARM and X86 does do macro-op fusion of compare (and maybe other ALU) instructions with a following conditional branch instruction.

[1] SiFive U74 can tie together a forward conditional branch over a single integer ALU instruction with that following instruction. They pass down the two execution pipes in parallel (occupying both i.e. they are still two instructions, not a macro-op). The ALU instruction executes regardless, but the conditional branch controls whether the result is written back. i.e. it effectively converts a branch into predication

hajile · on March 12, 2022

> I don't believe there is any practical "magical" sequence of instructions that could be easily recognised to implement Javascript conversion from float to int. If that is in fact as important as ARM apparently think it is (I have my doubts) then an equivalent of FJCVTZS should be added to RISC-V as an extension.

They claim 2%, but only in JS code. I'd guess static analysis of outputted v8/JSC/SM JIT code from the top 100 websites would give a very accurate estimation of the savings. One of the most fundamental performance boosters is using 31-bit ints instead of doubles, but every single time time the user needs to access a number for output, it must be converted to a double to keep the JS spec contract.

All that said, I think only Apple's last 4-6 chips and ARM's most recent generation of chips actually implement the instruction and people have been fine without it. I'd guess we'll not be seeing this in RISC-V until much lower-hanging fruits have been picked.

userbinator · on March 12, 2022

that there should be some standard set of instructions that compilers should emit for that special case, and the instruction decoder on high end CPUs should be magic enough to detect the sequence and do optimal things (fused instructions?)

Detecting a long fixed sequence of instructions and "compressing" them into one internal operation seems like it would require a lot of fetch bandwidth and/or a really wide decoder. x86 has had macro-fusion since Core Solo/Duo.

Dylan16807 · on March 12, 2022

Those downsides would be real, depending on how awkward the set of instructions is, but on the plus side risc-v should be able to handle a lot more instructions per cycle in a given power/area budget.

olliej · on March 12, 2022

Ok, I said this elsewhere: FJCVTZS is not special, and while JS may have been a motivating factor, the actual behavior is "emulate the x86 double->int conversion"

There is nothing magic about it.

A more correct name for FJCVTZS would be FXCVTZS. What FJCVTZS does is override the default FPU rounding and signaling results for double to integer conversion to match the x86 behaviour. There is no special logic needed in the FPU, all that happens is instead of the instruction passing the current thread FPU rounding and clamping flags, it passes the flags that exactly match x86 behaviour.

That's it.

Because the JS label is inaccurate everyone believes it to be useless outside of js, when in reality it's useful to anything that needs x86 behavior for double->int conversion, so any x86 emulators on arm (Qemu, presumably the translation runtimes, etc).

God I hate that they named it that.

slaymaker1907 · on March 12, 2022

I think the vector operations feel very RISC. One set of operations for the different vector sizes. Another thing to remember is that most of this stuff is an optional part of the ISA.

A good comparison is R7RS with scheme. The vast majority of it are optional RFCs that exist for the sake consistency and aren't implemented by most schemes. The "mandatory" parts are specified via R7RS-small and work is being done on R7RS-large, though even that won't contain every RFC.

I could see us ending up with an equivalent for RISC-V where a common group of extensions get grouped together as a standard (likely including stuff like virtualization support but excluding vector operations).

skavi · on March 12, 2022

The fastest open source RISC-V core is already technically generating micro ops. (branch to cmov).

goodpoint · on March 12, 2022

Have you seen how tiny rv64 is compared to x86 and ARM?

zogomoox · on March 12, 2022

I wonder how many future projects will not use RISC-V because middle management will stop reading proposals after the word RISC.