RISC-V: The Last ISA?

Joel_Mckay · on Nov 16, 2022

The primary issue is the subsets of possible extensions to the core.

Can it replace ARM? maybe in some roles that only require core features.

Will it become the popular standard? probably not, as the groups failed to learn what ARM did wrong. Most popular ARM cores have many advanced features disabled in the compilers for a few reasons:

1. Some ARM ASIC advanced features would err on some chips, or were simply unavailable on some silicon revisions.

2. Deploying chip specific compilers is unsustainable commercially when your competitor can port in billions of dollars of work with gcc (as bad as it is)

3. a complete RISC-V implementation with floating point and DSP extensions should have been the minimum core. Without these features, it dropped the whole generation 1 chip hardware into the low-end mcu domain. A saturated market, that has already fragmented.

Is there possible solutions?

Complete a fully open 64bit core with mmu, fpu, and DSP. Then give the design to companies that have had a proprietary hardware issue for decades like microchip.

Allowing companies to make proprietary design variants was a mistake.

Happy computing =)

hajile · on Nov 16, 2022

Standardization at the MCU level doesn't matter very much. I can't just write code for ARM chip with M0 from company A and expect it to work with an M0 from company B. Likewise, having an M3 chip doesn't guarantee it will have Floats or SIMD (IIRC, it doesn't guarantee SIMD width either). And even going between M-series chips can change stuff like which Thumb 1/2 instructions are available. You're always a slave to whatever company's compiler and garbage Eclipse clone.

RISC-V actually has a better story here. You basically have RV32IC (compressed will always be there outside of extremely niche chips because 400 gates for 25-35% smaller code saves more in RAM/ROM than it costs). Optionally, you may have Zmmul for only hardware multiplication or M for both multiplication and division (shouldn't make a difference outside of the size of the software routines if they are needed).

On the desktop side, There are only two profiles that would actually be used -- RVA20S64 and RVA22S64. By the time actual desktop processors are finished, RVA24S64 will likely be around and make a couple optional extensions required.

https://github.com/riscv/riscv-profiles/blob/main/profiles.a...

Ultra-tiny cores using the still not finalized RV32E may exist in the future, but a company choosing them would know full-well the tradeoffs they were making.

The only poorly-specified area is DSPs, but that's a problem for the entire DSP industry. There are a lot of requirements from the barely-DSP up through units doing a lot of signal processing. On the whole, I'd prefer to look at a handful of letters than have to dig through hundreds of pages of esoteric manuals.

Joel_Mckay · on Nov 16, 2022

DSP and Quaternion operations in the core ASIC would greatly simplify a lot of code (as would pipeline media codecs).

The problem with such instructions has always been how to get compilers to leverage such feature automatically (see 30 years of GPU kludges). Perhaps some sort of standardized static library template could create/emulate a set of abstractions for the special opcodes within regular compiler contexts (i.e. still cross platform compatible like OpenCL, and hides the ugly inline assembly during binary optimization).

"RISC-V actually has a better story here", Time will tell, but we have all seen a lot of "better" cores disappear over the years. ;)

hajile · on Nov 16, 2022

You could do this with a custom extension. No particular solution has proven to be definitely better than the rest, so I'd rather them hold off on officially endorsing one method and getting it wrong then being stuck with that forever.

Joel_Mckay · on Nov 16, 2022

Traditionally on CISC the process was:

1. Identify high-level language output greatest common sequence in binary image code regions

2. Check if that sequence profiles in the top CPU usage of generic processes

3. optimize that sequence in the languages

4. verify improved performance

5. repeat until you reach the no-op delay loops

6. Add optimization in the compiler

7. verify improved performance with sub-optimal original language samples

8. add optimization to ASIC

9. replace optimizations in the compiler with new opcodes

10. verify improved hardware performance with sub-optimal original language samples

With RISC, many argued such processes were unnecessary, as every complex op can be emulated in simple code with a faster CPU. This ignores that fact Moore's law is effectively dead.

zozbot234 · on Nov 16, 2022

That's not too different from how the B and K extensions for RISC-V were designed...

Joel_Mckay · on Nov 16, 2022

Quantifying the original assertion model variant/fragmentation being a problem for long term standardization as an architecture. =)

notacoward · on Nov 16, 2022

> This ignores that fact Moore's law is effectively dead.

It wasn't ignoring anything. Moore's law wasn't dead at the time, and awareness of it very much informed the development of RISC.

Joel_Mckay · on Nov 16, 2022

My point was the same reasoning does not hold for RISCV in replacing ARM or amd64 architectures. Remember when the Transputer was proposed as a replacement for the RISC/busy-register CPUs... fail...

Enhance your calm. =)

notacoward · on Nov 17, 2022

I was responding to what you actually said. If you meant something else you should have worded it better.

Have a great day. You're the best. Peace and love.

TheMagicHorsey · on Nov 16, 2022

> 2. Deploying chip specific compilers is unsustainable commercially when your competitor can port in billions of dollars of work with gcc (as bad as it is)

Is it really that prohibitive to have your own custom compiler with LLVM (and CLANG)?

Joel_Mckay · on Nov 16, 2022

There are very good reasons llvm are not used in Real-Time systems.

When running in multitasking environments it doesn't really matter. =)

daworl · on Nov 16, 2022

Not coming from a Real-Time background, I'm quite curious as to why? Do you have some sources to point me to?

Joel_Mckay · on Nov 16, 2022

The simplest explanation is the code motion optimization behavior, and even regular compilers must turn this off in some circumstances. Much of what makes llvm performant, can also make unpredictable execution times and latency worse.

There is a deep rabbit hole here, that touches on clocks/hardware/schedulers like RTLinux OS/RTOS/compilers. I am the wrong person to teach you this subject. ;)

pohl · on Nov 16, 2022

Even if Mill Computing can't get off the ground, eventually the patents will expire and someone will pick up on the ideas.

sparkie · on Nov 16, 2022

I'm pretty sure Mill computing will never get off the ground. If they were intent on producing something they would've started by now. The only thing they've worked on from the beginning is patents. And since now the focus is on RISC-V due to its openness, Mill will likely not see success unless they go the same route, because people don't want a locked down processor.

zozbot234 · on Nov 16, 2022

Mill is not an alternative to RISC-V; their insn set is a lot closer to hardware-specific microcode (much like VLIW architectures in general) than a truly general-purpose ISA. RISC-V admits of extensions, but aside from that it's fully general and most use of RISC-V will target standard ISA configs.

dogma1138 · on Nov 16, 2022

Even if you intend to have a completely open project patents aren’t necessarily a bad thing.

Patents prevent others form parenting it, even if prior art exists someone can still get a patent and the legal battle to revoke it can be costly if you don’t have large corporate backing.

So getting patents and offering them under an open FRAND license is a good option, some of the most common standards we have work this way.

Ofc if the only reason you get patents is to commercialize them later it’s going to be quite hard to get people to adopt your stuff.

GLGirty · on Nov 16, 2022

It's a shame they haven't got anything to market. It's such a clever re-think of how instructions operate. If anything, they are probably too far ahead of their time.

Reminds me a bit of Transmeta. "Code morphing" sounds very much like the instruction reordering that's standard now.

kllrnohj · on Nov 16, 2022

Transmeta was an ISA emulator + JIT, and it did ship (it's used by Tegra's "Project Denver" CPU cores).

Doing an emulator/JIT at this level has some severe problems & a very sketchy security story. It's unlikely that anyone else wants to do this, and "AOT" solutions like Rosetta 2 are almost certainly the better approach to ISA flexibility.

menage · on Nov 17, 2022

Mill's approach is pretty similar to Rosetta in this regard - binaries are distributed in a relative high-level abstract format that's a bit like LLVM IR, and code generation for a specific chip's concrete ISA is done at/before installation time.

GLGirty · on Nov 16, 2022

I watched Goddard's lectures a few years ago, and he talked about different price points for the mill--'tiers' of processors that differed mainly, (iirc,) by the size of the belt. He made some claims about porting binaries from one tier to another without compilation from source, and it struck me as a very optimistic claim, for the reasons you hinted at.

My bread and butter work isn't very close to the metal, but you sound more experienced with this sort of thing. Are you familiar with the mill? Do you think they have a chance of avoiding the weeds that Transmeta got stuck in?

kllrnohj · on Nov 16, 2022

I'm not familiar with mill at all, but the problem of doing a JIT at this level is where do you store the result and how does the JIT actually run? The CPU can't exactly call mmap to ask the kernel for a read/write buffer, fill it with the JIT result, and then re-map it r+x. So you get things like carveouts where the CPU just has a reserved area to stash JIT results, which is then globally readable/writable by the CPU. Better hope that JIT doesn't have an exploit that lets malicious app A clobber the JIT'd output of kernel routine B when app A gets run through the JIT! Also not like the kernel is aware of the JIT happening, either, as the CPU can't launch a new thread to do the JIT work. So as far as all profiling tools are concerned it looks like random functions just suddenly take way longer for no apparent reason. Good luck optimizing anything with that behavior. And the CPU then also can't cache these results to disk obviously, so you're JIT'ing these unchanging binaries everytime they're run? That's fun.

Maybe all of that is solvable in some way, sure. CPUs can communicate with the kernel through various mechanisms of course, you can build it such that the JIT is a different task somehow that's observable & preemptable, etc... But it's complicated & messy. And very complex for what a CPU microcode typically is tasked with dealing with, for a benefit that seems quite questionable. It's not like there's any reason a CPU doing the JIT is going to be more optimal than the kernel/userspace doing the JIT - it's trivial (and common) to expose performance counters from the CPU, after all.

That doesn't mean a CPU designed for a JIT is inherently bad, it just means doing this at the microcode level like Transmeta was doing is a bad idea.

UncleOxidant · on Nov 16, 2022

Probably the patents are part of the reason it hasn't gotten off the ground?

cgeier · on Nov 16, 2022

I was visiting a space technology fair quite recently, and I was quite surprised to see how many companies are betting on RISC-V in that space (pun intended).

On another note: does anyone have experience with a RISC-V board you can actually order right now?

netr0ute · on Nov 16, 2022

I have a LicheePi and a VisionFive 1. The LicheePi is really tiny and really cute, and it's no more than $30. However, it's essentially a Raspberry Pi Zero and so is super slow. The VF1 is like the LicheePi but hugely upgraded, as it's got a whopping TWO cores and a full 8GB of RAM. Both of these boards lack GPUs and so are painful to use as desktops because the CPU is doing all the rendering. Overall, they work shockingly well if you want a working RISC-V device.

NewJazz · on Nov 16, 2022

Recently announced are the BL808, which is supposed to be under $10, and the VisionFive 2, which is 4 cores and up to 8 GB RAM.

Unfortunately neither features the vector extension.

https://wiki.pine64.org/wiki/Ox64

https://liliputing.com/visionfive-2-risc-v-single-board-comp...

Vt71fcAqt7 · on Nov 16, 2022

VisionFive 2 has a GPU IIRC

netr0ute · on Nov 16, 2022

I already know because I was one of the first to back it on Kickstarter.

PaulHoule · on Nov 16, 2022

I got one coming in the mail. Almost as much fun as a Commander x16

rjsw · on Nov 16, 2022

I have been waiting for the PINE64 Star64 [1] to be released, planned to be next month.

[1] https://wiki.pine64.org/wiki/STAR64

nanomonkey · on Nov 16, 2022

Clockwork's uConsole R-01 [0] can be ordered now, and ships in a few months. It looks like a fun option at $139.

[0]https://www.clockworkpi.com/product-page/uconsole-kit-r-01

BirAdam · on Nov 16, 2022

I have a Clockwork Pi DevTerm R-01 which uses the Allwinner D1.

It's fun to mess around with.

zasdffaa · on Nov 16, 2022

My (inexpert) view is that it might be the last ISA because a general purpose ISA will become less important. I see the cpu core as being more like a glue that connects together specialist hardware, so you have a wide SIMD number cruncher (like AVX) but probably off-core, likewise off-core compression (perhaps), encryption, graphics (already there really), prob matrix mult, all the fancy hardware for AI that's coming out now, possibly embedded prolog/logical deduction device etc.

So the main ISA will be little more than a distribution device for allocating tasks to subunits, and no-one will care how it's put together.

But I'm no expert.

sparkie · on Nov 16, 2022

The primary issue here is that it is expensive to move data between processing units, and having them share memory has all the common problems with race conditions and cache invalidation.

So while we can continue to squeeze more onto a smaller die, providing more capabilities on the same processor looks like it will win out for some time yet.

zasdffaa · on Nov 16, 2022

> The primary issue here is that it is expensive to move data between processing units

I'm thinking of them all on a chip (or even same core) with external DRAM shared, and cpu cache shared.

> and having them share memory has all the common problems with race conditions and cache invalidation

That's pretty well a solved problem, or modern OS's wouldn't work.

> providing more capabilities on the same processor looks like it will win out for some time yet

I agree, for now, tomorrow's cpu will look like today's. But further out, I expect my view might perhaps start coming to pass.

hajile · on Nov 16, 2022

You are arguing both sides here. You argue that moving data between discrete processors is bad then argue that sharing memory with processors is also bad. One of the two certainly MUST happen.

GPUs seem to on the right path here as the trend toward a thread manager sending data to various CPU/ALU/SIMD/matrix/whatever units.

thriftwy · on Nov 16, 2022

I suggest that eventually it would make sense to move from octet-based ISAs to 64-bit word based ISAs. One such sizeof(int, long, float, double, char) is 1.

colejohnson66 · on Nov 16, 2022

What would be the benefit? Modern processors handle (what appear as) octet-based memory (despite having 64-bit words) all the time with no issues. In fact, your processor may only work on, say, 128-bit (16 byte cache line) memory accesses at a time.

But if you insist, TI has some DSPs where CHAR_BITS is 16. They're supposedly a massive pain to use because you have to mask off the upper octet every so often.

scottlamb · on Nov 16, 2022

You mean sizeof({int, long, float, double, char}) == 8, right? [edit: no, I guess not. sizeof returns char-sized units, not bytes. TIL.]

But why?

I don't anticipate we'll ever stop caring about using RAM efficiently, so I assert we need a way to operate on smaller types. The compiler could provide the traditional API I want this by masking/shifting without instructions specifically for smaller types, but (a) then there's a lot of extra masking and such to do anything, and (b) it would be bad for thread safety. IIUC Alpha was a bit like this in that read-write operations done from different threads on adjacent bytes within a word might clobber each other. Not great...

And as colejohnson66 said, there are DSPs like this, but they are explicitly not designed for general-purpose programs.

thriftwy · on Nov 16, 2022

If you're searching for a way to snap your data to power-of-two byte grid, you're already not using RAM efficiently.

scottlamb · on Nov 16, 2022

That seems pretty absolute! Bitstream encoding of everything would introduce a lot of complexity. There are places where it's called for, but I think in many many places the status quo of rounding up to 1/2/4 bytes is a better solution than either bitstream encoding or rounding up to 8 bytes.

sparkie · on Nov 16, 2022

RISC-V is a variable-length instruction ISA, where instructions are a multiple of 16-bits.

genpfault · on Nov 16, 2022

Only if they port Windows 10 to it!

pencilguin · on Nov 16, 2022

RV is an open-ended collection of ISAs. Do not doubt that the number of instructions will grow as large as is found in x86 and ARM, making a mockery of the "R".

I am ready for the successor, Risc-6, already. It can retain most details of RISC-V while fixing its more glaring deficiencies. It is not a mistake to optimize an ISA for use in undergraduate courses, but probably is one to use identically the same ISA industrially.

I don't know if the worst deficiency is the need to fuse up to 5 consecutive instructions to get useful operations per clock cycle to a competitive level.

While fixing serious problems, numerous smaller incidentals could be patched, such as having chosen the architectural representation of "true" to be 1 instead of -1.

msla · on Nov 16, 2022

> Do not doubt that the number of instructions will grow as large as is found in x86 and ARM, making a mockery of the "R".

https://danluu.com/risc-definition/

Number of instructions isn't what is "Reduced" in RISC, but rather the complexity of instructions: Single-width opcodes with very few addressing modes, such that pipelining is easy and taking exceptions is not as painful as it is when your opcodes have lots of ALU state, like with indirect addressing and opcodes that mix addressing modes with their own ALU operations.

RISC is also somewhat nebulous, with some ISAs being "more RISC" (RISCier?) than others, and some RISC ideas (architectural delay slots, integer multiplication and divide opcodes requiring their own registers) eventually getting dropped entirely as chip technology moves on.

> It is not a mistake to optimize an ISA for use in undergraduate courses, but probably is one to use identically the same ISA industrially.

I don't know that a good ISA needs to be optimized for undergraduate courses to be used in them: MIPS is famously used in H&P and it has quite a following to this day, just not on the desktop or in servers. Was RISC-V optimized for use as a "model organism" like that?

pencilguin · on Nov 17, 2022

Let us not forget SPARC's register windows.

I don't know if you could say RISC-V was optimized, as such, for undergraduate courses, but historically that was its design target. That said, there was a lot more attention to extensibility via regularized instruction set extensions than the undergraduate target demanded.

RISC-V's permanently-zero register is akin to other RISCs' special multiplication registers.

sparkie · on Nov 16, 2022

The number of instructions is already becoming quite large if you consider the V and P extensions. Here's the vector instructions: https://github.com/riscv/riscv-v-spec/blob/master/inst-table...

The set of RISC-V features that each CPU supports is going to be so widely varied that you will need dozens of compiler switches to make full use any processor, and since most software will be compiled for the lowest-common-denominator, some capabilities will go largely unused. This is already the case with x86 with AVX-512, the full capabilities are only leveraged in specific workloads.

Either that or developers are going to rethink the way they distribute software. No longer as compiled binaries, but as an IR which is then compiled on each processor, since the best way to learn what features are supported by any CPU is to query them on it.

addaon · on Nov 16, 2022

Or OS authors are going to have to rethink how they handle illegal instruction exceptions, by patching the calling program to call into an emulated version when running on hardware that lacks support; this would allow program authors to choose whether to favor the least common denominator (and give up performance on more capable processors) or to target best possible performance (but take a hit on more limited machines). An ABI that reserves a few registers for this sort of emulation, or a couple or reserved architectural registers, could help a lot.

inkyoto · on Nov 17, 2022

> Or OS authors are going to have to rethink how they handle illegal instruction exceptions, by patching the calling program to call into an emulated version when running on hardware that lacks support; […]

It can be done at the OS kernel level, but it does not have to. The usual place to do it in is libc – by virtue of weak symbols in dynamic libraries.

At the linking time, the ld links the executable against a generic function, e.g. __mul512_xyz, and __mul512_xyz is a weak symbol that points to a generic fallback implementation of the 512-bit wide operation in libc.so. Now, at the process startup time, the libc initialisation code looks up the CPU revision, attempts to locate a CPU revision specific, optimised version of, e.g. __cpu_family_abc_123_mul512_xyz, and links the optimised version of the function in. I have made names up but, fundamentally, the idea is that.

> […] this would allow program authors to choose whether to favor the least common denominator

… and this is where I perceive a problem with RISC-V – based on my current knowledge of the ISA in question. Being an open source and a free ISA, RISC-V places no obligation on two (or three, or more) distinct CPU vendors to implement the same set of ISA extensions (other than the foundational one). Therefore, effectively we are talking about every possible permutation of all current and future, extra, ISA extensions that would have to be supported in libc, and it will make it the dynamic re-linking at the runtime either difficult or not possible (not easily, anyway), and the RISC-V specific optimisation effort(s) will go out of the window.

hajile · on Nov 17, 2022

The foundation releases profiles.

For example, the RVA22S64 means it's a 64-bit application processor with supervisor mode complying with the 2022 recommendations. It has a bunch of specific requirements about RAM layout, cache instructions, being little-endian, and so on all specified so code is portable. Likewise M (mul/div), A (atomics) F (32-bit float), D (64-bit float), C (16-bit instructions), and some misc bit manipulation and other instructions MUST be implemented.

At the same time, it specifies optional extensions like V (vector), Zfh (16-bit floats), scalar crypto, H (hypervisor), and some optional supervisor stuff.

If you want to claim to be a RISC-V chip, you MUST implement the specs exactly (else be infringing on their trademark).

If you want to enter the desktop ecosystem that is being started, you'd better support the extensions they specify in RVA22S64 (or newer standards) otherwise people aren't going to buy your chip. You can implement proprietary stuff provided it doesn't conflict with the standard, but normal software isn't going to use those features, so you are probably just wasting transistors.

All the work on unifying stuff like UEFI is because the companies involved are serious about a compatible standard moving forward. It's in all the biggest player's best interests to stick to the spec which is all that matters for end users as it guarantees a consistent experience.

https://github.com/riscv/riscv-profiles/blob/main/profiles.a...

inkyoto · on Nov 17, 2022

> At the same time, it specifies optional extensions like V (vector), Zfh (16-bit floats), scalar crypto, H (hypervisor), and some optional supervisor stuff.

It is the optional extensions that concern me the most, not the foundational ones. Cryptographic functions, for example, can be optimised using a vector instruction set, a crypto accelarating instruction set, or via a combination of both. If something better comes along (hypothetically), it will become another, new, instruction extension in the future. The libc runtime now has to do significant guess work to support a permutation of all available instruction extensions (or fall back to a generic, unoptimised, version when none are available), not to mention the libc blowing out quite substantially for RISC-V – in size and in complexity.

> If you want to claim to be a RISC-V chip, you MUST implement the specs exactly (else be infringing on their trademark).

I admire your sense of optimism, but I can nearly guarantee you that a less diligent Chinese RISC-V vendor will not give a flying fig about the trademark and will even cut those extensions out that form the foundational set – if it makes the chip cheaper to design and manufacture. Which has happened in the past with some ARMv7 (or ARMv8?) designs, and which we are nearly guaranteed to see in IoT devices (at the very least). I will not be surprised if a CPU vendor chooses to toss the D extension out and stick with the F extension only for niche case solutions, and it will not be possible to enforce it.

hajile · on Nov 17, 2022

> I admire your sense of optimism, but I can nearly guarantee you that a less diligent Chinese RISC-V vendor will not give a flying fig about the trademark and will even cut those extensions out that form the foundational set – if it makes the chip cheaper to design and manufacture

Who cares. Is HP going to buy a cheap Chinese chip that causes user software to randomly crash?

If that company is going to compete, maybe they want a 5nm chip. They design the chip and get past sanctions, but now have to pay a half-billion to get everything ready for fabrication and then another 2B for that 100k wafer discount.

Now they have huge quantities of chips that are too expensive for the Chinese market and too broken/incompatible for the American/European market. Sticking to the spec -- even if it's just microcoded -- is way better than the alternative.

> It is the optional extensions that concern me the most, not the foundational ones.

Crypto is optional because they believe scalar crypto probably isn't the way forward. They are working on vector crypto, but it's not finalized. Likewise, the vector extension was released too close to the 2022 deadline to force compliance. Look at web standards. Where backward compatibility matters, it's WAY better to be conservative than to mess up forever.

RISC-V chips with decent performance likely won't be out until 2024. By that time, these extensions will be well-established. Companies will have had time to implement them and they will become required.

> I will not be surprised if a CPU vendor chooses to toss the D extension out and stick with the F extension only for niche case solutions, and it will not be possible to enforce it.

This is a particularly unlikely outcome because JS support requires double-precision. Even tiny ARM cores in phones over a decade ago had support for 64-bit floats. If you want to save area, cutting cache or wider SIMD gets you way more bang for your buck.

Tuna-Fish · on Nov 16, 2022

This won't work, because the emulation will be too slow to provide usable fallback.

This kind of approach has been used in the past, but the problem is that the overhead of trapping invalid instructions is massive compared to almost anything you do in an instruction, to the point that if you actually use such instructions, your program will be completely unusable on platforms that don't provide them.

addaon · on Nov 16, 2022

The suggestion isn't to implement, say, Vector Add in an illegal instruction handler; instead, it's to use the illegal instruction handler to replace the Vector Add instruction with a jump to a user-space emulation routine. The emulation routine may /also/ be too slow to be usable, but the trap overhead is a once-through cost, not a per-execution cost.

StillBored · on Nov 16, 2022

Doable, but you end up having to generate a unique sequence for every single instruction in the program because for something like 'mul r1,r2,r3' your going to have to move the information about which registers are being multiplied somewhere else, as well as encode the return address, and as you mentioned the register spill/etc is going to be costly or you have to reserve a couple registers for it.

There are a lot of problems doing it that way too, particularly if your on a small microcontroller with limited ram.

pencilguin · on Nov 16, 2022

It is a good way to turn optimizations into gross pessimizations.

StillBored · on Nov 16, 2022

Or you know, you mandate most of the stuff in the core arch document with the assumption that on tiny cores that lack.. say multiply instructions its handed by.. wait for it... microcode.

Which actually given the toy CPU's i've implemented is pretty easy to do even if your talking something that fits on a small/medium sized FPGA.

Illegal instruction traps are a lot worse/harder to get right than just cracking an unsupported opcode into a string of microops, particularly if the arch has simple addressing modes that themselves won't ever need to be part of multiple micro ops. AKA you won't have to deal with fault handling in the middle of a sequence. And even worse, the optimal sequence is likely diffrent for every tiny micro arch change, or even change in memory subsystem characteristics. So it needs to be tied to the core and associated memory subsystem, which on a popular arch implemented across the scale of platforms you find x86/arm is going to be unachievable without lots of unoptimal code being used.

addaon · on Nov 16, 2022

Microcode is an option here, but (having also implemented small microcoded cores on an FPGA) it's easy to be misled about ASIC costs here. ROM on an FPGA is very cheap proportional to logic, because it's implemented in various high-density (compared to the LUTs) memory blocks. A very small RISC-V ASIC implementation can easily be smaller than the ROM that would be necessary to microcode a complex ISA extension, ignoring the cost of the sequencer, while the same design in an FPGA might have zero change in area for the ROM.

bpye · on Nov 16, 2022

On RISC-V it doesn't have to be true microcode. You could trap and emulate in M mode, transparently to the running OS.

pclmulqdq · on Nov 16, 2022

Nobody is going to be patching calling programs like this. There may be OS emulation in machine mode (saving and restoring the register file) to pretend to be a "generic" RISC-V.

addaon · on Nov 16, 2022

> Nobody is going to be patching calling programs like this.

Oh the things I've shipped...

StillBored · on Nov 16, 2022

I don't think he is suggesting patching individual programs PPC for example simply faults and traps to the OS for any number of illegal instruction or invalid alignment situations (which may vary based on core implementation). Then the kernel looks at the faulting instruction, emulates it, and resumes the program. Its absolutely terrible for perf because all the internal state has to be serialized to the architecturally visible state in order to update it, and resumes.

addaon · on Nov 16, 2022

Nope, I was very much suggesting the former.

The PPC approach works, but it has significant performance costs over and above instruction emulation. I've shipped PPC software that you may have used that uses both approaches, as appropriate.

StillBored · on Nov 16, 2022

Eek, at that point you might just as well try and ship an IR with a JIT/AOT type pass during install, first run, etc.

Its possible we know each other IRL... lol

hajile · on Nov 16, 2022

There are only 2 major versions of RISC-V for desktop processors (64-bit with privileged mode) with a third likely coming in 2023 or 2024. That's definitely better than the competition.

x86 has about 35 new extensions in the last 20 years excluding AVX-512 which adds another 20 or so.

It's also no different from ARM where you have (currently in widespread production) ARMv6, ARMv7, ARMv8 32-bit, ARMv8.0, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, ARMv9.0, 9.1, 9.2, 9.3, and some ISA extensions that don't fit neatly into ANY of those. Then ARM also announced (no doubt due to pressure from RISC-V) that custom instructions are also allowed.

This is effectively a solved problem for ARM and x86. It will be even less of an issue for RISC-V.

sparkie · on Nov 16, 2022

There's far more than 2.

Even just taking RV32I, you have M, A, F, D standard extensions which are all optional. A CPU could include any combination of these, so you have 3! possible instruction sets off the start line. But let's assume we're working with G (which means all of IMAFD)

Many PC grade processors will support some or all of extensions B (bitmanip), T (transactional memory), P (packed SIMD), V (Vectors), C (compressed instructions), RV64I, RV128I, Q (quad floats), and not to mention all the non-standard extensions.

See: https://groups.google.com/a/groups.riscv.org/g/isa-dev/searc... for non-standard extensions.

You also have vendor specific extensions such as those from Huawei: https://github.com/riscv/riscv-code-size-reduction/tree/main...

As an application developer, finding the right set of features for your users will mean having to distribute many different binaries unless you stick to plain old RV32G, which has the capabilities of a ~20 year old x86 processor.

If you assume all standard extensions, it only takes a processor to lack 1 of them for the program to not work.

You can test for feature sets before running any of these instructions, but this is more commonly done at compile time via the preprocessor. This is what I mean about developers changing their perspective. It will be necessary to move all of those #ifdef __FEATURE__ to runtime, unless you recompile on each processor.

Vt71fcAqt7 · on Nov 16, 2022

>As an application developer, finding the right set of features for your users will mean having to distribute many different binaries unless you stick to plain old RV32G, which has the capabilities of a ~20 year old x86 processor.

Do you mean RV64G? Linux doesn't really support 32 bit anymore. Even then the VisionFive 2 supports rv64gc And the whole board sells for ~50 USD. The d1 supports RV64GCV (admittedly v. 0.7) and sells for even less. I don't think any Linux CPUs will skimp on C and most will have V and B in my opinion.

sparkie · on Nov 16, 2022

The set of RISC-V processors is going to be far more diverse than you are expecting. There's already dozens of open cores, both 32-bit and 64-bit, with varying support of standard extensions.

Assume RVG64GBCV, do we skip P, T, Q? What about J (extensions for dynamic language runtimes)? Do we just ignore those capabilities and leave applications to not get the best out of the machine?

If we stick to current status quo, of having #ifdef __FEATURE__ scattered through our code, we would want a source based distribution like Gentoo to make sure we get the most out of our machines, so we can check which features are present and set all the required compiler flags.

Otherwise, we could have runtime checks for features, but this will require changing the way applications are commonly written, because we don't want to be checking for the presence of features everywhere in code. We only want to check once on application startup.

hajile · on Nov 17, 2022

It seems like you didn't even bother to read the linked spec. The result is a bad faith argument on your part where you are choosing to ignore the clearly stated facts. You also ignore the reality of x86 and ARM extension fragmentation.

If you want people to use your chip in an interchangeable way (how ALL desktop processors want to work), then you'll implement RVA22S64 which automatically requires you to support a bunch of extensions.

If you want proprietary or cutting-edge extensions in your chip, feel free, but software developers will be using the spec and your extras will just be ignored and drive up your cost for no real value.

If you violate the specs you claim to support, you are liable for trademark infringement as use of the trademark requires adherence to any standards you claim to be supporting. At most you could claim "RISC-V compatible", but that will just become code for "doesn't work with normal software" and nobody will buy your chips and you'll have wasted a huge sum of R&D money.

This is better than x86 or ARM which you conveniently ignore. x86 for example has some 55 extensions released in the last 20 years. Some like SGX or FMA4 only have support of a small subset of chips. Some have disappeared. Many aren't supported by older chips still commonly in use. ARMv8 has TEN different version PLUS more optional extensions PLUS the ability to add custom instructions. There's also FOUR versions of ARMv9 too (leaving aside the still very popular ARMv6 and v7).

If fragmentation were such a huge problem, then certainly nobody would be using these ISAs. Of course, people use them everywhere which completely disproves your argument.

pjmlp · on Nov 16, 2022

That is how mainframes and micro-computers have been doing since forever.

I guess it is time for another great idea for VCs, I know, let's create a WASM runtime for RISC-V.

Vt71fcAqt7 · on Nov 16, 2022

>Do not doubt that the number of instructions will grow as large as is found in x86 and ARM

x86 has at least ~1,000 instructions,[0][1] maybe more depending on how you count. I can't believe RISC-V will approach that number. I couldn't easily find how many instructions arm64 has. But you may be right about RISC-6, only I think any "RISC-6" will just be RISC-V with less useful instructions depreciated

[0]https://www.unomaha.edu/college-of-information-science-and-t... [1]https://fgiesen.wordpress.com/2016/08/25/how-many-x86-instru...

hajile · on Nov 16, 2022

That's 1k instructions ONLY in their most RISC-like form. It also generally excludes stuff like MOV effectively translating into countless instructions when you start looking at all it's variants based on source/destination type, data types, concatenated instructions tacked on, etc (MOV alone is technically Turing complete[0]).

All that stuff is spelled out in the instruction and while it may have the same mnemonic, the actual binary output is different. In these terms, the actual number of instructions is likely in at least the tens of thousands.

In comparison, RISC-V splits off loads and stores into separate instructions. It further splits off each size of load/store into even more instructions. Start clumping all of this stuff together and RISC-V instruction count would also plummet.

[0] https://stackoverflow.com/questions/61048788/why-is-mov-turi...

cwzwarich · on Nov 16, 2022

Number of distinct mnemonics doesn't really determine ISA complexity. As an extreme example, how many AArch64/RISC-V instructions are there corresponding to an x86 MOV?

sparkie · on Nov 16, 2022

If you exclude AVX-512, the number is probably less than half that.

And since Intel locked AVX-512 out of alder-lake processors, if you want a consume grade processor which supports it you would buy a Zen4 from AMD.

Vt71fcAqt7 · on Nov 16, 2022

>If you exclude AVX-512, the number is probably less than half that.

This paper seems to say there are thousands, am I reading it wrong?

>There are two primary barriers to producing a formal specification of x86-64. First, the instruction set is huge: x86-64 contains 981 unique mnemonics and a total of 3,684 instruction variants.[0]

[0]https://dl.acm.org/doi/10.1145/29809

sparkie · on Nov 16, 2022

AVX-512 is huge. See https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

glhaynes · on Nov 16, 2022

While fixing serious problems, numerous smaller incidentals could be patched, such as having chosen the architectural representation of "true" to be 1 instead of -1.

What issues does that cause? Finding it difficult to web search for…

a1369209993 · on Nov 16, 2022

The only one I recall of the top of my head is that it makes `first & bool | second & ~bool` faster, since you don't need a `dec r0` instruction to turn {0,1} into a bitmask. Anyone else have a better guess?

pencilguin · on Nov 17, 2022

"that & bool", yielding either zero or "that", is common enough, by itself, to matter.

a1369209993 · on Nov 17, 2022

Huh, good point, actually; I'd overlooked that, eg in `sum += filter[i] & elem[i]`.

hoosieree · on Nov 16, 2022

I always pronounce RISC-V in my head as "RISC VEE". Even "RISC OR" is easier for me. If they call the successor RISC-VI then I'll probably mentally autocorrect it to "RISC-VIM" anyway, it's a lost cause.

pencilguin · on Nov 17, 2022

I pronounce it "riskive", myself. Had there been a RISC-IV it would have been pronounced the same, so it is lucky there wasn't one.

Vt71fcAqt7 · on Nov 16, 2022

>First it’s likely to be a sign that CPU ISAs are, to some extent, done.

I think this is not only correct but in a way almost obvious. Now that RISC has decidedly won out over CISC, well, what does RISC imply? To reduce the number of instructions to the minimum (or more accuratly to min/max the number). What we are left with are ISAs with few total instructions. This basically limits how much room there is for improvement, and the core instructions do not really change across modern RISC ISAs. In result, the best feature of RISC-V is not any technical feature but it's licensing features. If it gains mass adoption, nobody will ever switch to anything else because the ISA isn't the bottle neck of CPU performance anymore. A potential new ISA would have to have a different market such as GPU or ML to succeed.

theluketaylor · on Nov 16, 2022

> Now that RISC has decidedly won out over CISC

Has it? The modern "RISC" designs out there in the wild include a lot of pretty specialized instructions.

I would agree load/store has won over instructions with memory side effects common to many CISC setups. Keeping side effects to a minimum to improve out of order and speculative execution has been a key goal of most modern ISAs, but keeping instruction count to a bare minimum has not. CPUs have landed in a space between RISC and CISC where simplified and reduced instruction count is certainly a goal, but the ISA will happily toss in a specialized instruction where significant speedup can be achieved.

pencilguin · on Nov 16, 2022

Judging from the number of instructions in all its necessary extensions, RISC-V is RISC in name only. ARM ("Acorn RISC Machine") instructions are as absurdly many as x86's, which does not even pretend to be RISC. So it is silly to say RISC has won.

an-unknown · on Nov 16, 2022

RISC has nothing to do with instruction counts. Think of it as a "set of reduced instructions", not a "reduced set of instructions". It's the fact that every instruction only does something "simple" which you can directly implement in hardware that matters here.

In case it's still not obvious: PowerPC is obviously a RISC ISA, but it has just as many instructions as some CISC ISAs out there. Meanwhile PDP-8 has 8 instructions, but it's clearly a CISC ISA.

sparkie · on Nov 16, 2022

I think "do something simple" is going to get phased out and replaced by "do something useful" over time.

Consider if you want to load a 64-bit immediate into a register under the current RISC-V spec. Takes so many instructions that it's easier to just load a constant from memory, despite memory access being much slower.

    lui r1, 0x01234
    lui r2, 0x89ABC
    addi r1, 0x567
    addi r2, 0xDEF
    shl r1, 32
    or r1, r2

24-bytes, 6 instruction cycles and 2 registers.

In x86_64:

    movabs rax, 0x0123456789ABCDEF

11-bytes, 1 instruction cycle, 1 register.

hajile · on Nov 16, 2022

This is a red herring and considering that RISC-V binaries are generally the same size or smaller than x86, there are certainly code sequences where RISC-V is much smaller for the same code.

Another interesting metric is that x86 average instruction length is 4.25 bytes[0] which further implies that x86 instructions are only getting longer as time goes by. Meanwhile, RISC-V should continue to get more dense as new instructions are added.

[0] https://aakshintala.com/papers/instrpop-systor19.pdf

xpuente · on Nov 16, 2022

How many uOps is that instruction in x86_64?

Note that while the performance effect of L1I cache misses is negligible, the complex x86-64 decoder may not be. Some SPECpu2017 have front-end problems due to the x86-64 decoder.

colejohnson66 · on Nov 16, 2022

`MOV Rq, Iq` takes one uop with a throughput of up to four(!) per cycle on some ISAs: https://uops.info/html-instr/MOV_R64_I64.html

zozbot234 · on Nov 16, 2022

That's only for a fully general 64-bit value though. Most constants in practical use can be loaded/recomputed a lot more easily.

sparkie · on Nov 16, 2022

Practical use is double-precision floating-point constants, preprocessor generated bit-flags, high-bits pointer tagging in dynamic languages, etc. Having to read them from memory or expend multiple cycles to materialize them into a register defeats the point on even having them constant.

Veliladon · on Nov 16, 2022

When you see RISC replace it in your head with "load/store architecture".

Like when you look at the SIMD implementations on x86 "CISC" machines they're all load/store based instead of register/memory based like on the scalar and x87 implementations.

This is what people mean when they say RISC has won.

notacoward · on Nov 16, 2022

The practical definition that came through from reading Computer Architecture: A Quantitative Approach by the two people who did most to make RISC happen (Hennessy at Stanford and Patterson at Berkeley) had to do with this formula.

work/second = work/instruction * instructions/cycle * cycles/second

The "reduction" in RISC was in work (and complexity) per instruction, enabling even greater gains in the other two factors. I'm well aware that this isn't the official definition, but IMO it's the perspective that has held up best through all the intervening years. Load/store is certainly part of that. So are orthogonal instruction and register sets, exposed pipelines, software TLBs, etc. We've gotten away from some of those things nowadays, but in the early days they were necessary to drive performance despite lower work per instruction.

Veliladon · on Nov 16, 2022

> exposed pipelines

Oh boy. Good old branch delay slots on the earlier MIPS architectures.

notacoward · on Nov 16, 2022

Branch and load delays on the MC88K. I was working at Encore then, doing things like writing exception handlers and optimizing bcopy/memcpy. I actually enjoyed the puzzle-like aspect of it, but I totally got why others weren't into it.

Vt71fcAqt7 · on Nov 16, 2022

>ARM ("Acorn RISC Machine") instructions are as absurdly many as x86's

Can you give the number of x86_64, ARM64 and RISC-V instructions? I didn't think ARM64 had 1000s of instructions.