The quote in there mentions “rate of iteration”, which given the noise Qualcomm ...

themiddleupper · 2024-04-30T12:07:07

I suspected it was not good news for RISC V when Qualcomm got involved. They are going to split the RISC V ecosystem.

Pet_Ant · 2024-04-30T14:07:42

> They are going to split the RISC V ecosystem.

Part of the DNA of RISC-V is to provide a basis from which a million flowers can bloom. Instead of homebrewing, you can reuse RISC-V and make tweaks as you need... if you really do need a custom variant. Think of it as English, a common substratum from which there are lots of mostly interoperable dialects. This is what happened with Unix.

Now if we want a mainstream consumer then we will need a dominant strain or two. The RISC-V foundation is doing that with the RVA23 profiles etc, but if there is a few major ones that should be navigable. Linux had support for PC-98[1] which was a Japanese alternate x86 platform.

The changes I've seen proposed by Qualcomm don't seem drastically different [2], and could be incorporated in the same binaries with sniffing the supported features. The semantics are what are important, and it's not different there at all. Could be supported with trap and emulate

[1] https://en.wikipedia.org/wiki/PC-98 [2] https://lists.riscv.org/g/tech-profiles/attachment/332/0/cod...

Tuna-Fish · 2024-04-30T14:34:20

The code size reduction instructions are an extension that will go through and will eventually be supported by everyone, and is not the bone of contention here. They are designed to be "brown-field" instructions, that is, fit into the unimplemented holes in the current spec.

The reason the spec is going to split is not them, but the fact that Qualcomm also wants to remove the C extension, and put something else in the encoding space it frees.

Pet_Ant · 2024-04-30T15:12:50

Hmm, that seems like a mistake because C allows for instruction compression with low cost to decode that is perfect for embedded use which is a big part of the RISC-V usage now.

That said, if they implemented C, and then had their replacement toggleable with a CSR that would still be backwards (albeit not forwards) compatible so that'd only be an issue if Qualcomm RISC-V binaries become dominant, but I don't think binaries are gonna be that dominant outside of firmware going forward, and any that are will be from vendors that will multi-target.

phkahler · 2024-04-30T17:26:58

>> Hmm, that seems like a mistake because C allows for instruction compression with low cost to decode that is perfect for embedded use which is a big part of the RISC-V usage now.

It may be low cost to decode a compressed instruction, but having them means regular 32-bit instructions can cross cache lines and page boundaries.

My own thought is that there should be a "next" version or RISC-VI that is mostly assembler-level compatible but changes all the instruction encodings to be more sane. What that means exactly is still a bit fuzzy, but I am a fan of immediate data being stored after the opcode.

Pet_Ant · 2024-04-30T18:36:09

> My own thought is that there should be a "next" version or RISC-VI that is mostly assembler-level compatible but changes all the instruction encodings to be more sane.

I feel like that is really a case of Chesterton's fence. It was done by people who litterally wrote the book on processor design (David Patterson, author of "Computer Architecture: A Quantitative Approach", "The Case for RISC", "A Case for RAID", ). I have heard a talk with the rationale behind where bits are placed to simplify low-end implementations.

> What that means exactly is still a bit fuzzy, but I am a fan of immediate data being stored after the opcode.

As a hobbyist, I get it... but except for when you are reading binary dumps directly, which happens so rarely these days, when is that ever relevant? That is just OCD. I think of this video when I get the same itch and temptation. https://www.youtube.com/watch?v=GPcIrNnsruc

Also, let's not forget that RISC-V is already a thing with millions of embedded units already shipped.

phkahler · 2024-05-02T13:33:36

>> I feel like that is really a case of Chesterton's fence. It was done by people who litterally wrote the book on processor design

It was originally intended for use in education where students could design their own processors and fixed instruction sizes made that easier. I'm not saying "therefore it's suboptimal", just that there were objectives that might conflict with an optimal design.

>> > What that means exactly is still a bit fuzzy, but I am a fan of immediate data being stored after the opcode.

>> As a hobbyist, I get it... but except for when you are reading binary dumps directly, which happens so rarely these days, when is that ever relevant?

How about in a linker, where addresses have to be filled in by automated tools? Sure, once the weirdness is dealt with in code its "done" but it's still an unnecessarily complex operation. Also IIRC there is no way to encode a 64bit constant, it has to be read from memory.

Maybe I'm wrong, maybe it's a near optimal instruction encoding. I'd like to see some people try. Oh, and Qualcomm seems to disagree with it but for reasons that may not be as important as they think (I'm not qualified to say).

Pet_Ant · 2024-05-02T13:57:19

> Also IIRC there is no way to encode a 64bit constant, it has to be read from memory.

There never is, you can never set a constant as wide as the word length. Instead you must "build" it. You can either load the high bits as low, shift the value, and then add the low bits, or sometimes as Sparc has it ('sethi'), there is an instruction that combines the two for you.

https://en.wikibooks.org/wiki/SPARC_Assembly/Control_Flow#Ju...

https://en.wikipedia.org/wiki/SPARC#Large_constants

brucehoult · 2024-05-01T10:05:23

> millions of embedded units already shipped

10+ billion. With billions added every year.

THead says they've shipped several billion C906 and C910 cores, and those are 64 bit Linux applications cores, almost all of them with draft 0.7.1 of the Vector extension. The number of 32 bit microcontroller cores will be far higher (as it is with Arm).

__s · 2024-04-30T17:38:19

yes, was curious why compression format didn't require

1. non compressed instructions are always 4 byte aligned (pad a 2 byte NOP if necessary, or use uncompressed 4 byte instruction to fix sizing)

2. jump targets are always 4 byte aligned (which exists without C, but C relaxes)

This avoids cache line issues & avoids jumps landing inside an instruction. Can consider each 2 compressed instructions as a single 4 byte instruction

Bit redundant to encode C prefix twice, so there's room to make use of that (take up less encoding space at least by having prefix be 2x as long), but not important

fwsgonzo · 2024-04-30T18:18:55

I completely agree. Not that everything has to be relaxed, but at least the things that made it impossible to decode RISC-V when C is enabled. The amount of code needed to detect when and how instructions are laid out is much larger than it should be.

brucehoult · 2024-05-01T10:01:44

"impossible"?

It's a little easier than ARMv7, and orders of magnitude easier than x86, which doesn't seem to be preventing high performance x86 CPUs (at an energy use and surely silicon size penalty admittedly).

Everyone else in the RISC-V community except Qualcomm thinks the "C" extension is a good trade-off, and the reason Qualcomm don't is very likely because they're adapting Nuvia's Arm core to run RISC-V code instead, and of course that was designed for 4-byte instructions only.

Pet_Ant · 2024-04-30T18:47:26

That is a trade-off towards density that seems not worth it where all it would take is a 16 bit NOP to pad and a few more bytes of memory to save on transistors of implementation.

Maybe they did the actual math and figured it's still cheaper? Might be worth it.

__s · 2024-04-30T19:24:06

SiFive slides: https://s3-us-west-1.amazonaws.com/groupsioattachments/10389...

Their argument is that since eventually there'll be 8 byte instructions, those will have the same cache line issues (tho that could be addressed by requiring 8 byte instructions be 8 byte aligned)

Pet_Ant · 2024-04-30T19:47:18

Check your link? It isn't working for me.

__s · 2024-05-01T02:12:59

https://lists.riscv.org/g/tech-profiles/topic/slides_on_reta...

panick21_ · 2024-04-30T17:58:04

C is good for high performance instruction sets to. Funny how every company that starts with green field RISC-V doesn't ever mention it as a problem. And yet the one company who wants to leverage their ARM investment thinks its a huge problem that will literally break the currently established standard.

Tuna-Fish · 2024-04-30T14:30:30

The solution to this is not to split, but just follow Qualcomm. Their vision for the future of the ISA is simply much better than SiFive's.

Right now, most devices on the market do not support the C extension, and any code that tries to be compatible does not use it. Qualcomm wants to remove it because it is actively harmful for fast implementations, and burns 75% of the entire encoding space, which is already extremely tight. SiFive really wants to keep it. The solution to fragmentation is to just disable the C extension everywhere, but SiFive doesn't want to hear that.

sitkack · 2024-04-30T17:31:54

> Right now, most devices on the market do not support the C extension

This is not true and easily verifiable.

The C extension is defacto required, the only cores that don't support it are special purpose soft cores.

C extension in the smallest IP available core https://github.com/olofk/serv?tab=readme-ov-file

Supports M and C extensions https://github.com/YosysHQ/picorv32

Another sized optimized core with C extension support https://github.com/lowrisc/ibex

C extension in the 10 cent microcontroller https://www.wch-ic.com/products/CH32V003.html

This one should get your goat, it implements as much as it can using only compressed instructions https://github.com/gsmecher/minimax

FullyFunctional · 2024-04-30T22:26:48

The expansion of a 16-bit C insn to 32-bit isn't the problem. That part is trivial. The problem (and it is significant) is for a highly speculative superscalar machine that fetches 16+ instructions at a time but cannot tell the boundary of instructions until they are all decoded. Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties (AKA IPC) and design/verification complexities that could have gone to performance.

It is also true that burning up the encoding space for C means pain elsewhere. Example: branch and jump offsets are painfully small. So small that all non-toy code need to use a two instruction sequence to all call (and sometimes more).

These problems don't show up on embedded processors and workloads. They matter for high performance.

camel-cdr · 2024-04-30T23:21:00

> me but cannot tell the boundary of instructions until they are all decoded

Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

> Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties

What does decoding have to do with mispredict penalties?

> Example: branch and jump offsets are painfully small

Yes, thats what the 48 bit instruction encoding is for. See e.g. what the scalar eficiency SIG is currently working on: https://docs.google.com/spreadsheets/u/0/d/1dQYU7QQ-SnIoXp9v...

inkyoto · 2024-05-01T00:41:35

> Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

It is not about decoding, which happens later, it is about 32-bit instructions crossing the L1 cache line boundary in the L1-i cache which happens first.

Instructions are fetched from the L1-i cache in bundles (i.e. cache lines), and the size of the bundle is fixed for a specific CPU model. In all RISC CPU's, the size of a cache line is a multiply of the instruction size (mostly 32 bits). The RISC-V C extension breaks the alignment, which incurs a performance penalty for high performance CPU implementations, but is less significant for smaller, low power implementations where performance is not a concern.

If a 32-bit instruction cross the cache line boundary, another cache line must be fetched from the L1-i cache before an instruction can be decoded. The performance penalty in such a scenario is prohibitive for a very fast CPU core.

P.S. Even worse if the instruction crosses a page boundary, and the page is not resident in memory.

dzaima · 2024-05-01T01:07:29

I don't think crossing cache lines is particularly much of a concern? You'll necessarily be fetching the next cache line in the next cycle anyway to decode further instructions (not even an unconditional branch could stop this I'd think), at which point you can just "prepend" the chopped tail of the preceding bundle (and you'd want some inter-bundle communication for fusion regardless).

This does of course delay decoding this one instruction by a cycle, but you already have that for instructions which are fully in the next line anyways (and aligning branch targets at compile time improves both, even if just to a fixed 4 or 8 bytes).

inkyoto · 2024-05-01T13:45:10

> I don't think crossing cache lines is particularly much of a concern?

It is a concern if a branch prediction has failed, and the current cache line has to be discarded or has been invalidated. If the instruction crosses the cache line boundary, both lines have to be discarded. For a high-performance CPU core, it is a significant and, most importantly, unnecessary performance penalty. It is not a concern for a microcontroller or a low power design, though.

dzaima · 2024-05-01T15:42:41

Why does an instruction crossing cache lines have anything to do with invalidation/discarding? RISC-V doesn't require instruction cache coherency so the core doesn't have much restriction on behavior if the line was modified, so all restrictions go to just explicit synchronization instructions. And if you have multiple instructions in the pipeline, you'll likely already have instructions from multiple cache lines anyways. I don't understand what "current cache line" even entails in the context of a misprediction, where the entire nature of the problem is that you did not have any idea where to run code from, and thus shouldn't know of any related cache lines.

dzaima · 2024-05-01T00:40:26

Mispredict penalties == latency of pipeline. Needing to delay decoding/expansion to after figuring out where instructions actually start will necessarily add a delay of some number of gates (whether or not this ends up in mispredict penalty increasing by any cycles of course depend on many things).

That said, the alternative of instruction fission (i.e. that which RISC-V avoids requiring) would add some delay too (I have no clue how these compare though, I'm not a hardware engineer; and RISC-V does benefit from instruction fusion which can similarly add latency, and whose requirement other architectures could decide to try to avoid (though it'd be harder to keep avoiding it as hardware potential improves while old compiled binary blobs stay unchanged), so it's complicated)

camel-cdr · 2024-05-01T06:59:17

Ah, that makes sense, thanks. I think on the end it all boils down to both the arm and the rv approach to be fine approaches, with slightly different tradeoffs.

timschmidt · 2024-04-30T23:18:02

> Qualcomm wants to remove it because it is actively harmful for fast implementations

Qualcomm's "fast implementation" reportedly started out life as an ARM core and has had it's decoders replaced. That explanation makes their very different use of the instruction space make much more sense to me than any other. They did the minimum to adapt an existing design. Not the stuff of lasting engineering.

panick21_ · 2024-04-30T18:02:20

> Right now, most devices on the market do not support the C extension

That's outright false.

And outside of the actual devices, the whole software ecosystem very much uses the C extension.

Qualcomm simply wants to break the standard to make money, that literally all it is.

> Qualcomm wants to remove it because it is actively harmful for fast implementations

Funny how not a single company other then Qualcomm argues this. Not Ventara, not Si-Five, not Esperanto, not Tenstorrent, non of the companies form China.

Its almost, almost as if it not that big of a deal and Qualcomm simply want to same money and reuse ARM IP.

> The solution to fragmentation is to just disable the C extension everywhere, but SiFive doesn't want to hear that.

Literally nobody except Qualcomm wants to hear it. It wasn't even a discussion before Qualcomm. All the other companies had plenty of opportunity to bring up issues in all the working groups, and nobody did. Literally not a single company gave a talk, talking about how the C extension was holding them back. In fact most of them were saying the opposite.

FullyFunctional · 2024-04-30T22:30:01

There is a lot of stuff behind the scene you don't know. You statement about "other companies" is completely wrong.

panick21_ · 2024-05-01T14:09:58

These things are supposed to be discuss in the open, its an open standards process. So please link me to the official statements by these companies that they are unhappy.

I have watched many discussions and updates by the work-groups, and nobody came forward.

And why are they afraid to come forward? If this is so important then shouldn't there be an effort to convince the community?

So sorry until I see something other claims like 'there is a shadow cabal of unhappy companies planning a takeover' I'm not gone buy that this is widespread movement.

snvzz · 2024-05-02T01:13:39

>I have watched many discussions and updates by the work-groups, and nobody came forward.

Rivos came forward. They (kindly) told Qualcomm not to put words on their mouth.

Rivos is, of course, totally fine with C.

brucehoult · 2024-05-02T02:30:14

Yup. Rivos said basically "Please don't interpret our willingness to look at your data, once you provide it, as supporting your claims"

camel-cdr · 2024-04-30T23:07:57

pray tell

brucehoult · 2024-05-01T10:11:12

> Right now, most devices on the market do not support the C extension, and any code that tries to be compatible does not use it.

I don't know of ANY commercially-sold RISC-V chips that don't implement the C extension. Even the 10 cent CH32V003 implements C (RV32EC).

> burns 75% of the entire encoding space

In ARMv7 the 2-byte original Thumb instructions burn 87.5% of the 4-byte encoding space (28 out of 32 combinations of the 5 MSBs).

camel-cdr · 2024-04-30T14:46:51

> most devices on the market do not support the C extension

Name one that doesn't, it's exactly the opposite (for 64-bit).

GeorgeTirebiter · 2024-04-30T18:32:05

Maybe you've found the solution: RV32 must have the C extension.

RV64 and RV128 must NOT have the C extension.

Problem solved?

camel-cdr · 2024-04-30T18:40:28

No I meant, that for 64-bit CPUs virtually every available one supports the C extension.

Dylan16807 · 2024-04-30T13:52:05

> to be more arm64ish

I have a hard time picturing what this means. There's so much flexibility in implementing a core; what would they want to change in the ISA to make them like it better in a non-negligible way?

fidotron · 2024-04-30T13:59:16

Prior discussion includes https://news.ycombinator.com/item?id=37996820

Tuna-Fish · 2024-04-30T14:39:04

Mainly, richer addressing modes.

SiFive designed RISC-V to have braindead-level simple addressing modes, with the idea that you use 2-4 normal alu ops to do addressing instead of a single op with a more complicated addressing mode. Then, to reduce the horrible impact this has on code size, they introduced the C extension that burns 75% of the encoding space of 32-bit instructions on 16-bit instructions, but this is still only a bandaid and a much weaker solution than having better addressing modes in the first place.

dzaima · 2024-04-30T16:51:41

RISC-V already has an extension for simplifying address calculations, Zba, required for RVA23, for doing x*2+y, x*4+y, and x*8+y in a single instruction (sh1add/sh2add/sh3add; these don't have compressed variants, so always 4 bytes). Combined with the immediate offset in load/store instructions, that's two instructions (6 or 8 bytes depending on whether the load/store can be compressed) for any x86 mov (when the immediate offset fits in 12 bits, at least; compressed load/store has a 5-bit unsigned immediate, multiplied by width).

Also, SiFive didn't design this - in 2011 there's already "Given the code size and energy savings of a compressed format, we wanted to build in support for a compressed format to the base ISA rather than adding this as an afterthought" in the manual[0], while SiFive was founded on 2015.

[0]: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-...

camel-cdr · 2024-04-30T15:12:07

It's not as black and white as you portray it.

Both sides of the argument agree that high-performance implementations are doable and not hindered much buy both, you can watch the debate in the profiles SIG zoom meetings.

I think the real dispute was how the opcode space and code density will evolve when more and more extension will need to be added. Will more 32 opcode space and aligned 64 bit instructions but no 16 and no 48 bit instructions in the longterm be a better choice than fewer 32 bit instructions, but 16/48/64 bit instructions?

Qualcomm is also currently involved in the Combined Instructions SIG, and proposed a few instructions in that vain: https://docs.google.com/spreadsheets/d/1dQYU7QQ-SnIoXp9vVvVj....

Notice that these are currently very mild combined instructions, like BEQI, or CLO, which are unlikely to be cracked into uops compared to more complex addressing modes (e.g. apple silicon needs to crack pre/post increment load/stores).

BTW, this article is actually about removing Qualcomm specific stuff from android, see: https://lists.riscv.org/g/sig-android/topic/105816077#msg389

snvzz · 2024-05-05T05:56:39

It's the other way around. Support for Qualcomm's incompatible platform is being removed[0].

0. https://news.ycombinator.com/item?id=40212173