Hacker News new | past | comments | ask | show | jobs | submit login
“Risc V greatly underperforms” (gmplib.org)
310 points by oxxoxoxooo 52 days ago | hide | past | favorite | 348 comments



I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

But ultimately, the gist of their argument is this:

>Any task will require more Risc V instructions that any contemporary instruction set.

Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.


I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

I am familiar with many tens of instruction sets, since the first computers with vacuum tubes until all the important instruction sets that are still in use, and there is no doubt that RISC-V requires more instructions and a larger code size than almost all of them, for doing any task.

Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM, the so-called better results were for the compressed extension, not for the normal encoding.

Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes.

The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions.


You seem to be making your whole argument around some facts which you got wrong. The central points of your argument are often used in FUD, thus they are definitely worth tackling here.

>Even the hard-to-believe "research" results published by RISC-V developers have always showed worse code density than ARM

the code size advantage of RISC-V is not artificial academic bullshit. It is real, it is huge, and it is trivial to verify. Just build any non-trivial application from source with a common compiler (such as GCC or LLVM's clang) and compare the sizes you get. Or look at the sizes of binaries in Linux distributions.

>the so-called better results were for the compressed extension, not for the normal encoding.

The C extension can be used anywhere, as long as the CPU supports the extension; most RISC-V profiles require it. This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode. Effort was put in making this very cheap for the decoder.

The common patterns where number of instructions is larger are made irrelevant by fusion. RISC-V has been thoroughly designed with fusion in mind, and is unique in this regard. It is within its right in calling itself the 5th generation RISC ISA because of this, even if everything else is ignored.

Fusion will turn most of these "2 instructions instead of one" into actually one instruction from the execution unit perspective. There's opportunities everywhere for fusion, the patterns are designed in. The cost of fusion on RISC-V is also very low, often quoted as 400 gates, allowing even simpler microarchitectures to implement it.


I was surprised to find the top gOggle hits for "RISC-V fusion" (because I don't know WTF it even is) point to HN threads. Is this not discussed prominently elsewhere on the 'net?

https://news.ycombinator.com/item?id=25554865

https://news.ycombinator.com/item?id=25554779

Is the Googrilla search engine really is starting to suck more and more, or is there something else going on in this case?

The threads read more like a incomplete explanation with a polarized view than anything useful for understanding what fusion means in this context.

Overall is give the ranking a score of D-.


> ... is there something else going on in this case?

There is: the term comes from general CPU design terminology and is not specific to RISC-V, although Google does find some RISC-V-specific materials for me given your query[1–3]. Look for micro- and macro-op fusion in Agner Fog’s manuals[4] or Wikichip[5], for example.

[1]: https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu...

[2]: https://reviews.llvm.org/D73643

[3]: https://erik-engheim.medium.com/the-genius-of-risc-v-micropr...

[4]: https://www.agner.org/optimize/#manuals

[5]: https://en.wikichip.org/wiki/macro-operation_fusion


Try 'risc V instructuin fusion" or "Risc V macro op fusion' Not that hard really. It is a well known subject.

I hope people could stop whining everytime they mess up a search query.


>I hope people could stop whining everytime they mess up a search query.

I dont know. I have never seen anyone using the term "Fusion" by itself, may be it is specific to RISC-V crowd? It is always "macro-op fusion". So your parent 's search parameter isn't something out of order for someone how knows very little about hardware. And HN are full of Web developers so abstracted in the hierarchy they knows practically zero about hardware.

And to be quite frankly honest the GP's point about Fusion had me confused for a sec as well.


> This is in stark contrast with ARMv7's thumb, which was a literal separate CPU mode.

This is disingenuous. arm32's Thumb-2 (which has been around since 2003) supports both 16-bit and 32-bit instructions in a single mode, making it directly comparable to RV32C.


Your statement does not run counter to mine quoted.

Thumb-2 is better designed than Thumb was, but it is still a separate CPU mode.

And it got far less use than it deserved, because of this. It doesn't do everything, and switching has a significant cost. This cost is in contrast with RISC-V's C extension.


Comparing RISC-V's "C" extension to classic Thumb when Thumb-2 is 17 years old is like comparing RISC-V's "V" extension to classic SSE when AVX-512 and SVE2 are already available. Its an insidious form of straw-man attack that preys on the reader's ignorance.

> [Thumb-2] doesn't do everything, and switching has a significant cost.

Technically true, but irrelevant. Cortex-M is thumb-only and can't switch. Cortex-A processors that support both Thumb and ARM instructions almost never actually switch at all.


> Cortex-A processors that support both Thumb and ARM instructions almost never actually switch at all.

That is not correct. At least before ARMv8, most processors that could run both Thumb and ARM switch very frequently, up the point some libraries could be Thumb while others were ARM (i.e. within the same task!). A lot (but not all) of Android for ARMv7 is actually Thumb(-2). This is why "interworking" is such a hot topic.

Also, contrary to what the above poster says, switching does not have a "high cost", it is rather similar to the cost of a function call.


> it is rather similar to the cost of a function call.

It literally is a function call, most of the time.

And yeah, thumb-2 was the preferred encoding for 32b iOS and Android, and the only supported encoding for Windows phone, so it was used on billions of devices.


The ARMv8-M profile is Thumb-only, so on ARM microcontroller platforms there is no switching at all, and it does do everything, or at least everything you might want to do on a microcontroller, and has of course gotten a very large amount of use, considering how widely deployed those cores are.


Is thumb-only particularly good for density, compared to being able to mix instruction sizes?


Thumb has both 16-bit and 32-bit instructions.


Oh, you meant thumb and thumb-2.


"thumb-2" isn't really a thing. It's just an informal name from when more instructions were being added to thumb. it's still just thumb.


Thumb2 is a thing. Thumb is purely 16 bit instructions. Thumb2 is a mix of 16 bit and 32 bit instructions.

As an illustrative example, in Thumb when the programmer writes "BL <offset>" or "BX <offset>" the assembler creates two undocumented 16 bit instructions next to each other which together have the desired effect. If you create those instructions yourself using e.g. .half directives (or if you're writing a JIT or compiler) then you can actually put other instructions between them, as long as you don't disturb the link register.

In Thumb2 the bit patterns for BL and BX are the same, but they are an ACTUAL 32 bit instruction which can't be split up like it can in Thumb.


The main distinction is that the 16-bit RISCV-C ISA exactly maps to existing 32-bit RISCV instructions, its implementation only occurs in the decode pipe stage


The C extension is that, an extension. A RISC-V core with the C extension should still support the long encoding as well. There is no 16-bit variant specified, only 32, 64 and 128.

There is an E version of the ISA with a reduced register set, but this is a separate thing.


You are mixing up integer register size and instruction length.

RISC-V has variants with 32 bit, 64 bit, or (not yet fully specified or implemented) 128 bit registers.

RISC-V has instructions of length 32 bits and, optionally but almost universally, 16 bit length.


Ah yes, I misunderstood the original comment as implying that RISC-V C had 16 bit register length, rather than opcode length.


The E version only has half as many registers


So there really are more instructions in RAM, right?

And then they get combined in the CPU, right?

Won't those instructions need to be fetched / occupy cache?


> the so-called better results were for the compressed extension, not for the normal encoding.

Ignoring RISC-V’s compressed encoding seems a rather artificial restriction.


Indeed.

The "C" extension is technically optional, but I'm not aware of anyone who has made or sold a production chip without it -- generally only student projects or tiny cores for FPGAs running very simple programs don't have it.

My estimate is if you have even 200 to 300 instructions in your code it's cheaper to implement "C" than to build the extra SRAM/cache to hold the bigger code without it.


The compressed encoding has good code density, but low speed.

The compressed RISC-V encoding must be compared with the ARMv8-M encoding not with the ARMv8-A.

The base 32-bit RISC-V encoding may be compared with the ARMv8-A, because only it can have comparable performance.

All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A.

When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer.

There are good reasons to use RISC-V for various purposes, where either the lack of royalties or the easy customization of the instruction set are important, but claiming that it should be chosen not because it is cheaper, but because it were better, looks like the story with the sour grapes.

The value of RISC-V is not in its instruction set, because there are thousands of people who could design better ISAs in a week of work.

What is valuable about RISC-V is the set of software tools, compilers, binutils, debuggers etc. While a better ISA can be done in a week, recreating the complete software environment would need years of work.


> The compressed encoding has good code density, but low speed.

That's 100% nonsense. They have the same performance and in fact, some pipelines can get better performance because they fetch a fixed number of bytes and with compressed instructions, that means more instructions fetched.

The rest of the argument falls apart resting on this fallacy.


They have the same performance only in low performance CPUs intended for embedded applications.

If you want to use a RISC-V at a performance level good enough for being used in something like a mobile phone or a personal computer, you need to simultaneously decode at least 8 instructions per clock cycle and preferably much more, because to match 8 instructions of other CPUs you need at least 10 to 12 RISC-V instructions and sometimes much more.

Nobody has succeeded to simultaneously decode a significant number of compressed RISC-V instructions and it is unlikely that anyone would attempt this, because the cost in area and power of a decoder able to do this is much larger than the cost of a decoder for simultaneous decoding of fixed-length instructions.

This is the reason why also ARM uses a compressed encoding in their -M CPUs for embedded applications but a 32-bit fixed-length encoding in their -A CPUs for applications where more than 1 watt per core is available and high performance is needed.


You're just making stuff up.

ARM doesn't have any cores that do 8 wide decode. Neither do Intel or AMD. Apple has, but Apple is not ARM and doesn't share their designs with ARM or ARM customers.

Cortex X-1 and X-1 have 5 wide decode. Cortex A78 and Neoverse N1 have 4 wide decode.

ARM uses compressed encoding in their 32 bit A-series CPUs, for example the Cortex A7, A15 and so on. The A15 is pretty fast, running at up to 2.5 GHz. It was used in phones such as the Galaxy S4 and Note 3 back before 64 bit became a selling point.

Several organisations are making wide RISC-V implementations. Most of them aren't disclosing what they are doing, but one has actually published details of how it's 4-8 wide RISC-V decoder works -- they decode 16 bytes of code at a time, which is 4 instructions if they are all 32 bit instructions, 8 instructions if they are all 16 bit instructions, somewhere between for a mix.

https://github.com/MoonbaseOtago/vroom

Everything is there, in the open, including the GPL licensed SystemVerilog source code. It's not complex. The decode scheme is modular and extensible to as wide as you want, with no increase in complexity, just slightly longer latency.

There are practical limits to how wide is useful not because you can't build it, but because most code has a branch every 5 or 6 instructions on average. You can build a 20-wide machine if you want -- it just won't be any faster because it doesn't fit most of the code you'll be executing.


Doesn't decompression imply that there is some extra latency?


No, it is just part of the regular instruction decoding. It is not like it is zip compressed. It is just 400 logic gates added to the decoder… which is nothing.


All the implementation I know of all does the same thing: they expand the compressed instruction into a non-compressed instruction. For all (most?), this required an additional stage in the decoder. So in that sense, supporting C mean a slight increase in the branch mispredict penalty, but the instruction itself takes the same path with the same latency regardless of it being compressed or not.

Completely aside, compressed instruction hurt in a completely different way: as specified RISC-V happily allows instructions to be split across two cache lines, which could be from two different pages even. THIS is a royal pain in the ass and rules out certain implementation tricks. Also, the variable length instructions means more stages before you can act on the stream, including for example rename them. However a key point here is that it isn't a per-instruction penalty, it a penalty paid for all instruction if the pipeline support any variable length instructions.


Yes, but the logic signal needs to ripple through those gates, which takes time.


It potentially adds latency, but doesn't drop throughput.


> The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot.

Which isn't really a big advantage, because ARM and x86 macro-op fuse those instructions together. (That is, those 2-instructions are decoded and executed as 1x macro-op in practice).

cmp /jnz on x86 is like, 4-bytes as well. So 4-bytes on x86 vs 4-bytes on RISC-V. 1-macro-op on x86 vs 1-instruction on RISC-V.

So they're equal in practice.

-----

ARM is 8-bytes, but macro-op decoded. So 1-macro op on ARM but 8-bytes used up.


The fusion influences only the speed, not the code size and the discussion was about the code size.

For x86, cmp/jnz must be 5 bytes for short loops or 9 bytes for long loops, because the REX prefix is normally needed. x86 does not have address modes with auto-update, like ARM or POWER, so for a minimum number of instructions the loop counter must also be used as an index register, to eliminate the instructions for updating the indices.

Because of that, the loop counter must use the full 64-bit register even if it is certain that the loop count would fit in 32-bit. That needs the REX prefix, so the fused instruction pair needs either 5 bytes (for 7-bit branch offsets) or 9 bytes, in comparison with 4 bytes for RISC-V.

So RISC-V gains 1 byte about at every 20 bytes from the branch instructions, i.e. about 5%, but then it loses more than this at other instructions so it ends at a code size larger than Intel/AMD by between 10% and 50%.


By combining instruction compression and macro-op fusion you get the net effect of looking like you have a bunch of extra higher level opcodes in your ISA.

Compress a shift and load into a 32-bit word and macro-op fuse those and you have in effect an index based load instruction, without sucking up ISA encoding space for it.


I agree that this is about the code size, but you seem to be doing your back-of-the-envelope estimates based on RISC-V uncompressed instructions, which is a mistake and explains why your estimates came out with nonsense results like "code size larger than Intel/AMD".


> For x86, cmp/jnz must be 5 bytes for short loops or 9 bytes for long loops, because the REX prefix is normally needed.

It's a shame the x32 architecture didn't catch on. https://en.wikipedia.org/wiki/X32_ABI


ARM64 has cbz/tbz compare-and-branch instructions that cover many common cases in a single 4-byte instruction as well.


There are cases when cbz/tbz are very useful, but for loops they do not help at all.

All the ARMv8 loops need 2 instructions, i.e. 8 bytes, instead of the single compare-and-branch of RISC-V.

There are 2 ways to do simple loops in ARM, you can either use an addition that stores the flags, then a conditional branch, or you can use an addition that does not store the flags, then a CBNZ (which tests whether the loop counter is null). Both ways need a pair of instructions.

Nevertheless, ARM has an unused opcode space equal in size to the space used by CBNZ/CBZ/TBNZ/TBZ (bits 29 to 31 equal to 3 or 7 instead of 1 or 5).

In that unused opcode space, 4 pairs of compare-and-branch instructions could be encoded (3 pairs corresponding to those of RISC-V plus 1 pair of test-under-mask, corresponding to the TEST instruction of x86; each pair being for a condition and its negation).

All 4 pairs of compare-and-branch would have 14-bit offsets, like TBZ/TBNZ, i.e. a range larger than that of the RISC-V branches.

This addition to the ARM ISA would decrease the code size by 4 bytes for each 25 to 30 bytes, so a 10% to 15% improvement.


> I am sorry but saying that RISC-V is a winner in code density is beyond ridiculous.

You have no idea what you're talking about. I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported. The only way you get much lower code density is without the C extension, but I haven't seen it not implemented in a real-world commercial core, and if it wasn't, I'm sure there was because of a benefit (FPGAs sometimes use ultra-simple cores for some tasks, and don't always care about instruction throughput or density)

It should be said that my experience is in embedded, so yes, it's unsafe code. But the embedded use-case is also the most mature. I wouldn't be surprised if extensions that help with safer programming languages would be added for desktop/server class CPUs, if they haven't already (I haven't followed the development of the spec that closely recently)


>> RISC-V has an acceptable code size only for unsafe code

> You have no idea what you're talking about.

> It should be said that my experience is in embedded, so yes, it's unsafe code.

Just going based off your reply it certainly sounds like they had at least some idea what they were talking about? In which case omitting that sentence would probably help.


Textbook example of the kind of hostility and close-mindedness that is creeping into our beloved site. Why are we dick measuring? why are we comparing experience like this? so much "I" "I" "I"...

I have no horse in the technical race here, but I certainly am put off from reading what should be an intellectually stimulating discussion by the nature of replies like this.


It was likely instigated by its parent trying to inflate himself by giving themselves some credentials, to try and give their voice more weight.

All of it, pretty sad, but I believe we should focus on the technical arguments and try to put everything else aside in order to re-conduct the discussion somewhere more useful.


Somehow I find it refreshing to see flamewars about ISAs, for I hadn't seen any in the last 15 years (at least). It makes me feel young again. :-)


Oh no. We don’t maintain this site with these types of comments no matter your feelings. It’s the internet. Don’t get heated!


> Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community.

> Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

https://news.ycombinator.com/newsguidelines.html


Memory-safe programming does not need any special ISA extension compared to traditional, C-like unsafe code. Even the practical overhead of bounds- and overflow checking is all about how it impedes the applicability of optimizations, not about the checks themselves.


It doesn't need them, but having hardware checks is generally more performant than having software ones.


RISC-V doesn't hinder safe code, that was an incorrect claim. Bound checks are done with one instruction - bltu for slices, bgeu for indexes. On intel processor you need cmp+jb pair for this.

The linked message is about carry propagation pattern used in gmp. AIU optimized bignum algorithms accumulate carry bits and propagate them in bulk and don't benefit from optimal one bit at a time carry propagation pattern.


> Except for this good feature, the rest of the ISA is full of bad features

What are your thoughts on the way RISC V handled the compressed instructions subset?


It's not too surprising. Load, store, move, add, subtract, shift, branch, jump. These are definitely the most common instructions used.

Put it side-by-side with Thumb and it also looks pretty similar (thumb has a multiply instruction IIRC).

Put it side-by-side with short x86 instructions accounting for the outdated ones and the list is pretty similar (down to having 8 registers).

All in all, when old and new instruction sets are taking the same approach, you can be reasonably sure it's not the absolute worst choice.


It was more a question of the way it was handled (i.e. it's not a different mode and can be mixed) than what the opcode list looked like.


Mode switching bloats the instruction count by shifting in and out. RISC-V does well here.

If there's a criticism, it's that the two bytes on 32-bit instructions mean the total instruction range is MUCH smaller overall until you switch to 48-bit instructions which are then much bigger.


It only addresses a subset of the available registers. Small revisions in a function which change the number of live variables will suddenly and dramatically change the compressibility of the instructions.

Higher-level languages rely heavily on inlining to reduce their abstraction penalty. Profiles which were taken from the Linux kernel and (checks notes...) Drystone are not representative of code from higher-level languages.

3/4 of the available prefix instruction space was consumed by the 16-bit extension. There have been a couple of proposals showing that even better density could be achieved using only 1/2 the space instead of 3/4, but they were struck down in order to maintain backwards compatibility.


This is just rubbish.

Small revisions to a function that increase the number of live variables to more than the set that are covered by the C extension mean that reference to THAT VARIABLE ONLY have to use a full size instruction. There is nothing sudden or dramatic.

Note that a number of a C instructions can in fact use all 32 registers. This includes stack pointer-relative loads and stores, load immediate ({-32..+31}), load upper immediate (4096 * {-32..+31}, add immediate and add immediate word ({-32..+31}), shift left logical immediate, register to register add, and register move.

It's certainly possible that another compressed encoding might do better using fewer opcode, and I've seen the suggestions. The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

But anyway it does well, and the opcode space used is not excessive. If anything it's TOO SMALL. Thumb2 gets marginally better code size while using 7/8ths of the opcode space for the 16 bit instructions instead of RISC-V's 3/4.


> The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

The RISC-V Compressed Spec v1.9 documented the benchmarks which were used for the optimization. RV32 was optimized with Dhrystone, Coremark, and SPEC-2006. RV64GC was optimized using SPEC-2006 and the Linux kernel.


> It's certainly possible that another compressed encoding might do better using fewer opcode, and I've seen the suggestions. The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from).

Javascript is limited to 64-bit floats as is lua and a couple other languages.

Sure, you can optimize to 31/32-bit ints, but not always and not before the JIT warms up.


The compressed instruction encoding is very good and it is mandatory for any use of RISC-V in embedded computers.

With this extension, RISC-V can be competitive with ARM Cortex-M.

On the other hand, the compressed instruction encoding is useless for general-purpose computers intended as personal computers or as servers, because it limits the achievable performance to much lower levels than for ARMv8-A or Intel/AMD.


This is wrong. RISC-V's instruction length encoding is designed in such a way that compressed instructions can be decoded very quickly. It doesn't pose the same kind of performance problem amd64's variable instruction length encoding does. Even if it did, it would be obvious nonsense to claim that this made it impossible for a RISC-V implementation to achieve amd64-like performance; if it were true, it would also make it impossible for amd64 implementations to achieve amd64-like performance, which is clearly a contradiction. And ILP and instruction decode speed are also improved by other aspects of the RISC-V ISA: the fused test-and-branch instructions you mentioned, but also the MIPS-like absence of status flags.


This of course utter nonsense. There's nothing different about the performance of compressed instructions.


For competitive performance in 2021 with CPUs that can be used at performance levels at least as high as those required for mobile phones, it is necessary to decode simultaneously at least 8 instructions per clock cycle (actually more for RISC-V, because its instructions do less than those of other CPUs).

The cost in area and power of a decoder for variable-length instructions increases faster with the number of simultaneously-decoded instructions than the cost of a decoder for fixed-length instructions.

This makes the compressed instruction encoding incompatible with high-performance RISC V CPUs.

For the lower performance required in microcontrollers, the compressed encoding is certainly needed for adequate code density.

The goals of minimum code size and of maximum execution speed are contradictory and the right compromise is different for an embedded computer and for a personal computer.

That is why ARM has different ISAs for the 2 domains and why also RISC-V designs must use different sets of extensions, depending on the application intended for them.


RISC-C compressed instructions cannot be compared to CISC variable length instructions. The instruction boundaries are easy to determine in parallel for multiple decoders. Something which is hard for e.g. x86. Compressed instructions don’t have arbitrary length. It is two instructions fitted in a 32-bit word.

Decompression is part of the instruction decoding itself. It only requires a minuscule 400 logical gates to do.

In fact RISC-V is very well designed for doing out-of-order execution of multiple instructions as instructions have been specifically designed to share as little state as possible. No status registers or conditional execution bits. Thus most instructions can run in separate pipelines without influencing each other.


8-wide is the absolute state of the art. Last I checked, AMD’s fastest core was 4-wide, and Intel’s was 6-wide. I only know of Apple doing 8-wide, and not anyone else. So branding this as the minimum necessary for mobile devices when even most desktops do not achieve it is silly.


> I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

It's perfectly possible to have read the spec and disagree with the rationale provided. RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

> Which is easy to verify as utter nonsense. There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

This doesn't seem to be true when you actually do an apples-to-apples comparison.

Taking as an example the build of Bash in Debian Sid (https://packages.debian.org/sid/shells/bash). I chose this because I'm pretty confident there's no functional or build-dependency difference that will be relevant here. Other examples like the Linux kernel are harder to compare because the code in question is different across architectures. I saw the same trend in the GCC package, so it's not an isolated example.

riscv64 installed size: 6,157.0 kB amd64 installed size: 6,450.0 kB arm64 installed size: 6,497.0 kB armhf installed size: 6,041.0 kB

RV64 is outperforming the other 64-bit architectures, but under-performing 32-bit ARM. This is consistent with expectations: amd64 has a size penalty due to REX bytes, arm64 got rid of compressed instructions to enable higher performance, and armhf (32-bit) has smaller constants embedded in the binary.

Compressed instructions definitely do work for making code smaller, and that's part of why arm32 has been very successful in the embedded space, and why that space hasn't been rushing to adopt arm64. For arm32, however, compressed instructions proved to be a limiting factor on high performance implementation, and arm64 moved away from them because of it. Maybe that's due to some particular limitations of arm32's compressed instructions that RISC-V compressed instructions won't suffer from, but that remains to be proven.


The size of the files can be very misleading, because a large part of the files can be filled with various tables with additional information, with strings, with debugging information, with empty spaces left for alignment to page boundaries and so on. So the size of the installed files is not necessarily correlated with the code size.

To compare the code sizes, you need tools like "size", "readelf" etc. and the data given by the tools should still be studied, to see how much of the code sections really contain code.

I have never seen until now a program where the RISC-V variant is smaller than the ARMv8 or Intel/AMD variant, and I doubt very much that such a program can exist. Except for the branches, where RISC-V frequently needs only 4 bytes instead of 5 bytes for Intel/AMD or 8 bytes for ARMv8, for all the other instructions it is very frequent to need 8 bytes for RISC-V instead of 4 bytes for ARMv8.

Moreover, choosing compiler options like -fsanitize for RISC-V increases the number of instructions dramatically, because there is no hardware support for things like overflow detection.


So (i) the research on RISC-V that shows it has dense code is bunk, and (ii) the fact that it compiles to a smaller binary is irrelevant, and (iii) it sounds like you're saying in advance that it might also have smaller code section sizes within the binary but that's irrelevant too.

And yet you're quite confident that RISC-V has poor code density. So you clearly have a source of knowledge that others don't. If it's a blog/article/research, could you share a link? If it's personal experimentation, you should write a blog post, I would totally read that.


Re-read what was written. He is saying exactly that the RISCV code size is larger, but to see it you need the right tool used the right way to actually look at the code, not debug info, constant sections, etc.


I would hope there's something more to his reasoning.

There's tables showing values for just the code and RISC-V beating aarch64 and x86-64 with ample margin, in this very discussion.


Once again, you’re comparing the compressed instructions without telling anyone that that’s what you’re doing, because you are convinced that the compressed instructions will never be performant on a mobile or desktop core. The foundation disagrees, and every manufacturer competing in mobile-class RISC-V chips disagrees. Even if you think that they are wrong, that is the plan they are going forward with. Real world RISC-V chips targeted at the mobile space support the compressed instruction. Even if you think that they are wrong to do so, the support is there. So what you are doing is refusing to compile for the instruction set the chips actually support, instead targeting what you think they should be doing instead, and then declaring that the code density is worse. This is nonsense.


> RISC-V is in fact the outlier among ISAs in many of these design decisions, so there's a heavy burden of proof to demonstrate that making the contrary decisions in many cases was the right call.

Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field? Obviously ISAs are a low level detail and far removed from the end product, but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon, before ubiquitous internet and the open source culture could establish itself. We've now got companies using genetic algorithms to optimize ICs and people making their own semiconductors in the 100s of microns range in their garages - maybe it's time to rethink some things!

(EE in a past life but little experience designing ICs so I feel like I'm talking out of my rear end)


> Genuinely asking, why? Do we think RISC-V should, or even could, try to compete against the AMD/Intel/ARM behemoths on their playing field?

Well, it's exactly what many RISC-V folks are trying to do. There's news about a new high performance RISC-V core on the HN front page right now!

> but it feels like the architectural decisions we are "stuck with" today are inextricably intertwined with their contemporary market conditions and historical happenstance. It feels like all the experimental architectures that lost to x86/ARM (including Intel's own) were simply too much too soon,

I just want to note that ARM64 was a mostly clean break from prior versions of ARM. Basically a clean slate design started in the late 2000s. It's a modern design built with the same hindsight and approximate market conditions available to the designers of RISC-V.


I agree that arm64 and RISC-V designers had basically the same hindsight and market conditions to refer to -- and they made a lot of very similar decisions.

But arm64 does seem constrained by compatibility with arm32 -- at least in that they until now (ten years later) usually have to share an execution pipeline and register set.

Is it really conceivable that the arm64 designers had free rein to make the choice whether to use condition codes or not on a purely technical basis? I don't think so. Even if they thought -- as all other designers of ISAs intended for high performance since 1990 have (Alpha, Itanium, RISC-V) -- that it's better not to use condition codes, I don't think they would have been free to make that choice.

The same goes for whether to expose instructions using the "free" shift on the 2nd ALU input. It's not really free -- it's paid for with a longer clock cycle or an extra pipeline stage or splitting instructions into uops. And since it was there for 32 bit they might as well use it in 64 bit as well. And the same for the complex addressing modes.


Unfortunately, it seems that, at least for gmp, the shared objects balloon in comparison to all other architectures. It is about three times bigger (6000 instead of 2000kB): https://packages.debian.org/sid/libgmp-dev. I am hopeful that this may improve with extensions, though I know little about the details.


       text    data     bss     dec     hex filename
     311218    2284      36  313538   4c8c2 arm-linux-gnueabihf/libgmp.so.10
     374878    4328      56  379262   5c97e riscv64-linux-gnu/libgmp.so.10
     480289    4624      56  484969   76669 aarch64-linux-gnu/libgmp.so.10
     511604    4720      72  516396   7e12c x86_64-linux-gnu/libgmp.so.10
Strange, that's not what I see.


Probably because, like most applications, that one does not have a lot of wide multiplications. It is hard not the turn this point into an insult at the OP.


Perhaps thumb2 makes an 8-wide decide much harder. Plus, then you can't have 32 instead of 16 registers.


A company creating embedded risc-v cpus also has some added extra instruction set extensions that conflict with the floating point instructions though.


Yeah, I'm not sure he takes into considered compressed instructions, which can be used anywhere, rather than being a separate mode like Thumb on ARM.

Fusing instructions isn't just theoretical either. I'm pretty sure it is or will be a common optimisation for CPUs aiming for high performance. How exactly is two easily-fused 16-bit instructions worse than one 32-bit one? Is there really a practical difference other than the name of the instruction(s)?

At the same time, the reduced transistor count you get from a simpler instruction set is not a benefit to be just dismissed either. I'm starting to see RISC-V cores being put all over the place in complex microcontrollers, because they're so damn cheap, yet have very decent performance. I know a guy developing a RISC-V core. He was involved with the proposal for a couple of instructions that would put the code density above Thumb for most code, and the performance of his core was better than Cortex-M0 at a similar or smaller gate count. I'm not sure if the instructions was added to the standard or not though.

Even for high performance CPUs, there's a case to be made for requiring fewer transistors for the base implementation. It makes it easier to make low-power low-leakage cores for the heterogeneous architecture (big.little, M1, etc.) which is becoming so popular.


>> But ultimately, the gist of their argument is this...

Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did.

Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language.

I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode.

I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags...


So if I'm understanding this particular tussle correctly, carry flags are problematic for optimization because they create implicit mutable shared global state, which isn't necessarily reflected in the machine code.

Risc-v basically says "lets make the implicit, explicit" and you have to essentially use registers to store the carry information when operating on bigints. Which for the current impl means chaining more instructions.

Is that correct?

That sounds like what the FP crowd is always talking about - eschewing shared state so it's easier to reason about, optimize, parallelize, etc.


There's more options than just a global carry flag, like a n+1th bit on each arithmetic register. Actually, that is how I assumed they would implement it.


> I don't think they even tried to read the ISA spec documents. If they did, they would have found that the rationale for most of these decisions is solid: Evidence was considered, all the factors were weighted, and decisions were made accordingly.

Nevertheless, the ISA speaks for itself. The goal of a technical project is to produce a technical artifact, not to generate good feelings having followed a "solid" process.

If the process you followed brought you to this, of what use was the process?


> It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Also, the godbolt.org compiler explorer has Risc-V support: useful for someone interested in comparing specific snippets of code.


So how would you suggest re-writing their example in less than 6 instructions for RISC-V? X86/arm both have instructions that include the carry operation for long additions, and only require 2 instructions.


I don't even see the issue. RISC-V is supposed to be a RISC-type ISA. It's in the very name. That it takes more instructions when compared to a CISC-type ISA like x86 is completely normal.

https://en.wikipedia.org/wiki/Reduced_instruction_set_comput...


The argument for RISC instructions (in high performance architectures) is that the faster decode makes up for the increase in instruction count. The problem is that a faster decode has a practical ceiling on how much faster it's going to make your processor, and it's much lower than 3x. If your workload is bottlenecked on an inner loop that got 3x larger in instruction count, no 15% improvement in decode performance is going to save you.


Amount of instructions matters much less if they can be fused into more complex instructions before execution.

RISC-V was designed with hindsight on fusion, thus it has more opportunities for doing it, and doing it at a lower cost.

And, due to the very high code density RISC-V has, the decoder can do its job while not having to look at a huge window.


Ok, but where it the chip that can fuse these?


I don't know what the design goals of RISC-V were, but I would guess performance is not the key goal or at least not the only goal. It makes more sense that ease of implementation is a more important goal, if they want to make adoption easy. That's another argument for favoring RISC over CISC.


If that's the case, you can always stick a uop cache in after the decoder.


To everyone who's saying "But macro-fusion!" in response, see my comment here: https://news.ycombinator.com/item?id=29421107


Larger caches won't help much either; there's an old article I remember that compares the efficiency of various ARM, x86, and one MIPS CPU, and while x86 and ARM were neck-and-neck, the MIPS was dead last in all the comparisons despite having more cache than the others. RISC-V is very similar to MIPS.


Larger caches, as seen in Apple's M1 L1, are one of many tools to deal with bad code density.

RISC-V might, at first glance, look similar to MIPS, but it leads in code density among the 64 bit architectures.


> [RISC-V] leads in code density among the 64 bit architectures.

You keep baldly asserting this in virtually all of your very many replies here, with a vague appeal to your own authority, but you haven't shown anything. Given that the submission is precisely an example of bad code density, if you're really here in the service of intellectual curiosity then please show instead of just telling.


No argument from authority is needed.

Anyone is free to download the disk images for a large body of software such as the same versions of Ubuntu or Fedora, and compare the binary sizes -- using the "text" output from "size" command, not raw disk files as there are also things such as debugging info in there.

Here's an example, using (ironically) the GMP library itself.

https://news.ycombinator.com/item?id=29423324

Here we see riscv64 significantly smaller than the other 64 bit ISAs aarch64 (28.1%) and x86_64 (36.5%), and beaten by a smaller margin by 32 bit ARM Thumb2 (-17.0%)

Anyone can check the sizes of bash, perl, emacs ... whatever they want ... themselves, without relying on the word of anyone here.


That could be confounded by gcc not enabling -funroll-loops, alignment of branch targets, etc for some architectures.


Conceivably, different ISAs could have the code compiled with different size/performance tradeoffs. One assumes that the Ubuntu and Fedora etc people would make sensible choices, consistent across ISAs. I don't know any reason why they would both make the same bad choices (which by "bad" here I am talking about small code at the expense of speed).

Gcc certainly supports unrolling, alignment of labels / loops / functions on all targets I'm aware of.


I don't think there is anything preventing the processor to fuse those instructions into a single operation once they are decoded.


Instruction fusion is the magical rescue invoked by all those who believe that the RISC-V ISA is well designed.

Instruction fusion has no effect on code size, but only on execution speed.

For example RISC-V has combined compare-and-branch instructions, while the Intel/AMD ISA does not have such instructions, but all Intel & AMD CPUs fuse the compare and branch instruction pairs.

So there is no speed difference, but the separate compare and branch instructions of Intel/AMD remain longer at 5 bytes, instead of the 4 bytes of RISC-V.

Unfortunately for RISC-V, this is the only example favorable for it, because for a large number of ARM or Intel/AMD instructions RISC-V needs a pair of instructions or even more instructions.

Fusing instructions will not help RISC-V with the code density, but it is the only way available for RISC-V to match the speed of other CPUs.

Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance


>Unfortunately for RISC-V, this is the only example favorable for it, because for a large number of ARM or Intel/AMD instructions RISC-V needs a pair of instructions or even more instructions.

Yet, as many pointed out to you already, RISC-V has the highest code density of all contemporary 64bit architectures. And aarch64, which you seem to like, is beyond bad.

>but it is the only way available for RISC-V to match the speed of other CPUs.

Higher code density and lack of flags helps the decoder a big deal. This means it is far cheaper for RISC-V to keep execution units well fed. It also enables smaller caches and conversely higher clock speeds. It's great for performance.

This, if anything, makes RISC-V the better ISA.

>Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

Grasping at straws. RISC-V has been designed for fusion, from the get-go. The cost of doing fusion with it has been quoted to be as low as 400 gates. This is something you've been told elsewhere in the discussion, but that you chose to ignore, for reasons unknown.


I see that you are pretty active here in debunking anti-RISC-V attacks, thanks for that! There are a bunch of poor criticisms about RISC-V.

> This is something you've been told elsewhere in the discussion, but that you chose to ignore, for reasons unknown.

I would call it RISC-V bashing.

Everyone loves to hate RISC-V, probably because it's new and heavily hyped.

It is really common to see irrelevant and uninformed criticism about RISC-V. The article, which seems to be enjoyed by the HN audience, literally says: "I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project". How can anyone say such a thing about a collaborative project of more than 10 years, fed by many scientific works and projects and many companies in the industry?

I do not mean that RISC-V is perfect, there are some points which are source of debate (e.g. favoring a vector extension rather than the classic SIMD is a source of interesting discussion). But I would appreciate on HN to read better analysis and more interesting discussions.


The more I think about CPU implementations the more I think that what RISC-V is doing isn't as bad as many people think. Everyone is going "more instructions = worse".

But the truth is that if you can build a CPU that fetches an infinite number of instructions per cycle, your biggest bottleneck isn't going to be the number of instructions, it's going to be unpredicted branches, jumps and function calls because fetching the entire function + everything behind it doesn't help, if you're going somewhere else. But the opposite is also true, adding more instructions than you need doesn't hurt as much as many people seem to think.

In practice the code density of RISC-V is not significantly worse than other architectures. So we don't even have to imagine an infinitely large fetcher, a finite fetcher that is bigger than what x86 CPUs have is good enough.


> But the truth is that if you can build a CPU that fetches an infinite number of instructions per cycle, your biggest bottleneck isn't going to be the number of instructions, it's going to be unpredicted branches

Note that: Within Superscalar processors a group of instructions that are decoded at the same cycle is called a decoding group.

Branching is a problem, but the branch predictors do an excellent job. (especially for function calls which are very well predicted by the RAS [Return Address Stack]) But the biggest bottleneck to fetch a large instruction group is decoding.

Especially the instruction size decoding. An ISA like RISC-V or ARM that drastically reduces the possible instruction sizes is a big advantage to decode large instruction groups.

And dependencies between instructions within the decoding group is also a concern. For example, register renaming will quickly require several cycles (several stages) when the decoding group scales up. RISC-V also addresses this since the register indexes are easily decoded earlier and the number of registers used can also be quickly decoded.

And you're right, these are topics that are rarely addressed by RISC-V detractors.


> How can anyone say such a thing about a collaborative project of more than 10 years, fed by many scientific works and projects and many companies in the industry?

Well the statement you quoted might be exaggerating things quite a bit but you're also just handwaving. The base ISA isn't a result of 10 years of industry experts doing their best; it's an academic project and a large proportion of it carried out by students:

> Krste Asanović at the University of California, Berkeley, had a research requirement for an open-source computer system, and in 2010, he decided to develop and publish one in a "short, three-month project over the summer" with several of his graduate students. [..] At this stage, students provided initial software, simulations, and CPU designs

The ISA specification was released in 2011! In one year! Of course there's been revisions since then, the most substantial being 2.0 in 2014 (I think). But if you look at the changes and skip anything that isn't just renaming / reordering / clarifying things, it always fits on half a page. It's by and large the same ISA, with some nice finetuning.

And here's the thing, a lot of people who originally read the spec felt like it is what it looks like, a "textbook isa", very much the kind of thing a group of students might come up with (I wonder why?).. just taken to completion. And what I remember from the spec (I read it long ago) is that cost of implementation was almost always a primary concern (and that high performance inmplementations would have to work harder but shrug itsnobigdealright?): it smelled like a small academic ISA tuned specifically for cheap microcontrollers. Not an ISA designed by industry experts for high performance cores. But the hype party is trying to sell it as a solution for all your computing needs, and almost seem to claim that no tradeoffs have been made against performance? And this is on a submission about performance, which is of course a subject a lot of people find interesting...

So I think there's very much reason to be critical of and discuss the ISA. Some critique may come from wrong assumptions, but being critical is not just bashing, and calling (attempts at) technical criticism uninformed & irrelevant and handwaving it away with 10 years of hype isn't helping the discussion. Better contribute that better analysis you refer to. (Unfortunately it seems like mostly everyone is just arguing without posting technical analysis)


> Some critique may come from wrong assumptions, but being critical is not just bashing

Focusing criticism only on this architecture, producing criticism without any solid argument, putting aside all the positive aspects, criticism when you clearly lack of expertise, sending similar criticisms in several conversations when already debunked on multiple occasions.. This is really systematic on discussion about RISC-V. That's what I call bashing, I cannot call that legitimate, correct or constructive criticism.

> The ISA specification was released in 2011! In one year! [...] But if you look at the changes and skip anything that isn't just renaming / reordering / clarifying things, it always fits on half a page.

I encourage anyone to open the original document published in 2011 [1] and compare it with current RISC-V specification documents [2].

There is very little left from the 3 month student work, it mainly remains the philosophy which is probably highly influenced by project supervisors. Moreover, it mainly consists of the basic RISC-V ISA which is indeed designed to be simple and minimalist, whereas the current RISC-V full spec. consists of a multitude of extensions.

At this stage your statement and the statement from the email is not just exaggerated, it is pure misinformation.

> Not an ISA designed by industry experts for high performance cores

Okay, the industry wasn't involved as much as today in the beginning. But RISC-V is really the product of experts in the field of high performance architectures

[1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-...

[2] https://riscv.org/technical/specifications/


> I encourage anyone to open the original document published in 2011 [1] and compare it with current RISC-V specification documents [2].

And for extra credit, look how much it changed with respect to the thing that's being discussed around this submission. Integer arithmetic and carry or overflows in base ISA. Carry flag, add? Sltu? Oh right, the base ISA hasn't changed so much after all.

> Moreover, it mainly consists of the basic RISC-V ISA which is indeed designed to be simple and minimalist, whereas the current RISC-V full spec. consists of a multitude of extensions.

Hand waving about 10 years of extensions and shuffling things around does little to address criticism concerning the base ISA. Hilariously so when these extensions fail to address the criticism at all.

Still no analysis either. I'm not going to call out misinformation in a post free of information.


> Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

I'm very skeptical that a RISC-V decoder would be much more complex than an X86 one, even with instruction fusion. For the simpler fusion pairs, decoding the fused instructions wouldn't be more complex than matching some of the crazy instruction encoding in X86.

For ARM I'm not so sure, but RISC-V does have very significant instruction decoding benefits over ARM too, so my guess would be that they'd be similar enough.


You ignore compressed instructions on RISC-V which 64-bit ARM does not have.

And if you compare 32-bit CPUs then RISC-V has twice as many registers reducing the number of instructions needed to read and write from memory.

RISC-V branching takes less space and so does vector instructions. There are many case like that which adds up end results in RISC-V having the most dense ISA in all studies when using compressed instructions.


> Even if instruction fusion can enable an adequate speed, implementing such decoders is more expensive than implementing decoders for an ISA that does not need instruction fusion for the same performance

On the other hand just splitting up x86 instructions is very expensive, and decoding in general takes a lot of work before you even start to do fancy tricks.


How does the instruction fusion work? It seems to be mentioned in the article and by a couple of other commenters.


The CPU executes the two (or more) dependent instructions "as if" they were one, e.g., in 1 cycle.

The CPU has a frontend, which has a decoder, which is the part that "reads" the program instructions. When it "sees" certain pattern, like "instruction x to register r followed by instruction y consuming r", it can treat this "as if" it was a single instruction if the CPU has hardware for executing that single instruction (even if the ISA doesn't have a name for that instruction).

This allows the people that build the CPU to choose whether this is something they want to add hardware for. If they don't, this runs in e.g. 2 cycles, but if they do then it runs in 1. A server CPU might want to pay the cost of running it in 1 cycle, but a micro controller CPU might not.


Do RISC-V specs document which instruction combinations they recommend be fused? Sounds like the fused instructions are an implementation detail that must be well-documented for compiler writers to know to emit the magic instruction combinations.


For this specific case, yes, the RISC-V ISA document recommends instruction sequences for checking for overflow-- that are both amenable to fusion and are relatively high performing on implementations that don't fuse.

Section 2.4, https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2...


It generally goes the other way around -- programmers and compilers settle on a few idiomatic ways to do something, and new cores are built to execute those quickly. Because RISC-V is RISC, it seems likely that those few ways would be less idiomatic and more 'the only real way to do x', which would aid in the applicability of the fusions.


Any != All. There is a difference between synthetic benchmarks and real world test cases.


(RISC-V fan here) This is a real-world use case. GMP is a library for handling huge integers, and adding two huge integers is one of the operations it performs, and the way to do that is to add-with-carry one long sequence of word-sized integers into another. It's not synthetic; it's extremely specialized, but real.


So this person found a pathological case for the RISC-V instruction set?


This is not a pathological case, it is normal operation.

A computer is supposed to compute, but the RISC-V ISA does not provide everything that is needed for all the kinds of computations that exist.

The 2 most annoying missing features are the lack of support for multi-word operations, which are needed to compute with numbers larger than 64 bits, but also the lack of support for detecting overflow in the operations with standard-size integers.

If you either want larger integers or safe computations with normal integers, the number of RISC-V instructions needed for implementation is very large compared to any other ISA.

While there are people who do a lot of computations with large numbers, even the other users need such operations every day. Large number computations are needed at the establishment of any Internet connection, for the key exchange. For software developers, many compilers, e.g. gcc (which uses precisely libgmp), do computations with large numbers during compilation, for various kind of optimizations related to the handling of constants in the code, e.g. for sub-expression extraction or for operation complexity lowering.

So every time when some project is compiled, libgmp or other equivalent library for large numbers might be used, like also every time when you click on a new link in a browser.

So this case is not at all pathological, except in the vision of the RISC-V designers who omitted support for this case.

That was a good decision for an ISA intended only for teaching or for embedded computers, but it becomes a bad decision when someone wants to use RISC-V outside those domains, e.g. for general-purpose personal computers.


> A computer is supposed to compute, but the RISC-V ISA does not provide everything that is needed for all the kinds of computations that exist.

This is non-sense. You can still do everything you need. Its just that in some cases the code size is a bit bigger or smaller.

And the difference with compressed instruction is not nearly as big, if you add fusion the difference is marginal.

So really its not a pathological case its 'its slightly worse' case and even that is hard to prove in the real world given the other benefit RISC-V brings that compensate.

And we can find 'slightly worse case' in the opposite direction if we would go looking for them.

If you gave 2 equal skilled teams 100M and told them to make the best possible personal computer chip, I would bet on the RISC-V team winning 90 times out of a 100.


The assembly code in the email is trivial. You don't seem to understand that the carry bit dependency exists regardless of the architecture. So ultimately, just fetching more instructions is enough to achieve optimal performance. As others said, code density of RISC-V is very reasonable on average. It's not significantly worse than x86 across an entire binary.


I don't think you're supposed to. The compiler handles that stuff, ideally RISC-V is just another compilation target.


Did you misunderstand the issue entirely?

The context here is the implementation of one of the inner loops of a high-performance infinite-precision arithmetic library (GMP), in RISCV the loop has 3x the instruction count it has in competing architectures.

“The compiler” is not relevant, this is by design stuff that the compiler is not supposed to touch because it’s unlikely to have the necessary understanding to get it as tight and efficient as possible.


An actual arbitrary-precision library would have a lot of loops with loops control and load and stores. Those aren't shown here. Those will dilute the effect of a few extra integer ALU instructions in RISC-V.

Also, an high performance arbitrary-precision library would not fully propagate carries in every addition. Anywhere that a number of additions are being done in a row e.g. summing an array or series, or parts of a multiplication, you would want to use carry-save format for the intermediate results and fully propagate the carries only at the final step.


>There's not even a need to look at the research, which shows RISC-V as the clear winner in code density. It is enough to grab any Linux distribution that supports RISC-V and look at the size of the binaries across architectures.

Wait, if we are talking about actual ISA instructions, why is it hard to believe that RISC-V would have more of them ? The argument in favor of RISC is to simplify the frontend because even for a complex ISA like x86, the instructions will get converted to many micro-ops. In terms of actual ISA instructions, it seems quite reasonable that x86 would have fewer of those (at the cost of frontend complexity).


Code density is about requiring the least amount of bytes of code to get something done, on average.

Doing it using a small pool of instructions, too (as RISC-V does), is the cherry on top.


RISC V is an opinionated architecture and that is always going to get some people fired up. Any technology that aims for simplicity has to make hard choices and trade offs. It isn’t hard to complain about missing instructions when there are less than 100 of them. Meanwhile nobody will complain about ARM64 missing instructions because it had about 1000 of them.

Therein lies the problem. Nobody ever goes out guns blazing complaining about too many instructions despite the fact that complexity has its own downsides.

RISC-V has been designed aggressively to have minimal ISA to leave plenty of room to grow, and require minimal number of transistors for a minimal solution.

Should this be a showstopper down the road, then there will be plenty of space to add an extensions that fixes this problem. Meanwhile embedded systems paying a premium for transistors are not going to have to pay for these extra instructions as only 47 instructions have to be implemented in a minimal solution.


It reminds me of nutrition advice. The 70s said X is evil or bad. Then we discover, X doesn't matter.

I think in 10-20 years everyone will agree that all the "bad" RISC-V decisions don't matter. The same way x86 (CISC) was supposed to be bad because of legacy/backwards compatibility.


I remember Jim Keller saying that for every ISA just 7 or 8 instructions are important and they are used for like 99% of the code. Load, store and the like.


That doesn't mean we should eliminate everything else. Many operations that would require 3 or more simple instructions are easy enough to implement in hardware, so that they take no more time than one of these simple instructions. This can be a big win even if they are only used 1% of the time.


Fair point but Arm is not just Arm64 - if you want a simple low cost ISA then there is Cortex-M.


So this is one tiny corner of the ISA, not something that makes ALL instruction sequences longer - essentially RISCV has no condition codes (they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other).

It's a trade off - and the one that's been made is one that makes it possible to make ALL instructions a little faster at the expense of one particular case that isn't used much - that's how you do computer architecture, you look at the whole, not just one particular case

RISCV also specifies a 128-bit variant that is of course FASTER than these examples


RISC-V designers optimized for C and found overflow flag isn't used much and got rid of it. It was the wrong choice: overflow flag is used a lot for JavaScript and any language with arbitrary precision integer (including GMP, the topic of OP).


Over just the time I've been aware of things, there's been a constant positive feedback loop of "checked overflow isn't used by software, so CPU designers make it less performant" followed by "Checked overflow is less performant so software uses it less."

I wish there was a way out.

Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement.

WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.

I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware.


There's a blog page somewhere that's a rant for implementing saturating and other arithmetic modes. Would be a really good idea.

Main one is interrupt on overflow.


I agree. A lot of software only wants protection against overflows but does not depend on them for functionality. If something wants to read out the carry bit, it should be explicit and although it is unfortunate, indicating that requires a full instruction.


> WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things.

Can you refresh my memory here? What exactly is different about Wasm return values than any other function-oriented language?


Common Lisp will often have auxiliary return values that are often not needed (e.g. the mathematical floor function returns the remainder as a second value). Unused extra values are silently discarded.

So you can do, for example (+ (floor x y) z) without worrying about the second value to floor.

A lisp compiler for a register machine will usually return the first N values in registers (for some small value of N), so the common case of using only the primary return value generates exactly the same code regardless of how many values are actually returned.

I don't remember the details, but however wasm happens to implement multiple return values, you can't just pretend that a function that returns 2 values only returns one.


Return values are pushed onto the Wasm operand stack in left-to-right order. Engines use registers for params/returns up to a point in optimized code and then spill to the stack. So at the call site you can just drop the return values you don't want and should end up with machine code that does exactly what you want.


wasm functions must have a fixed number of results[1]. Lisp functions may have variadic results. A wasm function call must explicitly pop every result off of the stack[2].

These rules add significant overhead to implementing:

  (foo (bar))
Which calls foo with just the first result of bar, and the number of results yielded by bar is possibly unknown.

In psuedo-assembly for a register machine, implementing this is roughly:

  CALL bar
  MOVE ResultReg1 -> ArgumentReg1
  CALL foo
And this works regardless of how many values the function bar happens to return[3].

Any implementation that is correct for any number of results of "bar" will be slow for the common case of bar yielding one or two results.

1: https://webassembly.github.io/spec/core/syntax/types.html#sy...

2: https://webassembly.github.io/spec/core/exec/instructions.ht...

3: Showing only the caller side of things, when the caller doesn't care about extra values hides some complexity of implementation of returning values because you also need to be able to detect at run-time how many values were actually returned. e.g. SBCL on x86 uses a condition flag to indicate if there are multiple values, because branching on a condition flag lets you handle the only-1 value case efficiently.


Ok, I see. Yes, there is some overhead if you want not just multiple returns, but variadic returns. Most virtual machines and programming languages make this fairly inefficient because of fixed arity of returns. AFAICT, targeting anything other than machine code is going to cost you in this case.


Common Lisp supports multiple value returns, which C doesn't. Go sort of does, though.


Wasm supports multiple value returns.


Also Rust applications are increasingly going to be built with integer overflow checking enabled, e.g. Android's Rust components are going to ship with integer overflow checking. And unlike say GMP, that poses a potential code density problem because we're not talking about inner loops that can be effectively cached, it's code bloat smeared across the entire binary.


> it's code bloat smeared across the entire binary.

That's probably not true in the usual case. Most arch's are 64 bit nowadays. If you are working on something that isn't 64 it you are doing embedded stuff, and different rules and coding standards apply (like using embedded assembler rather than pure C or Rust). In 64 bit environments only pointers are 64 bits by default, almost all integers remain 32 bit. Checking for a 32 bit overflow on a 64 bit RISC-V machine takes the same amount of instructions as everywhere else. Also, in C integers are very common because they are used as iterators (ie, stepping along things in for loops). But in Rust, iterators replace integers for this sort thing. There still is an integer under the hood of course, and perhaps it will be bounds checked. But that is bounds checked - not overflow checked. 2^32 is far larger than most data structures in use. Which means while there may be some code bloat, the lack full 64 integers in your average Rust problem means it's going to be pretty rare.

Since I'm here, I'll comment on the article. It's true the lack of carry will make adds a little more difficult for multi precision libraries. But - I've written a multi precision library, and the adds are the least of your problems. Adds just generate 1 bit of carry. Multiplies generate an entire word of carry, and they almost a common as adds. Divides are no so common fortunately, but the execution time of just one divide will make all the overhead caused by a lack of carry look like insignificant noise.

I'm no CPU architect, but I gather the lack of carry and overflow bits makes life a little easier for just about every instruction other than adc and jo. If that's true, I'd be very surprised if the cumulative effect of those little gains didn't completely overwhelm the wins adc and jo gets from having them. Have a look at the code generated by a compiler some time. You will have a hard time spotting the adc's and jo's because there are bugger all of them.


It's a good observation that checking for 32-bit overflow is not too bad, but...

> In 64 bit ... almost all integers remain 32 bit.

I don't believe this is true for Rust, even if we exclude index integers in iterators because we think those overflow checks can be optimized out. Certainly in my Rust project (Pernosco) there is a lot of manipulation of 64-bit integer data.


Yeah but the code required for an overflow check is just one extra instruction (3 rather than 2)


For generalised signed addition, the overhead is 3 instructions per addition. It can be one in specific contexts where more is known about the operands (e.g. addition of immediates).

It’s always 1 in x64/ARM64 as they have built-in support for overflow.


you have to include the branch instruction too in any comparison


In x86 it is still one instruction: jc or jo after an addition.


And in RISC-V it's a "BLT sum,summand" after an unsigned addition, and a single branch instruction after a signed addition if you know the sign of one of them e.g. adding or subtracting a constant.


built-in support for overflow is not really builtin. There's only adc, but the C library has no such function, and compilers only got them added a few years ago. Almost nobody uses them, as they introduce a dependency. Most workaround that in much slower ways, worse than RISC-V


It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

That said, I think it's less of an issue these days for JS implementors in particular. It might have mattered more back in the day when pure JS carried a lot of numeric compute load and there weren't other options. These days it's better to stow that compute code in wasm and get predictable reliable performance and move on.

The big pain points in perf optimization for JS is objects and their representation, functions and their various type-specializations.

Another factor is that JS impls use int32s as their internal integer representation, so there should be some relatively straightforward approach involving lifting to int64s and testing the high half for overflow.

Still kind of cumbersome.

There are similar issues in existing ISAs. NaN-boxing for example uses high bits to store type info for boxed values. Unboxing boxed values on amd64 involves loading an 8-byte constant into a free register and then using that to mask out the type. The register usage is mandatory because you can't use 64-bit values as immediates.

I remember trying to reduce code size and improve perf (and save a scratch register) by turning that into a left-shift right-shift sequence involving no constants, but that led to the code executing measurably slower as it introduced data dependencies.


> It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome.

It just feels backwards to me to increase the cost of these checks in a time where we have realized that unchecked arithmetic is not a good idea in general.


I think I agree it was a mistake/wart. I can understand the frustration of the GMP dev in the text - they get hit hard by this. The omission feels arbitary and capricious and a bit ideologically motivated. The defense of the choice seems like post-hoc rationalization.

RiscV looks like a nice ISA otherwise.

I wouldn't be surprised if it was eventually extended to add a set of variant instructions that wrote flags to an explicitly specified register.


Shouldn't settle for less than interrupt on overflow tbh.


That would probably be the easiest to add, wouldn't it?


It should be, but they never do. It's really important for HPC workloads.


Which is exactly the right trade-off for embedded CPUs, where RISC-V is most popular right now.

If desktop/server-class RISC-V CPUs become more common, it's not unreasonable to think they'll add an extension that covers the needs of managed/higher-level languages like RISC-V.

Even for server-class CPUs you could argue that you absolutely want this extension to be optional, as you can design more efficient CPUs for datacenters/supercomputers where you know what kind of code you'll be running.


They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.


Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder).

So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".


> Any hardware adder provides almost for free the overflow detection output (at less than the cost of an extra bit, so less than 1/64 of a 64-bit adder). So anyone who thinks about an efficient hardware implementation would expose the overflow bit to the software.

Ah, but where you do put that bit that you got for free?

A condition codes register, global to the processor / core state? That worked terrific for single-issue microcontrollers back in the 1980's. Now you need register renaming, and all the expensive logic around that to track which overflow bit is following which previous add operation. That's what's being done now for old ISAs, and it generally disliked for several reasons (complexity being chief among them).

Well, you could stuff that bit into another general purpose register, but then you kind of want to specify 4 registers for the add command. Now where are the bits to encode a 4th register in a new instruction format. RISC-V has room to grow for extensions, but another 5 bits for another register is a big ask.


I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.


> I have no clue about flags but why not just store the flags with the register? Each register would have 32+r bits where r is the number of flags.

That sort of design can be done, but that just pushes the problem around. Let's look at the original ARM 64 bit code:

    adds  x12, x6, x10
    adcs  x13, x7, x11
The second add with carry uses the global carry bit, it isn't passed as an argument to the adcs instruction. So if you store the carry bit with the x12 register, you would then need to specify x12 in the adcs instruction on the next line. So you need a new instruction format for adcs that can specify four registers.

You could change the semantics, where the add instructions use one register as the source and the destination, like on x86-64, but that's a whole 'nother discussion on why that is and isn't done on various architectures.


> A hardware implementation that requires multiple additions to provide the complete result of a single addition can be called in many ways, but certainly not "efficient".

1. There's not multiple additions in the recommended sequences. Unsigned is add,bltu; Signed with one known sign is add, blt; Signed in general is add, slt, slti, bne.

2. These instruction sequences are specified so that an instruction decoder can treat these sequences following the add as a "very wide" instruction specifying to check an overflow flag, if a hardware implementation so chooses.


Even if a dedicated comparator can be a little cheaper than a full adder, all CPUs already have full adders/subtractors, so all the comparisons are done by subtraction in the same adder/subtractor.

So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.


> So your recommended sequence has 4 additions done in the adder/subtractor of the ALU, because all comparisons, including the compare-and-branch instructions, count as additions, from the point-of-view of the energy consumption and execution time.

For the signed overflow case, with compile-time unknown value added, if the processor doesn't do anything fancy to elide the recommended sequence (fusion or substitution of operation) only. Not for the other two cases, and not with fusion.


> They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware.

I would like to see some benchmarks of this efficient implementation in hardware, even simulated hardware, compared against conventional architectures.

Even for C, it's a recurring source of bugs and vulnerabilities that int overflow goes undetected. What we really need is an overflow trap like the one in IEEE floating point. RISC-V went the opposite direction.


Is it worth the encoding space to define new ALU instructions with overflow flag semantics? Or could there be an executive format that implies a different mode?


This isn't an isolated case. RISC-V makes the same basic tradeoff (simplicity above all else) across the board. You can see this in the (lack of) addressing modes, compare-and-branch, etc.

Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.


I'm not an expert on ISA and CPU internals, but an X86 instruction is not just "an instruction" anymore. Afaik, since the P6 arch Intel is using a fancy decoder to translate x86/-64 CISC into an internal RISC ISA (up to 4 u-ops per CISC instruction) and that internal ISA could be quite close to the RISC-V ISA for all I know.

Instruction decoding and memory ordering can be a bit of nightmare on CISC ISAs and fewer macro-instructions are not automatically a win. I guess we'll eventually see in benchmarks.

Even though Intel has had decades to refine their CPUs I'm quite excited to see where RISC-V is going.


As someone who is an expert on ISA and CPU internals, this meme of "X86 has an internal RISC" is an over-simplification that obscures reality. Yes, it decodes instructions into micro-ops. No, micro-ops are not "quite close to the RISC-V ISA".

Macro fusion definitely has a place in microarchitecture performance, especially when you have to deal with a legacy ISA. RISC-V makes the very unusual choice of depending on it for performance, when most ISAs prefer to fix the problem upstream.


Indeed. Also not an expert, but relying on macro-op fusion in hardware is tricky IIRC since different implementors will (likely) choose different macro-ops, resulting in strange performance differences between otherwise-identical chips.

Of course, you could start documenting "official" macro-ops that implementations should support, but at that point you're pretty much inventing a new ISA...


RISC-V does document "official" macro-ops that implementations are encouraged to support.


> "X86 has an internal RISC" is an over-simplification that obscures reality

Is it misleading though? I don't mind simplifications unless they are misleading. Would like to hear your criticisms of this meme.


X86 processors could have an internal VLIW for all we know. The instruction length would be very small by Itanium standards but still. It could be anything.


Seems to me that the parent comment was claiming to know.


Most commonly used x86_64 instructions decode to only 1 or 2 µops, thus often also just as "complex" as the original instructions.


> but an X86 instruction is not just "an instruction" anymore.

This is technically true but not really. Decoding into many instructions is mainly used for compatibility with the crufty parts of the x86 spec. In general, for anything other than rmw or locking a competent compiler or assembly writer will only very rarely emit instructions that compile to more than one uop. The way the frontend works, microcoded instructions are extraordinarily slow on real cpus.

Modern x86 is basically a risc with a very complex decode, few extra useful complex operations tacked on, and piles and piles of old moldy cruft that no-one should ever touch.


X86 doesn't have to go to microcode to have multiple uOPs for an instruction. Most uarchs can spit out three or four uOPs per instruction before having to resort to the microcode ROM. Basically instructions that would only need one microcode ROM row in a purely microcoded design can be spit out of the fast decoders.


> it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop.

As someone else who replied said, I'm not a CPU architect, just software that works close to the metal. That means I pay attention to compiler output.

What you say is true in the very early says: compilers did indeed use the x86's addressing modes in all sorts of odd ways to squeeze as many calculations as possible into as few bytes as possible. Then it went in the reverse direction. You started seeing compilers emitting long series of simple instructions instead, seemingly deliberately avoiding those complex addressing modes. And now it's swung back again - I'm the complier using addressing modes to shift plus a couple of adds in one instruction is common again. I presume all these shifts were driven by speed of the resulting code.

I have no idea why one method was faster than the other - but clearly there is no hard and fast rule operating here. For some internal x86's implementations using complex addressing modes was a win. On some, for exactly the same instruction set, it wasn't. There is no cut and dried "best" way of doing it, rather it varies as the transistor and power budget changes.

One thing we do know about RISC-V is it is intended to cover a _lot_ transistor and power budgets. Where it's used now (low power / low transistor) is turned out their design decisions have turned out _very_ well, far better than x86.

More fascinatingly to me, today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you.

I have no idea how that will pan out for RISC-V. I don't think any one has done a super scalar implementation of it yet(?) But in the non-super scalar implementations the RISC-V instruction set choices have worked out very well so far. And when someone does do a super scalar implementation (and I'm sure there will be a lot of different implementations over time), it seems very possible x86's learnings on addressing mode use will be yesterdays news.


For those use cases you typically have specialised hardware or an FPGA.


So when h266 or whatever comes out you can't watch video anymore because your cpu can't decode it in software even if it tried?


An FPGA can be reprogrammed, and we do really do this for standards with better longevity than video standards (e.g. cryptographic ones like AES and SHA). For standards like video codecs, we just use GPUs instead, which I assume is what OP had in mind for "specialized hardware" (specialization can still be pretty general :-)).


Hardware video decoding is done by a single-purpose chip on the graphics card (or dedicated hardware inside the GPU), not via software running on the GPU. Adding support for a new video codec requires buying a new video card which supports that codec.


SystemC bloat will require you to upgrade to a bigger FPGA!


So do what Power does: most instructions that update the condition flags can do so optionally (except for instructions like stdcx. or cmpd where they're meaningless without it, and corner oddballs like andi.). For that matter, Power treats things like overflow and carry as separate from the condition register (they go in a special purpose register), so you can issue an instruction like addco. or just a regular add with no flags, and the condition register actually is divided into eight, so you can operate on separate fields without dependencies.


IIRC few Power(PC) cores really split the condition register nibbles into 8 renamable registers and while Power(PC) includes everything (including at least two spare kitchen sinks) only a few instructions can pick which condition register nibble to update. Most integer instructions can only update cr0 and floating point instructions cr1. On the other hand you can do nice hacks with the cornucopia of bitwise available bitwise operations on condition register bits and it's one of the architectures where (floating point) comparisons return the full set of results (less, equal, greater, unordered).


On POWER, all the comparison instruction can store their result in any of the 8 sets of flags. The conditional branches can use any flag from any set.

The arithmetic instructions, e.g. addition or multiplication, do not encode a field for where to store the flags, so they use, like you said, an implicit destination, which is still different for integer and floating-point.

In large out-of-order CPUs, with flag register renaming, this is no longer so important, but in 1990, when POWER was introduced, the multiple sets of flags were a great advance, because they enabled the parallel execution of many instructions even in CPUs much simpler than today.

Besides POWER, the 64-bit ARMv8 also provides most of the 14 predicates that exist for a partial order relation. For some weird reason, the IEEE FP standard requires only 12 of the 14 predicates, so ARM implemented just those 12, even if they have 14 encodings, by using duplicate encodings for a pair of predicates.

I consider this stupid, because there would not have been any additional cost to gate correctly the missing predicate pair, even if it is indeed one that is only seldom needed (distinguishing between less-or-greater and equal-or-unordered).


ARM also does something similar, many instructions has a flag bit specifying whether flags should be updated or not. It doesn't have the multiple flag registers of POWER though.


Which at least back in the day neither the IBM compilers nor GCC 2.x - 4.x made much use of. I've seen only a few handoptimzed assembler routines get decent use out of them. Easy to fuse pairs a probably a good compromise for carry calculation e.g. add + a carry instruction. That would get rid of one of the additional dependencies, but it would take a three operand addition or fusing two additions to get rid of the second RISC V specific dependency. And while GMP isn't unimportant it is still a niche use case that's probably not worth throwing that much hardware resources at to fix the ISA limitations in the uarch.


Which feature are you saying was not used much? Addition with carry; or addition without carry; or multiple flag registers?


> they're a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other

It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC) the dependency issue can go away because they're renamed and their results aren't dependent on previous data.

Now, of course, instructions that only update some condition flags and preserve others are the devil.


It's not a tiny corner. People do arithmetic with carry all the time. Arbitrary precision arithmetic is more common than you think. Congratulations, RISC-V, you've not only slowed down every bignum implementation in existence, all those extra instructions to compute carry will blow the I$ faster, potentially slowing down any code that relies on a bignum implementation as well.


  > all those extra instructions to compute carry will blow the I$ faster
i think the idea is, as others have mentioned, the add/comp instructions are fused internally to a single instruction, so probably its not that bad for i$ as we might think?


> RISCV also specifies a 128-bit variant that is of course FASTER than these examples

Is it actually implemented on any hardware?


No. Mentioning it is only meant to distract.


Is there a semi competitive Risc-V core implemented anywhere?

It all seem hypothetical to me now, fast cores would fuse the instructions together so instruction count alone isn't adequate for the original evaluation of the ISA. Now I'm not sure that there are any that really do that..


BOOM cores fuse ops already, so the cores don't have to be all that fast to start to see wins from it.


SiFive have a new thing which is roughly as fast as a low-power CPU from Intel or Apple (e.g. IceStorm).

Obviously not really competitive, but I think they are still mostly making hardware so there is hardware at-all stage.


You can opt in to generating and propagating conditions and rename the predicates as well.


> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

When you hear the "<person / group> could make a better <implementation> in <short time period>" - call them out. Do it. The world will not shun a better open license ISA. We even have some pretty awesome FPGA boards these days that would allow you to prototype your own ISA at home.

In terms of the market - now is an exceptionally great time to go back to the design room. It's not as if anybody will be manufacturing much during the next year with all of the fab labs unable to make existing chips to meet demand. There is a window of opportunity here.

> It is, more-or-less a watered down version of the 30 year old Alpha ISA after all. (Alpha made sense at its time, with the transistor budget available at the time.)

As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.

I also really like the Unix philosophy of doing one simple thing well. Sure, it could have some special instruction that does exactly your use case in one cycle using all the registers, but that's not what has created such advances in general purpose computing.

> Sure, it is "clean" but just to make it clean, there was no reason to be naive.

I would much rather we build upon a conceptually clean instruction set, rather than trying to hobble together hacks on top of fundamentally flawed designs - even at the cost of performance. It's exactly these hobbled conceptual hacks that have lead to the likes of spectre and meltdown vulnerabilities, when the instruction sets become so complicate that they cannot be easily tested.


Yeah. I don't have a dog in this fight... I don't have strong opinions either way, and this is one of those arguments that will be settled by reality after some time goes by.

But the author making an argument like that...

> I believe that an average computer science student could come up with a better instruction set that Risc V in a single term project.

Pretty much blew their credibility. It's obviously wrong, and a sensible, fair person wouldn't write it.


This student gets an A because he added the "adcs" instruction to his ISA. Everyone else gets an F because they didn't.


> Do it. The world will not shun a better open license ISA.

When you use a CPU architecture you don't just get an ISA.

You also get compilers and debuggers. Ready-to-run Linux images. JIT compilers for JavaScript and Java. Debian repos and Python wheels with binaries.

And you get CPUs with all the most complex features. Instruction re-ordering, branch prediction, multiple cores, multi-level caches, dynamic frequency and voltage control. You want an onboard GPU, with hardware 4k h264 encoding and decoding? No problem.

And you get a wealth of community knowledge - there are forum posts and StackOverflow questions where people might have encountered your problems before. If you're hiring, there are loads of engineers who've done a bit of stuff with that architecture before. And of course vendors actually making the silicon!

I've seen ISAs documented with a single sheet of A4 paper. The difficult part in having a successful CPU architecture is all the other stuff :)


>As I see it, lower numbers of transistors could also be a good thing. It seems blatantly obvious at this point that multi-core software is not only here to stay, but is the future. Lower numbers of transistors means squeezing more cores onto the same silicon, or implementing larger caches, etc.

How about some 32 way SMT GPUs... No more divergence!


The idea is to use the compressed instruction extension. Then two adjacent instructions can be handled like a single “fat” instruction with a special case implementation.

That allows more flexibility for CPU designs to optimize transistor count vs speed vs energy consumption.

This guy clearly did not look at the stated rationale for the design decisions of RISC-V.


Compressed instructions and macro-fusion aren't magical solutions. It's not always possible to convince the compiler to generate the magical sequence required, and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding.

Beyond that, compressed instructions are not a 1:1 substitute for more complex instructions, because a pair of compressed instructions cannot have any fields that cross the 16-bit boundary. This means you can't recover things like larger load/store offsets.

Additionally, you can't discard architectural state changes due to the first instruction. If you want to fuse an address computation with a load, you still have to write the new address to the register destination of the address computation. If you want to perform clever fusion for carry propagation, you still have to perform all of the GPR writes. This is work that a more complex instruction simply wouldn't have to perform, and again it complicates a high performance implementation.


Part of the idea is to create standard ways to do certain things and then hope compiler writers generation code according to that. That will allow more chip designers to take advantage of those if they want to.

They spent a lot of time and effort on making sure the decoding pretty good and useful for high performance implementations.

RISC-V is designed for very small and very large system. At some point some tradeoffs need to be made but these are very reasonable and most of the time no a huge problem.

For the really specialized cases where you simply can't live with those extra instruction, those will be added to the standard and then some profiles will include them and others not. If those instructions are really as vital as those that want them claim, they will find their way into many profiles.

Saying RISC-V is 'terrible' because of those choices is not fair way of evaluating it.


RISC-V is designed for very small and very large system

That's exactly the problem --- there is no one-size-fits-all when it comes to instruction set design.


There is a trade-off but there is overall far more value in having it be unified.

The trade-offs are mostly very small or non existent once you consider the standard extensions that different use cases will have.

Overall having a unified open instruction set is far better then hand designing many different instruction sets just to get marginal improvement. Some really extreme application might require that, but for the most part the whole indsutry could do just fine with RISC-V. Both on the low and on the high end, and in fact better then most of the alternative all things considered.

If integer checking is really the be all end all and without it RISC-V can not be successful without it, it will be added and it will be pulled into all the profiles. If it is not actually that relevant then it wont. If it is very useful for some verticals and not others, it will be in those profiles and not in others.


>overall far more value in having it be unified.

>[...]

>If integer checking is really the be all end all and without it RISC-V can not be successful without it, it will be added and it will be pulled into all the profiles. If it is not actually that relevant then it wont. If it is very useful for some verticals and not others, it will be in those profiles and not in others.

So which is it? Unified or something else?


The goal is that there is a unified core that runs the majority code. The majority of ecosystem and tooling works of a common base. Lots of code can be used in a way to be very universal.

Some verticals that will be special like deep embedded will likely be different enough that it will be slightly different, but it still profits from all the work going into the overall ecosystem.

RISC-V allows 'the market' to decide between uniformity and specialty in a orthogonal way. My bet is that this will actually lead to a lot of uniformity in most verticals.


Again, I ask: Unified, or something else?

Having a "majority" isn't really different from what we have now, and it includes the downside of a more robust monoculture.

And to be clear here, I'm not anti RISC-V. I am very highly skeptical, given how frequently we see things like code size critics responded to with 'but it will be faster', which isn't really an answer,

The same thing happened here. Hand wavy "the market" and "orthogonal way" doesn't communicate anything meaningful, and or worth some clear downsides at present.

Finally, "the goal" is really "a goal or vision" as expressed by proponents. It's not really one in the objective sense.


Not having the definition being controlled by a company, being and open specification and the whole ecosystem and tooling from the ground up being designed around a fully modular instruction set with use-case specific profiles that can evolve independently is very different then what we have now.

> Finally, "the goal" is really "a goal or vision" as expressed by proponents. It's not really one in the objective sense.

No idea what that means.

I don't know if it will be faster, frankly I don't care if its slightly slower, even if I don't really think that will be the case. An open ISA allows for real open chips and finally making real progress in that direction in terms of openness.


I agree with the latter portion, and that's the part I like about RISC-V the most.

Just wish some of the choices were made differently. Open could be open and amazing. We are likely to get open with a bit of hobbling built in early on, and that makes me wonder.

This discussion is much improved over the initial, "unified, but..." one we started with.


I think overall it is amazing. I think considering everything they had to do and everything they wanted to achieve, I think they made really, really good overall choices.

There are some things here and there one can argue, depending on what one considers the main usecase. Overall I think starting with a very small base makes a lot of sense.

In fact they actually went to large in some places and that's why they have been working on a profile that is considerably smaller, less registers and floats in int register standards.

Given that this was made by a small group of students and one professor I think its remarkable and innovative.


I do not think much of the choice on how to do math and not have flags.

We will see over time.

My preference is a hybrid. There are ops we know will repeat a lot. Targeting those with efficient, small instructions will be a win over this every time.

I expect to see that done.

Sing the tool chains all come up as a good thing though. We're heading into interesting times.


In the context of gmp, people write architecture-specific assembly for the inner loop anyway.

Besides that, you raise good points on sources of complexity. I’m waiting for the benchmarks once such developments have been incorporated. Everything else is guesswork.


If they didn't implement those benchmarks (at least in simulation, like they benchmarked everything else) before releasing the spec, then they have nothing but handwaving and wishful thinking in saying this issue can be solved by op fusion. The reality is that they optimized for 1980s-style C programming without noticing that this isn't the 1980s any more.


I don’t see why offsets larger than 16-bit are important. Are you implying that most fusion candidate pairs would need this? In tight inner loops why would you need large offsets?

Of course you discard architectural state changes in fusion. If I have a bunch of instructions which end up reading from memory into register x10, then I can fuse with all previous instructions which wrote into x10, as their results get clobbered anyway.

Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance.

What performance tricks are you giving up by doing fusion?


> and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding.

More difficult than x86? We're talking about a damn simple variable width decoding here.

I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe.

> and again it complicates a high performance implementation.

But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is.

The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly.


Let's assume you are right. In 5 years the organization behind RISC-V apologizes and introduces a "bignum" extension. That doesn't sound too bad.


He literally addressed this, albeit obliquely, in the message

> I have heard that Risc V proponents say that these problems are known and could be fixed by having the hardware fuse dependent instructions. Perhaps that could lessen the instruction set shortcomings, but will it fix the 3x worse performance for cases like the one outlined here?

Macro-fusion can to some extent offset the weak instruction set, but you're never going to get a multiple integer multiplier speedup out of it given the complexity of inter-op architectural state changes that have to be preserved, and instruction boundary limitations involved; it's never going to offset a 3x blowup in instruction count in a tight loop.


Fusing 3 instructions is not unusual, those could also have been compressed. Thus you have no more microcode to execute and only 50% more cache usage rather than 300%


Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions.


Even if you do so, the program size is still bigger, and it consumes more disk, RAM and most importantly cache space. Wasting cache for having multiple instructions when on another architecture it's done by only one doesn't make particular sense to me.

Also, it's said that x86 is bad because the instructions are then reorganized and translated inside the CPU. But it seems that you are proposing the same, the CPU that preprocessed the instructions and fuses some into a single one (the opposite that x86 does). Ad that point, it seems to me that what x86 does makes more sense: have a ton of instruction (and thus smaller programs and thus more code that can fit in cache) and split them, rather than having a ton of instructions (and waste cache space) for then the CPU to combine them into a single one (a thing that a compiler can also do).


x86 also does macro fusion. Difference is RISC-V was designed for compressed instruction and fusion from the get go. X86 bolted this on.

Anyway what you gain from this is a very simple ISA, which helps tool writers, those who implement hardware as well in academia for teaching and research.

How does the insanely complex x86 instructions help anyone?


How many cache misses are for program instructions, versus data misses?


IME icache misses are a frequent bottleneck. There's plenty code where all the time is spent in one tight inner loop and thus the icache is not a constraint, but there's also a lot of cases with a much flatter profile. Where icache misses suddenly become a serious constraint.


Obtaining the carry bit does not involve branches though. Overflow checking probably does.


Depends on the application. But even if they are few, it's not a good reason to have them just for having a nice instruction set, that if you are not writing assembly by hand (and nobody does these days) doesn't give you any benefit.

Also don't reason with the desktop or server use case in mind, where you have TB of disk and code size doesn't matter. RISC-V is meant to be used also for embedded systems (in fact their use nowadays is only for these systems), where usually code size matter more than performance (i.e. you typically compile with -Os). In these situations more instructions means more flash space wasted, meaning you can fit less code.


> that if you are not writing assembly by hand (and nobody does these days) doesn't give you any benefit.

An elegant architecture is easier to reason about. Compilers will make fewer wrong decisions, fewer bugs will be introduced, and fewer workarounds will need to be implemented. An architecture that's simple to reason about is an invaluable asset.


I think talking about ISAs as better or worse than one another is often a bad idea for the same reason that arguing about whether C or Python is better is a bad idea. Different ISAs are used for different purposes. We can point to some specific things as almost always being bad in the modern world like branch delay slots or the way the C preprocessor works but even then for widely employed languages or ISAs there was a point to it when it was created.

RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need.

I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them.


Branch delay slots are an artifact of a simple pipeline without speculation. There's nothing inherently "bad" about them.


If you're designing a single CPU that definitely has a simple pipeline, branch delay slots are maybe justifiable. If you're designing an architecture which you hope will eventually be used by many CPU designs which might have a variety of design approaches, then delay slots are pretty bad because every future CPU that isn't a simple non-speculating pipeline will have to do extra work to fake up the behaviour. This is an example of a general principle, which is that it's usually a mistake to let microarchitectural details leak into the architecture -- they quickly go stale and then both hw and sw have to carry the burden of them.


> My conclusion is that Risc V is a terrible architecture.

Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".

My expectation here is that RISC-V requires some inefficient instruction sequences in some corners somewhere (and one of these corners happens to be OP's pet use case), but by and large things are fine.

And even then, I don't think that's clear. You're not going to determine performance just by looking at a stream of instructions on modern CPUs. Hell, it's really hard to compare streams of instructions from different ISAs.


> Kinda stopped reading here. It's a pretty arrogant hot take. I don't know this guy, maybe he's some sort of ISA expert. But it strains credulity that after all this time and work put into it, RISC-V is a "terrible architecture".

Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD)


> Seems quite balanced with all the other replies here which claim it's the best architecture ever whenever anyone says anything about it.

RISC-V is pretty good. Probably slightly better for some things than ARM, and slightly worse for others. It's open, which is awesome, and the instruction set lends itself to extensions which is nice (but possibly risks the ecosystem fragmenting). Building really high performance RISC-V designs looks like it's going to rely on slightly smarter instruction decoders than we've seen in the past for RISCs, but it doesn't look insurmountable.


Calling it terrible is definitely something from the book of Linus T.

Bad? Quite possible, it was meant as a teaching ISA initially IIRC, but terrible? Who knows.


That's the difficulty here, people are already arguing past each other because nobody seems to agree what the ISA is for.

If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it.

So how did all the hype form, and why is it that there are people seemingly hyping it as the next-gen dream-come-true super elegant open developed-with-hindsight ISA that will eventually displace crufty old x86 and proprietary ARM while offering better performance and better everything? Of course that just baits you into arguing about its potential performance. And don't worry if it doesn't have all the instructions you need for performance yet, we'll just slap it with another extension and it totally won't turn into a clusterfuck with a stench of legacy and numerous attempts at fixing it (coz' remember, hindsight)!

And then if you question its potential, you'll get someone else arguing that no no, it's not a high performance ISA for general use in desktops / servers, it's just an extensible ISA that companies can customize for their special sauce microcontrollers or whatever.

Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust.


Godbolt:

  typedef __int128_t int128_t;

  int128_t add(int128_t left, int128_t right)
  {
    return left + right;
  }
GCC 10, -O2, RISC-V:

  add(__int128, __int128):
        mv      a5,a0
        add     a0,a0,a2
        sltu    a5,a0,a5
        add     a1,a1,a3
        add     a1,a5,a1
        ret
ARM64:

  add(__int128, __int128):
        adds    x0, x0, x2
        adc     x1, x1, x3
        ret

This issue hurts the wider types that are compiler built-ins.

Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it.

Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood.

(The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.)


Hmmm... I think this argument is solid. Albeit biased from GMP's perspective, but bignums are used all the time in RSA / ECC, and probably other common tasks, so maybe its important enough to analyze at this level.

2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?

So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.

The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.

----------

Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.

I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.

So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.

----------

ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.

..........

Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?

EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).

2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).


> RISC-V has a bunch of competing vector instructions.

There is only one standard V extension. Alibaba made a chip with a prerelease version of that V extension which is thus incompatible with the final version, but in practice that just means that the vector unit on that chip is not used because it is incompatible, not that there are now competing standards


GMP is basically a worst-case example since it uses a lot of overflow. The RISC-V architecture has been extensively studies and for most cases it's a little more dense than (say) ARM when compared like-for-like.


> So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.


Can you treat the whole vector register as a single bignum on x86? If so, I totally missed that.


No.

Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.

The main issue is that the vector-add instructions are missing carry-out entirely, so recreating the carry will be expensive. But with a big enough number, that carry propagation is parallelizable in log2(n), so a big enough bignum (like maybe 1024-bits) will probably be more efficient for SIMD.


even AVX512 dies


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: