Hacker News new | past | comments | ask | show | jobs | submit login
Linus Torvalds says RISC-V will make the same mistakes as Arm and x86 (tomshardware.com)
103 points by kaycebasques 9 months ago | hide | past | favorite | 144 comments



Bad headline, he prefaces the statement with "I fear", `says` could've been written `fears`

It's realistic to expect RISC-V's execution won't be (& hasn't been) flawless

The context is re spectre


Torvalds has experiences of sufficient breadth and depth such that it is not unreasonable to take his concerns about reconciling hardware and software as being the concerns of an expert. And of course these days Torvalds tends to be more tactful in his criticisms than when he was younger.


Exactly this. A lot of people forgot or didn't know he used to work in CPU hardware ( Transmeta ) too.

>And of course these days Torvalds tends to be more tactful in his criticisms than when he was younger.

And that is why he has been very careful about RUST and RISC-V.

And just on the parents notes, > It's realistic to expect RISC-V's execution won't be (& hasn't been) flawless

Except most of the RISC-V supporter has been claiming every RISC-V execution are flawless.


> Except most of the RISC-V supporter has been claiming every RISC-V execution are flawless.

No-nothings, perhaps.

RISC-V cores are being designed by probably a couple of dozen companies, and chips made by many more (SiFive for example has 100+ customers for their cores). All of them have a list of more or less serious errata, just like every Arm and x86 chip.

The strength is in the diversity, in the freedom to experiment and innovate, in the shared software ecosystem, in the security of knowing that if your chip vendor goes out of business for any reason (bankruptcy, change of plans, being acquired...) there will always be another vendor you can turn to.

Unlike the people who were depending on Itanium, DEC Alpha, Motorola 68000, z80 and dozens of others.


This.

I keep hearing from Amiga people how the Amiga "coulda been a contenda" in this day and age of computing if only Commodore weren't so mismanaged. If only Commodore hadn't bought Amiga Inc. in the first place and the brilliant, revolutionary Jay Miner were allowed to market his designs on his own terms without being beholden to some evil corporate overlord who wanted to squeeze them for profit. If only. If only.

But it was clear, from the late 1980s, that the PC was going to win because it was not under the control of a single corporation. Indeed, it was winning despite IBM beginning to fumble the ball. All corporations, in the long run, tend toward mismanagement that can totally sink a platform if it's under their control. It happened to Commodore, Atari, Symbolics, SGI, it almost happened to Apple in the 90s, and it looks like it might happen to Raspberry Pi in the near future. But when there's an open standard on which companies can compete even as they interoperate, the platform thrives, and things unimaginable to its originators can be built upon it.

RISC-V brings this quality to the chip architecture and ISA, the same way Linux brought it to operating systems and the combination of x86, ISA/PCI bus, and BIOS/UEFI interface brought it to the general system design. It's not perfect, and no implementation of it is going to be perfect, but RISC-V really is gonna change everything.


I regret I have only one upvote to give.


> The strength is in the diversity, in the freedom to experiment and innovate, in the shared software ecosystem

Does RISC-V have a standardized device discovery mechanism? One critique of ARM (?) I've heard is that for every maker of chips one has to basically re-invent the wheel on bus discover and then on-bus device discovery.


Device tree seems to be standard fare, just like ARM.


I haven't read anything, but if he said "I fear that" rather than "I think that", that's the difference between cautioning against and predicting. The headline makes it seem like he was predicting.


Linus continues to be wrong about CPU hardware now, just as he was back then. See for example in the past his continued insistence that ISA does not matter at all for performance. One need only look at the huge single threaded performance and efficiency of apple M cores (10 wide instruction fetch) relative any x86 to to see that that statement is not true. I don't think I would take his hardware assertions as gospel.


> See for example in the past his continued insistence that ISA does not matter at all for performance. One need only look at the huge single threaded performance and efficiency of apple M cores (10 wide instruction fetch) relative any x86 to to see that that statement is not true.

Your argument is that ISA impacts performance... and your example is ARM, which went from worse performance than x86 across the board to having an implementation that sometimes beats x86 because one company made a better implementation than everyone else? That seems like a great example showing that ISAs aren't the primary factor in performance.


Apples chips utterly dominate in performance per watt versus x86. Arm64s simple fixed length instruction set is absolutely central in this. Qualcomm now exceeding Intel in their mobile chips is further evidence ISA is critically important. In the data center this is also happening with graviton 4 and eventually grace hopper.

Anyway, my point isn't that implementation does not effect performance. Of course it does. My point was that Linus has asserted ISA is irrelevant to performance, and that only implementation matters. That assertion does not hold up.


>Arm64s simple fixed length instruction set is absolutely central in this.

Note that doing variable instruction length is not an issue per-se as proven by more dense, yet still easy to decode RISC-V.

The way x86 does it, however, is very costly to decode and overall awful.


yeah, there's a lot of nuance there. Not all variable length isas are a problem. X86 definitely is.


TLDR: RISC-V 2-byte and 4-byte dual instruction lengths have no significant decoding scaling / latency problem at any practical CPU width.

Two instruction widths gives big benefits in reduced code size, better cache utilisation, less instruction fetch bandwidth etc and at very low cost.

RISC-V instruction length decoding requires you to look at only 2 bits per 2 bytes of code (an AND gate on those two bits reduces them to 1 bit). So then decoding, say, 32 bytes of code -- which will contain between 8 and 16 instructions with an average around 11 -- requires looking at only 16 signals. That's close to lookup table territory.

In fact the way to do it (at least what I came up with) is to make a decoder for each group of 4 aligned bytes, plus optionally the previous 2 bytes (so 6 bytes in total).

Each block can decode either one instruction (an aligned 4-byte instruction, or an unaligned 4 byte instruction starting in the previous block, followed by a second unaligned 4-byte instruction which is left for the next decode) or two instructions (two 2-byte instructions, or an unaligned 4-byte instruction started in the previous block, followed by a 2-byte instruction).

Each block receives a signal saying whether or not the previous block consumed the last two bytes, and outputs a similar signal to the next decoder block. So you can daisy-chain these, with 8 such decoder blocks in parsing 32 bytes of code.

That's not a HUGE delay.

But you can do better.

Each block generates two signals, one saying whether the last two bytes of the group are consumed if the previous decoder block consumes all of its bytes, and a second signal saying whether the last two bytes of the group are consumer if the previous decoder block does NOT consume its last two bytes of code.

These signals can be thought of as "generates" and "propagates" signals, analogous to a full-adder's carry output.

These two signals can be generated in parallel, and INDEPENDENT OF all previous blocks. You can even decode the one or two complete instructions both possible ways, using three decoders: a 4-byte only decoder started at -2 bytes, a 2-byte only decoder starting at +2 bytes, and a decoder that can do both 2-byte and 4-byte instructions starting at +0. ALL INDEPENDENT of any previous instruction decoder blocks.

And then you can use the same technique as in a carry-lookahead adder to select which decoded output (one or two instructions) you use from each decoder block, with the same scaling properties as in an adder.

Except that if you're decoding only 32 bytes of code per cycle then you're looking at the analogue of an 8 bit adder, where carry-lookahead is barely worth it. The technique comes into its own in 16, 32, 64, 128 bit adders. That's equivalent to decoding 64, 128, 256, 512 bytes of code PER CYCLE, which is far beyond the bounds of a practical CPU because of basic block lengths and predicting multiple branches.

The cost is that each 4 bytes of code needs three decoders while producing maximum two instructions, though the 3rd decoder only has to deal with the limited 2-byte C extension instruction set (and only one of the other two decoders needs a C decoder).

Note: you can save on decoders, at the expense of a little latency, by having only one full-strength (2-byte and 4-byte) decoder and one 2-byte only decoder, and using the "carry" input to MUX either bytes -2..1 or bytes 0..3 into the full-strength decoder.

Yes, it's more complex than an Aarch64 decoder of the same width, but it's completely manageable and massively simpler than x86.


yeah, it's definitely manageable when the instructions don't vary in size from 1-15 bytes. interesting details.


I'm not sure he's talking about the cores themselves. What's more likely IMO is the ecosystem choices like preboot environment.


None of Torvalds direct quotes use the word fear at all

Torvalds is quoted as saying “I suspect…”

The article author qualified such with the use of “fear”

We need a quiz before comments are posted; did you read the article? And ask for written context about a random segment.


I watched the whole discussion

https://m.youtube.com/watch?v=cPvRIWXNgaM&t=450

> My fear is that...

Get off my lawn


> RISC-V is an open-standard ISA for processors that is slowly gaining traction, especially in China, where some tech companies are using it to bypass America’s sanctions on the country.

I really don't like that narrative, it doesn't even make sense as MIPS already exists and is more than sufficient for the applications RISC-V is being used in. The same could be said about Linux but it's obvious how ridiculous it would be to bring that up.


>> […] where some tech companies are using it to bypass America’s sanctions on the country.

> I really don't like that narrative, it doesn't even make sense as MIPS already exists and is more than sufficient for the applications RISC-V is being used in.

Is it a "narrative" if some Chinese tech companies are actually using it for such a purpose?

And even MIPS-the-company is moving to RISC-V:

> Computer architecture courses in universities and technical schools often study the MIPS architecture.[9] The architecture greatly influenced later RISC architectures such as Alpha. In March 2021, MIPS announced that the development of the MIPS architecture had ended as the company is making the transition to RISC-V.[10]

* https://en.wikipedia.org/wiki/MIPS_architecture

> In May 2022, MIPS previewed its first RISC-V CPU IP cores, the eVocore P8700 and I8500 multiprocessors.[9] In December 2022, MIPS announced availability of the P8700.[10]

* https://en.wikipedia.org/wiki/MIPS_Technologies


> Is it a "narrative" if some Chinese tech companies are actually using it for such a purpose?

I'm not saying that the narrative is wrong, it's just absurd to bring it up when introducing RISC-V. The same could be said about Linux but it's very clear how WRONG it would be to introduce Linux by saying "Linux is being used to bypass US sanctions".

> And even MIPS-the-company is moving to RISC-V:

I didn't know that, but what I meant to say is that MIPS ISA is open and could be used instead. There's nothing special about RISC-V.


>> I didn't know that, but what I meant to say is that MIPS ISA is open and could be used instead. There's nothing special about RISC-V.

Market share. RISC-V is getting attention around the world where MIPS is not. The software ecosystem around RISC-V is getting contributions from everywhere while MIPS (even worse, modified China MIPS) is really just in China. They will benefit from using something that's more widespread. So while it may not be technically special, it is "better" in some sense.

As an example, both folks in China and at Google have made efforts to bring Android to RISC-V while the same can not be said for MIPS.


> what I meant to say is that MIPS ISA is open and could be used instead. There's nothing special about RISC-V.

It is well documented that the researchers at Berkeley who invented RISC-V wanted to use the MIPS ISA but were told they would have to pay several million dollars for the right, wouldn't even get RTL for a CPU core for that, and would not be allowed to publish anything they implemented themselves.

So, no, people "can't just use the MIPS ISA".

With RISC-V you don't have to ask anyone's permission, and don't have to pay anyone anything unless THEY developed a core and you choose to license it from them rather than develop your own or use a free one.

There do happen to be some technical advantages to RISC-V, but they are by far outshadowed by the the freedom advantage.


It feels as if it is giving a negative slant on the technology when it shouldn't be seen as such.


I’m not sure it’s a negative slant so much as one of the primary reasons it’s seeing adoption.

Even if it has little to do with the benefits of the architecture itself.


When XuanTie released their cores, a dozen manufacturers started putting them in designs (for better or worse.) I don't recall any such design being released under such permissive terms in the MIPS world.


Imagine how much money and risk could be averted if a ton of companies came together to work on open core designs. We could fully commoditize the CPU market and lower costs across the board just like Linux did for server OSes or Android did for phones.


One thing that I learned when becoming a senior engineer is that if the layer above you is profitable enough you should build substandard implementations in your layer that match the abstractions of the above layer. Overall the system will perform much better than if you optimize for whatever the bottleneck on your layer is.


There sounds like a decent amount of insight here, but it's a little vague for me. Could you give a bit more detail, possibly a worked example?


Sure.

Assume you're writing an OS from scratch that will be used by people, rather than by servers in a data center.

Building the system to be as responsive as possible at the lowest possible latency is what you should do. Even if you're burning half your CPU cycles on busy work waiting for user input. Ignore how much more throughput you could have with a traditional OS. Users get a lot less annoyed by a job that takes 20 minutes instead of 10 than by a window which takes 3 seconds to resize.

This goes against every instinct that engineers have, because you're not optimizing for the scarce resource on your layer - cpu cycles above.

The way I got around that brain damage was to find what metric matters to the ultimate user and trace it back to the layers I had control over, then optimize the hell out of it.


> Even if you're burning half your CPU cycles on busy work waiting for user input.

Of course then you slam into a wall when people want it to run on battery-powered devices.


You can't be everything for everyone. I'd imagine you'd still be using less power than the average JS hot mess today though.


So, optimize for user experience rather than efficient use of resources?


More like optimize towards the quality of the whole product rather than just the quality of your isolated component


Your user might be another application whose bottle neck is much more expensive than your own bottle neck.


Thanks, this comment, and comment 40986327 from afiori cracked it for me.


apple added hw instructions to m1 chips just for the obj c reference counting use case. claimed up to 15% boost. technically this is a violation of separation of concerns, but if it delivers results then it delivers results


If those instruction act like a macro that is faster than the expanded instructions then it is not really a just a tradeoff.

If the instruction were to have some magical one-off effect it would be a violation.



Any offers on how long it will it be before were talking about Windows RISC-V machines?

I just upgraded my 6 year old Lenovo X1 and was pretty close to pulling the trigger on an ARM (Snapdragon) machine but felt too unproven and I knew there would be software issues for some of the things I needed to do so sticking with x86 (AMD).

I can easily see the decision next time going the other way or even RISC-V if the sofware is there to match.


We know Microsoft is working on it, but that's about it.

All bets are off as for when, but I would expect it to show up alongside capable hardware.


>> Any offers on how long it will it be before were talking about Windows RISC-V machines?

I'm betting Android will be there first.


Which mistakes is he referring to concretely?


Not exactly nostradamus to predict things wont go perfectly. But that's the thing about history, it gives you a chance not to make the same mistakes. Will RISCV have learnt history?


it sounds to me like complex instructions are unavoidable unless the use case is restricted to microcontrollers. A modern ISA should CPU feature flags that allow implementers to disable complex and expensive instruction sets.

Would it be fair to day that ARM for example is a CISC/RISC hybrid, where armv8-a is full on CISC but armv8-m is RISC proper? But that's what I'm suggesting anyways.


You are conflating complex with niche.

RISC embraces niche while avoiding complex.

iAPX432 has examples of complex with stuff like OOP, garbage collection, and even data structures baked into the ISA. These instructions are very complex and take massive numbers of cycles to complete.

In contrast, a bunch of bit manipulation instructions are niche, but any one of them can complete in just 1 or maybe 2 cycles.

Same with something like AES where even x86 has a RISC-like approach where you call a couple simple instructions to setup, then you call a simple instruction (1-2 cycles) for each round then call some simple finalization instructions. Each instruction is niche, but not complex.


maybe i was conflating the two, but my point was cost of implementation. niche or not,minimizing hardware cost is important to consumers and vendors alike. On the other hand, for general purpose computing, complex instructions can provide a performance boost, as do "niche" instruction sets that may not necessarily complex as you put it. Since ARM doesn't actually make the hardware, it should let vendors select which feature sets they want to implement based on their requirements, ARM programs can then test the value of these static registers before attempting conditional execution.


Did you mean the other way around? -m is intended for simple(r) microcontrollers whereas -a is intended for application processors.


sorry, yeah I meant the other way around.


the "futuristic world if <X> was true" meme fully applies to the core idea that sw guys don't get the hw (and vice versa). even more true for the business people that leave no room for bridging the gap.

abstractions exist for a reason but it is refreshing to see some examples where researchers leverage the underlying hw architecture (see mamba architecture that works around GPU design - https://news.ycombinator.com/item?id=38932350). but we can do so much more instead of pushing the new version of the existing process node.


1. Want a simpler ISA

2. Build it

3. Realize adding a complex instruction that's weird can really boost performance in key use cases

4. Repeat 3 until someone thinks your ISA is overcomplicated and makes a new one


And not only for Instruction Set Architectures. I feel this might be the case for software too.

1. Want a simpler application

2. Write it

3. Realize adding complex code that's weird can really boost performance

4. Repeat 3 until someone thinks your application is overcomplicated and makes a new one

I guess the moral here is that computers are complicated and trying to avoid complexity is hard or infeasible.


Except it's not performance usually, it's just features. And then it gets bloated. And someone things they don't need all that crap.

But turns out they did.


I feel like there are massive industries that exist just because of this


Everyone uses about 5% of the features of their word processors, but it is a different subset for everyone so all features are needed by someone and most get equal useage.


I do not think that is true. There are definitely more and less commonly used features. Everyone uses basic formatting, but only a small minority of users use index or bibliography features. The features might matter a lot to that small minority, but they are not anything like equally used.


There are a small minority that everyone uses. It quickly falls off.


Ehh the issue is features tacked on w/out regards to existing ones. Lotta apps like that end up with multiple ways to do the exact same thing but with very slightly different use cases


Jira has entered the chat...


Or programming language design. However, I think during the process, things are still distilled so that new common patterns are incorporated. I am sure null-terminated strings were not a particularly bad idea in the 70s given the constraints developers faced at the time. It's just that later we have different constraints and have gained more experience, thus finally realizing that it is an unsafe design.


I expect it's basically always been understood that null-terminated strings were unsafe (after all, strncpy has existed since the 70s [1]), more just that the various costs of the alternatives (performance, complexity, later on interop) weren't seen as worth it until more recent times. And it's not like they didn't get tried— Pascal has always had length-prefixed strings as its default, and it's a contemporary of C.

[1]: https://softwareengineering.stackexchange.com/a/450802


We are coming full circle back to Pascal Strings, just that now we don't mind using 32 or 64 bytes for the length prefix. And in cases where we do mind we are now willing to spend a couple of instructions on variable length integers.

But in the bigger picture the wheel of programming languages is a positive example of reinvention. We do get better at language design. Not just because the requirements become more relaxed due to better hardware and better compilers, but also because we have gained many decades of experience which patterns and language features are desirable. And of course the evolution of IDEs plays a huge role: good indentation handling is crucial for languages like python, and since LSP made intelligent auto-complete ubiquitous strong typing is a lot more popular. And while old languages do try to incorporate new developments, many have designed themselves into corners. That's where new languages can gain the real advantage, by using the newest insights and possibilities from the get-go, depending on them in standard library design and leaving out old patterns that fell out of favor.

No modern language in their right mind would still introduce functions called strstr, atoi or snwprint. Autocomplete and large screens make the idea of such function names antipatterns. But C can't easily get rid of them.


I think saying "software" is much too broad, and you have to narrow the comparison to a small subset of software development for it to make sense. With software, typically you're dealing with vague and changing requirements, and the hope is that if you build five simple applications, four will be basically adequate as written, needing only incremental feature enhancements, and only the fifth needs significant work to rise to the emerging complexity of the problem. (The ratio can be adjusted according to the domain.)

In this case they're creating a new solution to a problem where all previous solutions have ended up extremely complex, and the existing range of software currently running on x86 and ARM gives them with a concrete set of examples of the types of software they need to make fast, so they're dealing with orders of magnitude more information about the requirements than almost any software project.

The closest software development equivalent I can think of would be building a new web browser. All existing web browsers are extremely complex, and you have millions of existing web pages to test against.


Yeah, definitely this. We think that the specs are pretty much baked after a year or two of shipping, so the focus is now on making things faster, which requires very complex algorithms, but don't worry, once we get it done, it won't have to change anymore! Right? ... then new use cases come in. New feature requests come in. We want to adapt the current code to the new cases while still covering the existing ones and also maintaining all the performance boost we gained along the way. But the code is such a mess that it is just not feasible to do so without starting from scratch.


Your understansing of programming is superficial to point of unfixable by explaining why :(


Explain why please


Example gratia, last statement is approximately equivalent to "I guess computer science is infeasible."

But it is not; the fact that Debian UNIX running programming languages of the sort that are used today exist proves that we have already managed a significant amount of complexity; compare to infite-size ENIAC for programming. You would certainly want to use the Debian than be stuck with latter given that you could magically NOT to implement the existing systems for it—had to write your whole program from scratch either on the tape or the Debian system.

There are many other layers of issues and problems with the comment, but understand that no matter how intellectually honest and kind you are, you cannot reply to every person who is wrong saying it & telling WHY. Often the "you are wrong" may be more valuable—mutually. I do not want to argue about this though.


...or you can cheat and write the weird code as a separate, child app.


though I believe in RISC-V‘s case what will happen is that every vendor will have that realization at the same time, not tell anyone and make an extension and now there‘s five different incompatible encodings for the same operation.


And that doesn't matter, because:

- Such custom extensions live in custom extension space.

- Software ecosystem that must work across vendors will use neither.

- If these extensions actually do something useful, the experience from them will be leveraged to make a standard extension, which will at some point make it into a standard profile, and thus adopted by the software ecosystem.


Opensource fundamentally changes that situation. All you need is a maintained version of GCC/LLVM that supports your processor and you’ll have distro that supports your needs. Especially if it’s just about some performance boosting instructions. It’s not going to be an issue, we really aren’t in a binary world anymore for the most part.


RISC-V had a fantastic opportunity to be the one true microcontroller ISA, but the attempts to become an "everything, everywhere, all at once" ISA have been fascinating to see throughout the whole process.


RISC-V is not a ISA, but a family of ISAs. That is why is has a small core and plenty of extensions. So instead of having a dozen completely different ISAs you now have a dozen almost the same ISAs. It was meant to be forked and customized for needs.

Ideally, your needs can be met by selecting from the basket of already ratified extensions. So what if no one else has the exact set of extensions you do? A lot easier to support if you share a common core with already supported chips.

Look at how many programming languages look C and how much easier it is to pick them up than say APL. Think of Risc-V as the C of the future.


It seems it's very hard to avoid trying to become everything everywhere. It looks easy from our point of view, but when you're at the wheel, it looks like the most reasonable course of action.


RISC-V will ALWAYS be more simple than x86 because the core of the ISA isn't a disaster. Around 90% of all x86 instructions are just 12 major instructions. x86 gets those instructions very wrong while RISC-V gets them very right. This guarantees that RISC-V will ALWAYS get 90% of code right.

The main benefit of variable-length encoding is code density, but x86 screwed that up too. An old ADD instruction was 2 bytes. Moving to AMD64 made that 3 bytes and APX will make it 4 bytes. RISC-V can code most ADDs in just 2 bytes. In fact, average x86 instruction length is 4.25 bytes vs 4 bytes for ARM64 and just 3 bytes on average for RISC-V.

x86 made massive mistakes early on like parity flag (there today because Intel was trying to win a terminal contract with the 8008), x87, or nearly 30 incompatible SIMD extensions. RISC-V avoided every one of these issues to create actually good designs.

Lessons from previous RISC ISAs were also learned so bad ideas like register windows, massive "fill X registers from the stack", or branch delay slots aren't going to happen.

I hear the claim that "it'll become overcomplicated too", but I think the real question is "What's left to screw up?"

You have to get VERY far down the long tail of niche instructions before RISC-V doesn't have examples to learn from and those marginal instructions aren't used very much making them easier to fix if needed anyway. This is in sharp contrast with x86 where even the most basic stuff is screwed up.


> x86 gets those instructions very wrong while RISC-V gets them very right

That is heavy over-exaggeration.

> An old ADD instruction was 2 bytes.

By "old" you mean.. 32-bit adds with 8 registers to choose from. Which are gonna be the majority of 'int' adds. Unfortunately, 64-bit, and thus pointer, 'add's do indeed require 3 bytes, but then you get a selection of 16 registers, for which RISC-V will need 4-byte instructions, and on x86 a good number of such 'add's can be "inlined" into the consumer ModR/M. (ok, for 'add' specifically RISC-V does have a special-cased compressed instruction with full 5-bit register fields (albeit destination still same as a source), but the only other such instruction is 'mv'. At least x86 doesn't blatantly special-case the encoding of regular 'add' :) )

> nearly 30 incompatible SIMD extensions

30, perhaps, but incompatible? They're all compatible, most depend on the previous (exceptions being AMD's one-off attempts at making their own extensions, and AVX-512 though Intel is solving this with AVX10) and routinely used together (with the note of potentially getting bad performance when mixing legacy-prefix and VEX-prefix instruction forms, but that's trivial to do, so much so that an assembler can upgrade the instructions for you).

RISC-V isn't cheaping out on vector extensions either - besides the dozen subsets/supersets of 'v' with different minimum VLEN and supported element types (..and also 'p', which I've still heard about being supported on some hardware despite being approximately dead), it has a dozen vector extensions already: Zvfh, Zvfhmin, Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned, Zvknhb, Zvknha, Zvksed, Zvksh

> massive "fill X registers from the stack"

https://github.com/riscv/riscv-isa-manual/blob/176d10ada5d8c...

Granted, that's optional (though, within a rounding error, all of RISC-V is) and meant primarily for embedded, but it still is a specified extension in the main RISC-V spec document.

Now, all that said, I'd still say that RISC-V is pretty meaningfully better than x86, but not anywhere near the amount you're saying.


I'd love to see a ground-up ISA that takes a CISC-style approach without all the memory hazards (and other faults) baked into it. Decode logic is "cheap" these days relative to other things on the CPU so why not lean into it?


Maintaining separation of concerns is the complete opposite of a CISC-style approach. If you are implementing garbage collection or OOP in hardware, you are bound to conflate things in weird ways just like this becomes unavoidable in software too.


This requirement is captured by MASKMOVDQU. It's clearly an initialism.


Reminds me of the XKCD about standards.

1. There are 5 different standards.

2. Some well-meaning person says "this is ridiculous, this should be standardized"

3. There are 6 different standards.


Time/date being the best example of this. I think I've seen every possible variation by now, except maybe putting the year in between the month and day. Drives me absolutely crazy.


A man finds a shiny bottle on the beach.

He rubs it, and a genie emerges.

  Genie: "I shall grant you one wish, mortal."

  Man: "I wish to bring my grandmother back to life."

  Genie: "Apologies, but reviving the dead is beyond even my vast powers. Perhaps you'd like to make a more... achievable wish?"

  Man: "Alright then, I wish for worldwide adoption of the ISO 8601 date format."

  Genie: "...Uh, where did you say your grandmother was buried?"


Please, at least ask for RFC 3339.


They had one job: make it similar to ARMv8-A but not too similar to give ARM any ideas. Look at SSE4/AVX2 and NEON and make sufficiently similar vector extensions to that as well.

So far RISC-V has failed miserably at both. "Just solve it in decoder", "just fuse common idioms", "just make vsetvli fast". Sure. Not like designers of ARMv8 or X86S had decades of prior experience to make better decisions.


"Just solve it in decoder" is a perfectly valid approach for now, and if this turns out to be unrealistic then they can go ahead and add a conditional move extension or whatever it is.

It’s far easier to add things to interfaces than remove them.

One mistake they have made that may bite them later is their compressed instructions extension. It takes up far too much of the instruction set space for the amount of utility it provides, even on microcontrollers in my opinion. It also introduces lots of edge cases around word alignment across cache and memory protection boundaries.


Point of order!

The RISC-V C extension takes up 75% of the opcode space.

The 16 bit Thumb1 instructions take up 87.5% of the ARMv7 opcode space.

Not to mention predication taking up 92.75% of the classic ARMv1-ARMv6 instruction set.

RISC-V has comparatively twice as much "32 bit" opcode space as ARMv7, and 3.45x more than classic ARM A32.


Point of order! The core of the RISC-V ISA was already designed by the time ARMv8-A was announced on 27 October 2011. I don't think anyone anticipated it -- certainly I remember it being a complete surprise to me.

Here is the RISC-V design as at 13 May 2011:

https://people.eecs.berkeley.edu/~krste/papers/EECS-2011-62....

Compare that to the design eventually ratified in 2019 and you'll find some changes in details, but not in concept.

- instructions encodings changed to e.g. put rd on the right next to the opcode and the MSBs of literals on the left.

- J and JAL are now combined

- JALR had three versions (in func3) to distinguish call / return / others. This is now done as a convention which registers are used in rd and rs1. (It matters only for a return address prediction stack, an advanced feature)

- RDNPC (get address of next instruction) later turned into AUIPC (which includes adding an offset to the PC)

- on the other hand, it already has the very RISC-V feature of loads using rd and rs1 while stores use rs1 and rs2 (and the offset split up differently). Most other ISAs (including all Arm ISAs) use the same register fields for loads and stores and the offset in the same place, resulting in either loads using rs2 as the destination or stores using rd as a source!

Seriously, other than the changes mentioned above, this document looks just like the final RV32G/RV64G spec from ratification in 2019. (I didn't check the floating point instructions in detail).

As for SIMD -- RISC-V was designed as the control processor for an advanced length agnostic vector processor. F*ck fixed length SIMD.


> F*ck fixed length SIMD.

There is the issue. It goes completely against how the current landscape of SIMD support looks like and what everything is optimized around (it's fixed length SIMD, variable length has huge issues getting mapped onto anything that isn't auto-vectorization).

Now, I'm not a fan of RISC-V so, if anything, I encourage their way of doing RVV. Makes it more likely to die.


I don't see how it will die when it can imitate fixed length SIMD very easily. Not to mention that Arm too is pushing length-agnostic vector computing -- not very hard so far, admittedly, and they haven't yet departed from the 128 bit NEON vector register length. The very first SVE implementation, by Fujitsu, used 512 bit vector registers. And was the fastest supercomputer in the world for a while, so the concept is not obviously rubbish.


RVV can imitate NEON, but will it perform anywhere near reasonably on hardware? With NEON, you can be sure that the hardware will be optimized around 128-bit operations and thus comfortably use it for such. Whereas RISC-V hardware with 512-bit vectors might only get quarter throughput if you only need the low 128 bits (and that could end up slower than doing scalar code! whereas on ARM you can be pretty damn sure than 1×128-bit NEON won't be slower than 2×64-bit scalar or some SWAR).

Much as I like RVV, fixed-width use-cases still exist and are pretty important for CPU SIMD; scalable vectors work well for very-many-iteration embarrassingly parallel loops, but those are also things most suited for being moved to a GPU. Where CPUs have the most potential is in things with some dependency chain or small loops, for which scalable vectors largely just add questionability of performance.


Obviously using 128 bit vector length on a machine with 512 bit vector registers is not using all the performance you paid for, but it's not going to be slower than using a machine with 128 bit vector registers! Parallel operation on multiple data items is parallel operation on multiple data items.

The developers of the dav1d software AV1 decoder found that RVV on a Kendryte K230 [1] performed no worse than NEON on an A53 on their small fixed-size transforms on video CODEC blocks.

https://www.youtube.com/watch?v=asRnBcn5VKs&t=9m40s

[1] THead C908 core, very similar to the old C906 core found in e.g. the $3 Milk-V Duo, but dual-issue and with the RVV updated from draft 0.7 to ratified 1.0


> but it's not going to be slower than using a machine with 128 bit vector registers!

It very well could; for example, RVV hardware with VLEN=512, one 512-bit vector ALU, and three scalar ALUs (float or integer), would have 512 bits/cycle with vector and 192 bits/cycle for scalar (i.e. vector at VLMAX is beneficial!), but using only 128 bits of the vectors would end up with just 128b/cycle. And this is just regular ALU instructions, vrgather can get significantly worse, and it is extremely important for most fixed-width stuff.

This of course wouldn't ever happen with actually-128-bit hardware as it wouldn't ever make sense to have such vector vs scalar distribution for its guaranteed-128-bit workloads. And, though perhaps my previous example is a tad extreme, with higher VLEN it gets more and more reasonable to have a larger gap between scalar and vector ALU/port counts, and maintaining 'vector_alu_count*2 ≥ scalar_alu_count' gets less and less reasonable.

K230 is irrelevant here - it has VLEN=128, which is the optimal thing for being used as a fixed-width system / imitating NEON. And, more generally, looking at just one RVV implementation cannot give any insight on how well the "scalable" part of RVV works, as that only applies when running the same code across multiple different implementations with different VLEN.

Could high-VLEN hardware still be made such that 128-bit usage is never strictly worse than scalar? Perhaps. But there exist incentives under which it might not make sense to have such, which have no possible equivalent on NEON/x86, and apply specifically when the "scalable" part of RVV is actually used for its intended purpose in hardware.


We can at least say that the more general RVV ISA doesn't slow down at least one low end RVV implementation vs NEON. Anything more is just FUD. Anyone building superscalar scalar CPUs will very likely be superscalar on the vector side also -- as even the THead C910 from 2019 is.

As for the longer vectors, no way to say for sure until we have such hardware, and even then different vendors will have different goals and very likely different quality of implementation.

At the time they ported that code the K230 was the only RVV 1.0 chip available.

The SpacemiT K1/M1 with 256 bit vector registers has now been available in the Banana Pi BPI-F3 for a few weeks, and is about to ship (this month they say) in the Sipeed Lichee Pi 3A, Milk-V Jupiter, SpacemiT Muse Pi. So no doubt the dav1d people (and others) will be getting busy with that.

Around the end of the year we'll have SG2380 machines with SiFive X280 cores with 512 bit vector registers. That should be very interesting, as SiFive have been designing those since at least 2018 and the quality of implementation should be very good.


I mean, we are on a post that's talking about potential future mistakes, with "fear" literally in the title; indeed noone can know if such will materialize, but it's well worth noting and discussing such.

> and even then different vendors will have different goals and very likely different quality of implementation

And that's, like, the entire problem. RVV's design allows for otherwise-reasonable designs where fixed-width usages suffer (whereas there's literally no reason to do equivalent things on NEON/x86). And given that I doubt that the designers of such chips would be making legal mandates to not run general-purpose software on such, were such hardware to become real, general-purpose software could easily end up being expected to work well on it.

Yes, said RVV design does in fact allow having those different goals, which is good. ..For those goals. Pretending noone's gonna try to use hardware for anything other than its strictly intended purpose is nothing more than wishful thinking.

SG2380 / X280 are already known to have a very horrible vrgather, which is gonna be extremely awful for fixed-width stuff, and even some scalable stuff too. And SG2380 is rather explicitly for general-purpose computing.


You're arguing that a world in which some companies make Ferraris and some companies make Bambinos is a bad world and "someone" should make sure only things between a Corolla and a Camry exist.

Which is, basically, what Intel or Arm do, at any given time.

I prefer the possibility that some people make amazing products, and some people make underperforming (but possibly much chaper) products, and the market decides which ones to buy. The most important thing is that they can all use the same roads (software), even if they should stick to different lanes.

> SG2380 / X280 are already known to have a very horrible vrgather, which is gonna be extremely awful for fixed-width stuff, and even some scalable stuff too.

I don't believe that to be the case. IIRC X280 does vrgather in ~1 cycle for vector lengths up to the datapath width (256 bits, 32 bytes), so anything corresponding to your NEON or AVX2 cases are going to be just fine. As I'm sure you know the computational complexity of vrgather is proportional to vl^2 so no one is going to do single-cycle vrgather at LMUL=8 => 4096 bit sizes, or even probably at 512 bit.

You could use a 256 bit datapath to implement 512 bit vrgather in 4 cycles, and LMUL=8 (4096 bit) in 256 cycles but that's not the intended use for the X280 so they didn't spend those transistors.

At longer sizes X280 does 1 element per cycle, which is still a significant speedup over what you could do on the dual-issue in-order scalar CPU.

> And SG2380 is rather explicitly for general-purpose computing.

The *P670* cores on the SG2380 is for general-purpose computing. The X280 cores are for media processing and similar tasks that don't typically use large gather operations.


My understanding is that the X280 performs at one cycle per element, and that there is supposed to be a faster gather for LMUL=1/2. As you mentioned the data-path width.

If that is true, then their higher LMUL implementations are basically a design mistake. They could've used the LMUL=1/2 vrgather to implement LMUL=1 and above by calling it repeatedly (LMUL*2)^2 times, that should add almost no extra die area. This is what the C908 and X60 seem to do.


To be fair, a 1 cycle/element isn't that bad at, say, e64,m8 - 64 cycles - whereas the extending mf2 to m8 would be 16^2=256 shuffles (and there's still some hardware cost involved from having to merge together the shuffles based on the index bits, and of course scheduling it all). But yeah it's pretty bad for LMUL=1, being slower at just vl>4, and still has worse worst-case timings (e8,m8 still being 256 shuffles, where VLMAX=512).


Yeah, C908 and X60 have a 256 cycle LMUL=8 vrgather for any SEW.

I've written what I expect sane implementations to do here: https://gitlab.com/riseproject/riscv-optimization-guide/-/is...

This didn't take into account higher SEW, but I could imagine an implementation that as a LMUL2 native implementation for e32/e64.

The number of connections for a specific SEW and VLEN should be (VLEN/SEW)(SEW/8), however the regularity and distance probably also impacts final implementation cost.


There is the better, but much more complex, option of a variable-latency vrgather, where, for each DLEN-sized output segment, it counts the number of distinct DLEN-sized chunks of the table the current indices require, and spends, say, ceil(that_count/2) uops (each uop being able to take two DLEN-sized table chunks, the indices, and outputting DLEN data; could also have only one DLEN-sized table chunk input at the obvious cost; also at some point there needs to be the necessary blending, idk where that fits in; probably could just be a temporary directly in the shuffle silicon).

This would result in k*VL/DLEN uops for simple index patters such as zip/unzip/reverse/broadcast with k=1 or k=2, and worst-case (e.g. transpose) is as bad as your option 2 with N=1 or N=2, and requires just DLEN^2 connections in the main shuffle, as much of the heavy lifting is done by the already-necessary silicon for getting DLEN-sized chunks of operands (maybe there might be complications with having those not being requested sequentially though; I'm not a hardware person).


I'm also not a hardware person, but I think there is an easier/hackier variant of your approach, that should be relatively simple in an in-order design, but probably harder to do in an ooo design, because either need to variably add or remove uops based on the input.

Maybe this is what you had in mind, but I must admit that I didn't fully understand how you described the implementation:

For LMUL>1 vrgather:

Create one uop for every LMUL=1 register:

* Look at first index (n=idx[0]/(VLEN/SEW)), to select vector register to read from

* do the LMUL=1 vrgather from that register, and check if all are in range

* if they weren't, emit LMUL=1 vrgather uops corresponding to the other LMUL=1 register sources

This would give you LMUL cycles for all permutations that read all values for a LMUL=1 register destination from a single LMUL=1 source register, or the proportional equivalent when DLEN<VLEN. So this would cover the LUT/zip/unzip/reverse/broadcast cases.

As I mentioned I don't think this is a good fit for ooo designs, but in-order ones, with larger vector lengths could/should probably implement this.


That is exactly my design (or, rather, a design I remember reading in the #riscv IRC room), taking the "could also have only one DLEN-sized table chunk input at the obvious cost" path; i.e. to extend your specification to my double-DLEN-chunk method, you'd have it find two different indices that each select a separate register, and blend those shuffled together all in one uop.

https://github.com/llvm/llvm-project/issues/79196 also mentions this in "we want the elements to come from as few DLENs of the source vector as possible"; and down the comments there is me having had too much fun writing code to fill in "I don't care" index elements such that they don't add any more required chunks, in a completely-DLEN-agnostic way.

I don't think this is necessarily that bad for ooo; it's really "just" variable latency, which would already exist to an extent on cores which dynamically scale based on VL (though, granted, not all would be such, and the exact latency would be found much later down the pipeline even then).


Ah, great, I just wasn't sure what you meant with ceil(count/2).

It's not worth implementing when VLEN>=DLEN, beacause then you can replace all of the uses were you'd always get an atvantage with unrolled LMUL=1 instructions. So I don't think it will be common in ooo designs, since they usually have smaller vectors.


It's only not worth it if software indeed consistently does the LMUL=1 unrolling. Which I suppose most will, and vrgather with LMUL>1 will be roughly dead-on-arrival, but who knows.


I'm arguing that it should be possible for city planners to be able to, while designing their road networks, assume all cars will be capable of driving through their roads reasonably near the speed limit where possible (with exceptions alike hardware bugs patched in microcode, but those are to be hated and not the expectation). (I suppose there are specialty vehicles not meant for general travel, but I'd assume there are actual restrictions around such where reasonable (I know nothing cars), whereas, with hardware, if it exists, people will use it for whatever they want regardless of its intent)

> As I'm sure you know the computational complexity of vrgather is proportional to vl^2

But that sucks for anything wanting just a fixed 16-byte table (which is a pretty frequent thing in fixed-width SIMD), and low VL will not help with that as the table argument of vrgather is always VLMAX; you necessarily need low LMUL (so a more correct average complexity is LMUL*VL even for hardware dynamically scaling resource usage based on VL). But, if I wanted a 32-byte table, I'd have to jump to LMUL=2 for portable code, at which point VLEN=512 hardware would be forced to consider each result element potentially selecting from 128 bytes of data.

Granted, here at least it can be reasonably expected that general-purpose hardware would make LMUL=1 not horribly slow for typical use-cases (be it via spamming silicon at it, not having high VLEN, or specializing for small local variance of indices at runtime (which, while neat, is silicon that could've been spent on actually meaningful things instead of reconstructing what the developer already knew; and even then it probably would result in higher latency)).


Fortunately 16-byte tables are the good case that always fits into a LMUL=1 register under the standard V extension.

I do agree however, this would result in bad performance for VLEN>128 and VLEN<DLEN implementations.

Adding a gather instructions limited to 16 elements might be a solution, however it could also result in vendors feeling more free to have a slower LMUL=1 vrgahter, that still needs to perform reasonably. The LMUL=1 and 16 element vrgahters are the most common cases.

IMO the software ecosystem needs to steer the hardware here, in place of ARM for neon, and Intel/AMD for AVX, stopping people from implementations with disproportionately slow permutes. People can't expect to put a ML core, like the X280, as the main CPU on a regular desktop class processor and expect general purpose software to be optimized on it.

I suppose this could be added to the RVA profile as well, but they don't want to specify micro architectural details, so what does it mean for an instruction to be fast? Maybe there could be a non-normative clause that software is expected to use LMUL=1 vrgather for LUTs and other shuffles inside of hot loops. I suppose creating an issue regarding this can't hurt, so I'll look into it.


If you have code that is tuned so critically on LMUL then you can always determine the correct LMUL value at program startup, for the hardware you find yourself running on, whether by looking at the CPU model or running a few test cases or whatever.

Unlike with choosing between SSE, AVX, AVX512 etc, with RVV you don't have to duplicate your code to do so.


But for SSE/AVX/AVX512, I don't have to duplicate code to get reasonable performance - code written for SSE2 will still run very reasonably on AVX-512 hardware. And if all I need is primarily 16-byte, there may not be any reason to even bother with AVX2/AVX-512 (biggest benefit might very well be just the extra registers & separate destination, which don't require source-level changes anyway).

Whereas with RVV you might have to dynamically select LMUL to not get unreasonable perf (a very low bar!).

And.. a 16- or 32-byte shuffle should not be considered as some "critically tuned" thing, this is basic fixed-width SIMD stuff.


And we're back to you wanting all RVV implementors to make the same performance vs size/cost trade-offs when the whole point of it is that different implementors have the freedom to make the trade-offs that make sense for their target markets -- and you have the freedom to buy them or not buy them, depending on whether they fit your needs. There is not a single vendor who has to address all markets and use-cases, there will be dozens of them, all using a common ISA and all (sensibly witten) software will run everywhere, at varying levels of performance and cost.


It's not so much that I want that, more that it's a basic and requirement for fixed-width usages, and is easily attained by aarch64 & x86. Whereas RVV's potential of tradeoffs clearly interferes with that.

And if hardware making bad tradeoffs for fixed-width usages ever makes it to general-purpose usage to any significance, RVV as a whole would end up being required to be considered as not fit for fixed-width stuff (outside of extremely important & well-funded things where people can afford to add manual tuning for each separate funky CPU, but it's a pretty safe bet that most developers won't have every single piece of RISC-V hardware.. especially that which they would hate).

Without reasonable performance guarantees, "RVV can imitate NEON" is as useful of a statement as "base RV64G without any extensions at all can imitate NEON". It's just stating that it's a turing-complete system.


And, once again, I point out that X280 is not intended for general-purpose usage. It's for special purpose stuff, and I'm sure fits what NASA wants in their space PIC64 chip very well. In the specific case of the SG2380, the 16 P670 cores are there for general-purpose use, with the 4 X280s for specialised uses.

And, again, even at 1 element per cycle, using vrgather is significantly faster than not using it and using the dual-issue scalar side instead.

AND at short NEON or AVX vector lengths the X280 is single-cycle for vrgather *anyway*.


If what camel-cdr in another comment here said is correct, the single-cycle vrgather wouldn't apply to anything not written specifically for X280 as I doubt anything anywhere is attempting to use LMUL=1/2 for 128-bit or 256-bit vrgather.

But, sure, X280 might not be intended for general-purpose usage. Still pretty sure that won't stop people from trying. And it's far from impossible, and, to some extent even reasonable, for future hardware to be intended for general-purpose usage while still disadvantaging fixed-width.

And, for a good number of fixed-width SIMD usages, the alternative might not necessarily be "do everything exactly as with SIMD but in scalar registers with one instruction per would-be-SIMD-element"; namely, SWAR could be utilized, some shifts/rotates/bswap/multiplies used in place of shuffles, entirely different algorithms used, and you wouldn't need to duplicate computation for something like a cumulative sum/xor that on SIMD would've been implemented as log2(n) slides. I've had a good number of cases where some fixed-width x86 SIMD thing is only marginally faster than the scalar baseline.

I'm not saying this is some utter massive disaster that'll kill RVV; but vrgather definitely, as-is, is pretty inadequate for many things that it can actually do purely due to forcibly having both operands have the same LMUL and having only that as the cap on the table size; and it is unquestionably possible (though not guaranteed, and maybe even unlikely) that in the future software will be expected to handle funky-tradeoff hardware, which'd suck for software developers.


I will agree with you on one thing. Having `vrgather` use the same `LMUL` for both operands is not ideal. It's natural, of course, for permutations, where the actual data is in the lookup table and the indexes are the permutation, but not for other uses where the indexes are the data and the table is a table. Having the table ignore `vl` is also ugly.

The proper solution would, I think, be to have duplicate `vtype` and `vl` CSRs that are set up by a different `vsetvl2` instruction. It would be seldom-enough used to dispense with the opcode-expensive immediate form and only take the type from a register, making it a cheap I-type instruction.

This would also solve the problem of table and indexes having different element sizes that prompted the late creation of the `vrgather16` instruction and imposition of the artificial 65536 bit limit on `VLEN` in RVV 1.0 and foreshadowing of a potential future `vrgather32`.


That's an interesting solution. Would also not be unreasonable for like indexed loads/stores, though there fully getting rid of the built-into-instruction element type would have its problems.

While vrgatherei32 is certainly definable, it not existing does have the benefit of not burdening small impls with yet more connections for indices. Also, vrgatherei64, spooky.


Can someone please create an instruction set where the entire instruction set is implemented (even if in microcode) on every processor that uses that instruction set?

Have you looked at RISC-V or x64? They are messes of instruction set extensions, most of which reuse the same opcodes.

We have terabytes of disk space and gigabytes of RAM. We can afford a few extra bits in the opcode.


RISC-V has specs to suit everything from microcontrollers through high performance multi-core server level machines. It's not realistic or a good idea to use identical features for such different applications.

For example a microcontroller can't be made economically with 64-bits, SIMD, quad-precision floating point, transactional memory and so on. So it makes sense to provide flexibility rather than locking in a single high-end feature set.


Note that the $2.99 Milk-V Duo has two 64 bit CPUs, including one 1 GHz Linux capable core with MMU, FPU, and a 128 bit length-agnostic vector unit that supports 32 and 64 bit FP and 8, 16, 32, 64 bit integer. It also has 64 MB RAM.

If you're talking microcontrollers at the level of the $0.10 CH32V003 then sure.

https://arace.tech/products/milk-v-duo


The CV1800B in the Milk-V Duo looks very cool - but it's not an MCU.

According to the product page it actually even has an auxiliary MCU in it, in addition to its two high-performance processors.


It's a full-featured applications processor at MCU prices.

Just compare it to the $23.80 Teensy 4.0 which has 1 MB RAM (vs 64 MB) and runs a Cortex M7 at 600 MHz. That is definitely an MCU. And, when it came out, very good value for money -- I love mine. But the $3 Duo blows it away in pretty much every way.

If you want to talk about just the chip, the MIMXRT1062DVL6B in the Teensy goes for $14.48 qty 1, $9.17 qty 960.

https://www.digikey.com/en/products/detail/nxp-usa-inc/MIMXR...

The CV1800B goes for $18 for 5 chips, $3.60 each.

https://arace.tech/products/sophon-cv1800b-5pcs

"can't be made economically with 64 bits [etc]". Yeah it can.


Is that really valuable? Stuff like clz implemented as slow emulation is arguably worse than useless because the application specific fallback can easily be faster than the emulation of the generic opcode.

If the operation is slow you still have to have detection and fallback. Unless you don't care about the performance, in which case you could just never use the fancy instruction at all.


As a more modern example, pdep & pext (BMI2) on Zen 2 and Zen 3 are extremely slow, even though they're technically supported. Thus any code wanting to actually use those for anything is essentially required to add a "But actually! If I'm running on Zen 2/3, pretend pdep/pext do not exist".

If you want every CPU to support every instruction without caring about perf, you could just have the OS emulate it on hitting illegal-instruction; perf isn't gonna be that different given that nothing would be using any of the potentially-slow instructions anyway and the ISA might as well not even have it in the first place. It just solves absolutely nothing to technically support everything.


"If you want every CPU to support every instruction..,"

That does indeed seem nice. As long as there's an easy way to enquire about whether an instruction is slow or not, it means that all the software runs, and the software that knows about issues with some instructions runs as fast as it would without support for those instructions. The only downside is the effort put into writing emulations of the instructions, which seems small in context.


There are two kinds of instructions that you might have or not have. The first is things such as vector instructions, crypto, string and blockmove which can usually be hidden away in library functions and you just set a function pointer to the correct version at program startup. But others such as ... just to pick a couple ... "ANDN" (rs1 & ~rs2) or "SH3ADD" ((rs1<<3) + rs2) are just naturally mixed in with potentially all of your code. The savings they provide relative to the base ISA make it simply not worth wrapping the in a library.


You're basically asking for the end of backwards compatibility and every processor having a obscure unknown custom ISA that nobody can ever replicate, since you want the instruction set to be designed so that no future processor can ever be created that adds even a single new instruction to an existing instruction set. The only way that is feasible is if there is a new instruction set every time a new processor comes out. Yikes, what a dumb idea.


That's how things were before the IBM PC era: every computer either had an incompatible system architecture, or a wholly different CPU.


This would fragment compiler optimization because now instead of compiler developers (collectively) focusing on 1-3 ISAs they're split across way more.


Compiler developers lately seem focused on emitting LLVM IR.


Yeah, but there's still those that need to do IR -> ASM.


Hardly dumb. That's exactly how GPUs work. It's how mobile phones worked for a long time (J2ME). It's how Azul's old Java server worked too.

Having software ship in a high level abstract form and then the hardware comes with an integrated OS/compiler combo is pretty nice from an architectural perspective. It definitely frees up the CPU designers to do things they otherwise couldn't do easily due to backwards compatibility constraints, as well as allowing new hardware features to be used immediately as long as the compiler can recognize patterns that benefit.


> That's exactly how GPUs work.

(NVIDIA) GPUs don't have one common instruction set - they are just shipped with one blessed suite of tooling that hides the differences between a very limited set of architecture families. Many different PTX instructions have radically different performance characteristics between families (because they aren't actually hardened in some.)


The x86_64 ISA is already fragmented like this.

There's a pile of different SIMD ("SSE") instruction sets, and it's pretty much anybody's guess what the system your program is running on is likely to have.

cat /proc/cpuinfo sometime and just look at all the stuff under 'flags'


> There's a pile of different SIMD ("SSE") instruction sets, and it's pretty much anybody's guess what the system your program is running on is likely to have.

It is nowhere near that bad. There are three lineages of chips--Intel Atoms (/E-cores), Xeons (/P-cores), and AMD's chips--and within each lineage, a newer chip is a strict superset of the features of the previous chip. The exception to this rule are the features that are so bad that no one uses them (most notably MPX), and AVX-512 which landed in the desktop P-cores before big-little happened, where they were turned off because the E-cores don't support AVX-512.

If your computer isn't over a decade old, it's not a guess which SSE instruction sets your hardware has--it has all of the SSE instruction sets.


Where can I find FMA4 in modern AMD CPUs? What about 3dNow? XOP?

Even SSE isn't guaranteed as Intel chips never implemented SSE4a.

AVX-512 is especially weird. Cannon lake implemented VBMI, but not Cooper or Cascade Lake only for it to reappear on Ice, Tiger, and Rocket lake then disappear on Alder Lake and reappear on Sapphire Rapids. BF16 was in Cooper Lake only to disappear for Ice, Tiger, rocket, and Alder lake before reappearing again on Sapphire Rapids.


>and AVX-512 which landed in the desktop P-cores before big-little happened, where they were turned off because the E-cores don't support AVX-512.

x86-64v4 requires avx-512.

In short, they utterly fucked up.


RISC-V looked at that and said "Watch me, I can do even more".

Also, it's a safe assumption that everyone is on x86-x64-v2 now which includes up to SSE4.2. And almost everyone is on x86-x64-v3 which includes AVX2. In practice there are about three main "bubbles" of supported extensions. Intel did try to throw a wrench into that with AVX512 but luckily AMD took a saner route and everything converges on more or less the same set of supported subsets of AVX512.


Nah, RISC-V looked at that and decided to formalize the concept of extensions properly and put it through a proper standards process instead of letting it be dominated by the whims of manufacturer's product and marketing pipeline.


And then formalize enough of them that there are now over 1 quintillion valid ways to create a standards-conforming RISC-V core.


Not a concern, as the common ecosystem of binary software follows RVA profiles.

Microcontrollers / embedded can go wild, as the vendor controls the whole stack and can build its firmware images to suit the system.


The big promise of RISC-V for embedded was that vendors would not have to maintain toolchains like that. There's no real benefit over using a proprietary core if you have to maintain your toolchain.

That's a big reason why they use Arm cores.


Absolutely, that's why the toolchains (gcc and llvm) support specifying which extensions to use.

Most people are gonna build for a profile, but embedded can build for whatever custom set of extensions the core they are using has, which does not have to be a profile.


+1. It's a weird thing to complain about considering building for ARM already involves a bunch of board specific compiler arguments in regards to FPU, MMU, etc.

If anything, choosing the extension set is cleaner.


> We have terabytes of disk space and gigabytes of RAM. We can afford a few extra bits in the opcode.

Except instructions need to be fetched, which takes time.


And L1 I-cache size has what seem to be very hard limits due to complexity and latency.


There's plenty that do.

Pretty universally that's because they weren't interesting enough to invest in iterations.

And the extensions don't really reuse opcodes. With very few exceptions the extensions aren't mutually incompatible. X86 would be a lot denser if they were.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: