A lot of nonsense in this article, but I'll try to address his points:
(1) There are several royalty free implementations, ranging from the minimal (PicoRV32) to widely-used embedded (Rocket) to reasonably high-end (BOOMv*). There are also proprietary implementations, and this is how it should be.
(2) RISC-V has defined various platform specifications which means in practice ISA fragmentation is not going to be a thing. The platform will define the minimum set of extensions, and the rest will be probed at runtime (which is exactly the way x86 works).
(3) About economics - the author ought to read some Clayton Christiansen, or maybe take a look at how companies like Intel, AMD and Arm came to prominence.
(4) Yes, open spec != open implementation, but see point (1).
(5) RISC-V has been designed by experts in the field and adopted by some very large hardware companies with experience and who ship in very large volumes (millions and billions of chips). It may well not be perfect, but it doesn't need to be in order to be successful (see also x86).
High-end means building and optimizing miroarchitecture to specific node and prosess technology. Typically it costs $100-200 million to do and it's not high-end after 3 years after anymore.
It has some comparisons so you can judge for yourself. Of course, since this is a project driven by only 2-3 grad students it isn't completely fleshed out. However, you would assume that if the ISA just wasn't suitable for high performance implementations that a project like BOOM would have uncovered that by now.
Right, every so often BOOM's Github page gets reposted as if it was something new, and some of the leadership of RISC-V will talk about this architectural feature or that architectural feature that might support a high performance RISC-V chip, but the proof of a high-performance chip is a high-performance chip.
(The experience of Intel's Itanium, IBM's Cell processor and many others shows that it's not enough to have a few good ideas but you have to have ZERO bad ideas that slow you down to get a high performance design.)
You have to have zero bad ideas that set a ceiling on performance... e.g. you have clear ALL the bottlenecks out of the way.
Just doing something about the one potential bottleneck that you feel like doing something about doesn't necessarily get you a gain in performance at all.
The idea of scheduling parallelism in the compiler (VLIW) doesn't work for mainstream workloads because the time it takes for data to come back from the DRAM is highly variable.
A super-scalar processor can possibly run some instructions at the hardware level while other block wait until data gets back from DRAM.
A VLIW processor packs N instructions together (say N=3) and if one of them is blocked by DRAM, they all block. (If one of them is blocked by Optane they all block for a very long time...)
It looks obvious in retrospect but it's amazing how most of the RISC workstation vendors missed it and put themselves out of business by getting on the Itanic train.
(VLIW is successful for DSP and GPU, but that's because workloads like that can have completely predictable fetches)
I don't know what the problem w/ Cell was exactly, but it was the same in that it couldn't pull data from DRAM fast enough to keep the silicon busy.
With Itanium, they had advanced loads to decouple the pipeline from unknown memory latencies.
Also Itanium was a heavily superscalar design; I think you meant out of order.
Cell just plain didn't really let you directly address main RAM, so you never saw unknown DRAM accesses stalling the cores. Access to the local memory was always single cycle.
In neither case was unknown RAM latencies an issue with the design.
That's probably the main bear argument concerning RISC-V--at least as it applies to the West. ARM has a big ecosystem, it licenses at reasonable rates, so if you want to design a chip, cost-savings associated with RISC-V are in the noise. (And it's at least a bit unclear what other benefits associated with open source software come into play.)
riscv proponents said that ARM licence rate might be reasonable, but negociation time is (or was at the time riscv was just a concept) not reasonable at all.
And time to market is quite important in this area.
At some point if it catches on, being open and royalty free, I should be able to call Global Foundries, TSMC, or Samsung and say "I want 300,000 of your RISC-V chips in 64 bit with X list of extensions built on your Y nm process".
Your 3rd point also jumped out at me as I read the post. Seemed like a classic case of someone arguing that a coming disruptor was too low end to be a competitive threat.
My understanding was that RISC-V was created as a teaching tool at Berkley so that students and researchers could have a completely open and modern architecture to study. I assume part of the reasons it's a (very) simple core with a lot of small extensions is to encourage incremental development by students. Not needing multiplication right away is a feature here.
I don't think there was ever a commercial goal in mind.
There’s other reasons. Let’s say you’re building a Larabee-style GPU. You want a simple core with a large set of SIMD units (along with some custom extensions for the hardcoded parts of the pipeline). Adding a multiplier and divider to your core would kill its usability either by becoming non-standard or by wasting huge amounts of space and time.
Multiplications and divisions happen out of band on most chips (that is, on the side) because they take so many clock cycles to complete. All the extra synchronization takes a lot of extra work.
The RISCV ISA is seeing an uptick in the embedded space where it can be very efficient in core size and thus cost. Here too, requiring those units would hurt the market. Meanwhile, it can reasonably be assumed that any desktop-grade CPU will have those extensions just like it’s assumed that desktop x86 chips all have SSE.
Speaking of that, there are over twenty x86 extensions released since 2000. There are 8 or so in just the last 5 years. Despite this, life goes on because the tools that make handing this easy have existed for decades.
> RISC-V seems like it hasn’t learned anything from CPUs designed after 1991.
Never mind what RISC-V says about itself. Here is Dave Jaggar, a key ARM architect, endorsing RISC-V as state of the art in 2019: https://youtu.be/_6sh097Dk5k?t=51m10s.
Thank you for this, and more specifically for adding the relevant timestamp - and more personally - for Jaggar's answer to the subsequent question regarding complexity in hardware and software stacks, which as a layperson is something I often think about.
I'd love to see the result of a concerted ground-up creation of a hardware and software system that starts from scratch and leaves unresolvable legacy considerations to virtual machines or whatever.
Anyway, totally Off Topic, but thanks for what I'm assuming will be my watching that entire video!
> I'd love to see the result of a concerted ground-up creation of a hardware and software system that starts from scratch and leaves unresolvable legacy considerations to virtual machines or whatever.
This has been tried often. (Thinking Machines, Multiflow, e.g., not to mention Itanium) I would love to see this too; I worked for a startup that tried to do this (vliw, no specex), and taped out with a working chip, after 6 months. Maybe the startup didn't have the right sales team, but they didn't manage to make any meaningful sales, the culture in the buyer's market is too conservative.
In the case of the company I worked for I suspect that part of the problem is that conservative buyers will look for excuses to say no, instead of excuses to say yes. One such example is that nobody would accept that you could move specex to the compiler, with the old "sufficiently advanced compiler" joke, despite the fact that they could prove that llvm was "advanced enough".
> I'd love to see the result of a concerted ground-up creation of a hardware and software system that starts from scratch and leaves unresolvable legacy considerations to virtual machines or whatever.
I'm not a specialist in CPU design but to me RISC-V seems like an incremental evolution, nothing really ground-breaking, when compared to alternatives the main benefit is the openness/licensing/cost. But at the same time think that you can get a $5 raspberry pi and embedded ARM chips cost pennies.
The entire effort of supporting it and building the ecosystem will be hard to justify economically at these cost figures.
Mill on the other hand has a lot of interesting ground-breaking ideas that can potentially offer a 10x performance/watt improvement.
Many people have been recently in awe after experiencing the ARM performance improvements around 40% offered by the Apple silicon and AWS Graviton CPUs. Just imagine something offering a 10x based on a Mill design, I'd love to see it built one day and available on the market.
40% on existing software & even better performance when emulating old software. Mill is extremely alien with no indication they're able to get any of the same gains. Regardless of their brilliance, they seem to not know how to run an engineering business & their choices show that (our new CPU needs a completely new compiler vs adding a new backend, our new CPU needs a new OS, etc). Their efforts are basically "boil the ocean" levels of engineering without any sales to justify that they're on a path to success.
Not really. He answered a question about learning about CPU design, and he said that RISC-V is a good thing to look at. Yes, he said "state of the art" but it was in the context of learning about CPU design, so I think the meaning here is not as strong.
Your post in interesting. I agree yet disagree. You got the context correct but he explicitly said 'it's the standard now for 32 bit general purpose instruction sets and it's got the 16 bit compressed stuff. Learning from the best still'.
Standard =/ state of the art but in this context, he's stating that learning from the best (state of the art CPU designs) is by learning RISC-V.
One doesn't say something is a standard and state of the art in one breathe, to somehow believe that they're doing something outdated or from 1991's knowledge base.
I think you're interpretation of his message would need to have further explicit statements from him to support your claim (that somehow 'state of the art' doesn't mean for CPU design, instead of just learning CPU design).
The listener would be better suited to follow @seedless-sensat's (and my) understanding because there is no evidence to the contrary of what we're stating. The speaker intended both (but if you can find evidence to support your different thesis, I look forward to investigating it further).
Which could be a good thing if it causes developers to write
more portable code. Or a web-like chaos where you have to add
little hacks for every other platform.
Anyway - Extensions based on a minimal subset allow for fragmentation. Yes. Which is probably intended? This can become a problem. Yes. Therefore it should be possible to declare a use case like "all purpose computing" (personal computer) and define a required superset with specific requirements.
e.g. RISC-V "Personal Computing Rev01" shall support the following extensions {list}
And please tell me why a closed source, patented, licensed ISA would be better instead of just saying it's open and therefore fragmented and bad.
PS: Insert here the usual "What's about" for Intels notorious CPU-Extensions or ARMs track of compatibility.
I'm really surprised by those two lines blog posts that gloss over the complexity of ISA design and ecosystem and concludes riscv creator know nothing about nothing.
May be I'm biased because I learned computer architecture at my university reading Computer Architecture a Quantitative Approach (in 1995, yes I'm old).
This a long and very detailed book with lots and lots of data about tradeoffs in ISA design.
The authors of this book created MIPS and then RISC-V so it's not just academia-only data.
I would love to hear about people who really have done their homework and say with this different RISC-V design over a few thousands tests of various linux/bsd tools/software we gain X% on Y metric. With an open source simulator and tooling provided to verify the claim.
But I think it's mostly talk is cheap (and cheap money from competitor interest ...).
Plenty of people consider ARMv8 a better ISA than RISC-V; the addressing modes seem to be a real concern. Those same people mostly consider MIPS to be too simple for its own good; obviously this simplicity doesn't completely doom the architecture but there are some long-standing differences of opinion here.
There is plenty of evidence (rather than opinion) that with a decent compiler and hardware implementation the more complex addressing modes don't gain you anything worth the bother in either code size or performance.
RISC-V is extensible. If someone really believes that more complex addressing modes are worth it then they can add them as a custom extension. If their chips work out to have better performance/price, performance/area, performance/energy then everyone else will want to copy them and those more complex addressing modes will become a standard extension for RISC-V.
There is no sign of the RISC-V community (as opposed to ARM or x86 fans who aren't using RISC-V anyway) moving in that direction.
The promise of RISC-V (as I see it) is that it allows for innovation in chip design with a relatively lower level of fragmentation than would otherwise be true. Without an open and extensible ISA companies would either have to afford the ARM architecture license, or build their own ISA, with all of the tooling and infrastructure cost that implies. This level of innovation implies some fragmentation, but by building on a common base, the pain of that fragmentation can be minimized.
"Overall, RISC-V will lead in a revolution for nationalist vanity CPUs (think Loongson; no one will run them but for show and perhaps a niche of radical ideologues)."
With the US-China tech war that has obviously changed.
The US has demonstrated an ability and a willingness to prevent China from using highend fab and ARM thus forcing China to invest massively in chip production.
That is why RISC-V with its free ISA has its moment now.
Not because it is novel from a technical perspective.
> Not because it is novel from a technical perspective.
If I understand correctly, it wasn't really trying to be technically novel. RISC-V wasn't mean to be a research ISA experimenting with interesting new ideas, it was meant to be a decent Free and Open ISA using modern but established design principles.
The problems with MIPS are not technical or ecosystem or being battle-tested or not. The problem is it is proprietary and since sometime between when SGI spun it out (and the expected rise of Itanium) and the purchase by Imagination Technologies it has been run by a series of incompetent and/or people who just want to extract as much money as possible and then get out. They've had some great technical people throughout who have revamped an modernized it, but in the end it's the proprietary nature and mismanagement that has been more important.
The use cases like what Western Digital is doing seems like a reasonable driver as well. If they demonstrate some notable cost savings versus ARM, that should drive other companies that direction.
I'm going to take a contrarian point - what is so bad about vanity projects? I'm not going to touch on the connection between nationalism and these vanity projects, but the fact that teams can be sponsored to work on their own OSs or their own CPUs is valuable in itself. Having more people be able to learn about complex systems that would otherwise require working at one of the specialized places for those - the Linux Kernel Team, Microsoft, AMD, Intel, etc, seems like a good thing. It is a way for a new set of people to get practical experience without the constraints of being inside an existing organization.
RISC-V is great for platforms where low cost, simple design, existing toolchain is important. In these areas it probably is the future.
If your goal is maximum performance, and general purpose platforms, then it's not very interesting at all. You dont want RISC-V in your PC, but you do want it in your toaster.
RISC-V deserves hype, but that hype should be confined to what RISC-V is good at, not convince you that RISC-V will kill every other ISA.
I think this is a great point. Products succeed by doing well in a specific market and branching out from there to cover more use cases. RISC-V is not going to succeed in the Desktop / Laptop / Server CPU markets because the designs are nowhere near as specialized as what AMD/Intel (along with Apple I guess..) are putting out.
Where RISC-V can show its value is in the IoT / Embedded markets. There are not high requirements for speed or efficiency, cheapness and ease of integration is going to win out there. If RISC-V can win in those markets the designs made using it can improve over time to take on markets with higher requirements.
ARM64 and RISCV64 are so similar ISAs that whatever implementation you can do with one you can do with the other, given the same level of engineering skill and effort.
Apple probably spent billions to make M1. If someone made the same investment with RISC-V they could get the same results -- the main company currently pushing higher performance RISC-V (SiFive) has total funding of under $200m so they clearly don't have billions to spend. They've got an announced CPU (tested in FPGA) with ARM A72 level performance. That's only about five years of very incremental performance improvements behind (non-Apple) ARM.
Because RISC-V wasn't ready yet, and still isn't ready. Ecosystems matter. But the ecosystem sure seems primed for massive growth, and there's no architectural reason why RISC-V can't perform as high as ARM and beyond.
Only sort of. If you care about power consumption (like a thin and light laptop, a tablet or a phone) then the M1 looks pretty good. But if you want optimal performance, for gaming, engineering, ML, and other workstation tasks, then you want a design that is only optimized for performance, and can make use of a much larger power envelope.
Pardon my ignorance on this subject, but is there something inherently better in RISC-V than traditional PICs or other uCs? I have a RISC-V Pinecil and I haven't really figured out what to do with it yet besides the cool factor.
I haven't closely followed RISC-V but I think the deal is it's scalable. You can build low end and high end processor cores with it.
A PIC or ARV etc can really never be a high end machine. And something like a modern 0x86 would really never be low end either. A small aside legacy processor cores on modern CMOS can do fast bit twiddling while consuming trivial amounts of power and die area.
pretty much the same thing one does with any other exotic hardware- install a niche unix variant, then proceed to stare at 'top' and 'show registers' output in the debugger while fantasizing about how cool it is and fixing programs that don't build on it yet :)
What he claims is design flaws is highly contested. It’s claimed that the flaws leads to lower code density, but this is simply false when you consider compressed instructions, which is supported by most implementations. RISC-V with compressed instructions is better than 64-bit ARM.
The design of the compressed instructions is also much more elegant than ARMs Thumb instructions, which was reasonably dropped for 64-bit ARM
Compressed instructions really hurt the parallelism of CPUs. See M1 for an example of what happens when you design your decoders to fetch independently of each other because the offset of their instruction is deterministic. I think a better model is for code to be compressed with an efficient encoder at cache line size (eg zstd with the dictionary defined to be your instruction set)
I'm not talking about parallel decode. I'm talking about the fetchers feeding the decoders. How does the fetcher know where the instruction boundaries are if you have variable-length instructions? My understanding is that x86 is the worst here - while x86 gets broken down into RISC-like uops so the overall x86 hit isn't large, the fetcher is extremely complex & slow to try to get back some parallelism that x86 encoding loses. The compressed version of these instructions seems similarly complicated since the length of the instruction is dependent on the data operands to the instruction.
Can you help me correct my understanding of the situation if that's where I'm wrong or what additional details I may be missing? I'm not a CPU architecture expert by any means - just an enthusiast.
The issue with x86 is that instructions can be any length, and that their length is a nontrivial function of their encoding. This means there's an unavoidable sequential scan over, effectively, every byte of the input when doing decode. (This isn't implemented as a sequential scan like it would be in software.)
With RISC-V, compressed instructions are still 16 bit, and all instructions are 16 bit aligned. Further, the instruction length is fully determined by the first two bits in the instruction (11₂ for 32 bits).
For an 8-wide decoder, this means routing is fully determined by 15 bits, after a single AND reduction, this can be expanded early during a fetch pass, the worst case is an 8-way MUX for the last instruction, and even doing decode fully in parallel at every possible offset would at worst require interleaving eight 32 bit decoders with eight (much simpler) 16 bit decoders. None of that really matters a great deal since RISC-V is so trivial to decode.
Early on the focus was on the hypothetical CISC advantage, hence wanting ‘compressed’ variable-length instructions, even though the ISA hasn't ended up especially space efficient now that the actual complex stuff is legacy. But as to why their encoding is so difficult, put that down to lack of foresight. x86 was designed before parallel decode became a thing, and the complexity accumulated over time.
Intel have had quite a few ISAs and all the famous ones are bad, so there doesn't have to be a good reason beyond that.
x86_64 instructions can be anything from 1 byte to 15 bytes long. You have to decode the entire instruction, except any trailing constant/offset, to even know how long the instruction is.
RISC-V instructions can be exactly two lengths: 2 bytes or 4 bytes. You can tell how long an instruction is by looking at the first 2 bits: 11 -> the instruction is 4 bytes long, anything else (00, 01, 10) -> the instruction is 2 bytes long.
These things are so different that you can't just say "oh, they're both variable length".
Also, the 2 byte instruction in RISC-V are completely optional. They each exactly duplicate some 4 byte instruction. If you want to build an 8-wide machine like the M1 and you think having two instruction lengths will make that too hard -- you can just use the fixed length RISC-V base instruction set instead. The only downside is your programs will be a bit bigger, and you'll need to compile your own software (as Apple, for example, does).
My understanding is that the typical way to do parallel decode of a variable length ISA is to speculatively decode from every valid offset, then throw away those which turned out to be wrong.
So x86 implementations start to decode at [n, n+1, n+2, n+3, etc.] bytes. Then if turns out that, say, the instruction at offset n was 2 bytes, throw away whatever work was done for decoding at n+1 and continue decoding the n+2 etc.
For RISC-V with the C (compressed) extensions, instruction boundaries are 16-bit aligned, so you can do a similar strategy like above for x86 except you speculatively decode at offsets [n, n+2, n+4, etc.]. See e.g. the SonicBOOM paper.
Of course, speculatively decoding instructions only to throw them away costs power. I have no hard numbers, but I do wonder why RVC didn't specify that an "instruction bundle" is 32-bit aligned and consists of either a 32-bit instruction or a pair of 16-bit instructions? That would have made decoding simpler, though at a cost of worse density.
The problem with this approach is that you get more and more such wasted decoding effort when you try to make a wider decoder. Then again, uop caches exist for a reason and generally work well, they are an excellent choice if you have a hard to decode ISA.
I do think that RISC-V will open the door to open source implementations as well. It will take a long time, but we might reach a day when people are 3d printing their own chips. Why does everyone always put so much pressure on new technologies to change everything over night? Maybe RISC-V's best days are 10-20 years from now.
The sizes of parts of modern transistors are measured in the number of layers of atoms (monolayers) that comprise them. Getting that stuff built requires some pretty beefy, expensive technology, and is not anywhere near the realm of 3d printers. There's a reason you can count the number of cutting-edge semiconductor fabrication companies on one hand (and the suppliers of some of the machines in the fab on one digit).
I love that reading is a constructive process: you can get something out that the author maybe didn't put in.
I remember reading a book about organic farming that made a great point:
(paraphrasing)
As a home organic gardener you might think that you can't compete with a commercial farmer, but...
Say you want to make a tomato dish. You want the most succulent tomato with the softest skin.
The commercial farmer needs to choose a variety that has a skin though enough to pick mechanically or by rough hands.
At home you can go as soft as you want to.
You can pick it and cook immediately.
The commercial farmer needs to use a variety that can be picked green and ripened.
Etc.
Different scale, different constraints, different opportunities.
It seems that bonding atoms in 3D using an STM is very difficult, if not impossible. We'll probably have self-assembling molecular processors before then.
I imagine you could put down a "wire" of metal atoms without too much trouble. Something like a capacitor shouldn't be too difficult if you could put down a dielectric.
Not sure about how to do a transistor - maybe there could be an alternative "switch".
Yeah, I'm hand-waving. And maybe the metal substrate an STM uses is a non-starter.
But still, atomic precision seems to be in hobbyists reach and I imagine SOME approach to low volume tiny circuits are possible that would not be economically viable in terms of RnD and fab for "big silicon".
Okay. It's probably possible. Hobbyist tooling isn't there yet though. I'm actually developing a DIY atomic force microscope, for other use-cases though.
STMs are very limited though. The kinds of atoms you can pick up are pretty limited afaik.
Give it a couple of years, maybe 5-10, and there will probably be hobbyists moving atoms around. The industry and academia will, of course, probably be further along.
I don't know why it would make sense, even (apart from having fun doing it). We do not print books at home even though we all have printers, we use print shops for that.
Maybe you don't print books at home, but the home publishing does exist. Mostly for small businesses who need to give each customer a slightly different book (if only because of updates). If you will sell 1000 books you should get a publisher to print the book, but if you are looking at only a couple per month print on demand at home makes sense as you can ensure everyone gets the latest corrections.
The above is moving to electronic, but paper still has advantages over electronics so it isn't dead.
We do not print books, but we do print plenty of photos and documents and newsletters that it'd take a professional photo lab or large-scale offset printers to print just a few decades ago.
The market for people home printing photos and documents and newsletter is probably 1 million times bigger than the one for people home printing CPUs, though.
Yes, and yet plenty of tools and products target that market anyway. Especially since while a good chunk of it is hobbyist use, another large chunk is small-run electronics where relatively high price tags can easily be justified. But even in the hobbyist space, while the people who would actually use the tools are few in numbers, many hobbyist projects get built in the 100's or even thousands - whether by end-users themselves or by people selling them to end users. FPGAs are often a good chunk of the cost of projects like that.
Something that could allow people to print chips competitive with stuff you can run on present-day FPGAs would already be transformative (and yes, I recognise 3d printing is far off achieving anything like that as well, but the point is it doesn't need to compete with the state of the art to be highly useful for many use-cases).
While much smaller in volume, there's also a whole sub-culture of people opposed to "emulation", a subset of which sees FPGA's as borderline "cheating", and/or who would otherwise love to print pin-compatible replacements for old chips.
Heck, even being capable of printing replacements for 6581/8580 (SID - sound chip from Commodore 64) or 6526 (IO chips used in the C64 and others that were notorious for burning out) would get a whole lot of retro-enthusiasts excited, as there are a bunch of FPGA reproductions of the SID which are way overkill.
Doing that would be possible by matching processes that were getting dated by 1980...
Right now it looks like POSIX and some good minix equivalents. It's waiting for a Linux to change the game.
As for manufacturing, I think it's harder but not crazy harder. Small batch fabs making small vendors' customized implementations seems reasonable. It's really about reducing requirements for big scale. ISA is one.
I don’t have much knowledge on the low-level intricacies of CPUs, could someone please answer a question of mine?
There is an older, famous post titled something like “C is not a low-level language”, and relatedly one that assembly itself is not all that low-level today.
I do now almost nothing about the state of today’s CPUs, only that they are extensively complex with long pipelines and that branch prediction is a thing. Also, several layers of caches. x86 on top is almost only an API which frees the hardware engineers from backward compatibility, since these intricacies are programmed by microcodes that are not really accessible to general software.
My question is, wouldn’t a “dumber” processor be better with a more expressive “API”, with instructions like prefetch, some form of parallelism (so that pipelines can be created with machine code and not just heuristics), since it is probably easier and better to branch predict in software than in hardware? In a way current processors sort of JIT compile assembly to some hardware-native form, but if those hardware-level tools were accessible to software, software could do much much better optimizations and bugs could be patched.
> "...some form of parallelism (so that pipelines can be created with machine code and not just heuristics)..."
Not sure exactly what is meant by this. Either the answer is the hardware already automatically maximizes parallelization (https://en.wikipedia.org/wiki/Superscalar_processor) or the answer is that there's no way to physically rewire a chip on the fly.
> "...since it is probably easier and better to branch predict in software than in hardware..."
Been tried, doesn't seem to have been that successful IIRC. A cursory Google shows that x86 had (and dropped) branch hint prefixes for instructions and the Cell's SPUs had branch hint instructions as well. Various languages also allow exposing branch prediction hints to the compiler, e.g.: https://en.cppreference.com/w/cpp/language/attributes/likely
This is kinda what motivated Intel's Itanium architecture. You should look into that to see why what you're proposing makes sense, but hasn't gotten traction.
I'm no expert on such things but your proposal sounds similar (or at least maybe adjacent to) VLIW in the sense of pushing a lot of the smarts to the compiler. It seems to have had some isolated successes but not nearly as much as it was hyped to in the 1990s. https://en.wikipedia.org/wiki/Very_long_instruction_word
> My question is, wouldn’t a “dumber” processor be better with a more expressive “API”, with instructions like prefetch, some form of parallelism (so that pipelines can be created with machine code and not just heuristics), since it is probably easier and better to branch predict in software than in hardware?
No. I'll list the reasons I know, but a real hardware designer will know even more.
There are many different ways to design hardware. The right thing to prefetch changes with different implementations. Unless you think everyone is going to ship source code?
Not necessarily source code, but some intermediate representation (even perhaps x86, and then going through a software compiler that converts it to this lower level language in hardware, inserting cache fetches where necessary. But probably a higher level representation, for example JVM byte code could be better optimised)
Not me, but a compiler backend would perhaps have better “materials” for optimization this way. There is no reason for existing higher level code to not be transformable to this “microcode”
A compiler backend would have to do it statically, and would not be able to do it based on the workload that's being run.
The argument to expose this sort of control would be because domain knowledge would allow higher level domain knowledge to be used to tune the prefetching manually.
I pretty much agree on all points except design flaws.
As an instruction set RISC-V is not significantly worse than ARM. RISC-V trade-offs performance to simplicity of implementation. I think this makes it good choice for embedded. IoT, security chips need to be cheap and simple. Price and size are more important than performance.
For example, stuff that lowRISC makes is never going to be high performance, but it can be very important for security point of view. https://www.lowrisc.org/our-work/
Re ISA fragmentation: Linux has various well established mechanisms to deal with varying CPU capabilities, these are necessary and used on many other platforms.
Re Economics: binary compatibility is not very relevant for folks using open source software, most of the time source code is pretty portable, Debian riscv64 seems to be above 98% built:
What happens if your RISC-V implementation doesn't meet the debian requirements? Or for that matter, provides some sweet instructions that speed up memcpy, or whatever, 100x?
Binary compatibility matters on anything that isn't compiled specifically for the machine.
Yocto/Gentoo might have been better examples if you want to argue binary compatibility doesn't matter. Particularly if you need to compile the bootstrap image just to get the machine to boot.
It just has to meet the standardised baseline that is already defined and all Linux distros use. For all the extensions, Linux has various well established mechanisms to detect and utilise varying CPU capabilities, these are necessary and used on many other platforms.
As I'm sure your aware, those mechanisms aren't ideal. There are performance, maintenance and various other problems with them. They exist because distro's are compiled to the lowest acceptable common denominator and then libraries/etc are swapped in as needed. For something like x86, this is an acceptable tradeoff because there is an expectation that a distro boots on a 25 year old computer as well as the latest amd/intel offerings with a ton of new instructions.
That said, if you rebuild many apps with -march=native, -flto, etc there are frequently large performance benefits due to the compiler being able to selectively use things like AVX512/whatever for random code sequences where calling out to a library function, or checking for feature existence at runtime wipes out much of the perf advantage.
A large part of the advantage of a new architecture would be avoiding all this crap. If it comes baked in from the start that isn't a good start, considering what it will look like in a decade or two.
What's the boundary between microcode and a VM w/ JIT?
Some tidbits I've read, IIRC: Apple Silicon M1 does a surprisingly good job of running a x86 emulator. Some chips now use microcode to support novel ISA extensions. Some chips are adaptive, optimizing hot path ISA instructions. There were special purpose chips for LISP and JVM and probably others. Old school mainframes used to emulate other instruction sets.
Knowing a little bit about JVM, it's JIT, the Java Memory Model...
These (amazing) hardware advances kinda sound like a VM w/ JIT.
So at what point are these chips designed in conjunction with the JIT?
Isn't that kinda what Transmeta was trying to do?
--
What can these chips and systems do to improve security? Stuff like tagged data descriptors and buffer overflow protection.
Thanks, but I myself will be the judge of what is interesting.
After all the speculative execution bugs I think it's pretty clear that Intel has reached a complexity threshold where it gets too difficult to reason about what the processor is actually doing. In that regard I think the "reduced" in RISC could enable us to write fast and safe programs. And we chuck out tons of legacy too, you can probably still run MS-DOS on the latest Intel chips.
I'm not too worried about fragmentation – just look at any two x64 CPU's and they won't support all the same features, this is an annoying but mostly solved problem. And AMD does pretty well, while every board essentially needs a custom bootloader and a unique description of where all the hardware is.
The ISA fragmentation issue is the worst thing I've consistently heard about RISC-V.
Fragmentation is in general horrible. It was a major problem with ARM32, and that was much more strictly defined with less variation. ARM64 tried to define things more rigidly and has mostly banished the problem. The ARM ecosystem continues to be held back though by fragmentation around bootstrapping, hardware enumeration, and other aspects of the system that surround the core. Every ARM board tends to be a special snowflake.
I can't imagine RISC-V beyond niche applications unless someone publishes a more strictly specified version of it that provides a unified platform.
> I can't imagine RISC-V beyond niche applications unless someone publishes a more strictly specified version of it that provides a unified platform.
They're working on this right now. Niche applications can still do their thing, but there will be standard profiles for e.g. a "Linux class" application processor, or an "ARM Cortex-M*" equivalent micro-controller.
Did a quick search on this, and I believe the Linux portion of this is the responsibility of the "UNIX-Class Platform Specification Task Group" [1]. They seem to be quite active, which I'm reading as a sign things are progressing.
I think RISC-V is converging on a standard “this is what a ‘big’ core looks like”, but with the possibility that you stick a tiny RISC-V core on an IC somewhere.
Hardware enumeration IMO can be harder than dealing with ISA variations. (And to my knowledge, RISC-V hasn’t solved any hardware enumeration problems.)
It's also not true, unless you also consider that x86 is fragmented. RISC-V Intl has defined two platform specifications with apparently more to follow. Those will define a set of extensions which are required, and the rest will be probed at runtime, which is exactly how x86 works.
I see so many comments here about RISC-V fragmentation. This is not my expertise, so correct me if I am wrong, but if this means that the particular RISC-V CPU has to be able to execute some instruction, from what I understand the RISC-V designers expect to happen is missing instructions will be implemented in software.
So if the CPU runs into an instruction that it doesn't understand, this will trigger a missing instruction trap and software can implement the instruction. Fragmentation problem solved, no?
The main flaw in the emerging hardware ecosystem is the lack of reciprocal/cooperative licensing.
The RISC-V ISA and some of the open cores allow (or even encourage) being bundled with proprietary hardware.
Even locked-down, user-hostile hardware with patents and cryptographic protections is allowed.
If you expect RISC-V to be the "Linux of hardware", think again. Your next phone will not be "free" as in "free-from-backdoors" or "free-from-patents".
Yet, compared to the status quo, it's a step in the right direction.
I don't know anything about the design flaws, so I can't comment on that, but as for the rest, the strategy of RISC-V worked out pretty well for everyone with the advent of the PC[1]. I'm not sure why it wouldn't work again. What would be so different between how things were done then with the PC and this now?
See, the confusing miasma surrounding USB and its various overloaded capabilities.
Or PCIe -- if you want to use ROCm on AMD GPUs, your PCIe complex needs to support "PCIe atomics", which, uh, none of my computers do. AMD's boards do, though!
That's nitpicking though. I don't think PC's open architecture is really comparable, the truly open bits like PCI are much less "open to interpretation" as RISC-V, while bits like the CPU socket, chipsets, and even the keyboard port were either proprietary or clones of proprietary tech.
No board support is required, this is on the SoC ("CPU") side, unless you're trying to connect the GPU via a chipset-side low-bandwidth slot.
Intel supports it since Haswell on desktop (and earlier in Xeons apparently), Marvell ThunderX2 and Mellanox BlueField support it, I'm pretty sure Ampere does too.
Yeah, it's the only reason I've thought about upgrading my polaris card up newer.
In my case, I tried connecting both via thunderbolt on some skylake/coffeelake laptops and on power9. Neither Intel's TB3 controller nor power9 support the necessary extension.
Not really a big deal, but a nitpicky example of how "fragmentation" sometimes hurts.
RISC-V is just an instruction set specification. It says absolutely nothing about the peripherals that need to be attached to a RISC-V core to produce a usable system.
Whereas the IBM PC was delivered as a complete system (including all the peripherals and glue logic) and to achieve 100% compatibility the clones were exactly that: they copied absolutely everything about the PC, down to level of the individual logic gates in the "glue" that interfaced the CPU and memory and peripheral controllers.
When I hear RISCV hype, I can only think about ARM and the extremely sorry state of cross-compatibility there (see: Linux device tree and various vendor bsp junk), and how RISCV is just likely going to be same except worse. So, no I have no high hopes for RISCV.
Device trees are fine and what you would find deeper in an x86 system if you looked.
BSP junk is annoying, but you don't have to use it. Just buy hardware that works on mainline Linux. That might mean you have to buy last gen hardware, but so what? A quad core A53 is still fairly fast.
Can someone with domain expertise comment specifically on the criticisms regarding the mistakes in the core spec, like wasted 32bit space and missing important instructions, detailed in [1] and [2] ?
In my opinion those complaints are generally correct. I work on low level OS and toolchain components (static and dynamic linkers) and interact with CPU architects. When I looked at RISC V's addressing modes I was dumbstruck, they are completely inadequate for modern high performance desktop or mobile cores. Just compare the design of PC relative branches on the two architectures:
ARM64 (b/bl): These instructions reserve 26 bits of the 32 bit instruction for a branch target (with an implicit 2 bit shift since instructions are 4 byte aligned), resulting in the ability to directly jump ±128MB (AKA ±2^25 instructions).
RISCV (jal): This instruction reserves 20 bits off the 32 bit instruction for displacement (there is an implicit shit of 1 bit since all RISCV instructions are 2 byte aligned). This results in the ability to jump ±1MB (AKA ±2^19 instructions).
This is a often a non-issue for small embedded cores because the code running on them is fairly compact and can be tuned for specific cores. It is a nightmare for large desktop and UI stacks (or web browsers) which often have many linked images and are much larger than 2MB. You can make it all work, but you need to add extra address calculation instructions or branch islands to do it, and those waste a bunch of space (what arm64 can do in a single instruction requires 2 or 3 on RISCV). Now you have all those extra instructions in your I-cache, extra jumps to branch islands wasting predictor slots, etc. You can try to solve this in hardware by adding special predictors to recognize branch islands, or using do op fusion to recognize idiomatic jump calculations, but that makes the chips more complex and still does not solve the code density issue (you can try to overcome some of that with trace caches, but that is again more complexity).
There is no simple way to fix these issues in RISCV, because all prime encoding space is gone. The best you can do is add better addressing modes in the 48 bit opcode space, but that interoduces significant code bloat (if you just make every unresolved target in a .o file use 48 bit jump instructions), buys you very little (if you continue using 32 bit instructions by default and only use the newer instructions in linker generated branch islands), or requires complex software, binary format, and tooling changes that are never likely to happen in order to dynamically relax function bodies and have the linker choose the size of the instructions (and the real kicker is any improvements made to toolchains to accomplish this would still not overcome arm64's better instruction design BUT would provide some improvements to arm64 binaries).
I could go into a similiar analysis of pc relative load instructions and how adrp is much better than auipc for large codebases. RISCV just wastes tons of bits in prime places in the encoding space. JAL blows 5 bits on encoding a return register. Technically that is more generic and orthogonal than having an architecturally specified return register that the instruction uses implicitly, but those 5 bits are incredibly valuable, they would have increased the displacement from ±1MB to ±32MB. Yes, specifying the register lets them play fun tricks in their calling conventions to simplify their prologue and epilogue code, but that really cannot justify the loss of branch reach. What is so infuriating is that they had an instruction like that (J), but they removed it because they did not want any instructions to use implicit registers (and I believe it cannot be added back because the encoding space has been reused). I understand the desire for architectural purity, but by doing so they doomed every high performance implementation to micro-architectural chaos.
While it might be tempting to think I have just rabbit holed on a single issue, it really is a big deal. Something like 5-10% of generated instructions PC-relative jumps, so getting them wrong has significant impacts... I would estimate this one issue alone will result in a ~5-10% code size increase (but only once you start have large binaries, it does not have an impact for anything less than ~1-2MB in size). It might not matter for small embedded controllers or in order cores, but it makes implementing high performance out of order cores much more complex. It is certainly possible to overcome these issues, but it means that for out of order RISCV cores to achieve similiar performance to ARM64 cores on large codebases they will need larger more complex branch predictors, larger caches, and extra decode logic, and potentially trace caches. This is not an isolated issue, it is just one where I have domain expertise, I have heard similiar criticisms of other parts of the instruction set from people who work in other parts of the stack.
Just to be clear, I don't think RISCV is terrible. I think it will be great for people doing custom cores with custom toolchains to ship bespoke silicon in small devices where the general purpose compute requirements are low. IOW, it is great if you just need some sort of CPU core but that is not what is really special about your silicon. On the other hand, I simply do not see it ever being competitor to arm64 in high end mobile devices, desktops, workstations, or servers. In order to fix the issues with it they would need to reclaim a bunch of allocated instruction encodings (maybe RISC6)?
What I'm hearing in this post is "Where RISC-V uses a smaller field size for something than ARM64 they have under-provisioned and will need extra instructions, and where RISC-V uses a larger field size for something than ARM64 it will never be used and is wasted".
In other words, ARM's architects chose every parameter correctly, and RISC-V's chose every parameter badly.
It might be true, but you kind of have to prove it, not just assert it.
Take the J/JAL vs B/BL range for example. It's not just embedded. Look through your Linux distro's binaries and you'll find very few with TEXT size over 2 MB.
One of the few on my (x86_64, but it doesn't really matter) system is /opt/google/chrome/chrome with 159,885,517 bytes.
That exceeds ARM64's single-instruction BL range, as well as RISC-V's.
A quick analysis shows that of the 373500 callq instructions in the binary, statically 55.53% fall within the RISC-V single-instruction range, 100% fall within the ARM64 range. Dynamically I don't know, but I suspect the percentage that fall in the RISC-V range would be a lot higher.
Anyway .. code size. The extra AUIPC instructions needed in the RISC-V program will make it around 648 KB or 0.4% larger than the ARM64 program.
That's a pretty far cry from "I would estimate this one issue alone will result in a ~5-10% code size increase" you state. I mean -- that's a factor of 12x to 25x different than you state.
But wait there's more.
One reason that RISC-V JAL offsets are limited is that 2 bits out of 32 are taken up by indicating whether the current instruction is 4 bytes or 2 bytes in size.
So that's a waste right?
It would be if you didn't use it. But RISC-V does use it. In a typical RISC-V program, around 50% to 60% of all instructions use a 2 byte opcode, giving a 25% to 30% reduction in code size.
On the same 152 MB program (Chrome) where not having long JAL offsets costs 0.6 MB of code size, the C extension will probably save around 40 to 45 MB of code size.
That seems like a pretty good trade to me.
What will the speed effect of those extra AUIPC instructions be? I don't know. I'd have to instrument Chrome and run it at a fraction of normal speed to find out.
That's definitely something that should be done before making a pronouncement that one ISA is definitely better and the other one made all the wrong trade-offs.
However, my experience of analyzing smaller programs is that the dynamic penalty (execution speed) is typically much less than the static penalty (code size). At a wild guess, I'd go with four times less, or 0.1%.
That's in the noise.
Might a RISC-V core be enough simpler than a comparable ARM64 core to clock 0.1% faster? Could well be. Might it be enough simpler to be 0.1% smaller and thus cost 0.1% less in die space -- or allow you to put 0.1% more cores on the same chip? Could well be.
Even the detractors don't argue that RISC-V isn't simpler. "It's too simple" they say, takes purity and orthogonality too far.
Maybe, but you need to prove it, not just assert it.
You are correct, I am just making an assertion, but I don't have to prove it, I will be satisfied to wait and watch things play out. There is a lot of money and a many industry players working on RISC-V, so eventually the market should provide evidence to prove or invalidate my thesis.
I don't think the ARM architects did everything correctly and the RISC-V did everything wrong. I just chose an example I felt was an issue. On the other hand, I think that RISC-V supporting variable length instructions that are encoded with such that they can be easily decoded in parallel was very good use of encoding space. What frustrates me about RISC-V is that it feels like it ignored the last 20 years of industry experience and made a lot of unforced errors.
I think you are correct that I misestimated the branch density in normal linux binaries (the system I work on is a bit different), so I will take back my claim about code size increase, but I also think large binaries like Chrome are more significant than you seem to imply, especially once you start looking at desktop and mobile platforms. We can argue about people writing bloated code, but the fact is that apps like Twitter and Facebook ship mobile apps that are over a 100MB of executable code. And these things are not getting smaller over time. As code sizes increase that 2MB jump window is going to look very small.
It is going to be interesting to see how this plays out over the next few years.
Its incredibly how dismissive people are, and clearly haven't thought about it that much.
> ISA fragmentation ... binary distributions harder.
A overly simply one-sided analysis. You have to look at the cost and the benefits. Not simply say 'look there is a cost' therefore it is bad.
The modular architecture also allows RISC-V to be optimal in many different industries and form factors, from literally the smallest CPU ever created to massive multi-core servers.
The system is designed so that as much software as possible can run on the minimal spec and thus work does not have to be repeated between different industries.
Just because people use the custom extension feature a lot with embedded, does not imply that the same thing happens for desktop/server. The constrains are different, the market is different, the industry is different. So far we have seen all linux distribution build on the same profile.
> Economics. RISC-V has actively courted embedded, which makes sense as a niche.
Actually wrong. The first profile that was standardized was the one for full linux.
Embedded simply got more interest because more people could get it to market faster, and many universities work on embedded.
This is open source, people are just gone use it for what they want, and embedded had the most need and the lowest barrier to entry.
> Openness doesn’t tickle down. The openness of an ISA doesn’t have much impact on the implementation. A design with restricted signing keys is completely acceptable under their licensing
The point of RISC-V to make it POSSIBLE to create an open chip and use it commercially.
People for many reasons wanted to use open implementations, and RISC-V makes this possible.
And thanks to this philosophy there are actually where nice open chips and open chips that have commercial appeal.
To say well not everything is open therefore its bad, is just such a wrongheaded attitudes, its unbelievable.
> Design flaws. RISC-V seems like it hasn’t learned anything from CPUs designed after 1991.
If you actually read the spec, the literally have design consideration for every instruction with all the learning from the past and why they did them. Simply because you don't agree with all their conclusions, doesn't mean they didn't learn anything.
And when drawing conclusions from only point out the issues with one side, without considering the others, you are just not doing serious analysis.
Goal is to be just close enough to RISC-V to get benefit of work on it, but leave behind the design errors, so formal verification and compiler targeting can be ported with minimal effort. Credible lists of RISC-V design errors have been published.
RISC-V is targeting embedded customized chips and research and does quite well in that area.
In that area sometimes having easier design is much more worth then putting certain optimizations into the ISA spec.
But from what _I_ know RSIC-V is neither targeting nor suited for general purpose computing, like general purpose servers, desktops and any phone/tablet but low end ones.
The reasons for this are manifold, including manny thinks I can't assess and thinks like that for such targets you like want a bunch of changes. Like e.g. requiring support of one load between a LR/SC pair which might be required to be based on the LR.
But then if it would have targeted general purpose computing platforms I believe it would have failed.
The fact that RISC-V allows extremely simple implementations is what makes it successful I guess.
So they just need to bring out a tweaked RISC-V spec which requires more ISA extensions, more support for LR/SC and similar and maybe does also some ISA changes and voila we it can move into that space, likely reusing a good amount of tooling.
RISC-V seems to be the equivalent of Go in so many ways, stylistic simplicity at the expense of performance and actual simplicity. That doesn't make it bad, just flawed like everything else.
It won't take over user<->device centric computing (desktop/laptop/tablet/high end mobile). It might become the ubiquitious IoT processor, but the divergence of chip capabilities (and subsequent OS & ABI fragmentation) means nothing will be interoperable, even at source code level. When I compare that to the extreme interoperability of RPi + Pico Pi devices (even with all their painful flaws) I know which one I would prefer.
There's really a lot of severely misinformed opinion there.
"Much of the hype of RISC-V is hoping for laptop/desktop/server class silicon."
Really? RISC-V is pitched mostly against the likes of ARM, which seems to be doing fine without laptop/desktop/server class silicon.
Several RISC-V vendors are already shipping Cortex A53/A55 class CPUs, which are used as the LITTLE cores in mobile devices -- and even still the main cores in lower end mobile. Several RISC-V vendors have formally announced A72 (SiFive U84) or A73 (Alibaba C910) class cores. Alibaba is apparently using these internally already, and boards for general sale are expected this year. SiFive's U84 will probably be shipping in around 12 months from now.
ARM is a few steps ahead with A75 and A76, but those are just incremental developments and RISC-V is catching up fast.
It seems the article was written before the HiFive Unmatched was announced -- but I don't think it was written before the U74 CPU cores in the Unmatched were announced in October 2018, so I guess the author either wasn't paying attention, or else don't understand the standard 2 to 2.5 years from announcement of a core to shipping products using an SoC with that core. There is nothing surprising about the HiFive Unmatched.
Was the article also written before Apple announced its switch to ARM64 architecture and their M1 chip?
The M1 uses ARM's 64 bit instruction set, but is far ahead of anything from ARM or its other licencees in performance.
If someone made the same level of investment in a RISC-V core and SoC as in the M1 then that RISC-V product would perform basically the same as the M1. That's a several billion dollar investment. Apple has that kind of money, but the best known RISC-V vendors such as SiFive (total funding to date under $200 million) don't.
That's an economic problem to solve, not a problem with the RISC-V ISA.
Alibaba or Huawei might well make that kind of investment in RISC-V. They are definitely both very interested in it.
The article concludes with the same old links to uninformed people making uninformed criticisms of RISC-V. I don't know who erincandescent is other than having written this rather famous post. Apparent the credibility of the post lies in them being an ARM engineer. Could be. ARM has thousands of engineers.
Here's the opinion of probably THE most important ARM engineer of the 1990s and 2000s, Dave Jaggar who developed the ARM7TDMI, Thumb, Thumb2.
Check at 51:30 where he says "I would Google RISC-V and find out all about it. They've done a fine instruction set, a fine job [...] it's the state of the art now for 32-bit general purpose instruction sets. And it's got the 16-bit compressed stuff. So, yeah, learning about that, you're learning from the best."
(1) There are several royalty free implementations, ranging from the minimal (PicoRV32) to widely-used embedded (Rocket) to reasonably high-end (BOOMv*). There are also proprietary implementations, and this is how it should be.
(2) RISC-V has defined various platform specifications which means in practice ISA fragmentation is not going to be a thing. The platform will define the minimum set of extensions, and the rest will be probed at runtime (which is exactly the way x86 works).
(3) About economics - the author ought to read some Clayton Christiansen, or maybe take a look at how companies like Intel, AMD and Arm came to prominence.
(4) Yes, open spec != open implementation, but see point (1).
(5) RISC-V has been designed by experts in the field and adopted by some very large hardware companies with experience and who ship in very large volumes (millions and billions of chips). It may well not be perfect, but it doesn't need to be in order to be successful (see also x86).