RISC-V cores are about 4 years behind ARM. So most they are most suitable for embedded systems and single board computers at the moment.
At the lower end they have a big advantage as they can reach similar performance as Arm at half the silicon which translates into 4x lower price at similar volume of production.
But for high-end chips you will not see this difference.
SiFive has caught up about 2 years each year. So I suspect it will not work 3-4 years before RISC-V can compete with Arm desktop computers. Competing with Apple will take much longer.
But such perspectives kind of miss the point. Then beauty of RISC-V is that it can be used in all sorts of specialized hardware such as accelerators. You may be able to program AI engines or maybe even some kind of graphics card with RISC-V instructions in the future.
If anything, a more streamlined frontend matters more at the high-end, because it enables more scalable decode letting you extract increased parallelism out of the code. Better use of your frontend complexity budget also makes it easier to use things like fused microinsn sequences, or the RISC-V compressed insn set which brings code density on par with x86 (an outstanding result for such a clean design) for both RV32 and RV64. This is pretty much what we see going from x86 to ARM, and RISC-V only goes further in the same direction.
Decode doesn't take up much space in a high-end processor. Even if your architecture's decoder doesn't "scale" as well, you can just throw a bunch more transistors at it to get performance. That said, I doubt there's many CPU designers unhappy about a simpler ISA...
Hmm, I'm no expert on the subject, but I had heard that CMP/Jcc and TEST/Jcc macro-op fusion has been critical to i386 and amd64 performance, for both Intel and AMD processors, for about 15 years?
Those are "simple" fusions that processor do perform productively. The thing with RISC-V is that a lot of the design rides on "oh we're just going to combine all the very RISC instructions into macro-ops to catch up to what ARM is doing" except nobody is really doing that kind of fusion right now, so it's not clear if this is even feasible.
Hmm, not even the Allwinner D1 or the SiFive P650? In https://news.ycombinator.com/item?id=29423006 Bruce Hoult says SiFive has been doing macro-op fusion in one case since the 7-series, and I think he's the one that implemented it.
Certainly I didn't implement it, that was Andrew Waterman. I'm not a hardware person.
I think macro-op fusion (with a few exceptions) is useful only on a narrow set of mid-range cores. High end cores want to split everything up (RISC-V comes pre-split), while low end cores want to use as little hardware as possible.
There are a few exceptions, for example combining a LUI or AUIPC into a full 32 bit constant with the following instruction(s). There is a full 32 bits for that data path already, so that has zero effect on the back end (except just removing an instruction). SLLI/{SRLI,SRAI} pairs is probably another.
> High end cores want to split everything up (RISC-V comes pre-split)
Yes but a high-end core does not necessarily splits instruction the same way as RISC-V does.
For instance an indexed store:
RISC-V ISA splits it into an ADD+STORE pattern which may be recognized by macro-op fusion,
while a high-end core splits it into a store_address uop (that compute the effective address and update the address field into the store-buffer entry) and a store_data uop (that update the data field into the store-buffer entry, and this uop is "address agnostic")
Macro-op fusion (alone) does not only reduce the number of uop, it could also reduces physical register usage as intermediate results are not stored into physical registers.
For instance, the indexed load pattern (ADD+LOAD) without macro-op fusion needs:
- 2 physical registers in read;
- 2 physical registers in write (one per macro-op), meaning that 2 physical registers will be allocated.
While the macro-op fused version needs:
- 2 physical registers in read;
- 1 physical register in write, meaning that a single physical register will be allocated.
I talk about feasibility, not about design win in a high perf. circuit, since there is no real high perf. RISC-V today and since it's the feasibility that is questioned by the parent.
I can't say if the macro-op fusion will be enough to beat ARM & x86, but I can tell it's feasible.
Clearly macro-op fusion is a computable transformation of the instruction stream, so macro-op fusion is certainly feasible. I think what Jha was questioning was whether catching up to ARM is feasible if your main weapon is macro-op fusion. The Celio/Dabbelt/Patterson/Asanović tech report I linked yesterday shows that it's promising, but it's been six years and I'd like to know if that's actually being tried and how well it works.
Bruce Hoult's comment from a few months ago suggests that it wasn't really being tried at SiFive, but he left SiFive a couple years ago and might be out of date, and the Honey Badger designs are potentially a whole new ballgame.
> I think what Jha was questioning was whether catching up to ARM is feasible if your main weapon is macro-op fusion.
I don't think so, there are indeed unrealistic macro-op fusion patterns that can reasonably be described as "unfeasible" or "unrealistic".
But macro-op fusion patterns suggested in the RISC-V spec, papers about macro-op fusion on RISC-V, and the SiFive patent on macro-op fusion are realistic and feasible.
> Bruce Hoult's comment from a few months ago suggests that it wasn't really being tried at SiFive
Commits from SiFive people's in GCC [1] suggest that short-forward-branch macro-op fusion patterns are recognized by SiFive hardware.
I don't know if they recognize other macro-op fusion patterns.
But SiFive has a bunch of stuff to improve before diving into macro-op fusion.
At the lower end they have a big advantage as they can reach similar performance as Arm at half the silicon which translates into 4x lower price at similar volume of production.
But for high-end chips you will not see this difference.
SiFive has caught up about 2 years each year. So I suspect it will not work 3-4 years before RISC-V can compete with Arm desktop computers. Competing with Apple will take much longer.
But such perspectives kind of miss the point. Then beauty of RISC-V is that it can be used in all sorts of specialized hardware such as accelerators. You may be able to program AI engines or maybe even some kind of graphics card with RISC-V instructions in the future.