RISC-V cores are about 4 years behind ARM. So most they are most suitable for em...

zozbot234 · on May 3, 2022

If anything, a more streamlined frontend matters more at the high-end, because it enables more scalable decode letting you extract increased parallelism out of the code. Better use of your frontend complexity budget also makes it easier to use things like fused microinsn sequences, or the RISC-V compressed insn set which brings code density on par with x86 (an outstanding result for such a clean design) for both RV32 and RV64. This is pretty much what we see going from x86 to ARM, and RISC-V only goes further in the same direction.

seoaeu · on May 3, 2022

Decode doesn't take up much space in a high-end processor. Even if your architecture's decoder doesn't "scale" as well, you can just throw a bunch more transistors at it to get performance. That said, I doubt there's many CPU designers unhappy about a simpler ISA...

imtringued · on May 3, 2022

If macro op fusion is viable then there are more opportunities to throw more transistors at it to get more performance.

seoaeu · on May 3, 2022

The thing that makes most macro op fusion non-viable on x86 is that most of the macro ops you'd want to fuse are already a single instruction!

saagarjha · on May 3, 2022

It's not yet clear if macro-op fusion will really be a thing.

kragen · on May 3, 2022

Hmm, I'm no expert on the subject, but I had heard that CMP/Jcc and TEST/Jcc macro-op fusion has been critical to i386 and amd64 performance, for both Intel and AMD processors, for about 15 years?

saagarjha · on May 3, 2022

Those are "simple" fusions that processor do perform productively. The thing with RISC-V is that a lot of the design rides on "oh we're just going to combine all the very RISC instructions into macro-ops to catch up to what ARM is doing" except nobody is really doing that kind of fusion right now, so it's not clear if this is even feasible.

kragen · on May 3, 2022

Hmm, not even the Allwinner D1 or the SiFive P650? In https://news.ycombinator.com/item?id=29423006 Bruce Hoult says SiFive has been doing macro-op fusion in one case since the 7-series, and I think he's the one that implemented it.

In https://news.ycombinator.com/item?id=29444322 "ruslan" links the original macro-op fusion stuff from 02016: https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu... and https://arxiv.org/abs/1607.02318. Surprisingly, those present "shootout" results that are really just "effective dynamic instruction" counts and "dynamic byte" counts, ending in a "design proposal"; they don't seem to have simulated macro-op fusion at RTL level, much less in an FPGA or a MOSIS run.

brucehoult · on May 4, 2022

Certainly I didn't implement it, that was Andrew Waterman. I'm not a hardware person.

I think macro-op fusion (with a few exceptions) is useful only on a narrow set of mid-range cores. High end cores want to split everything up (RISC-V comes pre-split), while low end cores want to use as little hardware as possible.

There are a few exceptions, for example combining a LUI or AUIPC into a full 32 bit constant with the following instruction(s). There is a full 32 bits for that data path already, so that has zero effect on the back end (except just removing an instruction). SLLI/{SRLI,SRAI} pairs is probably another.

avianes · on May 4, 2022

> High end cores want to split everything up (RISC-V comes pre-split)

Yes but a high-end core does not necessarily splits instruction the same way as RISC-V does.

For instance an indexed store:

RISC-V ISA splits it into an ADD+STORE pattern which may be recognized by macro-op fusion, while a high-end core splits it into a store_address uop (that compute the effective address and update the address field into the store-buffer entry) and a store_data uop (that update the data field into the store-buffer entry, and this uop is "address agnostic")

avianes · on May 4, 2022

Little addendum:

Macro-op fusion (alone) does not only reduce the number of uop, it could also reduces physical register usage as intermediate results are not stored into physical registers.

For instance, the indexed load pattern (ADD+LOAD) without macro-op fusion needs:

- 2 physical registers in read;

- 2 physical registers in write (one per macro-op), meaning that 2 physical registers will be allocated.

While the macro-op fused version needs:

- 2 physical registers in read;

- 1 physical register in write, meaning that a single physical register will be allocated.

kragen · on May 4, 2022

I appreciate the correction! Is anybody doing those?

avianes · on May 3, 2022

> lot of the design rides on "oh we're just going to combine all the very RISC instructions into macro-ops to catch up to what ARM is doing"

There are more macro-op fusion suggestions for RISC-V precisely because RV instructions are simple. This gives more opportunity for macro-op fusion.

> so it's not clear if this is even feasible

What macro-op-fusion are you talking about exactly? Most of the macro-op-fusion suggested are quite trivial and have already been implemented

kragen · on May 3, 2022

> Most of the macro-op-fusion suggested are quite trivial and have already been implemented

In which chips, and how much of a win are they?

avianes · on May 3, 2022

I talk about feasibility, not about design win in a high perf. circuit, since there is no real high perf. RISC-V today and since it's the feasibility that is questioned by the parent.

I can't say if the macro-op fusion will be enough to beat ARM & x86, but I can tell it's feasible.

kragen · on May 3, 2022

What do you mean by "implemented"?

Clearly macro-op fusion is a computable transformation of the instruction stream, so macro-op fusion is certainly feasible. I think what Jha was questioning was whether catching up to ARM is feasible if your main weapon is macro-op fusion. The Celio/Dabbelt/Patterson/Asanović tech report I linked yesterday shows that it's promising, but it's been six years and I'd like to know if that's actually being tried and how well it works.

Bruce Hoult's comment from a few months ago suggests that it wasn't really being tried at SiFive, but he left SiFive a couple years ago and might be out of date, and the Honey Badger designs are potentially a whole new ballgame.

avianes · on May 4, 2022

> I think what Jha was questioning was whether catching up to ARM is feasible if your main weapon is macro-op fusion.

I don't think so, there are indeed unrealistic macro-op fusion patterns that can reasonably be described as "unfeasible" or "unrealistic".

But macro-op fusion patterns suggested in the RISC-V spec, papers about macro-op fusion on RISC-V, and the SiFive patent on macro-op fusion are realistic and feasible.

> Bruce Hoult's comment from a few months ago suggests that it wasn't really being tried at SiFive

Commits from SiFive people's in GCC [1] suggest that short-forward-branch macro-op fusion patterns are recognized by SiFive hardware.

I don't know if they recognize other macro-op fusion patterns.

But SiFive has a bunch of stuff to improve before diving into macro-op fusion.

[1]: https://patchwork.ozlabs.org/project/gcc/patch/2019043023474...

kragen · on May 5, 2022

Yes, that's the macro-op fusion pattern Hoult said they were recognizing in his post.