> When you have a look at the SweRV architecture and implementation, it is not incredibly complex, especially compared to, an out-of-order triple-issue A-15 core. And yet it achieves similar or higher CoreMark/MHz scores.
FYI, Coremark fits entirely within an L1 cache, so performance is significantly effected by load-to-use and load-to-load delay. The "skewed" pipeline design of SweRV helps it really shine here! The OOO cores typically have worse load-to-use and the OOO advantage typically shows up once you start missing in the L1 cache.
The remaining significant factor to Coremark performance is a good branch predictor, and to that end, I'm very impressed that their gshare can completely learn Coremark.
My reason to choose the A15 was because it was shown on the WD performance slide with a performance that's close to the SweRV.
The target use cases of the SweRV are obviously very different from the A15 (lack of MMU, lack of Dcache, lack of floating point will indeed do that).
But I didn't expect that adding these features would have a significant impact on peak performance of a benchmark that can fit completely in a cache?
My comment about complexity is entirely about the core execution pipeline: OoO vs in-order, number of pipes, number of ALUs. Even there, the A15 is significantly more complex than the SweRV, yet it performs more or less the same in best case circumstances. I expected at least some benefits? Since that is not the case, I assume that this complexity helps for non-ideal use cases (which the SweRV will probably never experience.)
Edit: as _chris_ points out: OoO helps when you start missing the L1 hits.
Makes sense. It pops in my head because any RISCV story here usually gets flooded with people hoping for an affordable RISCV chip that will run Linux. I figured the comparison might spawn some false hope :)
That's incredibly fascinating. It seems counter-intuitive that a random placement would be superior to a more regular one. Does this merely optimize for minimal length of the datalines?
> Does this merely optimize for minimal length of the datalines?
That's how it used to be 25 years ago. It's probably still a factor in the cost function, but the biggest part is timing. Of course, timing and length are closely related.
> It seems counter-intuitive that a random placement would be superior to a more regular one.
It's not necessarily superior if you want to have optimal timing for all paths between all cells. But you don't need that: only a few percent of all paths are actually timing critical. Those determine the maximum clock speed. The others have enough slack such that it doesn't matter that they are placed tens of microns too far.
Random placement (at an ever lower level of detail) is also better to avoid crosstalk. If you'd place a bunch of driver cells in a nicely aligned stack and make those wires go a nicely aligned stack of receiving flip-flops with parallel wires in between, you'd get the mother of all crosstalk problems.
Chips themselves aren't actually very regular in terms of their layout -- not all tracks are connected the same way, parts are partitioned across different clocking areas, some parts of the chip contain things like memory resources while others do not, etc. You don't actually have a perfectly symmetrical die. FPGAs for example are quite "heterogenous" in the device resources they contain (clocks, DSPs, memory, registers) and they are scattered in many places over the chip. (Note that FPGA resources are fixed while ASIC layouts are not, but both of them use simulated annealing/analytic placement algorithms in the design phase for automatically laying out digital logic among the chip. Both of these problems, while complex, do ultimately aim to solve the optimization problem of minimizing wire-length -- among many other timing/clocking constraints.)
FYI, Coremark fits entirely within an L1 cache, so performance is significantly effected by load-to-use and load-to-load delay. The "skewed" pipeline design of SweRV helps it really shine here! The OOO cores typically have worse load-to-use and the OOO advantage typically shows up once you start missing in the L1 cache.
The remaining significant factor to Coremark performance is a good branch predictor, and to that end, I'm very impressed that their gshare can completely learn Coremark.