FYI, Coremark fits entirely within an L1 cache, so performance is significantly effected by load-to-use and load-to-load delay. The "skewed" pipeline design of SweRV helps it really shine here! The OOO cores typically have worse load-to-use and the OOO advantage typically shows up once you start missing in the L1 cache.
The remaining significant factor to Coremark performance is a good branch predictor, and to that end, I'm very impressed that their gshare can completely learn Coremark.
My reason to choose the A15 was because it was shown on the WD performance slide with a performance that's close to the SweRV.
The target use cases of the SweRV are obviously very different from the A15 (lack of MMU, lack of Dcache, lack of floating point will indeed do that).
But I didn't expect that adding these features would have a significant impact on peak performance of a benchmark that can fit completely in a cache?
My comment about complexity is entirely about the core execution pipeline: OoO vs in-order, number of pipes, number of ALUs. Even there, the A15 is significantly more complex than the SweRV, yet it performs more or less the same in best case circumstances. I expected at least some benefits? Since that is not the case, I assume that this complexity helps for non-ideal use cases (which the SweRV will probably never experience.)
Edit: as _chris_ points out: OoO helps when you start missing the L1 hits.
If your point of reference is the layout of an Intel CPU, don't forget that those are on the order of 100mm2, where this layout is around 0.1mm2.
That's how it used to be 25 years ago. It's probably still a factor in the cost function, but the biggest part is timing. Of course, timing and length are closely related.
> It seems counter-intuitive that a random placement would be superior to a more regular one.
It's not necessarily superior if you want to have optimal timing for all paths between all cells. But you don't need that: only a few percent of all paths are actually timing critical. Those determine the maximum clock speed. The others have enough slack such that it doesn't matter that they are placed tens of microns too far.
Random placement (at an ever lower level of detail) is also better to avoid crosstalk. If you'd place a bunch of driver cells in a nicely aligned stack and make those wires go a nicely aligned stack of receiving flip-flops with parallel wires in between, you'd get the mother of all crosstalk problems.