Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtracted for consecutive reads. SRAM store is 1-2 cycles. FMUL(fixed point multiply) is 2 cycles. CPU is not pipelined nor cached.
3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.
On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.
This publication seems to describe more details on the arduino implementation:
It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.
3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.
0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...