Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtract...

cpldcpu · 2024-04-29T10:31:45

On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.

This publication seems to describe more details on the arduino implementation:

https://arxiv.org/abs/2105.02953

It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.