Thanks, that explains the accuracy. But it doesn’t explain why it took 7 seconds...

robxorb · 2024-04-28T21:57:38

The paper [0] regarding the Arduino inplementation mentions their MCU runs at 16Mhz, and they are also running the inference on 28x28 images.

This projects MCU runs at 48Mhz and is infering 16x16 images.

So, 3x less pixels, at 4x the Mhz. 7000 / 12 = 583ms. Versus 13.5ms, a 43x speed increase. Does seem high, depending maybe on differences between the AVR and RISC-V hardware and ISA. (Eg, might there be a RAM bottleneck on the AVR chip?)

[0] https://arxiv.org/ftp/arxiv/papers/2105/2105.02953.pdf

numpad0 · 2024-04-29T06:18:25

Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtracted for consecutive reads. SRAM store is 1-2 cycles. FMUL(fixed point multiply) is 2 cycles. CPU is not pipelined nor cached.

3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.

0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...

cpldcpu · 2024-04-29T10:31:45

On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.

This publication seems to describe more details on the arduino implementation:

https://arxiv.org/abs/2105.02953

It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.