Impressive numbers compared with the linked Arduino project. Makes me wonder, wh...

cpldcpu · 2024-04-28T14:21:42

The different is in using quantization aware training, where the quantization of the weights is already simulated during the training. This helps to restructure the network in a way where it can optimally store information in the allotted number of bits per weight.

When the NN is quantized only after training, a lot of information is lost, or you have to use less aggressive quantization that will have a lot of redundancy.

UncleEntity · 2024-04-28T16:14:07

Does that mean you're training a lower bit-rate(?) network or you are training a full network to 'know' it will eventually be running under quantization?

I'd imagine there's differences in the two approaches?

cpldcpu · 2024-04-28T17:25:18

The latter one. The network is trained in full precision (this is required for the gradient calculation), but the weights are nudged towards the quantized values.

bjornsing · 2024-04-28T19:03:55

Thanks, that explains the accuracy. But it doesn’t explain why it took 7 seconds to run inference on the Arduino, and milliseconds in this project…

robxorb · 2024-04-28T21:57:38

The paper [0] regarding the Arduino inplementation mentions their MCU runs at 16Mhz, and they are also running the inference on 28x28 images.

This projects MCU runs at 48Mhz and is infering 16x16 images.

So, 3x less pixels, at 4x the Mhz. 7000 / 12 = 583ms. Versus 13.5ms, a 43x speed increase. Does seem high, depending maybe on differences between the AVR and RISC-V hardware and ISA. (Eg, might there be a RAM bottleneck on the AVR chip?)

[0] https://arxiv.org/ftp/arxiv/papers/2105/2105.02953.pdf

numpad0 · 2024-04-29T06:18:25

Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtracted for consecutive reads. SRAM store is 1-2 cycles. FMUL(fixed point multiply) is 2 cycles. CPU is not pipelined nor cached.

3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.

0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...

cpldcpu · 2024-04-29T10:31:45

On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.

This publication seems to describe more details on the arduino implementation:

https://arxiv.org/abs/2105.02953

It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.

mbb70 · 2024-04-28T13:12:45

I believe the Arduino project used only a single hidden layer, whereas the authors quantization scheme allowed them to use multiple.