
Clover: 4-Bit Quantized Linear Algebra - ArtWomb
https://astojanov.github.io/projects/clover/
======
0xfaded

      Once the data exceed L3 cache,
      4-bit is fastest since the dot
      product is memory bound: data
      movement from RAM becomes the bottleneck.
      The speed-up is up to 6x over
      the 32-bit version.
    

Large linalg operations are memory bound.

I learned this the hard way hand coding an arm32 5x5 gaussian blur. When
benchmarking my first version, using two passes of the separatable 5x1 and 1x5
filters, I had a facepalm moment when I realized I was really only measuring
two round trips to main memory.

Operating on 16x8 pixel blocks, which fit within the register file, required
most pixels to only be fetched once and almost doubled performance.

~~~
vanderZwan
Obligatory link to Halide, a language that is all about making hand-written
scheduling easier:

[http://halide-lang.org/](http://halide-lang.org/)

~~~
0xfaded
Halide is cool, but from when I last looked at it its best at optimizing for
cache usage. For my example, there would have still been two round trips, but
to L1 instead of something farther.

I've compared using GCC intrinsics vs what I would have wanted. It doesn't
really get instruction scheduling right. Things like efficiently using
pipelines by grouping, say, multiplication instructions and not stalling by
ensuring the right number of cycles between them.

I have some specialised block matrix multiplication which basically never
stall the processor. We're talking 10x speedups over optimized but general c++
code.

------
natch
This is beautifully presented.

Addressing the author(s): I don’t know much about the topic area but you made
it interesting, and it is a trove of cool tricks and techniques (sure, bit
manipulation is basic but it’s fun to see how you put it all together to solve
the problems). And how you analyze the performance. Look forward to coming
back to this after I get more up to speed on linear algebra.

------
mchahn
Many years ago (1970s) I used very expensive but very accurate HP spectrum
analyzer. It would take in a signal and present the spectrum on a screen. I
studied the manual and found that it used a 1-bit(!) A/D. When averaging over
many samples the resolution would build up.

~~~
kragen
ΔΣ converters ("1-bit ADCs") have been commonplace since the 1970s; typically
they use some feedback tricks to get better signal-to-noise ratios than you
would calculate straightforwardly. 1-bit DACs were common in CD players
throughout the 80s, and the SACD format about 20 years back was a raw delta-
sigma bitstream. Unfortunately the Wikipedia article is not very good:
[https://en.wikipedia.org/wiki/Delta-
sigma_modulation](https://en.wikipedia.org/wiki/Delta-sigma_modulation)

Fundamentally, the reason they're especially accurate is that most of the
usual sources of error in digital–analog conversion just don't exist in a
1-bit ADC. INL? Zero. DNL? Zero. Nonmonotonicity? Please.

Clover, by contrast, is interesting to me for a different reason: a lot of ANN
inference stuff seems to work fine at 8 bits of precision. It'll be
interesting to see if those results can extend to 4 bits.

~~~
mchahn
Thanks for the info on ΔΣ converters. It connects several things I knew about.

Now I need to study up on the area of computation that clover lives in.

~~~
dragontamer
Just to note how common delta-sigma converters are: Arduino's got them for
voltage sensing.

I dare say that the most common AD converters are simply delta-sigma
converters, especially if 1MHz conversions is fast enough for your tasks.

To get into the 100MHz or above speed (ie: Oscilloscopes, Software defined
Radios, etc. etc.) you need to use other techniques.

~~~
mchahn
> To get into the 100MHz or above speed

That gets back to my original point. The spectrum analyzer went into the
gigahertz range. I don't think it was delta sigma. It was some multiple sample
thing that only worked in the frequency domain.

~~~
gugagore
It might have gone into the GHz range, but I'm guessing with a narrow
bandwidth. It's the bandwidth that determines the sampling rate needed.

