

Lab-Tools FPGA Binning Co-processor for APL (2012) [video] - chesterfield
https://www.youtube.com/watch?v=75BLv5PFEvo

======
sparky
Aliasing conflict!

This is about doing histogramming on an FPGA, not a tool to group FPGA dice
into speed grades :)

Would be interesting to see how an FPGA stacks up against a GPU on this
problem. GPUs are very fast at parallel histogramming, but hardwiring the
number-to-bin-index computation for a particular problem instance might buy
you quite a bit of energy efficiency, or even possibly a bit of performance if
the computation involves a divide. If the bins are of non-uniform size and the
indexing computation involves a binary search on a lookup table, it seems like
the much higher clock speeds on a GPU would win out.

~~~
sitkack
GPU clocks aren't really that high. FPGAs win when it comes to combinatorial
logic, if statements kill GPUs. My money is on the FPGA by a superlative.

~~~
TD-Linux
Most FPGAs can't handle very high clocks either (~400MHz max, 100MHz is more
reasonable for the Xilinx Artix chips). A huge fanout with a ton of
comparators is going to not run that fast.

I imagine binning with an ISA with conditional execution wouldn't be that bad
as far as jump penalty. Even with jumps, as long as they are predicted
correctly it's fine.

The big limitation here is probably the USB 2.0 speed - 40MB/s is not a lot
compared to a CPU's bandwidth to main memory.

~~~
sitkack
GPUs just recently hit 800+, they were at 600Mhz for a long while. As for USB
bottleneck, I am speaking generically and assuming an equal footing (memory
bandwidth, bus, etc), the specific device (binning on an FPGA) as a co-
processor for a computer system doesn't really make sense from a speed
advantage. I saw this more as a proof of concept for augmenting APL with an
FPGA. Of course, APL itself should just get compiled down the GPU.

It does make a for a great benchmark. Eventually FPGA fabric and GPU compute
will merge into an oatmeal with raisin consistency. I'd argue that with
embedded multipliers, FPGAs are already there. Altera has an OpenCL sdk.

------
pflanze
I don't understand why binning 1 million values would take more than a second
on the CPU. I've tested the following in C:

    
    
      int len= 1000000;
      int numbuckets= 1000;
      double* vec= ...; // malloc(sizeof(double)*len) and fill with values between 0..1
      unsigned long* res= calloc(numbuckets, sizeof(unsigned long));
      int i;
      for (i=0; i<len; i++) {
          int ii= vec[i] * numbuckets;
          res[ii]++;
      }
    

This runs in 8 ms on my elderly laptop.

It would be slower if the individual bins wouldn't be of equal size, but the
examples shown in the video seem to suggest that they are, and even with about
a 10 times slowdown for a binary search it would be faster than the USB 2.0
transfer. Without using a GPU, multiple threads or SIMD.

~~~
thesz
Make more buckets, make bins with different bounds (not all equal to
1/numbuckets). All of sudden your algorithm will work much more due to cache
misses and the need to perform conditional jumps.

~~~
pflanze
That does not explain it. Yes it will be slower with more buckets; how much?

    
    
      numbuckets uint64  uint32
      1000         6 ms    5 ms
      10000        6 ms    5 ms
      100000       6 ms    7 ms
      1000000     14 ms    8 ms
      10000000    38 ms   32 ms
      20000000            38 ms
      40000000            40 ms
      80000000            44 ms
    

40 ms is still >50 times faster than the "software algorithm" in the video.
And he _didn 't_ use 40 mio buckets there, as can be seen from the graphs.

> make bins with different bounds

You'd have to implement this _and_ use a huge number of buckets _and_ likely
still use an older CPU than he did to get to his software timings. Or use a
slow language implementation (interpreted or boxing arithmetic results). Or
implement a slow algorithm like a linear search over the buckets (which would
explain what the difference between his two algorithms of about a factor 2 is,
the faster one stopping once the bucket is found). In any of these cases, he
either doesn't know how to write fast code, or deliberately left the software
version way slower than possible to have the FPGA version shine, pending any
other better explanation.

~~~
pflanze
I've now implemented the binary search version; I was right, the factor 10 was
about correct, and even with 20 mio buckets the algorithm is still faster than
the FPGA and more than 3 times faster than his faster software algorithm even
though he's got a newer CPU (I'm running it on a Core 2 duo (single-threaded)
at 2.5 Ghz), and again, he's really using something between about 1000..100000
buckets, which would make my program on my machine between 17 to 32 times
faster than his:

    
    
      buckets   gcc -O1 gcc -O3
      1000       75 ms   72 ms
      10000     105 ms  102 ms
      100000    135 ms  136 ms
      1000000   243 ms  240 ms
      10000000  580 ms
      20000000  693 ms
    

(Results in my previous post were with gcc -O1.)

All results so far were with random numbers with an uneven distribution:
vec[i]= (i / len) * random_between(0.,1.). Here are the results for the binary
search with vec[i]= random_between(0.,1.):

    
    
      buckets  gcc -O1
      1000       82 ms
      10000     115 ms
      100000    148 ms
      1000000   360 ms
      10000000  735 ms
      20000000  860 ms
    

(He's using a Gaussian distribution, not even.)

Here's the code:

    
    
      int nvals= ..;
      double* vals= ..;
      int nbuckets= ..;
      double* buckets= ..;
      unsigned int* res= ..;
    
      int i;
      for (i=0; i<nvals; i++) {
          double val= vals[i];
          int bi;
          {
    	  int lo=0, hi=nbuckets;
    	  while (lo < hi) {
    	      int mid= (lo+hi) / 2;
    	      if (val < buckets[mid]) {
    		  hi= mid;
    	      } else {
    		  lo= mid+1;
    	      }
    	  }
    	  bi= lo;
          }
          res[bi]++;
      }
    

(Embedded within a Scheme driver program:
[http://christianjaeger.ch/scratch/binning.scm](http://christianjaeger.ch/scratch/binning.scm))

