
Bitset Decoding on Apple’s A12 - ingve
https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/
======
TickleSteve
"Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5
to get the number of cycles."

I can guarantee that unless he was going out his way to force it, the
processor was not running at 2.5GHz... it would have been scaled way back as
mobile processors are very aggressively tuned for power-saving.

Unless he can show how he managed to disable frequency scaling on the iPhone,
the results are kind of meaningless.

If he wants to get the cycle count of the instructions, you don't have to run
it, you need the datasheet from the manufacturer (which may be difficult to
source admittedly).

~~~
jcranmer
> If he wants to get the cycle count of the instructions, you don't have to
> run it, you need the datasheet from the manufacturer (which may be difficult
> to source admittedly).

A12 is an out-of-order superscalar core. Counting the exact cycle counts of a
sequence of instructions for such architectures by hand is basically a fool's
errand, especially if you start running into issues such as exhausting
register ports or issue stations.

~~~
zeusk
I'm quite sure Apple has a tool for profiling iOS applications. ARM has PMU
(Performance Measuring Unit) as part of the architecture, for exactly this
sort of testing.

~~~
saagarjha
Instruments probably has this available; I know they do for Intel processors.

Edit: Just checked, and my A10 shows a small but standard set of PMCs, many of
which show up here:
[https://opensource.apple.com/source/xnu/xnu-4903.221.2/osfmk...](https://opensource.apple.com/source/xnu/xnu-4903.221.2/osfmk/arm64/kpc.c)

------
nwallin
> On recent x64 processors, we find that it is beneficial to decode in bulk:
> e.g., assume that there are at least 4 set bits, decode them without
> checking whether there are four of them. The code might look as follow:
> [...]

I find this assertion surprising, and so I decided to test it. I wrote a naive
algorithm, and found the generated assembly even more surprising.
[https://godbolt.org/z/ah3msM](https://godbolt.org/z/ah3msM)

    
    
        int bits(unsigned long n, char* b) {
            int retval = 0;
            while(n) {
                *b++ = __builtin_ctzll(n);
                n &= n-1;
                retval++;
            }
            return retval;
        }
    

First of all, the business with incrementing retval is discarded entirely. It
simply performs popcnt on n and returns that. Second, it performs the
calculation in two sections. First it does the naive, simple loop until the
remaining bits are a multiple of 8. Then it unrolls the loop into groups of 8.
It unrolled the loop even with just -O2 set. In essence, the compiler does
precisely what his code does, but with groups of 8 instead of 4.

Clang seems to have changed behavior between versions 6 and 7; version 6 and
below use the naive loop, as does GCC. Clang 7 and 8 unroll it. I'm curious if
this had any effect on his results; his results for the A12 presumably are
using Clang, while his linux box presumably uses GCC.

There's not really a whole lot of room for improvement in that unless you go
whole hog and unroll the entire thing and table jump into it:
[https://godbolt.org/z/qWzfHl](https://godbolt.org/z/qWzfHl)

    
    
        int bits(unsigned long n, char* b) {
            const int retval = __builtin_popcountll(n);
            switch(retval) {
                default: for(char i = 64; i--;) b[i] = i; return 64;
                case 63: b[62] = __builtin_ctzll(n); n &= n-1;
                case 62: b[61] = __builtin_ctzll(n); n &= n-1;
                case 61: b[60] = __builtin_ctzll(n); n &= n-1;
                // etc
                case 2: b[1] = __builtin_ctzll(n); n &= n-1;
                case 1: b[0] = __builtin_ctzll(n);
                case 0: return retval;
            }
        }
    

Which has its own macabre beauty to it.

Interesting distinction between Intel CPUs and AMD: the blsr instruction has a
latency of 1 on Intel, and 2 on AMD, while tzcnt has a latency of 3 on Intel,
and 2 on AMD. But the performance is limited by the blsr instruction, so Intel
processes 1 bit per cycle, but AMD can only manage 1 bit per two cycles.

~~~
nwallin
I looked into it a little more, and coded up some benchmarks, and it turns out
I was wrong. His solution is like this:

    
    
        int bits(unsigned long n, char* b) {
            int retval = __builtin_popcountll(n);
            while(n) {
                *b[0] = __builtin_ctzll(n); n &= n-1;
                *b[1] = __builtin_ctzll(n); n &= n-1;
                *b[2] = __builtin_ctzll(n); n &= n-1;
                *b[3] = __builtin_ctzll(n); n &= n-1;
                b += 4;
            }
            return retval;
        }
    

Unrolled four times, this is faster than the naive code unrolled by the
compiler, but slower than the jump table. Unrolled eight times, that is ~10%
faster than the jump table, which really surprised me. How can a loop be
faster than non-branching code? The crucial difference is that his version
begins filling the pipeline immediately. The compiler unrolled version needs
to wait on the popcnt before it can begin working, and the jump table needs to
wait on both the popcnt and the subsequent jump. Despite the fact that the
jump table is 100% branchless after the initial jump, its speed advantage down
the stretch can't make up for the author's early start, _nor_ can it make up
for the author's eventual branch misprediction and subsequent pipeline stall.
Very strange. Cool, but strange.

I think the lesson here is to not jump to conclusions and post your faulty
misconceptions about performance on the internet before you benchmark it.

------
kazinator
What sort of application shows a noticeable speedup if we manage to make
bitset decoding, say, ten times faster?

~~~
pierrebai
In this case, Mr Lemire previously design with collaborators a fast JSON
parser (GB/s) that required such bit detection. You can find previous papers
in his blog.

~~~
booblik
Prof Lemire

------
tveita
Could a lookup table be faster even for fairly sparse bits? I'm imagining
something like

    
    
      uint8_t *result;
      uint8_t numpositions[];
      uint64_t positions0[];
      uint64_t positions1[];
      ...
    
      *(uint64_t*) result = positions0[byte0]
      result += numpositions[byte0]
      *(uint64_t*) result = positions1[byte1]
      result += numpositions[byte1]
      ...
    

Or probably accumulating results in a register until you get 8.

~~~
mcbain
Even if those arrays are entirely in cache they require more cycles per lookup
than the computation from the article.

That's been true of lookup tables for a long while now.

