
AVX2 faster than native popcnt instruction on Haswell/Skylake - ingve
http://0x80.pl/articles/sse-popcount.html
======
nkurz
As the author states in another comment here
([https://news.ycombinator.com/item?id=11278640](https://news.ycombinator.com/item?id=11278640))
the charts he has in the article understate the potential advantage of the
AVX2 approach over POPCNT. On current processors, POPCNT runs on a single
execution port and counts the bits in 4B per cycle. A fast unrolled solution
thus gets down very close to .125 cycles/byte (1/8).

The AVX2 implementation requires more instructions, but spreads the work out
across more ports. My fastest unrolled version takes 2.5 cycles per 32B
vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B
/2.5 cycles per 32B) than scalar popcnt().

Whether this is worth doing depends on the rest of the workload. If the rest
of the work is being done on scalar 64-bit registers, those other scalar
operations can often fit between the popcnts() "for free", and the cost of
moving data back and forth between vector and scalar registers usually
overwhelms the advantage.

But if you are already dealing with vector registers, the speedup can be even
greater than a microbenchmark would imply. We've been using Wojciech's
approach for operations on compressed bitsets, and find it fantastic:
[https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...](https://github.com/RoaringBitmap/CRoaring/blob/master/src/containers/bitset.c)

And since wider registers are almost certainly coming, the vectorized approach
seems likely to pull even farther ahead in the future. AVX512, the 64B
successor to AVX2, is currently scheduled for Skylake Xeon. Unless a second
execution port is added for popcnt() (not likely) the vectorized version will
probably end up about 3x as fast in a couple years.

~~~
pcwalton
> And since wider registers are almost certainly coming, the vectorized
> approach seems likely to pull even farther ahead in the future. AVX512, the
> 64B successor to AVX2, is currently scheduled for Skylake Xeon.

This is fantastic news. I hope this makes it down to consumer CPUs in time as
well.

The current trend of ridiculously wide SIMD vectors is a pretty exciting
development and is (IMO) a reason to be optimistic in spite of the current
climate of "Moore's Law is over and the multicore experiment has failed, we're
doomed" handwringing. If you can express your problem as a tight parallel
numeric loop, things look good for getting better and better performance in
the future.

~~~
boulos
<rant> Intel has decided to make _Xeon Laptop parts_. Xeon for Skylake means
has AVX512. It amazingly doesn't mean what it historically meant ("server
part").

I don't like that this means most developer laptops won't have AVX512 (I'm
hoping for a new Skylake "Xeon" mbp), but Lenovo and others have already
released some.

</rant>

~~~
ploxiln
On the other hand: this means that it is finally possible to get a laptop with
an intel cpu and ECC DRAM!

(And I'm guessing, really I hope, that I won't have to look up every xeon
laptop part in intel ark to see if it supports vt-x and vt-d - I hope xeon
means "no need for further market segmentation, all features that make sense
are enabled")

~~~
Nullabillity
I thought Skylake finally made VT-d a baseline feature (like VT-x has been for
a while).

------
userbinator
It should be kept in mind that these are microbenchmarks; I'd guess the _much_
bigger AVX2 instruction sequence will have visible cache effects in
macrobenchmarks (i.e. combined with a mix of other instructions) and lose its
small lead --- on Haswell AVX2 is ~17% faster, on Skylake only 6%.

POPCNT is a single instruction and should definitely be inlined; I doubt that
would be such a good idea for these longer sequences, but then putting it in a
function means the call+return overhead could also become significant. It's
hard to quantify without doing actual measurements, but with the figures given
I'd be inclined to stick with POPCNT.

~~~
stephencanon
The article is really targeting popcount for large bitvectors; in this
scenario, call overhead is a non-issue (constant cost of <5-10 cycles vs O(n)
work to be performed). I$ effects will also be negligible (the AVX2
instruction sequence is only a few cachelines long, and you wouldn't inline
it).

~~~
wmu
I did some tests and it seems that the AVX2 code is faster than popcnt when
the input size is 512 bytes or more.

------
jbapple
See also [http://danluu.com/assembly-intrinsics/](http://danluu.com/assembly-
intrinsics/) , in which hand-tuned assembly popcnt is 3x faster than using the
compiler intrinsic.

------
samfisher83
This is an awesome way to do it in software:

int BitCount(unsigned int u) { unsigned int uCount;

    
    
             uCount = u – ((u >> 1) & 033333333333) – ((u >> 2) & 011111111111);
             return ((uCount + (uCount >> 3)) & 030707070707) % 63;
     }
    

O(1) solution

This explains why it works.

[https://blogs.msdn.microsoft.com/jeuge/2005/06/08/bit-
fiddli...](https://blogs.msdn.microsoft.com/jeuge/2005/06/08/bit-fiddling-3/)

~~~
eloff
That % 63 at the end is integer division, that's horribly slow. Literally any
other sensible approach will beat it, including a naive LUT.

~~~
versteegen
Optimising compilers are smart enough to get rid of unsigned divisions by
constants. See [1] (part one of a series; also [2] is neat). To quote [2],
"libdivide's scalar code is up to 16 times faster for powers of 2, 10 times
faster for non-powers of 2, compared to naive hardware division." In this case
GCC (even with -O0) replaces % 63 with a complicated sequence of one integer
multiplication, three subtractions, an addition, and three shifts, which I
couldn't be bothered decompiling.

[1] [http://ridiculousfish.com/blog/posts/labor-of-division-
episo...](http://ridiculousfish.com/blog/posts/labor-of-division-
episode-i.html)

[2] [http://libdivide.com/](http://libdivide.com/)

~~~
eloff
You're right. I read somewhere that no compiler optimizes modulus by a
constant. That's clearly very outdated. I tested with clang 3.2, gcc 4.5, and
icc 13 (all on a 32bit platform) and they all quite happily replace the
division with other instructions.

------
JoshTriplett
Very interesting result!

However, this workload involved a loop summing the popcounts of a series of
values, with intermediate results not needed. Beating popcount on that
workload with AVX2 instructions seems plausible; splitting the work across
several instructions gives the CPU more to work with simultaneously, and
batching the last sum makes it effectively free when amortized across enough
iterations.

However, if the popcounts in your workload remain independent, and you
immediately need the values from them, I don't think this approach will help.

~~~
nkurz
_However, if the popcounts in your workload remain independent, and you
immediately need the values from them, I don 't think this approach will
help._

This can be true, but it depends on whether the rest of the workload is
vectorized. If you already have the data in vector registers, or can make use
of it there afterward, this approach turns out to be even more beneficial than
the microbenchmark numbers imply.

We've been working with Wojciech's approach for a while now, and thus can even
point to real-world code:
[https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...](https://github.com/RoaringBitmap/CRoaring/blob/master/src/containers/bitset.c)

~~~
JoshTriplett
That looks like another case of a bitset, where you want to accumulate the
popcounts over the whole bitset. That seems quite similar to the case in this
article. (I still find it surprising that popcount doesn't have enough
internal optimization to win, but the numbers certainly prove that.)

The case I'm talking about is when you need a single popcount value in
isolation as part of some larger computation that doesn't otherwise involve
popcounts.

For example, I've worked with libraries that manipulate variable-length data
structures where each bit in a flags field indicates the presence of a chunk
of data. To compute the size of the data, you need the popcount of the flags
field. So you get the popcount (using __builtin_popcount when available), and
then immediately use that value.

------
fiatmoney
My favorite use of popcnt, and now this alternate technique, is shaming
programming interviewers out of their reliance on explicit bit-twiddling
solutions.

------
colejohnson66
I'm actually surprised by this. I would think the AVX instructions have
complex microcode that would take longer than a simple bit shifter with an AND
to check and add the final bit. I assume `popcnt' isn't used that frequently,
so maybe Intel spend a lot of time fine tuning the microcode for AVX but not
`popcnt'.

~~~
ajross
AVX instructions (at least the one's we're talking about here) are fully
pipelined and none have microcoded implementations that I'm aware of.

And recognize that the news here is "on Haswell". On earlier SSE/AVX
implementations POPCNT is indeed faster. Something changed with Haswell that
allows this code to run faster than the purpose-designed instruction.

Most likely those two adjacent PSHUFB instructions were limited on earlier
architectures to running on a single execution port (with a "PSHUFB engine"),
and on Haswell it got insantiated on another ALU and they can now be executed
in parallel.

~~~
nkurz
_Something changed with Haswell that allows this code to run faster than the
purpose-designed instruction._

The change is primarily that AVX2 (which Haswell is the first generation to
support) extended binary and integer vector operations to 32B registers, while
AVX only supported 32B floating point operations. Instructions that previously
operated on 16B doubled their throughput by handling 32B with (usually) the
same latency:
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2)

~~~
thrownaway2424
I thought Agner had shown that the Haswell AVX2 stuff didn't necessarily
bother turning on the entire execution unit unless it really seemed warranted,
preferring instead to issue the 128-bit operation twice and combine. For
example see the later comments on
[http://www.agner.org/optimize/blog/read.php?i=142](http://www.agner.org/optimize/blog/read.php?i=142)

~~~
nkurz
There's a real effect here, but in practice I haven't found it to be an issue.
As long as your instruction mix has at least 1 256-bit operation in the last
few million instructions, the slowdown doesn't happen. I'm sure you could
construct a case where it would be a problem, but throwing in an occasional
unused VPXOR solves it easily enough.

One thing that can be an issue is alignment. Unless your reads are 32B
aligned, you will be limited to reading 40B per cycle. While a single
unaligned vector read per cycle isn't a problem, full utilization of the
increased throughput that 'stephencanon' mentions in the sibling is only
possible if both vectors are 32B aligned:
[http://www.agner.org/optimize/blog/read.php?i=415#423](http://www.agner.org/optimize/blog/read.php?i=415#423)

------
nickpeterson
Slightly off topic, are there any high level languages apart from c, c++, and
Fortran that deal well with simd? I've looked at Java and C# and they seem to
have very partial support. Anything else out there with a better simd story?

~~~
mmozeiko
Intel SPMD Program Compiler:
[https://ispc.github.io/](https://ispc.github.io/)

------
ndesaulniers
What's the best book folks recommend for learning SIMD programming?

~~~
matt_d
Some of the resources I've ran across (assuming x86 context):

* Brief intros in Agner Fog's manuals:

\- Chapter 12, Using vector operations in
[http://www.agner.org/optimize/optimizing_cpp.pdf](http://www.agner.org/optimize/optimizing_cpp.pdf)

\- Chapter 13, Vector programming in
[http://www.agner.org/optimize/optimizing_assembly.pdf](http://www.agner.org/optimize/optimizing_assembly.pdf)

* [http://www.whatsacreel.net76.net/asmtutes.html](http://www.whatsacreel.net76.net/asmtutes.html) / [https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D](https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D)

* Kusswurm (2014) "Modern X86 Assembly Language Programming - 32-bit, 64-bit, SSE, and AVX"

* Hughes (2015) "Single-Instruction Multiple-Data Execution"

* The author's website :-) [http://wm.ite.pl/articles/#assembler-x86-fpu-mmx-sse-sse2-ss...](http://wm.ite.pl/articles/#assembler-x86-fpu-mmx-sse-sse2-sse3-ssse3-sse4-avx-avx2-avx512)

References:

* [https://chessprogramming.wikispaces.com/x86-64](https://chessprogramming.wikispaces.com/x86-64)

* [https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)

* x86 intrinsics cheat sheet (link at the very bottom): [http://db.in.tum.de/~finis/](http://db.in.tum.de/~finis/)

