Hacker News new | comments | show | ask | jobs | submit login
AVX2 faster than native popcnt instruction on Haswell/Skylake (0x80.pl)
147 points by ingve 439 days ago | hide | past | web | 38 comments | favorite



As the author states in another comment here (https://news.ycombinator.com/item?id=11278640) the charts he has in the article understate the potential advantage of the AVX2 approach over POPCNT. On current processors, POPCNT runs on a single execution port and counts the bits in 4B per cycle. A fast unrolled solution thus gets down very close to .125 cycles/byte (1/8).

The AVX2 implementation requires more instructions, but spreads the work out across more ports. My fastest unrolled version takes 2.5 cycles per 32B vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B /2.5 cycles per 32B) than scalar popcnt().

Whether this is worth doing depends on the rest of the workload. If the rest of the work is being done on scalar 64-bit registers, those other scalar operations can often fit between the popcnts() "for free", and the cost of moving data back and forth between vector and scalar registers usually overwhelms the advantage.

But if you are already dealing with vector registers, the speedup can be even greater than a microbenchmark would imply. We've been using Wojciech's approach for operations on compressed bitsets, and find it fantastic: https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...

And since wider registers are almost certainly coming, the vectorized approach seems likely to pull even farther ahead in the future. AVX512, the 64B successor to AVX2, is currently scheduled for Skylake Xeon. Unless a second execution port is added for popcnt() (not likely) the vectorized version will probably end up about 3x as fast in a couple years.


> And since wider registers are almost certainly coming, the vectorized approach seems likely to pull even farther ahead in the future. AVX512, the 64B successor to AVX2, is currently scheduled for Skylake Xeon.

This is fantastic news. I hope this makes it down to consumer CPUs in time as well.

The current trend of ridiculously wide SIMD vectors is a pretty exciting development and is (IMO) a reason to be optimistic in spite of the current climate of "Moore's Law is over and the multicore experiment has failed, we're doomed" handwringing. If you can express your problem as a tight parallel numeric loop, things look good for getting better and better performance in the future.


<rant> Intel has decided to make Xeon Laptop parts. Xeon for Skylake means has AVX512. It amazingly doesn't mean what it historically meant ("server part").

I don't like that this means most developer laptops won't have AVX512 (I'm hoping for a new Skylake "Xeon" mbp), but Lenovo and others have already released some.

</rant>


On the other hand: this means that it is finally possible to get a laptop with an intel cpu and ECC DRAM!

(And I'm guessing, really I hope, that I won't have to look up every xeon laptop part in intel ark to see if it supports vt-x and vt-d - I hope xeon means "no need for further market segmentation, all features that make sense are enabled")


I thought Skylake finally made VT-d a baseline feature (like VT-x has been for a while).


AFAIK, no Skylake (Xeon or other) chips scheduled for 2016 are going to support AVX-512.


This matches my belief as well. It's currently scheduled for 2017, but since it's slipped once already and since Intel isn't facing a lot of high end competition, who knows what will actually happen.

The naming scheme is terrible, because although "Skylake Xeon" is what many sources are referring to, this turns out not to mean all chips from the Skylake Generation that are marketed as Xeon. Rather, it means something like "Skylake that's actually designed for servers" rather "Current Skylake for which we didn't disable ECC".

So despite what others are reporting, I probably should have said that AVX512 is expected to come with "Skylake Purley", which is expected to be marketed as "Skylake Xeon" for the server market: http://wccftech.com/intel-skylake-purley-platform-upto-28-co...


It should be kept in mind that these are microbenchmarks; I'd guess the much bigger AVX2 instruction sequence will have visible cache effects in macrobenchmarks (i.e. combined with a mix of other instructions) and lose its small lead --- on Haswell AVX2 is ~17% faster, on Skylake only 6%.

POPCNT is a single instruction and should definitely be inlined; I doubt that would be such a good idea for these longer sequences, but then putting it in a function means the call+return overhead could also become significant. It's hard to quantify without doing actual measurements, but with the figures given I'd be inclined to stick with POPCNT.


Hi, author here. You're perfectly right, microbenchmarks have flaws. Procedures should be run for different data sizes (as I did years ago) and code should be compiled with different compilers. And of course the method described in the text is designed to deal with large data. Using it for counting bits in 64-bit value would be... not wise. :)

AVX2's speedup over POPCNT is not not big, but seems it's such due to my indolence/stupidity. I've just merged pull request by Simon Lindholm (https://github.com/WojciechMula/sse-popcount/pull/2) with manually unrolled loops and it made the AVX2 code faster 40% than POPCNT.


The article is really targeting popcount for large bitvectors; in this scenario, call overhead is a non-issue (constant cost of <5-10 cycles vs O(n) work to be performed). I$ effects will also be negligible (the AVX2 instruction sequence is only a few cachelines long, and you wouldn't inline it).


I did some tests and it seems that the AVX2 code is faster than popcnt when the input size is 512 bytes or more.


See also http://danluu.com/assembly-intrinsics/ , in which hand-tuned assembly popcnt is 3x faster than using the compiler intrinsic.


This is an awesome way to do it in software:

int BitCount(unsigned int u) { unsigned int uCount;

         uCount = u – ((u >> 1) & 033333333333) – ((u >> 2) & 011111111111);
         return ((uCount + (uCount >> 3)) & 030707070707) % 63;
 }
O(1) solution

This explains why it works.

https://blogs.msdn.microsoft.com/jeuge/2005/06/08/bit-fiddli...


It's also a slow way. Years ago I did a comparison of the many different ways to compute popcount. My use case computes millions of popcounts, and popcount is the limiting factor.

The HAKMEM 169 version you mentioned was 1/2 of the performance of a simple 8-bit lookup table. http://www.dalkescientific.com/writings/diary/archive/2011/1... . (Note: this assumes the LUT is in cache, which is true for my code.)

HAKMEM 169 was also 1/3 the performance of the "popcount_3" implementation from https://en.wikipedia.org/wiki/Hamming_weight . (Note: that works on 64 bit integers, not 32 bit ints as yours does.)

When I wrote that blog post, I did not have a processor which supported the POPCNT instruction. The fastest SSSE3 version, using the same source as the SSSE3 version referenced in this link, was 8x faster than HAKMEM 169.

If you want to compute the popcount of dozens of words, then a bitslice method by Lauradoux is nearly as fast as one which uses intrinsics. http://www.dalkescientific.com/writings/diary/archive/2008/0...


Any solution with a bounded loop is O(1) too, as would be the case for a loop on the bits of a 32- or 64-bit uint.

What makes this interesting is that it's loopless and compiles to only a few instructions.


That % 63 at the end is integer division, that's horribly slow. Literally any other sensible approach will beat it, including a naive LUT.


Optimising compilers are smart enough to get rid of unsigned divisions by constants. See [1] (part one of a series; also [2] is neat). To quote [2], "libdivide's scalar code is up to 16 times faster for powers of 2, 10 times faster for non-powers of 2, compared to naive hardware division." In this case GCC (even with -O0) replaces % 63 with a complicated sequence of one integer multiplication, three subtractions, an addition, and three shifts, which I couldn't be bothered decompiling.

[1] http://ridiculousfish.com/blog/posts/labor-of-division-episo...

[2] http://libdivide.com/


You're right. I read somewhere that no compiler optimizes modulus by a constant. That's clearly very outdated. I tested with clang 3.2, gcc 4.5, and icc 13 (all on a 32bit platform) and they all quite happily replace the division with other instructions.


If you're into that kind of thing you might be pleased to know that Haswell CPUs have "bit manipulation instructions" designed to accelerate the kinds of C expressions you find in Hacker's Delight.


Very interesting result!

However, this workload involved a loop summing the popcounts of a series of values, with intermediate results not needed. Beating popcount on that workload with AVX2 instructions seems plausible; splitting the work across several instructions gives the CPU more to work with simultaneously, and batching the last sum makes it effectively free when amortized across enough iterations.

However, if the popcounts in your workload remain independent, and you immediately need the values from them, I don't think this approach will help.


However, if the popcounts in your workload remain independent, and you immediately need the values from them, I don't think this approach will help.

This can be true, but it depends on whether the rest of the workload is vectorized. If you already have the data in vector registers, or can make use of it there afterward, this approach turns out to be even more beneficial than the microbenchmark numbers imply.

We've been working with Wojciech's approach for a while now, and thus can even point to real-world code: https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...


That looks like another case of a bitset, where you want to accumulate the popcounts over the whole bitset. That seems quite similar to the case in this article. (I still find it surprising that popcount doesn't have enough internal optimization to win, but the numbers certainly prove that.)

The case I'm talking about is when you need a single popcount value in isolation as part of some larger computation that doesn't otherwise involve popcounts.

For example, I've worked with libraries that manipulate variable-length data structures where each bit in a flags field indicates the presence of a chunk of data. To compute the size of the data, you need the popcount of the flags field. So you get the popcount (using __builtin_popcount when available), and then immediately use that value.


My favorite use of popcnt, and now this alternate technique, is shaming programming interviewers out of their reliance on explicit bit-twiddling solutions.


I'm actually surprised by this. I would think the AVX instructions have complex microcode that would take longer than a simple bit shifter with an AND to check and add the final bit. I assume `popcnt' isn't used that frequently, so maybe Intel spend a lot of time fine tuning the microcode for AVX but not `popcnt'.


There is also the effect of pipelining: AVX might be slower in terms of execution latency/bit. but if this algorithm can fill the entire chain than its speed in terms of throughput can still be greater.

For example: say popcnt takes 5 cycles from start to finish per word, and uses 1 execution unit (we will assume there is 1 of each type of EU, which isn't true). Then it can process at most 1/5th a word per cycle. The AVX implementation might take 20 cycles from start to finish per word, but it uses 5 execution units in serial (each taking 4 cycles). Then when the first word passes the first stage (4 cycles later) the second word can be passed to stage 0. Then the throughput for larger strings becomes 1 word/4 cycles = 1/4th word per cycle. So 25% more throughput even though it's 4 times as slow when measured for a single word.

In real-life situations this can become quite a nightmare (or a fun little puzzle :]) if you need to intersperse this "popcnt" with multiple other operations.


AVX instructions (at least the one's we're talking about here) are fully pipelined and none have microcoded implementations that I'm aware of.

And recognize that the news here is "on Haswell". On earlier SSE/AVX implementations POPCNT is indeed faster. Something changed with Haswell that allows this code to run faster than the purpose-designed instruction.

Most likely those two adjacent PSHUFB instructions were limited on earlier architectures to running on a single execution port (with a "PSHUFB engine"), and on Haswell it got insantiated on another ALU and they can now be executed in parallel.


There's no need to speculate, this stuff is all in Intel's manuals. All shuffles are port 5 on Haswell, and they're single-cycle latency / single-cycle throughput. Actually, Nehalem - Ivy Bridge could execute two PSHUFBs per cycle; Haswell reduced the throughput while adding a 256b-wide version, so the total work per cycle remains constant if you adopt the new instruction.

The SSSE3 popcount implementation was never bottlenecked on PSHUFB[1]. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle.

[1] Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.


Too late to edit, but I mangled the last sentence of this comment; it should instead be something like "The speedup is because Haswell has wider vector instructions and more execution ports (it can execute more uops each cycle)."


Something changed with Haswell that allows this code to run faster than the purpose-designed instruction.

The change is primarily that AVX2 (which Haswell is the first generation to support) extended binary and integer vector operations to 32B registers, while AVX only supported 32B floating point operations. Instructions that previously operated on 16B doubled their throughput by handling 32B with (usually) the same latency: https://software.intel.com/sites/landingpage/IntrinsicsGuide...


I thought Agner had shown that the Haswell AVX2 stuff didn't necessarily bother turning on the entire execution unit unless it really seemed warranted, preferring instead to issue the 128-bit operation twice and combine. For example see the later comments on http://www.agner.org/optimize/blog/read.php?i=142


There's a real effect here, but in practice I haven't found it to be an issue. As long as your instruction mix has at least 1 256-bit operation in the last few million instructions, the slowdown doesn't happen. I'm sure you could construct a case where it would be a problem, but throwing in an occasional unused VPXOR solves it easily enough.

One thing that can be an issue is alignment. Unless your reads are 32B aligned, you will be limited to reading 40B per cycle. While a single unaligned vector read per cycle isn't a problem, full utilization of the increased throughput that 'stephencanon' mentions in the sibling is only possible if both vectors are 32B aligned: http://www.agner.org/optimize/blog/read.php?i=415#423


The other critical piece was that Haswell doubled load/store throughput to L1 cache, not just register widths. (I know that you know this, just want to make it explicit).


popcnt is used very frequently IMO.

One example is Bloom filter which is used extensively in many domains.

Another example is Tanimoto distance that has to be super fast for computational chemistry http://www.dalkescientific.com/writings/diary/archive/2011/1...


The vast majority of AVX instructions are single uop. Even the complex instructions that are made up of a couple uops aren't traditional "microcode".


Slightly off topic, are there any high level languages apart from c, c++, and Fortran that deal well with simd? I've looked at Java and C# and they seem to have very partial support. Anything else out there with a better simd story?


Intel SPMD Program Compiler: https://ispc.github.io/


What's the best book folks recommend for learning SIMD programming?


Some of the resources I've ran across (assuming x86 context):

* Brief intros in Agner Fog's manuals:

- Chapter 12, Using vector operations in http://www.agner.org/optimize/optimizing_cpp.pdf

- Chapter 13, Vector programming in http://www.agner.org/optimize/optimizing_assembly.pdf

* http://www.whatsacreel.net76.net/asmtutes.html / https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D

* Kusswurm (2014) "Modern X86 Assembly Language Programming - 32-bit, 64-bit, SSE, and AVX"

* Hughes (2015) "Single-Instruction Multiple-Data Execution"

* The author's website :-) http://wm.ite.pl/articles/#assembler-x86-fpu-mmx-sse-sse2-ss...

References:

* https://chessprogramming.wikispaces.com/x86-64

* https://software.intel.com/sites/landingpage/IntrinsicsGuide...

* x86 intrinsics cheat sheet (link at the very bottom): http://db.in.tum.de/~finis/




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: