The AVX2 implementation requires more instructions, but spreads the work out across more ports. My fastest unrolled version takes 2.5 cycles per 32B vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B /2.5 cycles per 32B) than scalar popcnt().
Whether this is worth doing depends on the rest of the workload. If the rest of the work is being done on scalar 64-bit registers, those other scalar operations can often fit between the popcnts() "for free", and the cost of moving data back and forth between vector and scalar registers usually overwhelms the advantage.
But if you are already dealing with vector registers, the speedup can be even greater than a microbenchmark would imply. We've been using Wojciech's approach for operations on compressed bitsets, and find it fantastic: https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...
And since wider registers are almost certainly coming, the vectorized approach seems likely to pull even farther ahead in the future. AVX512, the 64B successor to AVX2, is currently scheduled for Skylake Xeon. Unless a second execution port is added for popcnt() (not likely) the vectorized version will probably end up about 3x as fast in a couple years.
This is fantastic news. I hope this makes it down to consumer CPUs in time as well.
The current trend of ridiculously wide SIMD vectors is a pretty exciting development and is (IMO) a reason to be optimistic in spite of the current climate of "Moore's Law is over and the multicore experiment has failed, we're doomed" handwringing. If you can express your problem as a tight parallel numeric loop, things look good for getting better and better performance in the future.
I don't like that this means most developer laptops won't have AVX512 (I'm hoping for a new Skylake "Xeon" mbp), but Lenovo and others have already released some.
(And I'm guessing, really I hope, that I won't have to look up every xeon laptop part in intel ark to see if it supports vt-x and vt-d - I hope xeon means "no need for further market segmentation, all features that make sense are enabled")
The naming scheme is terrible, because although "Skylake Xeon" is what many sources are referring to, this turns out not to mean all chips from the Skylake Generation that are marketed as Xeon. Rather, it means something like "Skylake that's actually designed for servers" rather "Current Skylake for which we didn't disable ECC".
So despite what others are reporting, I probably should have said that AVX512 is expected to come with "Skylake Purley", which is expected to be marketed as "Skylake Xeon" for the server market: http://wccftech.com/intel-skylake-purley-platform-upto-28-co...
POPCNT is a single instruction and should definitely be inlined; I doubt that would be such a good idea for these longer sequences, but then putting it in a function means the call+return overhead could also become significant. It's hard to quantify without doing actual measurements, but with the figures given I'd be inclined to stick with POPCNT.
AVX2's speedup over POPCNT is not not big, but seems it's such due to my indolence/stupidity. I've just merged pull request by Simon Lindholm (https://github.com/WojciechMula/sse-popcount/pull/2) with manually unrolled loops and it made the AVX2 code faster 40% than POPCNT.
int BitCount(unsigned int u)
unsigned int uCount;
uCount = u – ((u >> 1) & 033333333333) – ((u >> 2) & 011111111111);
return ((uCount + (uCount >> 3)) & 030707070707) % 63;
This explains why it works.
The HAKMEM 169 version you mentioned was 1/2 of the performance of a simple 8-bit lookup table. http://www.dalkescientific.com/writings/diary/archive/2011/1... . (Note: this assumes the LUT is in cache, which is true for my code.)
HAKMEM 169 was also 1/3 the performance of the "popcount_3" implementation from https://en.wikipedia.org/wiki/Hamming_weight . (Note: that works on 64 bit integers, not 32 bit ints as yours does.)
When I wrote that blog post, I did not have a processor which supported the POPCNT instruction. The fastest SSSE3 version, using the same source as the SSSE3 version referenced in this link, was 8x faster than HAKMEM 169.
If you want to compute the popcount of dozens of words, then a bitslice method by Lauradoux is nearly as fast as one which uses intrinsics. http://www.dalkescientific.com/writings/diary/archive/2008/0...
What makes this interesting is that it's loopless and compiles to only a few instructions.
However, this workload involved a loop summing the popcounts of a series of values, with intermediate results not needed. Beating popcount on that workload with AVX2 instructions seems plausible; splitting the work across several instructions gives the CPU more to work with simultaneously, and batching the last sum makes it effectively free when amortized across enough iterations.
However, if the popcounts in your workload remain independent, and you immediately need the values from them, I don't think this approach will help.
This can be true, but it depends on whether the rest of the workload is vectorized. If you already have the data in vector registers, or can make use of it there afterward, this approach turns out to be even more beneficial than the microbenchmark numbers imply.
We've been working with Wojciech's approach for a while now, and thus can even point to real-world code: https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...
The case I'm talking about is when you need a single popcount value in isolation as part of some larger computation that doesn't otherwise involve popcounts.
For example, I've worked with libraries that manipulate variable-length data structures where each bit in a flags field indicates the presence of a chunk of data. To compute the size of the data, you need the popcount of the flags field. So you get the popcount (using __builtin_popcount when available), and then immediately use that value.
For example: say popcnt takes 5 cycles from start to finish per word, and uses 1 execution unit (we will assume there is 1 of each type of EU, which isn't true). Then it can process at most 1/5th a word per cycle.
The AVX implementation might take 20 cycles from start to finish per word, but it uses 5 execution units in serial (each taking 4 cycles). Then when the first word passes the first stage (4 cycles later) the second word can be passed to stage 0. Then the throughput for larger strings becomes 1 word/4 cycles = 1/4th word per cycle. So 25% more throughput even though it's 4 times as slow when measured for a single word.
In real-life situations this can become quite a nightmare (or a fun little puzzle :]) if you need to intersperse this "popcnt" with multiple other operations.
And recognize that the news here is "on Haswell". On earlier SSE/AVX implementations POPCNT is indeed faster. Something changed with Haswell that allows this code to run faster than the purpose-designed instruction.
Most likely those two adjacent PSHUFB instructions were limited on earlier architectures to running on a single execution port (with a "PSHUFB engine"), and on Haswell it got insantiated on another ALU and they can now be executed in parallel.
The SSSE3 popcount implementation was never bottlenecked on PSHUFB. The speedup is because Haswell is a physically wider machine (it has more execution ports) and can execute more uops each cycle.
 Except on Merom, where PSHUFB was cracked to 4 or 5 uops IIRC, but that's a ten year old part now.
The change is primarily that AVX2 (which Haswell is the first generation to support) extended binary and integer vector operations to 32B registers, while AVX only supported 32B floating point operations. Instructions that previously operated on 16B doubled their throughput by handling 32B with (usually) the same latency: https://software.intel.com/sites/landingpage/IntrinsicsGuide...
One thing that can be an issue is alignment. Unless your reads are 32B aligned, you will be limited to reading 40B per cycle. While a single unaligned vector read per cycle isn't a problem, full utilization of the increased throughput that 'stephencanon' mentions in the sibling is only possible if both vectors are 32B aligned: http://www.agner.org/optimize/blog/read.php?i=415#423
One example is Bloom filter which is used extensively in many domains.
Another example is Tanimoto distance that has to be super fast for computational chemistry
* Brief intros in Agner Fog's manuals:
- Chapter 12, Using vector operations in http://www.agner.org/optimize/optimizing_cpp.pdf
- Chapter 13, Vector programming in http://www.agner.org/optimize/optimizing_assembly.pdf
* http://www.whatsacreel.net76.net/asmtutes.html / https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D
* Kusswurm (2014) "Modern X86 Assembly Language Programming - 32-bit, 64-bit, SSE, and AVX"
* Hughes (2015) "Single-Instruction Multiple-Data Execution"
* The author's website :-) http://wm.ite.pl/articles/#assembler-x86-fpu-mmx-sse-sse2-ss...
* x86 intrinsics cheat sheet (link at the very bottom): http://db.in.tum.de/~finis/