Hacker News new | past | comments | ask | show | jobs | submit login

OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0) set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.

These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.

I wasn't actually able to run this code, since I don't have an AVX-512 machine, but I'm pretty sure it should be faster. I put the code on github if anyone wants to try: https://github.com/zwegner/toys/blob/master/avx512-remove-sp...

    const uint64_t index_masks[6] = {
        0xaaaaaaaaaaaaaaaa,
        0xcccccccccccccccc,
        0xf0f0f0f0f0f0f0f0,
        0xff00ff00ff00ff00,
        0xffff0000ffff0000,
        0xffffffff00000000,
    };
    const __m512i index_bits[6] = {
        _mm512_set1_epi8(1),
        _mm512_set1_epi8(2),
        _mm512_set1_epi8(4),
        _mm512_set1_epi8(8),
        _mm512_set1_epi8(16),
        _mm512_set1_epi8(32),

    };

  ...later, inside the loop:

    mask = ~mask;
    __m512i indices = _mm512_set1_epi8(0);
    for (size_t index = 0; index < 6; index++) {
        uint64_t m = _pext_u64(index_masks[index], mask);
        indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]);
    }
    output = _mm512_permutexvar_epi8(indices, input);



I'll run it for you on the same machine Wojciech is using and report back shortly. I'm getting a compiler error for your call to _pext_u64(), but I think it's just a matter of adjusting the compiler flags. OK, adding "-march=native" works (was only -mavx512vbmi). Oops, now ./unittest fails:

  [nate@nuc avx512-remove-spaces]$ ./unittest
  test 1 gap
  FAILED; len_ref=63, len=0
   input: [ bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
   ref: [bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
  result: []
Did you maybe flip the args to _pext_u64() like I often do? I'll check back in a few minutes and see if you've committed a fix.

Edit: Getting to my bedtime here. Send me email (in profile) if you'd like me to follow up with it tomorrow.


Ah, just figured it out. I accidentally deleted this line:

            len = 64 - __builtin_popcountll(mask);
So the destination pointer never got updated. Mental debugging is kind of annoying :)

Thanks!


OK, I'm still up and just saw your fix. Yes, that makes the ./unittest work. I have numbers for ./benchmark; let me grab and compile his original for comparison. Do you know which result you'd expect to see the best difference for?

Results: Yes, if I'm reading things right (and assuming the unittest caught any issues) you are much faster at large cardinalities! Your approach seems to be a constant .7 cycle/op, while Wojciech's varies from about .4 for card=1 to 4.0 for card=64. The breakeven point is card=7, with your approach faster at all higher numbers. Nice improvement!

Send me email if you'd like to pursue this further tomorrow. Also, you probably want to read dragontamer's post here closely and mentally consider whether his approach might make even more sense. I haven't thought about it yet, but if he thinks he knows the best way, he may well be right. Good night!

Edit: Just noticed that my numbers seem slightly faster than the ones Wojciech reported for large cardinalities. My guess would be that this is because I added "-march=native" to both compilations, although maybe there's something more subtle happening too. He's in Europe, so probably asleep now, but will likely be by to offer his thoughts in a few hours.


Keep in mind that the microbenchmark uses exactly fixed cardinalities, with no randomness, so the branches in the original algorithm will be perfectly predicted and hence will tend to inflate the reported performance versus a case where the number of spaces has an expected cardinality but with enough variance that the loop mispredicts once for every block.

So I suspect the cutover point for the pext approach to be better than the microbenchmark suggestions for typical real-world inputs.

It would also be interesting to see a lookup based approach, using multiple lookups (probably with gather) to build the shuffle control for a single block. 0.7 cycles per element means ~45 cycles per 64 byte block, so that's still quite a large budget per block, and the load ports have little pressure.


Awesome, thanks! I'll play around a bit with the algorithms and see if I come up with anything neat. I'll email you if I do.


Just favorited this thread for algorithmic awesomeness. Thanks guys.


Hi, author here. That's amazing! I totally love this approach, it so neat. Wow :) Would you mind if I include your solution in my repository and the article?

BTW you can always check validity of your code with Intel Software Developer Emulator (https://software.intel.com/en-us/articles/intel-software-dev...). I use this program constantly, as the AVX512 machines I have access to are remote ones (literally remote).


Hi there! Thanks for the nice writeup and successfully nerd-sniping me! :) You can absolutely use the code.

Good tip on the SDE too. I had heard of it but didn't realize it was so lightweight, dynamically patching the binary to emulate only the needed instructions. Cool!


Your approach is incredibly fast. I love the code, it's so short. :)

Please take look at https://github.com/WojciechMula/toys/tree/master/avx512-remo..., I put there some raw numbers; will update the article later.

For English texts your algorithm is more than 20 times faster (twenty, not a typo), for CSVs the speed-up is "only" 9. However, I'd love to compare it with a well tuned scalar code.

One important take away for me is that I underestimated pext; it's a powerful instruction. (on AMDs terribly slow, though)


Wow, not bad! I'm looking forward to the writeup.

I'm still thinking about this one. It seems that ~.5 cycles/byte is slower than it should be. There's 6 pext->kmovq->vpaddb chains, where the pexts and kmovqs are all independent and can be pipelined. One chain should be 3+2+1 cycles, so all 6 should be 11 cycles. Agner Fog's site doesn't have information for vpermb yet, but a Stack Overflow comment says it's 1 cycle (which is rather surprising, that's 64 64->1 8 bit muxes--lots of silicon!). The latency on most other instructions can be hidden.

I updated my code with a couple minor optimizations, that at least lead to some nicer assembly output:

Getting rid of the if (mask) branch--it should almost never be false in real code (and never is in the microbenchmarks).

Inverting the mask during the comparison:

        uint64_t mask = _mm512_cmpneq_epi8_mask(input, spaces)
                      & _mm512_cmpneq_epi8_mask(input, NL)
                      & _mm512_cmpneq_epi8_mask(input, CR);
...then the mask computation gets a lot cleaner, since there's now just three chained vpcmps (though this might lose a couple cycles or so because of the dependent 3 cycle latencies. Not sure how to trick gcc into using parallel vcmps with the ands on GPRs...). But we also save the mask = ~mask and the subtraction from 64 of the popcount.

All in all, speed-of-light for this code should be 3*3+2+11+1=23 cycles per iteration (if I'm counting everything properly), and probably even faster from pipelining between iterations. I hope it should be a bit faster than the 32ish that the old version was using.


Perhaps nobody would see this, as the thread is already dead, but... :) I managed to include description of your algorithm and updated results, including procedures by aqrit: http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.h...

BTW: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88798


vpermb is 3L1T, ie three cycles latency, one per cycle throughput. Full AIDA64 timing dump for cannonlake are on instlatx64 side.


Since there are only 8 bits per input character, would it be possible to just move their bits directly using pext?


Hmm, interesting idea.

The main problem I see is that PEXT can only move bits from within an 64-bit operand, so we can't operate on the 512-bit vectors anymore, AFAICT. We'd have to split the loop up into 8-byte chunks instead, and expand the mask from 1 to 8 bits per comparison, so PEXT pulls out all the bits of each character (e.g. a 4-bit mask 1011 becomes 0xFF00FFFF). This mask computation could be done in 512-byte chunks, but we'd have to split the original 1-bit mask up into 8-bit slices that each get popcounted individually (either in a 256 byte table lookup or with the popcount instruction). That's just so we know how much to increment the pointer between each 8-byte chunk.

All in all, I think the 8 popcounts + 8 PEXTs would make it slower than the full 512-bit version (there's only one PEXT execution unit on Skylake X chips). It might be competitive on machines without AVX-512, though. I'll try out some code--I should actually be able to run this one :)


If you could avoid actually using POPCNT for the popcount, and the limiting factor was the pext, the "speed limit" of this approach would be around 0.125 cycles per input byte since pext handles 8 bytes and you can do 1 per cycle.

Of course that assumes that you can fit 100% of the remaining work for every 8 byte chunk into what's left of that cycle (ie into no more than 3 uops). Since you have at least a store and a load per word, that's probably going to be impossible! It's not totally unlikely that you could get to around 0.25 cycles though.

A good approach would be to vectorize the rest of the work and store the resulting mask values and increments in 64-byte chunks, to be re-read by the scalar part (since there is no really efficient way to get stuff from vector to GP regs). So the scalar part would have 3 loads (mask, increment, input data) a store and the pointer increment. I guess you could load-op the addition, so it seems possible you might be limited by the 3 loads (taking 1.5 cycles).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: