
AVX512 VBMI – remove spaces from text - pedro84
http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html
======
zwegner
OK, I got nerd-sniped here. You can actually construct the indices for the
shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each
corresponding to a different bit of the index of each byte in the 64-byte
vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0)
set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple
pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits
the masks would be: 10101010, 11001100, 11110000.

These masks are then extracted with PEXT for all the non-whitespace bytes.
What this does is build up, bit by bit, the byte index of every non-whitespace
byte, compressed down to the least-significant end, without the indices of
whitespace bytes.

I wasn't actually able to run this code, since I don't have an AVX-512
machine, but I'm pretty sure it should be faster. I put the code on github if
anyone wants to try:
[https://github.com/zwegner/toys/blob/master/avx512-remove-
sp...](https://github.com/zwegner/toys/blob/master/avx512-remove-
spaces/avx512vbmi.cpp)

    
    
        const uint64_t index_masks[6] = {
            0xaaaaaaaaaaaaaaaa,
            0xcccccccccccccccc,
            0xf0f0f0f0f0f0f0f0,
            0xff00ff00ff00ff00,
            0xffff0000ffff0000,
            0xffffffff00000000,
        };
        const __m512i index_bits[6] = {
            _mm512_set1_epi8(1),
            _mm512_set1_epi8(2),
            _mm512_set1_epi8(4),
            _mm512_set1_epi8(8),
            _mm512_set1_epi8(16),
            _mm512_set1_epi8(32),
    
        };
    
      ...later, inside the loop:
    
        mask = ~mask;
        __m512i indices = _mm512_set1_epi8(0);
        for (size_t index = 0; index < 6; index++) {
            uint64_t m = _pext_u64(index_masks[index], mask);
            indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]);
        }
        output = _mm512_permutexvar_epi8(indices, input);

~~~
nkurz
I'll run it for you on the same machine Wojciech is using and report back
shortly. I'm getting a compiler error for your call to _pext_u64(), but I
think it's just a matter of adjusting the compiler flags. OK, adding
"-march=native" works (was only -mavx512vbmi). Oops, now ./unittest fails:

    
    
      [nate@nuc avx512-remove-spaces]$ ./unittest
      test 1 gap
      FAILED; len_ref=63, len=0
       input: [ bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
       ref: [bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
      result: []
    

Did you maybe flip the args to _pext_u64() like I often do? I'll check back in
a few minutes and see if you've committed a fix.

Edit: Getting to my bedtime here. Send me email (in profile) if you'd like me
to follow up with it tomorrow.

~~~
zwegner
Ah, just figured it out. I accidentally deleted this line:

    
    
                len = 64 - __builtin_popcountll(mask);
    

So the destination pointer never got updated. Mental debugging is kind of
annoying :)

Thanks!

~~~
nkurz
OK, I'm still up and just saw your fix. Yes, that makes the ./unittest work. I
have numbers for ./benchmark; let me grab and compile his original for
comparison. Do you know which result you'd expect to see the best difference
for?

Results: Yes, if I'm reading things right (and assuming the unittest caught
any issues) you are much faster at large cardinalities! Your approach seems to
be a constant .7 cycle/op, while Wojciech's varies from about .4 for card=1 to
4.0 for card=64. The breakeven point is card=7, with your approach faster at
all higher numbers. Nice improvement!

Send me email if you'd like to pursue this further tomorrow. Also, you
probably want to read dragontamer's post here closely and mentally consider
whether his approach might make even more sense. I haven't thought about it
yet, but if he thinks he knows the best way, he may well be right. Good night!

Edit: Just noticed that my numbers seem slightly faster than the ones Wojciech
reported for large cardinalities. My guess would be that this is because I
added "-march=native" to both compilations, although maybe there's something
more subtle happening too. He's in Europe, so probably asleep now, but will
likely be by to offer his thoughts in a few hours.

~~~
zwegner
Awesome, thanks! I'll play around a bit with the algorithms and see if I come
up with anything neat. I'll email you if I do.

~~~
wjnc
Just favorited this thread for algorithmic awesomeness. Thanks guys.

------
jchw
I was excited for AVX512 long ago but I've since heard that if you are jamming
AVX512 instructions to every core you get a forcibly lower clockrate. In
practice this sounds like it'd suggest using an AVX512 algorithm could
actually be slower even when it's faster. If that's the case, I wonder what
kind of performance gain you'd have to hit to beat a scalar or SSE-based
vectorized algorithm.

~~~
shereadsthenews
You also get a lower clock rate from running scalar code on many cores at
once. Use of the 512-bit unit only makes the coefficient different.

That said, the biggest mystery in this article is why you'd ever want to
remove all whitespace from text. Why is that useful?

~~~
jasonzemos
If one could remove spaces from JSON to create a so-called "Canonical JSON"
one could obtain the same digest hash from the same data (even in combination
with the hardware accelerated hashing offered by AVX!). Admittedly this is a
strange case but I've run into it.

~~~
jandrese
Isn't this a problem if you're fed these two inputs:

{mystring:"Hello World!"}

{mystring:"Hello World !"}

They aren't the same JSON.

~~~
marvy
also: {mystring:"HelloWorld!"}

------
dragontamer
I've been nerdsniped as well. I can't say I'm going to go ahead and try and
solve it, but the methodology presented in the post seems suboptimal.

The best method I personally think would work, is the "compaction algorithm"
documented here: [http://www.davidespataro.it/cuda-stream-compaction-
efficient...](http://www.davidespataro.it/cuda-stream-compaction-efficient-
implementation/)

True, that's a CUDA implementation, but AVX512 is closely related to GPU
programmers. Effectively, you calculate the prefix sum of the "matches".

The paper the above code is based on is very clear on how this works:
[http://www.cse.chalmers.se/~uffe/streamcompaction.pdf](http://www.cse.chalmers.se/~uffe/streamcompaction.pdf)

Pay close attention to "figure 1" on page 2. That's the crux of the algorithm.
Assuming 8-bit characters, you can generate a prefix-sum in just 6-steps (Each
step is a constant, pre-defined byte-shift + Add). A prefix sum is best
described by the following picture:
[https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-...](https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-
Steele_Prefix_Sum.svg)

Full Wikipedia page on Prefix Sum:
[https://en.wikipedia.org/wiki/Prefix_sum](https://en.wikipedia.org/wiki/Prefix_sum)

Prefix Sum is just 6-steps for a AVX512 register on 8-bit ints. That generates
the full AVX512-space permute (ie: if the prefix sum is 5 for an element, that
means that element belongs in index #5)., but AVX512 has "in lane" permutes
only. I dunno how many steps you'd need to get a "in lane" permute into a
"cross lane" permute... but it doesn't seem too difficult of a problem (and
IIRC, I think i read a blogpost about how to convert the in-lane AVX512
permutes into a cross-lane one).

I bet that the above sketch of the AVX512 algorithm can be implemented in less
than 30 assembly instructions for the full AVX512 / 64-byte space, maybe less
than 20. That should definitely run faster than the scalar version.

\-------

EDIT: Herp-derp. It doesn't seem like VPERMB is affected by AVX Lanes (!!).
[https://www.felixcloutier.com/x86/vpermb](https://www.felixcloutier.com/x86/vpermb)

So I guess you can just run VPERMB at the end on the calculated prefix-sum.
The end.

\-------

The Stream Compaction algorithm is a very important 1-dimentional work-
balancing paradigm in the GPU programming world. It is used to select which
rays are still active in a Raytracing scenario (so that all SIMD registers
have something to do).

~~~
zwegner
Interesting, I started out thinking along these lines, but once I figured out
I could use PEXT, I just went with that.

I think this approach needs some tweaks, though. Mainly that the vpermb at the
end is the inverse of what we want--the bytes at dense indices get spread out
to the sparse indices (it works analogously to gather, but we want scatter). I
can't think of a way around this right now...

That said, it's an interesting approach. I think the PEXTs would be the
bottleneck in my code (looks like there's only one execution unit for them,
whereas there's two for the VPADDs), and finding a way to parallelize all the
VPADDs could lead to a nice speedup.

~~~
dragontamer
You're right.

I did a brief look through AVX512 instructions to look for a solution, and
unfortunatley, it seems we both may have been overthinking this.

vpcompressb more or less does the job in one instruction. Agner Fog doesn't
have a latency listed however.

\---------

My search methodology was basically this:
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#cats=Swizzle&expand=1219,1227,1229&text=__m512i)

Search for __m512i (integer-based ZMM registers), with the category "swizzle"
(which includes permute, insert, and other such instructions). I figure any
potential AVX512 instruction would be a "Swizzle" style instruction.

\-------

Note: I originally responded to the wrong location in this thread. I
copy/pasted my text to here, which is where I originally intended to respond.

~~~
nkurz
Do you recall which machines VPCOMPRESSB works on? I think it's next
generation Icelake? Or is it there already on Cannonlake? And along the same
lines, is there a good general way of looking this up?

Coincidentally, searching for this, I found Geoff Langdale's blog post where
in addition to describing VPCOMPRESSB as 'dynamiting the trout stream', he
also describes something very close to zwegner's PEXT approach:
[https://branchfree.org/2018/05/22/bits-to-indexes-in-
bmi2-an...](https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-
avx-512/)

~~~
BeeOnRope
It's not in Cannonlake (nor the W variants). The D and Q versions are in SKX
though and they are 4L2T IIRC.

You need a CPU with VBMI2 for the B variant, can't remember off the top of my
head if Icelake has that.

------
Const-me
Interesting but his scalar code is slow. When you care about performance,
better to implement such algorithm so it reads bytes one by one, but move
blocks with memmove when switching from write to skip state.

Pathological case (skipping every other character) is slightly slower, but on
real data it’s much faster overall.

Unfortunately I don’t have AVX512 hardware so I can’t test.

~~~
wmu
I was wondering if you like to contribute better scalar code? I'll be happy to
include your (or anybody else) code and then compare different approaches.

~~~
aqrit
Try some of these :p
[https://gist.github.com/aqrit/6e73ca6ff52f72a2b121d584745f89...](https://gist.github.com/aqrit/6e73ca6ff52f72a2b121d584745f89f3)

~~~
wmu
Thank you again! Take a look:
[https://github.com/WojciechMula/toys/tree/master/avx512-remo...](https://github.com/WojciechMula/toys/tree/master/avx512-remove-
spaces)

Your AVX2 is better in benchmarks than Zach's AVX512VBMI, wow. I have to admit
that looked at the code but got lost. :) Will need more time to digest it.

However, in despacing English texts and CSV, Zach's variant is still faster.

------
CountHackulus
This is really neat, I wonder if there's a way to keep a "remainder" around,
kind of like Bresenham's algorithm, so that you can always do aligned reads
from memory.

The speedup on English text is really good, and I love the exploration into
the AVX intrinsics.

~~~
BeeOnRope
Unless I'm missing something it's the stores that are variably sized and hence
become misaligned, not the reads?

------
lostmsu
Instead of generating shuffle at runtime, couldn't a table be used for
shuffling lower and higher parts of the register separately, then merging the
result?

Also, for uncommon patterns, the register could be split further to make the
shuffle table fit into L1.

Also, I am not sure compiler can optimize that continue statement. The non-AVX
version might be improved by removing the if alltogether, and replacing
_dst++_ with _dst =_ followed by _dst += (src[i] == ' ' || src[i] == '\r' ||
src[i] == '\n') ? 0 : 1_

~~~
jsd1982
The ternary operator a?b:c is still a conditional operation and may result in
a branch instruction, same as an if statement. I wouldn't think it would
matter much how exactly the branch condition was spelled in C code. It'll get
translated to low level IR and optimized at that level.

~~~
kccqzy
In this case, the GP is having code of the form

    
    
        a ? 0 : 1
    

which can usually be rewritten as

    
    
        !a
    

Any worthwhile compiler should compile that into one instruction or two, like
`setne` or something else.

~~~
lostmsu
I was hoping for a direct translation to CMOV. Anyway, the assembly has to be
analyzed to be sure.

~~~
Karliss
No need to speculate.
[https://godbolt.org/z/AC2NKY](https://godbolt.org/z/AC2NKY) . Results
significantly vary between compilers. In almost all cases there is at least
one jump generated due to the way comparison with 3 values less than 32 is
done. If the number is less than or equal to 32 compiler does a lookup in
bitmask indicating which values to skip. Often Clang and MSVC generated a
second jump.

~~~
lostmsu
None of the compilers generated branch-free loop. Looks like _or_ short-
circuiting is the reason.

Now I wonder if getting rid of branches at assembly level actually gives any
benefit. That can be achieved by using _|_ instead of _||_.

------
terrycody
what is the usage of these things?

------
pkaye
What if you are working with unicode?

~~~
loeg
This removes ascii spaces from ascii or ascii-compatible encodings (e.g.,
utf-8 and others). It doesn't handle the full complexity of unicode spaces,
obviously.

