Hacker News new | past | comments | ask | show | jobs | submit login
Filtering a vector with SIMD instructions (AVX-2 and AVX-512) (quickwit.io)
104 points by francoismassot on Sept 1, 2022 | hide | past | favorite | 47 comments



Whenever you need a lookup table with vectors, remember the instructions like _mm256_cvtepu8_epi32 or _mm256_cvtepi8_epi64 can zero extend or sign extend integer lanes, and they can load source directly from memory. The OP needs a vector with int32 lanes in [0..7] range, can use 1 byte per lane and expand to uint32_t while loading with _mm256_cvtepu8_epi32. Reduces table size by a factor of 4, at the cost of a single instruction.

Another thing, it’s possible to produce the desired vectors without RAM loads, with a couple of BMI2 instructions. Example in C++:

    // Compress int32 lanes according to the mask
    // The mask is expected to contain either 0 or UINT_MAX values
    __m256i compress_epi32( __m256i val, __m256i mask )
    {
        uint32_t i32 = (uint32_t)_mm256_movemask_epi8( mask );
        // Compress, using 4 bit numbers for lane indices
        constexpr uint32_t identity = 0x76543210;
        i32 = _pext_u32( identity, i32 );
        // Expand 4-bit numbers into bytes
        constexpr uint64_t expand = 0x0F0F0F0F0F0F0F0Full;
        uint64_t i64 = _pdep_u64( (uint64_t)i32, expand );
        // Move to SSE vector
        __m128i vec = _mm_cvtsi64_si128( (int64_t)i64 );
        // Expand bytes to uint32_t
        __m256i perm = _mm256_cvtepu8_epi32( vec );
        // Shuffle these elements
        return _mm256_permutevar8x32_epi32( val, perm );
    }
One downside, these BMI2 instructions are rather slow on Zen 2 and earlier AMD processors. AMD has fixed the performance in Zen 3.


This takes me back to some blissful days spent optimizing an integer compression scheme in AVX2. It's really a difficult instruction set to work with. The authors comment about a set of oddly shaped Legos is very apt. It often doesn't have instructions that cross lanes and often lacks the thing you need, so you find yourself hacking around it with bit operations and lookup tables. AVX512 is a huge improvement, but it still hasn't caught on in a big way over seven years later.


Play with Intel ISPC or CUDA or OpenCL.

I think what programmers are missing is the "mental model" that scales well to large problems. Once you have such a model, working your way back down to SIMD-assembly is way easier.

---------

The general mental model is "perform work independently. Then perform prefix/scan to synchronize and coordinate between threads", combined with thread-barriers.

There are other models of course, but this paradigm works for a surprisingly huge number of problems.

EDIT: Compact, the solution from this blogpost, is itself a prefix-scan operation. See http://www.cse.chalmers.se/~uffe/streamcompaction.pdf . In case people weren't aware how prefix-sum is underlying so many of today's parallel algorithms in practice. Intel decided to make a specific assembly instruction for stream-compaction, but its just the prefix-sum steps we've known and loved for ages.


100% the mental model I've always used. Few notes:

- instead of using the "assign object X to bin Y by placing a pointer to X in binY.objectVector", that operation is equivalent to "sort+prefix-scan". Build an array of [0..N] (thrust counting_iterator is golden) and a parallel "values" array of [binX, ..., binZ], sort the index array using val array as the keyval.

- Once you've sorted the bins, the prefix scan finds the "start" of each sub-array and the length is start[x+1] - start[x]. If you have 0 size, that sub-array is empty.

- if the data is sparse, you can filter before you launch - so you'd build an array containing the (sorted) indexes of any Y that has data.

- all of these concepts become much easier working with "offset from an array" rather than a pointer directly

- structure-of-arrays rather than array-of-structs - to keep memory access aligned and maximized within a warp

- if you are looking for "each object outputs between 0 and N items as a result", that's basically prefix-scan within your warp/grid. Have the 0th thread do an atomic-CAS/atomic-increment to increment a global counter that offsets the whole warp/grid with the appropriate number of items, then every thread with an item writes to arr[GLOBAL_COUNT + myPrefixSum]. Again, in many cases you will probably write TWO parallel arrays at this time (index of the item, and the actual output value), or more.

- Allocate your arrays at program startup and allocate them as big as you think they'll need to be. GPUs don't do memory fragmentation, you'll want big contiguous allocations.

- Thrust Framework makes all this super easy with zip-iterators, counting-iterators, sort, prefix-scan, etc. If needed you can get pointers down to the actual arrays and intermingle all this with conventional low-level CUDA __kernel__ or __device__ functions. Check out their quickstart example at this link.

https://thrust.github.io/

(radix/postal) sorting and prefix-sum is, by far, the most successful way I've seen to port "conventional" logic flows to GPUs, those instructions run VERY fast due to being parallel and aligned. Your code is really mostly just "glue" to build those values for either warp-wide or device-wide sorts/prefix sums - the GPU is performing the heavy lifting of "re-aligning" the data and it does it very efficiently.


> - Thrust Framework makes all this super easy with zip-iterators, counting-iterators, sort, prefix-scan, etc. If needed you can get pointers down to the actual arrays and intermingle all this with conventional low-level CUDA __kernel__ or __device__ functions.

I find that CUDA's cub library is better if you're doing prefix-sums within a kernel. "Thrust" is more of a quick-and-dirty prototype kind of code, which has weaker performance than people expect.

You gotta get down and dirty with the __kernel__ functions. And cub is the library for that (not Thrust).

Thrust is great for GPU-prototypes / general scan/reduce prototyping IMO. Probably good enough for a lot of problems, but its a bit slow in practice. Thrust has the mental model down pat, but it just doesn't have enough performance.

https://docs.nvidia.com/cuda/cub/index.html


> I find that CUDA's cub library is better if you're doing prefix-sums within a kernel.

Thrust doesn't have a __device__ prefix-sum iirc, just the global call /laugh

> "Thrust" is more of a quick-and-dirty prototype kind of code, which has weaker performance than people expect.

Yes, absolutely, they're complimentary and in many cases CUB does things slightly better, or does things that Thrust doesn't support.

But Thrust is fantastic for "I want to allocate some GPU arrays, set up some data, and run sort+prefix sum, then hand it off to something else to run the actual algorithm. It's glue that helps you get started (eg see those quickstarts - those are very short even by CUDA standards let alone OpenCL) and figure out if your idea is going to work. And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations. It's also easy to get Thrust containers to give you a real pointer and at that point you can call CUB or real kernels or do whatever else you want.

As far as performance... eh, CUB is a little faster but not like incredibly much so, maybe 10% or so from what I remember, it wasn't huge. Thrust algorithms are usually not in-place so CUB can provide slightly higher problem size in most situations (since you don't have to allocate a scratch buffer). I actually found the CUB in-place sort was slower than Thrust non-inplace though (understandable, that's a common penalty, and CUB non-inplace might be even faster).

More fundamentally, Thrust really works at the level of iterators and not kernels/grids, so you can't really do warp-level operations at all using global "sort this shit" type commands. Thrust doesn't expose the grid information to you and doesn't make guarantees about what grid topology will be executed (there is an OpenMP backend!).

But if there is some general "per-item" function in your algorithm, you can call it using the map-iterator (can't remember what it's called but like, pass this object to this function) and either pass the object to work on, or have the value passed be an index of a work-item and your function loads it (store a pointer to the array start in the map-iterator). And in that case you inherit some of the occupancy auto-tuning that Thrust does, which is nice just as a basic thing to get off the ground - it'll try to use as wide a grid as is feasible given the occupancy/utilization.

I seem to remember that I did find a way to kinda work around it somehow, like what I was iterating was grid launches instead of work-items, and obviously those can use warp-collective calls etc, but yeah at some point you'll have to make the hop to a proper kernel launch, Thrust just lets you push it off a bit. I was just seeing if I could do it to leverage Thrust's occupancy auto-tuning.

Maybe it was that I'd stride the object space (eg launch an iterator for every 32 items) and do a kernel launch on each chunk, or something like that.


> And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations.

At a large granularity, yes if that's what you're doing.

But if you need to exit the kernel / device-side just to push/pop from a queue or allocate data to/from a stack (prefix-sum(sizes) -> allocate the top sum-of-(sizes) space from the stack), for a SIMD-stack push/pop operation, things will be quite slow.

SIMD-stack push/pop should be done at the block level and coordinated/synchronized between other blocks by using atomics (atomic_add(stack_head) / atomic_subtract(stack_head)). Especially if you don't know how many times a particular routine will push to the top of the stack.

Note: simd-stack is safe as long as all threads are pushing together, or popping together. If you can split your algorithm into the "push-only kernel", and then the "pop-only kernel" steps, you can have a surprising level of flexibility.

-------

Anyway, using a Thrust-level prefix sum will spin up an entire grid log(n) times each time you wanted to add/remove things from that shared stack. So you're really spawning too many grids IMO.

Instead, a CUB-level block-level prefix sum will atomic_add() / push onto the stack efficiently before exiting. So you have far fewer kernel calls.


> all of these concepts become much easier working with "offset from an array" rather than a pointer directly

This reminds me of this article [1] that summarizes very nicely several data structures that represent sparse tensors. I highly recommend it if you need to work with data in two dimensions (like a list of vectors) or more.

In particular, the Compressed Sparse Fiber is a kind of generalization of the "offset from an array" approach, in several dimensions. That can be handy when working with data in 3D for example.

[1] https://arxiv.org/abs/1804.10112


The AVX2 version is using a 8 KiB table...? Even the range check is inefficient. I'd bet that the AVX2 version could be 50% faster.


Why do the 255-bitmask? You’re going to look it up in a table anyway. I’m pretty sure I’ve implemented the compact operation before in AVX-256 and not needed a massive table and I don’t think you need a gather either but that was at a previous job. You can do a lot with the PEXT / PDEP in combination with multiplies to make masks.


My Rust is not strong, but in the AVX-512 solution, I don't fully understand how they're incrementing the input by a whole AVX-512 word (16xu32) by only doing input = input.offset(1); ? I'd assume that will increment their input array by only 1 single u32.

With the approach used here, it also looks like you'll write some garbage into output after output_end, which isn't a problem unless output was sized exactly equally to expected output and you attempted to write past the end of it via _mm512_mask_compressstoreu_epi32 .


The code snippets appear to be full of mistakes.

E. g. filter_vec_avx2 doesn't declare the loop variable i and stores input elements into the output instead of their indices. Or from_u32x8 has a DataType instead of __m256i and [u32; __m256i] instead of [u32; 8].


Amazing, manual optimizing with algorithms like this can boost performance dramatically indeed. I wonder whether Intel provides AVX-512 optimized libraries for basic algorithms like sorting and searching, or for matrix multiplication? Without those libraries, and auto-vectorization still in research, we will have to hand craft algorithms with intrinsics, which is time consuming.


Indeed a nice speedup!

As to libraries: I'm the main author of github.com/google/highway which provides 'portable intrinsics' so you only have to write your code once for all platforms.

It includes ready to use sorting and searching algorithms in hwy/contrib.


"Supported targets: scalar, S-SSE3, SSE4, AVX2, AVX-512, AVX3_DL (~Icelake, requires opt-in by defining HWY_WANT_AVX3_DL), NEON (ARMv7 and v8), SVE, SVE2, WASM SIMD, RISC-V V."

wow this looks very interesting, I noticed the release passed 1.0 milestone, how did you unify all those intrinsics? I'm particularly interested in RVV1.0.


:) We (several engineers) put together a table [1] of instructions that several architectures natively support. There was enough overlap that I thought a wrapper would make sense.

With a lot of the JPEG XL code written, when SVE and RVV were being introduced/discussed, I realized that pretty much the same code would work there, too: we just needed to replace a constexpr VecClass::kLanes with Lanes(), which is what Highway now does.

Beyond the initial set of efficient-everywhere ops, we've added some (such as AbsDiff) that help on say Arm without hurting other platforms, nor can user code do better itself. There are also a few (ReorderWidenMulAccumulate) that define a non-obvious relaxation of the interface which is more efficient for all platforms than holding to any one platform's interface.

For RVV, one remaining concern is the VSETVLI - the revised intrinsics now require an avl argument, which Highway creates in each function. It's not yet clear whether the compiler will be smart enough to (peephole-?)optimize away the duplicates.

1: https://github.com/google/highway/blob/master/g3doc/instruct...


There are already libraries that make it easier to do vectorization, which are equivalent to compilers that take generic instructions and convert them into SIMD equivalents (though it's somewhat more nuanced than this). The issue is not so much the compilers, but rather having a standard language for describing how to do the maths on vectors.

It's a lot like the difference between writing C or assembly. Even today, sometimes you need asm to make the code go real quick because what the compiler spits out isn't always optimal (but for the general case, it's quite good).


The thing is that Intel removed AVX512 in most CPUs. Only the servers one have or will have it.

https://www.tomshardware.com/news/intel-nukes-alder-lake-avx...


Removal from Alder Lake was indeed regrettable but AVX-512 is still included in several recent client (non-server) CPUs: Icelake, Rocket Lake, Tiger Lake.


At the cost of reduced frequency if I'm not mistaken.


The frequency scaling problems most prevalent in Skylake are significantly better as of e.g. Ice Lake. It's complex to summarize but basically those instructions request a higher "power license" for higher power delivery when used, broadly speaking. For single/dual core workloads on my laptop there is no frequency scaling loss IIRC, and even at full bore on all 4 cores I think the drop is from 3.6 to 3.5 nominal max clock. Ice Lake Client only has 1x FMA unit however, I don't know of any benchmarks of Ice Lake SP where there's 2x FMA units. You can reliably handle full datapath AVX-512 usage in client workloads these days, on Ice Lake or later, IMO. (Of course when you're on a laptop, the difference of 3.5GHz vs 3.6GHz on draining your battery doesn't feel too different...)

Wide-datapath designs will generally have some tradeoffs like needing to ramp up power delivery, so there will be things like initial instruction latency for wide-datapath instructions, etc. That's kind of inevitable; I suspect the same will be true of the new fancy "variable length" vector ISAs if the underlying implementation and vector usage is wide enough, too.

Also: you don't need to use wide 512-bit vectors with AVX-512! You can use the instruction set with 128-or-256/bit vectors just fine.


The laptop chips have never had "downclocking problems" in the sense that Skylake-SP did. AFAIK they are actually some of the top performers for ps3 emulation/etc.

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

Actually supposedly Skylake-X/Xeon-W had much lower AVX downclocking than Skylake-SP too... InstLat64 made a tweet at one point showing this was 10-20% for workstation vs 30-40% for server iirc. Tweet has been removed unfortunately.

Intel really throttled it down on server chips, for whatever reason. Probably didn't want datacenter chips to run the Unlimited Voltage that was necessary for full-clock dual-unit AVX-512 on 14nm.


It depends on the CPU type (bronze/silver < gold/platinum). My workstation saw 1.4-1.6x end to end application speedups, including throttling (even on multiple cores) for JPEG XL decoding and vqsort.

[ General observation, not directed at parent comment: ]

Frequency throttling, even on the most affected Skylakes, has always been a non-issue if you run say 1ms worth of continuous SIMD instructions. How could a 10-40% drop negate speedups from 2x vector width plus double the registers and a much more capable instruction set?

It is time we buried this myth :)


You can configure throttling in the bios if you have Skylake-X. You can set it to whatever you want. The caveat being it won't necessarily be stable (depending on voltage) nor will your cooling necessarily be able to handle it.

My 10980XE runs AVX2 at 4.2 GHz all core, and AVX512 at 4GHz.


You are mistaken.


Not entirely. With Skylake, there was a clock speed penalty for "heavy" 256b instructions and "light" 512b instructions, and a steep clock speed penalty for "heavy" 512b instructions. With Ice Lake, there is a very small single-core clock speed penalty for 512b instructions. There is no clock speed penalty after Ice Lake. (Which, for non-server CPUs, is currently a list containing one generation: Rocket Lake.)


Skylakes were not being discussed.


OK? That doesn't really change any of the facts I posted.


If you're using Julia and want something like this, you can try `LoopVectorization.vfilter`. It lowers to LLVM intrinsics that should do the right thing given AVX512, but performance is less than stellar without it IIRC. May be worth taking another lookat. I'm not licking this cookie; I welcome anyone else wanting to get their hands dirty.


I do like how after all that talk of "idiomatic rust" we still end up with what is substantially assembly code with extra steps to get any real speed


Well, the idea is to use the specificities of the machine you are running your code to solve a given problem. It's not supposed to be portable, although a god-like compiler might be able to implement those optimizations at some point.


Please refrain from calling interns compilers.


Are any of these instruction sets available on M1? If not, what's the alternative there?


No, AVX and AVX2 are amd64 things (Intel and AMD). The equivalent SIMD instruction set for arm64 (including the Apple M1) is called NEON. See https://developer.arm.com/documentation/ihi0073'/latest/'


There are SVE from ARM that will replace NEON


Thanks!


No but an equivalent set of simd instructions called NEON are available. IMHO having programmed both the ARM ones seem to have less baggage and better designed. not sure about the performance of wider lane variants (512) on x86 though, they may be significantly faster.


Common x86 impls have 2 or 3 vector pipes (recent intel is 2x avx512 or 3x avx2); apple arm has 4, so the difference in throughput, though definitely there, is less than you might expect.

There is sve, which has variable-width vectors. Haven't played with it--that gets you your width, but there are obviously compromises there.

The obvious longstanding omission from neon is movemask. One nice thing they have, though, is a permute with four source operands (!) for a 64-byte lookup table.


We have data from vqsort showing 3.2 GHz M1 to be half as fast as 3 GHz Skylake. I'd be interested in other direct comparisons of x86 vs NEON code :)

SVE will be interesting especially when vectors get wider. For the Arm V1, there are some unfortunate bottlenecks. See Table 3.41 instruction throughput in https://documentation-service.arm.com/static/6088343f85368c4...

For OP's application, the equivalents of compressstore aren't great. COMPACT is decent (1x256-bit/cycle) but CNTP occupies the bottleneck M0 pipeline.

Still, the V1 is probably fine for applications that do not often modify predicates.


I see. Sort is an interesting case because it has very tight dependencies (IOW, I expect there are many interesting workloads for which this result doesn't generalise). When you halve the vector width, you either double the length of the dependency chain or halve your effective work unit (which amounts to the same thing, along another axis--the performance impact of which is variable, but considering the m1 is very wide, it's probably not a great idea). I wonder if there's any room for a parallel sum scan, but my intuition is that, on a superscalar, at such small sizes, it's wasted work.

While I haven't looked closely at vqsort yet, I was looking at vxsort not long ago. I suggested to its author the possibility of handling multiple partitions in parallel--handling only architectural vector per partition at a time--and he said that, while that seemed like a potentially promising approach, he was more immediately concerned with bandwidth. Considering the m1's great width, memory bandwidth, and cache size (on top of the small vector size), this seems like an especially promising approach there.


I agree this is one specific use case.

Decreasing the vector size does have some upside: it makes the O(lognlognn) sorting network cheaper.

For M1, performance is not terrible, it seems mostly limited by the NEON instruction set (expensive to do masked stores and popcnt() comparison results).

How do you mean multiple partitions in parallel? One challenge is that subarrays vary in size. It could be interesting to do multiway partition (larger base for the logn, fewer passes over memory), but that seems hard to reconcile with the vectorized in-place approach. ips4o effectively does a 64 or 256-way partition, but that's also nontrivial to vectorize.


> makes the O(lognlognn) sorting network cheaper

Good point.

> subarrays vary in size

Hmm. Haven't thought about this too much yet, but it seems to me that you'd probably still even win if you had to branch once you run out the common prefix to find a new subarray. Or maybe you try to find subarrays with similar sizes and mask out the last n stores (if your subarrays are too mismatched in general, then you hit the worst-case of quicksort _anyway_, so).

> ips4o effectively does a 64 or 256-way partition, but that's also nontrivial to vectorize

This is interesting. 64 and 256 are annoyingly large--mainly because you have to scatter to make any sense, and scatter-store is slow by comparison where it's even supported--but just 4 seems potentially feasible.


Agree that scatter is expensive. We did actually have hopes for 4-way partition. Even ignoring the difficulties with doing that in-place, SIMD partition still seems to require one compressstore per partition. A quick prototype turned out to be half as fast as 2-way partition, which negates the gains from doing half as many passes (this depends on the ratio of compute to bandwidth, though).


This page crashes mobile Chrome (Android), anyone else see that?


Can't reproduce on android 11 with either chrome or firefox.


I also couldn't open it with Chrome on Android.


Yes, I had to restart my phone.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: