Hacker News new | past | comments | ask | show | jobs | submit login
Sorting with SIMD (tweedegolf.nl)
172 points by lukastyrychtr on Dec 17, 2022 | hide | past | favorite | 63 comments



If you are exclusively moving numbers around in an array, SIMD sorting sounds great.

But when you want to actually sort things, it's different. You have objects in an array, the objects will have one member which is a Key that you will sort the objects on.

In order to create a sorted permutation of the list, you either need to rearrange the list of object pointers (fast), rearrange the memory of the objects (slow), or you simply output a list of array indexes that represents what the sort order should be (fast).

Code that doesn't physically move the object memory around creates indirection. Indirection makes any SIMD sorting depend on scatter-gather in order to get data in. Scatter-Gather causes random memory accesses which don't cache as well as sequential access.


It might be easier for SIMD to work by extracting the key value from each object into its own array.

Form an array with fixed element size: [ {index0 | key0}, {index1 | key1}, ...], where indices are the element index to the original array of objects and the keys are fixed size. The fat element {index | key} is moved as a whole unit to carry the original index along with the swap during sorting. SIMD can deal with fixed size units very well.


That would be called a gather operation, which is itself slow, exactly as the OP comment said.


Yeah I find the thing you generally want but which many languages don’t support well is to have a struct-of-arrays or data frame layout rather than an array of objects. Then you can do one of:

Sort one array and rearrange other arrays the same way

Compute the array of indices into another array such that array[index[i]] < array[index[i+1]], I.e. the sorted version of the array is then [array[i] for i in index]. If you have a vectorised index operation (eg APL, R, …) getting the sorted array is one line of code.

I think with simd you’d want to sort your array and simultaneously permute either a second array or the array of indices 1, 2, …, n so you can use that to do accesses to corresponding elements in your other associated columns.


Maybe Google's new "Rune" language will become prevalent https://github.com/google/rune, which supports SoA.


What's SoA?


Struct of Arrays


Thanks


Last I checked, Jai was going this direction (a type modifier that allowed one to do either AOS or SOA) but abandoned it (or at least, abandoned the particular syntax JB originally prototyped).


I think they didn't really abandoned it, but were planning to make the metaprogramming facilities powerful enough to just implement them in plain code instead of creating a separate language feature. (Zig has already taken this direction: https://zig.news/kristoff/struct-of-arrays-soa-in-zig-easy-i...)


I've been playing with Intel's "Implicitly Parallel SIMD Compiler" (ISPC) [0].

The ISPC User Guide was easy for me to work through, and there's a section on "Structure of Arrays" (SOA) programming [1].

This approach to vector-parallel programming was new to me, but got me to try some C code for the kernel transformation. My simple applications are probably fast enough in Python, but I thought it quite worthwhile to look at tiny kernels and the code the compiler generates.

- [0] https://ispc.github.io/index.html

- [1] https://ispc.github.io/ispc.html#structure-of-array-types


Yeah. Whether this is faster than normal sort is a big "depends" based on how the sorted data is accessed.

SIMD sort with indirection would be nice if you only need to access top 25 or 1/255 items in the array or something. Faster sorting helps pay off the cost of cache misses when accessing the other fields.

If you need to sort and then process every item with associated data, I wouldn't be surprised if quicksort on the structs themselves is faster. The cache misses would add up.


Would using a structure-of-arrays (instead of array-of-structues) make this less of a problem?


Okay, let's suppose we are defining the problem like this:

• ObjectArray, an Array of objects, will be treated as immutable

• IndexArray, an array of ints, indexes into the first array. Initialize to [0...N). After sort finishes, this contains an ordering into the first array.

Now we want to run the sort.

Every time we do a comparison, we read from IndexArray, and use that index into ObjectArray. Every time we do a swap, we only mutate IndexArray.

There is guaranteed to be at least one level of indirection, because you need to read from IndexArray to determine which object is actually there. So you get a scatter-gather requirement here for SIMD code.

Now for the different ways to represent the array:

• Plain pointers to objects (like C# reference types)

• One object after the other in a contiguous memory block (Array of structures)

• All Keys contiguous in memory (Structure of arrays)

If you go with Plain Pointers to objects (reference types in C#), you get an extra read to some object in memory that was allocated elsewhere. Can be a cache miss.

If you go with Big Memory Block with objects inside (Array of structures), you avoid the read of the plain pointer value. You calculate the address of a key with a Multiply (by size of object) and an Add (base address). The data that's loaded into the processor cache may be unnecessary for sorting.

If you go with All Keys Contiguous in memory (Structure of arrays), you calculate the address of a key with a Bitshift (size of key) and an Add (base address). With keys contiguous in memory, it helps with caching, as you are loading only keys into the cache at that time.


The original (AFAICT) work on SIMD quick sort, also mentioned in the google post also implemented pointer sort by loading a pointed key using gather instructions and the method can be used for an array of structs. https://github.com/vkrasnov/avx_qsort/blob/master/qsort_AVX2...


I wonder why no one mentions bitonic sort? If you want to do anything in SIMD you better avoid branching as much as possible... and ideally altogether. Here is an implementation I co-authored some 10 years ago: https://github.com/zerovm/zerovm-samples/blob/master/disort/...

Sorting-networks which were already mentioned seems similar but a bit too abstract.

My code above doesn't contains values but those are easy to add I think. Of course it is better to permute fixed size pointers / offsets and not the entire blobs which can be of variable size and then it will complicate everything beyond feasible for SIMD


SIMD is a fantastic tool. My first try with AVX2 (256-bit), building an "overlay with transparency" for two YUVA420P frames, yielded speed better than 1 pixel per CPU cycle! Although i didn't even try optimising it all that much.


There is an interesting algorithm for sorting that is highly amenable to parallelization and which I thought this article was going to be about: https://en.wikipedia.org/wiki/Sorting_network


Though about that as well. Few years ago I implemented a SIMD sort using those and mergesort.


Has anyone been working on using SIMD on sorting networks? For the 16-input network, for each stage on the first half of the pipeline, all comparisons have no dependency and seems to be a good candidate for SIMD.


Yes indeed. IIRC std::sort's sorting network is only 3-7 elements wide. With SIMD we can handle 256. The trick is to minimize shuffling by reshaping 1D input to 2D and first sorting the columns. Previously one would then transpose (expensive) and then again sort columns. We instead fuse those two steps, see https://arxiv.org/abs/2205.05982.


I played a bit with sorting 8 floats in 2 SSE registers with a sorting network. I had a bit of trouble shuffling the registers so the values would be in the right places, so I ended up brute-forcing it:

https://github.com/0xf00ff00f/short-simd-sorter

I don't know if you could use the same idea to sort 16 values in 2 AVX registers. There's probably a better way.


That's pretty cool.

The compare_and_swap ops of the sorting network can be done in SIMD using _mm_min_epi32, _mm_max_epi32, and _mm_blend_epi32. For the 8-value network, they can operate on 4 pairs of the values and 8 pairs for 16-value network, as long as the pairs have no dependency with each other.


I recently tried to do that as well, but failed. Specifically, I have implemented AA sort [1] but for my use case the performance was about the same as std::sort in C++, for the substantial code complexity cost. I reverted to std::sort. The code is on github [2]

Still, in that particular project the vectors being sorted are relatively small, typically under than 100kb, so I have only implemented their “inner” algorithm which works on a single CPU core. The complete AA sort algorithm was apparently designed for large vectors, and uses both SIMD and multithreading. Might be still useful for very long vectors.

[1] https://ieeexplore.ieee.org/document/4336211

[2] https://github.com/Const-me/fTetWild/blob/master/MeshRepair/...


We also experimented with 4x4 networks but it's very helpful to use 256 or 512-bit SIMD. You might give our vqsort a try, it's quite a bit faster than std::sort: https://github.com/google/highway/blob/master/hwy/contrib/so...


This is a good explanation. Writing the masks in big-endian somewhat obscures things.

As an aside, while recent developments in quicksort are quite good, it seems like MSB radix sort performs comparably to (or slightly better than!) these algorithms on random inputs.


Random inputs for which you don’t know the bounds?


Yes. If the input is bounded in a narrow range compared to the length if the input, it seems like it should actually be a bit slower than normal, since the first few passes may do nothing and the later passes will need to be run on more of the input than normal. We're using it for arrays of doubles that have fairly small exponents though, and this causes us to have lots of empty buckets in the first pass, and it performs well enough.


Cool! I should probably not have commented, I haven’t really kept up with advances in sorting, haha. I assumed it was basically done to death 20 years ago. This will spur some reading I think! Thanks.


Please do comment in situations like this.

It's tough to find a good overview of this topic that includes recent work or that is aimed at empirical performance. That's part of why TFA is great!

A version of MSB radix sort for sorting ints, doubles, or floats, or sorting structs by a single key of those types is here[0] and a version for sorting strings is here[1]. These are based on the MSB radix sort from this repository[2] which is associated with a technical report about parallel string sorts. I was only interested in sequential sorts, but this repository turned out to be a great resource anyway.

[0]: https://github.com/alichraghi/zort/blob/main/src/radix.zig

[1]: https://github.com/dendibakh/perf-challenge6/blob/Solution_R...

[2]: https://github.com/bingmann/parallel-string-sorting


This is a well written SIMD example I find these are the hardest topic to clearly explain because of how much is happening. Anyone got other nice links I've been doing a lot of vector processing so I need to learn!


Someone previously shared this YouTube playlist on HN, and I have found it to be very helpful [1].

[1] https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg...


You probably know this, but Daniel Lemire's blog often has simple examples using avx-512 and/or branchless algorithms:

https://news.ycombinator.com/from?site=lemire.me


128bit SIMD on x86 is called SSE, not AVX. AVX is 256bit. SSE (first introduced with the Pentium 3) had many extensions that added instructions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and some others.


For most 128-bit SSE instructions, there are equivalent 128-bit AVX instructions, which differ from them by allowing 3 register addresses instead of 2 register addresses, and by not causing execution stalls when they are mixed with 256-bit AVX instructions.

For most AVX instructions, you may choose between 128-bit and 256-bit instructions, which is especially useful on those CPUs where 256-bit instructions cause down-clocking.

There are also 128-bit AVX-512 instructions and 256-bit AVX-512 instructions.

So the same SIMD algorithm may be implemented for Intel CPUs using either 128-bit SSE instructions or 128-bit AVX instructions or 128-bit AVX-512 instructions or 256-bit AVX instructions or 256-bit AVX-512 instructions or 512-bit AVX-512 instructions.


That's not entirely true. AVX from the beginning (introduced with Sandy Bridge), not AVX2 (introduced with Haswell), contains 128-bit instructions as well, and these are what's being used in the article (e.g. vpermilps).


Nice article, but there are many errors or typo that makes it hard to follow. (Is it a typo or do I not understand that part) For example, his use of 'unsafe': why functions that don't take pointer are declared as unsafe? Also taking a pointer of an array shouldn't need unsafe. Is 4 > 8 ? Missing close ']'. And that's just the one I see.

I haven't tried compiling the code example, but I don't think they do.


When would you use SIMD vice a GPU? (Eg Vulkan comp shader) Is it easier to write for CPU, but you bring in the GPU shader if doing massively parallel ops vice just a few? I've skimmed a Rust lib that uses CPU SIMD (GLAM), and it used a diff syntax from normal, so I'm not positive it would be easier if you're familiar with the GPU process.


It depends how much, where, and when you'd like the data to be sent? Most discrete GPUs cannot access the CPU memory directly, so you need to make a copy through the PCI bus, which can be slow.

If you have small chunks of data (mining), or your destination is the screen (video game), it might make sense to use the GPU. If you need high-throughput, low latency, or your destination is something like a sound card (DAW), then SIMD might be a better choice.

Fast integrated GPUs like Apple's allow for directly accessing the main memory without copy, making the GPU more viable for general purposes.


> Fast integrated GPUs like Apple's allow for directly accessing the main memory without copy, making the GPU more viable for general purposes.

My understanding (and I would be very happy for any clarifications/corrections!) is that you must use Metal Buffers (MTLBuffer) with the Apple Silicon GPU. If your data isn't already in a Metal Buffer (why would it be?), you have to do a copy into a buffer (but it will be extremely fast). Metal Buffers use raw bytes so they will work well with C/C++ arrays, but if you are using something like a Swift array, that is not such an easy to use pairing - you can only get a Swift UnsafeMutableRawPointer to the data.

Further complicating things is that for the best GPU compute performance on Apple Silicon, you want to use Metal Buffers that are private to the GPU, forcing you to use a "blit" operation to copy data to/from CPU-visible memory.

I've been exploring general purpose computing with both CUDA and Apple Silicon and am having a lot better luck using SIMD intrinsics instead. While the GPU compute is incredibly fast, there is just so much time spent synchronizing data. There really seems to be some sweet spot where the GPU wins out, but in a lot of general purpose uses you won't ever get there. But I would really like to be shown that I'm wrong!!!

PS - the very basic ARM Neon stuff that the M1 supports is insanely fast and gets even better when you use multiple threads.


> My understanding (and I would be very happy for any clarifications/corrections!) is that you must use Metal Buffers (MTLBuffer) with the Apple Silicon GPU. If your data isn't already in a Metal Buffer (why would it be?), you have to do a copy into a buffer (but it will be extremely fast).

You have to use Metal Buffers, but you don't have to copy, as long as it is properly page aligned [0]. The YUV data from both camera is for example.

> PS - the very basic ARM Neon stuff that the M1 supports is insanely fast and gets even better when you use multiple threads.

It is, these chips have so much to give!

[0]: https://developer.apple.com/documentation/metal/mtldevice/14...


I have just been nerd sniped.

EDIT - any hints on how to get a plain-old Swift array to be allocated in a conforming alignment (4096 bytes on Apple Silicon)?


It's not supported, unfortunately. But you can call posix_memalign and work with it as an UnsafeBufferPointer with a similar API.


> I've been exploring general purpose computing with both CUDA and Apple Silicon and am having a lot better luck using SIMD intrinsics instead

What sort of programs are you trying this with?


I am writing some stuff that will hopefully be published next year, so I don't want to get into it too much now.

But it basically started with this sequence of text from "Is Parallel Programming Hard, And, If So, What Can You Do About It?" [0]:

> Parallel programming has earned a reputation as one of the most difficult areas a hacker can tackle.

> However, new technologies that are difficult to use at introduction invariably become easier over time.

> Therefore, if you wish to argue that parallel programming will remain as difficult as it is currently perceived by many to be, it is you who bears the burden of proof

We are now in an era in which is it nearly impossible for the average consumer to buy a computing product that doesn't have multiple cores, SIMD, and a GPGPU. So I felt that it was time to explore how to do it with everyday basic general computing tasks. I was starting with nearly zero experience and writing what I've been learning :)

By the way, even the Raspberry Pi 4 does very nicely with ARM Neon, though it doesn't improve with multiple threads concurrently executing SIMD code like Apple Silicon does!

[0] https://mirrors.edge.kernel.org/pub/linux/kernel/people/paul...


That makes sense; thanks! And for passing to GPU, you need to serialize the data as byte arrays with specific alignment requirements that are not intuitive.


You use SIMD whenever you want to operate directly on system RAM and want to mix scalar and vector instructions.

You use GPUs when you have a big batch of work that you can transfer once into VRAM and only transfer back the result.


Most cutting edge olap engines use SIMD extensively though how and where and how much is always left to imagination (confirmed with snowflake and databricks). At least practically speaking, I’ve sometimes amazed at how these engines are almost an order of magnitude faster for some computations when I have done similar computations using C and Java and assumed I knew the best possible performance for a typical cpu..


There are some implementation details about Databricks’ Photon engine in this paper: https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf


Is there a particular reason to use direct intrinsics over portable simd?


What is portable simd?


Since the example was using rust, I'm guessing the [std::simd module](https://doc.rust-lang.org/std/simd/index.html), which is currently unstable but would be cool to see put to use here


This is why with redact.photo I randomly shuffle the pixels before blurring so it looks like you can reverse engineer it but you’ll just get scrambled pixels.


The threads are right next to each other, might you have meant to reply here? https://news.ycombinator.com/item?id=34031568 "Unredacter: Never use pixelation [for] redaction"


I thought that blurring a photo would set each pixel to the average color value of all its neighbors. Since information is lost, how could you reverse engineer a blurred photo?


If there are a limited number of input permutations then you can test all of them to see which is plausible.


Hah.. but by shuffling you preserve the number of white and black pixels, which may make it possible to reverse ;)


>You should probably not use inline assembly in production code

What are the alternatives here? Write the assembly in a separate file and provide a FFI?


Popular compilers support popular SIMD architectures through “intrinsic” functions. They look and act like regular functions, but they are built in to the compiler and usually compile to a single specific assembly instruction. In the article, _mm_set_epi32 is an intrinsic function that compiles to the instruction of the same name.

This is a sharp contrast to inline assembly for which the compiler has practically zero visibility into. Inline assembly can’t be pipelined with other work by the compiler. And, the compiler has to switch to a super-conservative assumption that the inline assembly might have done god-knows-what behind the compiler’s back.

AFAICT, the last holdouts for hand-written assembly are people working on media codecs. Even AAA game engines use intrinsic functions rarely and assembly nearly never.


Isn’t the reason they had to use inline assembly there because the compiler they’re using doesn’t have that particular instruction bound as an intrinsic?

What do you do in that case? I’m genuinely curious as it’s something I’ve run up against: the vector extensions for the LX7 processor in the ESP32-S3 don’t have intrinsics for them.


There are intrinsics for a wide range of ARM and PowerPC SIMD instructions, a huge range of Intel SIMD instructions and several useful instructions like ByteSwap or FindFirstSetBit on several architectures.

But, there is not an instrinsic for every instruction nor for useful instructions on every architecture. In those cases, you might be lucky to have the compiler recognize very specific patterns in C (compilers are great at recognizing C implementations of byteswap, for example). But, otherwise you’ll have to write inline assembly if you want to utilize those features.


You continue doing what you are doing.

Intrinsics in many languages are just a file full of inline ASM somewhere.


There are SIMD abstraction libraries floating around. And many so-called "Math" libraries will use SIMD instructions to speed things up, I believe. So the work is to cast the problem to the language of the library(ies) and do some profiling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: