That linked quicksort paper is pretty cool: https://arxiv.org/abs/2205.05982 - for sorting floats/integers the speedup against std::qsort is around 10x.
Considering climate change more focus should be put on making fast and efficient software. Every watt not wasted on bad algorithms/code is good.
It is a nice paper, but for practical applications on fixed-size numbers up to 64 bits, as well as for tuples on fixed-size numbers, and for partial sorting, it does not beat Radix Sort (on moderate-sized arrays).
I have tested it in ClickHouse, but end up with this:
PS. It's strange that Google's paper dismisses djbsort: https://sorting.cr.yp.to/ - also in the class of sorting networks, but introduced a few years ago.
I can't tell from the header file, but does your MSB radix sort use threading? I have a pure C implementation that speeds up radix sort by about 5-6x on my M1 Pro using all 10 cores. I'm struggling to find a way to use SIMD intrinsics to improve upon it, but will be working on Metal and Cuda versions (for my x86 system) to see if that helps.
FYI - all I'm doing is changing the algorithm attributed to HH Seward in AoCP 5.2.5 to be MSB, counting the number of elements in each bucket and allocating an equal(ish) number of elements (via entire buckets) to each thread, which then act in the normal recursive manner as described.
Defining "not wasted" is carrying a lot of work there. Yes, we should make things as efficient as we can. And we shouldn't reach for slow solutions, if we can avoid it. But... that is, essentially, the famous Knuth quote. Right?
> "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
Also this was probably in the 80ies or 90ies and was in reference to loop unrolling or hand-crafting assembly - which is kind of different to just not knowing what you are doing and writing O(n^2) algorithms for O(n) or O(log(n)) problems - due to a lack of understanding the problem which is the problem we have at the moment (not saying I'm above that, been there done that ;).
If Google can speed up std::qsort tenfold using SIMD instructions and likely other operations we should add this code everywhere imho.
The referenced presentation is an unusually good talk on optimization in DBMS. I recommend it to anyone interested in the topic. Among other tricks ClickHouse uses adaptive optimization to switch approaches during query execution. [0] It's interesting to contrast this to the costs of upfront optimization.
Nice slides! Of course, I should have asked whether it's on the pareto frontier. By the way, the author of pdqsort has a new sorting algorithm: https://www.youtube.com/watch?v=2y3IK1l6PI4
I’m not convinced that’ll bring about any significant change. Any power savings from switching to a RISC from x86 is coming from simplifying the instruction decoder, which seems to be about 15-20% if we compare the Ampere Altra to a comparable AMD chip. That’s not an order of magnitude.
On the other hand, on the order of 80% of a chip’s power is spent on OOO execution. If you want the order of magnitude improvement in power efficiency, you need to dump superscalar/OOO in favor of smart compilers and VLIW. Cheap DSPs have been doing it for years, but compilers aren’t good enough yet for general purpose processing.
Agree that OoO is the big cost. But we can also mitigate that without VLIW: SIMD/vector reduces the instruction count by ~5x, and energy by a similar factor.
And a portable API such as Highway also helps us move the same code from x86 to Arm or RISC-V with just a recompile :D
It's not clear that even a super smart compiler can do this. The best schedule depends on the latency of instructions. This is a problem because we can't know statically whether a particular memory load is in L1/L2/L3/DRAM/etc., as this can vary for different executions of the same load instruction.
According to [1], 88% of the speedup given by OOO processors is due to speculation, and the reordering in the case of cache misses attributed around 10% of the speedup. If OOO in general gives around a 50% speedup compared to in-order designs, reordering in the face of cache misses gives only around a 5% speedup. If you use a good static schedule with speculation, you'll get the bulk of the speedup. The rest can be recovered by increasing the clock rate by 5%, since you'll have gotten rid of so much silicon.
If you actually look at x86 microcode much, you start to realize that the backend is actually very risc-like. The distinction between cisc and risc to me is more of a frontend instruction selection one.
Looks cool. Are you familiar with Halide? In terms of target support or implementation complexity, how would you company using highway vs using Halide for similar projects? (I work at Google, use Halide)
Thanks! Yes indeed. Halide has good support for stencils and convolutions, especially taking care of the border handling, which is a pain with explicit SIMD if you are unable to 'define that away' by padding etc. Boilerplate is a bit more (~100 lines vs ~50 for Highway) but that's probably worthwhile.
Target support is good, though AFAIK non-constexpr-size vectors (SVE/RVV) are not supported.
The biggest concern I'd have is predictability. A thin wrapper over intrinsics makes it easy to experiment with codegen and see what's going on. We've seen some compiler bugs but usually narrow scope and relatively easy to reproduce.
By contrast, Halide involves extensive compiler transforms. When that breaks, it's much harder to move forward. There is also a risk of performance cliffs because the language is very flexible and AFAIK lacks 'this is going to be slow' guardrails.
It's great for prototyping and image processing and this is probably what you're using it for? But I also would not expect peak performance, at one point I reimplemented parts of RAISR with explicit SIMD and saw a surprisingly high speedup.
I've been having trouble getting good SIMD performance from WASM. Are there any benchmarks posted anywhere comparing Highway's performance on various native targets vs WASM?
Thanks!
> Do you think it would be possible to port this to Rust?
Yes, I think so. The main tricks are redefining macros based on the current compilation target, and compiling the same code multiple times. I believe Rust can do both, but have only surface-level familiarity.
One stumbling block is getting the C++ equivalent of #pragma target, but I think Rust also has that.
There's been other interest in a Rust port, feel free to open an issue to perhaps start a discussion.
Considering climate change more focus should be put on making fast and efficient software. Every watt not wasted on bad algorithms/code is good.