Hacker News new | past | comments | ask | show | jobs | submit login
Efficient and performance-portable vector software (github.com/google)
98 points by signa11 on Jan 11, 2023 | hide | past | favorite | 30 comments



That linked quicksort paper is pretty cool: https://arxiv.org/abs/2205.05982 - for sorting floats/integers the speedup against std::qsort is around 10x.

Considering climate change more focus should be put on making fast and efficient software. Every watt not wasted on bad algorithms/code is good.


It is a nice paper, but for practical applications on fixed-size numbers up to 64 bits, as well as for tuples on fixed-size numbers, and for partial sorting, it does not beat Radix Sort (on moderate-sized arrays).

I have tested it in ClickHouse, but end up with this:

https://github.com/ClickHouse/ClickHouse/blob/master/src/Com...

PS. It's strange that Google's paper dismisses djbsort: https://sorting.cr.yp.to/ - also in the class of sorting networks, but introduced a few years ago.


I can't tell from the header file, but does your MSB radix sort use threading? I have a pure C implementation that speeds up radix sort by about 5-6x on my M1 Pro using all 10 cores. I'm struggling to find a way to use SIMD intrinsics to improve upon it, but will be working on Metal and Cuda versions (for my x86 system) to see if that helps.

FYI - all I'm doing is changing the algorithm attributed to HH Seward in AoCP 5.2.5 to be MSB, counting the number of elements in each bucket and allocating an equal(ish) number of elements (via entire buckets) to each thread, which then act in the normal recursive manner as described.


Defining "not wasted" is carrying a lot of work there. Yes, we should make things as efficient as we can. And we shouldn't reach for slow solutions, if we can avoid it. But... that is, essentially, the famous Knuth quote. Right?


The full quote is:

> "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Also this was probably in the 80ies or 90ies and was in reference to loop unrolling or hand-crafting assembly - which is kind of different to just not knowing what you are doing and writing O(n^2) algorithms for O(n) or O(log(n)) problems - due to a lack of understanding the problem which is the problem we have at the moment (not saying I'm above that, been there done that ;).

If Google can speed up std::qsort tenfold using SIMD instructions and likely other operations we should add this code everywhere imho.


You mean the widely misrepresented one?


I do, indeed, mean that one. :D

If it looks like I misrepresented it again, please correct me!


Great paper. Is that the fastest sort right now?


There is no such thing as the fastest sorting algorithm.

Sorting is a landscape of hundreds of algorithms, with various details and assumptions.

Even the definition of the sorting task can be different - depending on what, how, and why you are going to sort.

I've covered it in my presentation here: https://presentations.clickhouse.com/bdtc_2019/#20


The referenced presentation is an unusually good talk on optimization in DBMS. I recommend it to anyone interested in the topic. Among other tricks ClickHouse uses adaptive optimization to switch approaches during query execution. [0] It's interesting to contrast this to the costs of upfront optimization.

[0] https://presentations.clickhouse.com/bdtc_2019/#34


Nice slides! Of course, I should have asked whether it's on the pareto frontier. By the way, the author of pdqsort has a new sorting algorithm: https://www.youtube.com/watch?v=2y3IK1l6PI4


If climate change is a focus, we collectively should stop buying x86 chips and focus on RISC-V and ARM chips.


I’m not convinced that’ll bring about any significant change. Any power savings from switching to a RISC from x86 is coming from simplifying the instruction decoder, which seems to be about 15-20% if we compare the Ampere Altra to a comparable AMD chip. That’s not an order of magnitude.

On the other hand, on the order of 80% of a chip’s power is spent on OOO execution. If you want the order of magnitude improvement in power efficiency, you need to dump superscalar/OOO in favor of smart compilers and VLIW. Cheap DSPs have been doing it for years, but compilers aren’t good enough yet for general purpose processing.


Agree that OoO is the big cost. But we can also mitigate that without VLIW: SIMD/vector reduces the instruction count by ~5x, and energy by a similar factor.

And a portable API such as Highway also helps us move the same code from x86 to Arm or RISC-V with just a recompile :D


It's not clear that even a super smart compiler can do this. The best schedule depends on the latency of instructions. This is a problem because we can't know statically whether a particular memory load is in L1/L2/L3/DRAM/etc., as this can vary for different executions of the same load instruction.


According to [1], 88% of the speedup given by OOO processors is due to speculation, and the reordering in the case of cache misses attributed around 10% of the speedup. If OOO in general gives around a 50% speedup compared to in-order designs, reordering in the face of cache misses gives only around a 5% speedup. If you use a good static schedule with speculation, you'll get the bulk of the speedup. The rest can be recovered by increasing the clock rate by 5%, since you'll have gotten rid of so much silicon.

[1] https://doi.org/10.1145/2451116.2451143


Nice paper!


> compilers aren’t good enough yet

So we need to go back to coding in assembler to save the planet? Sign me up!


If you actually look at x86 microcode much, you start to realize that the backend is actually very risc-like. The distinction between cisc and risc to me is more of a frontend instruction selection one.


Author here, happy to discuss.


Looks cool. Are you familiar with Halide? In terms of target support or implementation complexity, how would you company using highway vs using Halide for similar projects? (I work at Google, use Halide)


Thanks! Yes indeed. Halide has good support for stencils and convolutions, especially taking care of the border handling, which is a pain with explicit SIMD if you are unable to 'define that away' by padding etc. Boilerplate is a bit more (~100 lines vs ~50 for Highway) but that's probably worthwhile.

Target support is good, though AFAIK non-constexpr-size vectors (SVE/RVV) are not supported.

The biggest concern I'd have is predictability. A thin wrapper over intrinsics makes it easy to experiment with codegen and see what's going on. We've seen some compiler bugs but usually narrow scope and relatively easy to reproduce. By contrast, Halide involves extensive compiler transforms. When that breaks, it's much harder to move forward. There is also a risk of performance cliffs because the language is very flexible and AFAIK lacks 'this is going to be slow' guardrails.

It's great for prototyping and image processing and this is probably what you're using it for? But I also would not expect peak performance, at one point I reimplemented parts of RAISR with explicit SIMD and saw a surprisingly high speedup.


Thanks for the response! Your point about Halide not being as predictable and having few guardrails is good.

Yes I work on image processing, same pipeline as RAISR actually.


Looks very interesting!

I've been having trouble getting good SIMD performance from WASM. Are there any benchmarks posted anywhere comparing Highway's performance on various native targets vs WASM?


:) There are some wasm vs native benchmarks in the context of JPEG XL (for example https://github.com/niutech/jxl.js#benchmark) Could also be interesting for you to run the vqsort benchmark: https://github.com/google/highway/blob/master/hwy/contrib/so...


This is awesome work.

Do you think it would be possible to port this to Rust?


Thanks! > Do you think it would be possible to port this to Rust? Yes, I think so. The main tricks are redefining macros based on the current compilation target, and compiling the same code multiple times. I believe Rust can do both, but have only surface-level familiarity. One stumbling block is getting the C++ equivalent of #pragma target, but I think Rust also has that.

There's been other interest in a Rust port, feel free to open an issue to perhaps start a discussion.


Is there anything similar to this in C?


If you're on an Apple platform, their simd.h header works with the clang vector attribute to provide a nice way to do very similar stuff.


SLEEF comes with an abstraction layer (mostly for math). Instead of C++ function overloading, it uses type suffixes: https://github.com/shibatch/sleef/blob/master/src/arch/helpe...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: