This was the best part of the article for me: > Intrinsic vmulq_f32(x, c0, 0) mu...

ack_complete · 2024-02-06T02:43:20

That should be vmulq_lane_f32(), but yeah, lane broadcast is free on a number of NEON operations. Many operations also have built-in narrowing, widening, saturation, and rounding. One of the more ridiculous ones is vqrdmlah_lane_s16(), which translates to: signed saturating rounding doubling multiply accumulate returning high half (with a lane broadcast).

The downside is that the latencies can be a bit high sometimes compared to other CPUs. 128-bit vector integer adds, for instance, have 2c latency even on an Apple M1.

Another thing to watch out for is that some NEON guides are outdated and only tell you about ARMv7 features, missing some goodies added in ARMv8 like horizontal operations (vaddv) and rounding on conversions other than truncate.

adrian_b · 2024-02-06T06:37:21

On Intel/AMD you need AVX-512 support to get the instructions with broadcast (and many other goodies that are missing from SSE/AVX/AVX2).

Intel had such instructions many years before ARM (i.e. since Larrabee), but they have chosen to provide them only in their high-end CPUs, annoying the programmers that would like high performance but who do not like the burden of developing for a fragmented instruction set.