Any FizzBuzz example almost feels too silly to deserve a serious comment. But here goes.
As you noticed, the FizzBuzz pattern has a period of 15. So if you want to go fast, just go directly to a 15x vectorized unroll with unaligned stores.
Assuming 32-byte vectors like with AVX2, you initialize two vector registers v0, v1 to hold the pattern of 15 4-byte values (the 16th value doesn't matter) corresponding to 0 through 15. You also have corresponding delta vector registers d0, d1 whose entries are +15 for the "pass-through" values and 0 for the -1/-2/-3 sentinel values. The loop kernel is just this:
You could remove the unaligned stores with a monstrous lcm(15, 16) = 240x unroll (16 shifted instances of the above kernel) but you wouldn't be able to keep all the necessary loop state in registers.
There's a small bug in your code. The loop kernel writes out 16 entries despite only advancing by 15 entries, so your upper bound calculation needs adjustment. I'm also worried the JVM's compiler won't be able to eliminate the array bounds checks because of how you calculate the upper bound with the modulus. Writing it like this should fix both issues:
// We need i + 15 < length to keep the last written entry in bounds.
for (; i < length - 15; i += 15)
Notably: "Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero."
A nitpick: there's no such thing as AVX-256, you probably meant AVX-512.
var mask = VectorMask.fromLong(SPECIES, 0b0111_1111);
var i = 0;
var upperBound = length - length % 15;
for(; i < upperBound; i += 15) {
v1.intoArray(result, i);
v2.intoArray(result, i + 8, mask);
...
Yeah, that looks really neat, thanks for sharing that idea and even putting it together in Java here! I'll add it to the repo (https://github.com/gunnarmorling/simd-fizzbuzz) in a bit, and run the JMH benchmark for it, for the fun of it.
I have used FizzBuzz as an introduction to wheels. Any repeating sequence will do, but everyone knows FizzBuzz. It is a great example of that because you replace a simple algorithm with one that is also simple and about twice as fast in my language of choice (using circular, singly linked lists for the wheel).
The article does a similar thing, but using masks to achieve the speedup.
What is "wheel" in this context? The main one I can think of in programming is Python packages. I even considered you were talking about literal wheels! But it sounds like you're talking about some sort of data structure, like a ring buffer?
It doesn't seem to be a standard term but since you went to the trouble of posting a comment surely you were hoping it would be understood. Are you talking about modulo arithmetic? Or hierarchical timing wheels (the top hit for "wheel data structure")?
Any kind of circular data that repeats itself. In the case of wheel factorisation, it is a wheel "rolling" over the number line.
Any data structure will do. Either a circular one or one with O(1) access time. It is the concept used in the article with a vector mask repeating every 15 elements.
It is a technique. I have just always used a circular list for it :)
And the Link I posted explains it just fine. Anyway, I actually wrote some code to generate wheels not too long ago: https://git.sr.ht/~bjoli/wheel-utils
OK, thanks. I think your concept of a wheel could be summarised as "use of a ring buffer for modulo arithmetic". (Of course, like you said, a "ring buffer" doesn't have to be a linked list with start and end links connected - an array that you just index carefully works too.)
As you noticed, the FizzBuzz pattern has a period of 15. So if you want to go fast, just go directly to a 15x vectorized unroll with unaligned stores.
Assuming 32-byte vectors like with AVX2, you initialize two vector registers v0, v1 to hold the pattern of 15 4-byte values (the 16th value doesn't matter) corresponding to 0 through 15. You also have corresponding delta vector registers d0, d1 whose entries are +15 for the "pass-through" values and 0 for the -1/-2/-3 sentinel values. The loop kernel is just this:
And of course you still need the remainder loop.You could remove the unaligned stores with a monstrous lcm(15, 16) = 240x unroll (16 shifted instances of the above kernel) but you wouldn't be able to keep all the necessary loop state in registers.