Hacker News new | past | comments | ask | show | jobs | submit login

Any FizzBuzz example almost feels too silly to deserve a serious comment. But here goes.

As you noticed, the FizzBuzz pattern has a period of 15. So if you want to go fast, just go directly to a 15x vectorized unroll with unaligned stores.

Assuming 32-byte vectors like with AVX2, you initialize two vector registers v0, v1 to hold the pattern of 15 4-byte values (the 16th value doesn't matter) corresponding to 0 through 15. You also have corresponding delta vector registers d0, d1 whose entries are +15 for the "pass-through" values and 0 for the -1/-2/-3 sentinel values. The loop kernel is just this:

    store_unaligned(p, v0)
    store_unaligned(p + 8, v1)
    v0 += d0
    v1 += d1
    p += 15
And of course you still need the remainder loop.

You could remove the unaligned stores with a monstrous lcm(15, 16) = 240x unroll (16 shifted instances of the above kernel) but you wouldn't be able to keep all the necessary loop state in registers.




Your version written in Java

  private static final int[] VALUES = range(0, 16).map(FizzBuzz::fizzbuzz).toArray();
  private static final int[] DELTAS = stream(VALUES).map(v -> v < 0? 0: 15).toArray();

  private static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_256;

  private static int fizzbuzz(int i) {
    return i % 15 == 0? FIZZ_BUZZ: i % 5 == 0? BUZZ: i % 3 == 0? FIZZ: i;
  }

  public static int[] simdFizzBuzz(int length) {
    var result = new int[length];

    var v1 = IntVector.fromArray(SPECIES, VALUES, 0);
    var v2 = IntVector.fromArray(SPECIES, VALUES, 8);
    var d1 = IntVector.fromArray(SPECIES, DELTAS, 0);
    var d2 = IntVector.fromArray(SPECIES, DELTAS, 8);

    var i = 0;
    var upperBound = length - length % 15;
    for(; i < upperBound; i += 15) {
      v1.intoArray(result, i);
      v2.intoArray(result, i + 8);
      v1 = v1.add(d1);
      v2 = v2.add(d2);
    }
    for(; i < length; i++) {
      result[i] = fizzbuzz(i);
    }
    return result;
  }
quite cool indeed


There's a small bug in your code. The loop kernel writes out 16 entries despite only advancing by 15 entries, so your upper bound calculation needs adjustment. I'm also worried the JVM's compiler won't be able to eliminate the array bounds checks because of how you calculate the upper bound with the modulus. Writing it like this should fix both issues:

    // We need i + 15 < length to keep the last written entry in bounds.
    for (; i < length - 15; i += 15)


yes, right, if length is a multiple of 15, there is an issue.

And there is no primitive for masking on AVX2, those are only available on AVX-256.

if we want to minimize the iterations of the post loop, it should be

  var upperBound = length % 15 == 0? length - 15: length - length % 15;
but as you said, i'm not sure the VM will eliminate the bound checks in that case so your solution seems to be the best

  var upperBound = length - 15


> And there is no primitive for masking on AVX2, those are only available on AVX-256.

Sure there is a masking MOV in the original AVX (thus including AVX2), VMASKMOV. Works for both masked loads and stores.

https://www.felixcloutier.com/x86/vmaskmov

Notably: "Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero."

A nitpick: there's no such thing as AVX-256, you probably meant AVX-512.


Thanks, i've overlooked that

    var mask = VectorMask.fromLong(SPECIES, 0b0111_1111);

    var i = 0;
    var upperBound = length - length % 15;
    for(; i < upperBound; i += 15) {
      v1.intoArray(result, i);
      v2.intoArray(result, i + 8, mask);
      ...


Yeah, that looks really neat, thanks for sharing that idea and even putting it together in Java here! I'll add it to the repo (https://github.com/gunnarmorling/simd-fizzbuzz) in a bit, and run the JMH benchmark for it, for the fun of it.


I guess you can also prefill result[i]=i and then apply fizzbuzz mask to avoid addition altogether


Oh, wait, the original code receives input array while above loops from zero to n which is something else.


I actually have one use for FizzBuzz!

I have used FizzBuzz as an introduction to wheels. Any repeating sequence will do, but everyone knows FizzBuzz. It is a great example of that because you replace a simple algorithm with one that is also simple and about twice as fast in my language of choice (using circular, singly linked lists for the wheel).

The article does a similar thing, but using masks to achieve the speedup.


What is "wheel" in this context? The main one I can think of in programming is Python packages. I even considered you were talking about literal wheels! But it sounds like you're talking about some sort of data structure, like a ring buffer?


I call it a wheel because I was first exposed to them in wheel factorisation: https://en.m.wikipedia.org/wiki/Wheel_factorization


> I call it a wheel

Call what a wheel? What is the "it" here?

It doesn't seem to be a standard term but since you went to the trouble of posting a comment surely you were hoping it would be understood. Are you talking about modulo arithmetic? Or hierarchical timing wheels (the top hit for "wheel data structure")?


Any kind of circular data that repeats itself. In the case of wheel factorisation, it is a wheel "rolling" over the number line.

Any data structure will do. Either a circular one or one with O(1) access time. It is the concept used in the article with a vector mask repeating every 15 elements.

It is a technique. I have just always used a circular list for it :)


And the Link I posted explains it just fine. Anyway, I actually wrote some code to generate wheels not too long ago: https://git.sr.ht/~bjoli/wheel-utils


OK, thanks. I think your concept of a wheel could be summarised as "use of a ring buffer for modulo arithmetic". (Of course, like you said, a "ring buffer" doesn't have to be a linked list with start and end links connected - an array that you just index carefully works too.)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: