Others have hinted at this, but to be clear: This algorithm is slow, even in the...

Others have hinted at this, but to be clear: This algorithm is slow, even in the optimal case where the tables are in cache. On new X86 CPUs it is theoretically limited to less than 2 bytes per cycle. Probably somewhere around 1.5 for an implementation that loads 8 bytes of input at once and shift though them in order to limit load on the load slots.

Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.

This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.