Others have hinted at this, but to be clear: This algorithm is slow, even in the optimal case where the tables are in cache. On new X86 CPUs it is theoretically limited to less than 2 bytes per cycle. Probably somewhere around 1.5 for an implementation that loads 8 bytes of input at once and shift though them in order to limit load on the load slots.
Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.
This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.
Even without getting into SIMD algorithms you could load 8 bytes at a time and pretty easily go faster than that, possibly while using the multiplication instruction for mixing.
This of course ignores that we are not looking up values from a hash table in a vacuum. Other code will also be competing for the cache, and that generally means that everything runs slower because of more cache misses.