Nice, but it is not just 22x faster with the patterned_register_blocked_simd, it is also 6x less precise (from 1% FPR to ~6% FPR)
What size difference is needed to get these to align again, and how would that size difference impact the original's performance if _k_ was adjusted accordingly for iso-FPR?
What size difference is needed to get these to align again, and how would that size difference impact the original's performance if _k_ was adjusted accordingly for iso-FPR?