There is an Intel PEXT instruction, but I don't know the cycle count and I think it's limited to 64 bits.
With PEXT the whole prefix match is four instructions.
PEXT is pretty cheap... but, I still don't think I understand how what you're suggesting could optimally apply to prefix matching against all 16 strings at once, or why it would offer a substantial speed-up to the approach I've used? Can you elaborate?
I can't see how this avoids needing to do a comparison against the string in the slot once you've actually found the index though. I think I need to see this in C or asm to grok what the benefits are.