Hacker News new | past | comments | ask | show | jobs | submit login

One thing I learned about pcmpxstrx is that it's surprisingly slow: latency of 10-11 cycles and reciprocal throughput of 3-5 cycles on Haswell according to Agner's tables, depending on the precise instruction variant. The instructions are also limited in the ALU ports they can use. Since AVX2 has made SIMD on x86 fairly flexible, it can sometimes not be worth using the string comparison instructions if simpler instructions suffice: even a slightly longer sequence of simpler SIMD instructions sometimes beats a single string compare.

The SSE 4.2 string comparison instructions still have their uses, but it's always worth testing alternate instruction sequences when optimizing code that might use them.




I have the same experience. I've tried using pcmpestr in substring search a few times, and it had always turned out to not be worth it. I have however never tried it in during comparison functions, so I can't speak to that, but I wouldn't be surprised if the latency of the instruction mad at impact there too.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: