Hacker News new | past | comments | ask | show | jobs | submit login

Why isn't it the default everywhere?



As I tried to indicate, the microbenchmark-guided unrolled loops look better than the simple `rep` instruction because the benchmarks aren't realistic. They don't suffer from mispredicted branches, they don't experience icache pressure. Basically the microarchitectural reality is absent in a small benchmark.

There's also the fact that `rep` startup cost was higher in the past than it is now. I think it started to get really fast around Ivy Bridge.

This is all discussed in Intel's optimization manual, by the way.


Neat. I was unaware of this manual. For the lazy but curious:

https://software.intel.com/content/www/us/en/develop/downloa...

And if you want ALL THE THINGS:

https://software.intel.com/content/www/us/en/develop/article...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: