Hacker News new | past | comments | ask | show | jobs | submit login

The compiler will generate different code if it knew the rates at which branches were taken.

If a branch is almost always taken or almost never taken, a compiler will want to emit a jump. The frontend will be able to predict the jump with high probability, and a successfully-predicted jump is "free." The cost of a misprediction is paid for by the near-zero cost of the many successful predictions.

If a branch is hard to predict (and taking versus not taking it would load a different value into a register/memory), the compiler wants to emit a conditional move (cmov). A conditional move is slightly "more expensive" in the backend because the CPU has to wait for the condition to resolve before it can execute instructions dependent on the output. However, it is much cheaper than many mispredicted branches (mispredicts around half of the time).

FDO (feedback-directed optimization) or PGO (profile-guided optimization) means "run the code on some sample input and profile how often branches are taken/not taken." It gives the compiler more information to generate better code.

The problem with the blog post is that the compiler has no idea what the function's input data will look like. It (arbitrarily) chose to generate branches instead of cmovs. However, if the benchmark input is better suited for cmovs, then the benchmark will (wrongly) show that the compiler generates "slow" assembly. But that's not a fair test, because with PGO/FDO the compiler would generate equivalent assembly to the "fast" assembly (actually, probably faster). Finally, the human (OP) is using their knowledge of the benchmark data "unfairly" to write better assembly than the compiler.

The takeaway is: most of the time, one can't optimize code/assembly in a vacuum. You also need to know what the input data and access patterns look like. FDO/PGO gives the compiler more data to understand what the input data/access patterns look like.




Thank you this is an amazingly comprehensive answer! Now I wonder what would be the workflow for using these compiler features. Like if I am a normal or bad C programmer and I write my program and use valgrind to check that it doesn't have obvious problems and I compile it with -march native or whatever, then I can add some step to the workflow to somehow re-compile it in a way that uses the branching or access patterns of some examples that I let it process for that purpose?


Yes, Clang supports FDO, but it might be hard to set up (I've never set up FDO myself). You could check out https://github.com/google/autofdo and https://clang.llvm.org/docs/UsersManual.html#profile-guided-....

(People within Google say "FDO", basically everyone else says "PGO".)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: