Note that this is about slow compilation, not execution. IMHO if the output also becomes much faster, the tradeoff is reasonable. It would be interesting to know how much runtime performance changed, and in what direction, by not inlining.
On the other hand, the huge difference in compile times suggests there may be a hidden quadratic or higher complexity algorithm somewhere in the code path of the compiler when inlining is performed.
The function call overhead for an SVD-based matrix inverse should be tiny. Maybe if they are inverting 4x4 matrices, it's measurable. Inlining can give huge speedups but it should not be applied to every function without thought.
Right. Of course turning off forced inlining will speed up compilation. The article is completely silent about what it does to run time.
This is an unusual case, in that it involves code with SIMD instructions. Out of line calls usually mean the arguments have to be put on the stack, then loaded into the SIMD registers. Inlining opens up the optimization possibility of not doing that.
It's not clear what he's writing, but from the math it sounds like a physics engine for animation or games.
> IMHO if the output also becomes much faster, the tradeoff is reasonable.
The output may be faster in terms of CPU cycles, but the executable will have a larger memory footprint, which can cause performance problems when it comes to swapping, eviction of file-backed pages, etc.
aside from compilation time, inlining doesn’t mean your program will run faster, because now you’re filling your icache with the inlined instructions rather than a call instruction, that might be flushed away on a missed branch prediction.
True, but less so now than a decade ago; if it's a hot path (which anything `__forceinline` presumably is, assuming the author has a reasonable understanding of optimization) inlining is usually going to be a performance win, and often a cache/code size win as well (optimizing the inlined code, avoiding register spills).
True, and the functions in which inlining is most effective are small non recursive functions that don't lose any debug info from omitting the frame pointer.
On the other hand, the huge difference in compile times suggests there may be a hidden quadratic or higher complexity algorithm somewhere in the code path of the compiler when inlining is performed.