It actually is not the extra function call that is the big hit, since if you think about it objc_msgSend also does two calls (the call to msgSend, which at the end then tail calls the imp). The dynamic instruction count is also roughly the same.
In fact objc_msgLookup actually ends up being faster in a some micro benches since it plays a lot better with modern CPU branch predictors: objc_msgSend defeats them by making every call site jump to the same dispatch function, which then makes a completely unpredictable jump to the imp. By using msgLookup you essentially decouple the branch source from the lookup which greatly improves predictably. Also, with a “sufficiently smart” compiler it can be win because it allows you to do things like hoist the lookup out of loops, etc (essentially really clever automated IMP caching tricks).
There are also a number of minor regressions, like now you are doing some of the work on a stack frame (which might require spilling if you need a register, vs avoiding spills by using exclusively non-preserved registers in an assembly function that tail calls). In the end what kills it is that the profiles of most objC is large flat sections that do not really benefit from the compiler tricks or the improved prediction, and the added call site instructions end up in increased binary sizes and negative CPU i-cache impacts.
Interesting! Making two separate calls at the call site would have some extra overhead compared to what objc_msgSend does. The caller needs to load the self and _cmd arguments twice, for example, and stash the IMP somewhere convenient in between the two calls. If objc_msg_lookup has a standard prologue and epilogue then you'll end up running two of those each time. You'll push and pop two return addresses on the stack rather than just one.
However, I'll happily accept that these are probably pretty small costs, especially since so much of it is just register gains which probably result in cost-free renamings in the hardware. It makes sense that the i-cache impact is more important.
In fact objc_msgLookup actually ends up being faster in a some micro benches since it plays a lot better with modern CPU branch predictors: objc_msgSend defeats them by making every call site jump to the same dispatch function, which then makes a completely unpredictable jump to the imp. By using msgLookup you essentially decouple the branch source from the lookup which greatly improves predictably. Also, with a “sufficiently smart” compiler it can be win because it allows you to do things like hoist the lookup out of loops, etc (essentially really clever automated IMP caching tricks).
There are also a number of minor regressions, like now you are doing some of the work on a stack frame (which might require spilling if you need a register, vs avoiding spills by using exclusively non-preserved registers in an assembly function that tail calls). In the end what kills it is that the profiles of most objC is large flat sections that do not really benefit from the compiler tricks or the improved prediction, and the added call site instructions end up in increased binary sizes and negative CPU i-cache impacts.