In fact objc_msgLookup actually ends up being faster in a some micro benches since it plays a lot better with modern CPU branch predictors: objc_msgSend defeats them by making every call site jump to the same dispatch function, which then makes a completely unpredictable jump to the imp. By using msgLookup you essentially decouple the branch source from the lookup which greatly improves predictably. Also, with a “sufficiently smart” compiler it can be win because it allows you to do things like hoist the lookup out of loops, etc (essentially really clever automated IMP caching tricks).
There are also a number of minor regressions, like now you are doing some of the work on a stack frame (which might require spilling if you need a register, vs avoiding spills by using exclusively non-preserved registers in an assembly function that tail calls). In the end what kills it is that the profiles of most objC is large flat sections that do not really benefit from the compiler tricks or the improved prediction, and the added call site instructions end up in increased binary sizes and negative CPU i-cache impacts.
However, I'll happily accept that these are probably pretty small costs, especially since so much of it is just register gains which probably result in cost-free renamings in the hardware. It makes sense that the i-cache impact is more important.