This reminds me of Brian Cantril's remarks in a talk (sorry, can't remember which one) about Solaris and DTrace. For the first time ever, DTrace gave them the ability to look into the live performance of low-level OS code and they found a lot of pathological worst-case behavior that no one suspected beforehand. No matter how good your team is it's really hard to accurately predict how a system will behave, measurement is key.
However, their premise of "just wake the idle cores as soon as there are threads to run" can actually be harmful in some real-life scenarios, especially regarding the additional power use and the slowdown introduced by switching from the filled to the empty caches.
Intuitively, as they demonstrate the "speedup" on some specific programs using some specific patterns I'd expect that some specific programs and patterns exist that don't necessary benefit from every change they present.
And yes, that proof probably will be hairy, but that's precisely the reason it has to happen.
Also, in the likely case that the proof is for a limited domain ("at most 1000 CPUs", "level 1 cache shared between the same CPUs as the level 2 cache"), I think the scheduler should test for that at boot time and strongly consider panicking the kernel if it doesn't apply.
Unidirectional interprocess communication forces this issue to come up frequently. The pattern "send on pipe A, wait for pipe B" to a process that is waiting on pipe A and will reply on pipe B implies that, for a very short period between the send and the wait, there are two processes ready to run. Starting one of them on another CPU is suboptimal, and waking up a sleeping CPU is even worse. You're really doing an interprocess subroutine call, but the OS does not know this. (As I point out occasionally, QNX, with MsgSend/MsgReceive and a scheduler that understands them, does this much better.)