Writing the code in a way not to be dependent on the most advanced hardware solution in existence is better choice unless you target only the specific hardware or the solution is of the "use once then throw away" kind.
That would be true, if we were speaking of using this optimization instead of that optimisation.
We're not. Here, they're saying that in this particular architecture, a number of optimisations are effectively useless. They're saying that naive code is almost as fast. They're saying that some "optimizations" have even become counter-productive.
The lesson I get from this? Unless I know a fair bit about the target platforms, the naive approach is better, because any performance model I have in mind might be invalidated anyway.
They actually benchmark the Python interpreter. They show that only on Haswell the gain from using the goto label is low. For me, that's not an argument to never use goto label, especially not to remove it from the Python implementation which certainly should run on many other CPUs than Haswell.
Edit: therefore I don't agree with your "unless I know a fair bit about the target platforms, the naive approach is better." I'd agree however with "it the speed (or the battery use) actually doesn't matter, the naive approach is better." Sure.
Good thing they don't advocate not using goto label, nor removing it from the Python implementation. Heck, they're not even saying we should stop using goto labels on new projects.
Besides, goto label is a really low hanging fruit that hardly complicates your interpreter, and has no negative impact. They discuss heavier optimisations, some of which are slower on Haswell. Those might not be worth their while any more.
Also, "only on Haswell" won't apply for long. I give AMD 2 years to keep up with Intel's mighty branch prediction, if they haven't done so already. Mobile platform may be different given the energy requirements, though. I don't know.
So do you still write code for 386 and 486 generations? There is a point where previous CPUs drop off as a concern. All bets are off if you do cutting edge CUDA work, then you only care about the GPU you are running on today because the one you were using yesterday is already outdated, anyways.
instead of the plain switch if I'd write an interpreter in C (that's the topic of the article!) for a language that I know will be executed on ARM, AMD, Atom, you name it. The answer is: you bet I'd use the goto labels. That one particular CPU generation doesn't get much speedup from that construct doesn't mean that I want to make all other significantly slower. It's not an excuse and I'm not that lazy under these circumstances. If I would not care for performance that much, I would not even use C.
My point was that eventually the technique will fall out if use of newer generations don't benefit from it. So say today X does no good on brand new hardware, then we still do X because not all hardware is new, but the value of doing X will definitely diminish over time.
What is really annoying is when X is done to support an out-of-date CPU like a 386, to the detriment of CPUs that are actually currently in use. It does happen...
There's one caveat: advanced branch prediction is probably not free. It probably costs a bit of silicon, even some energy. It may or may not actually matter (I'm no hardware designer), but if it does, we may have to make some hard choices, like a few more cores vs badass branch prediction.
It may not matter on X86 specifically, where out of order stuff dominates anyway. But it might matter for stuff like the Mill CPU architecture, for which energy efficiency and not wasting silicon is very important.
I wouldn't be surprised that in this particular case it's even more about the patents and the possibility to have an advantage to the competition than the silicon. Anybody knows if there is a patent? If there is, who holds it? How much would the CPU which would implement that method cost more?
If someone identifies an optimum (for performance, efficiency), all competitors are forced to go there to stay competitive. We already have more cores then we know what to do with, and better single core performance is always appreciated, so badass branch prediction probably wins out IF they find that it is useful at all (these days, meaningful gains on single core are difficult to eke out).
Is for you the "brand new hardware" your desktop which spends hundred of watts per hour or your mobile phone which has to survive the whole day without recharging his 5 Wh battery but still let you surf the web with all the stuff you take as given? Making optimizations that work good with lower wattage CPUs is a good thing, unless you limit yourself only to the non-battery devices.
Mobile architectures are typically only a tock away from desktop architectures, who aren't really guzzling energy with these optimizations (the deep pentium 4 pipelines are behind us!). So if technology marches forward, there isn't some mobile version that is frozen in time just because of energy efficiency (indeed, good branch prediction saves energy also). There are no 386 or 486 CPUs to care about now, while P5s are limited to that Phi HPC product.