For what it's worth, this statistic is a little outdated, a little wrong, and not necessarily the one you care about.
A little outdated: It was done in 2013, and in the past couple years the touch panels on most flagship Android devices have gotten significantly better. Even the linked article was comparing Apple's latest device to older flagship models. The Nexus 5, released a week after Touchmarks published their numbers, consistently has about 70ms of latency, for example. The M8 reportedly gets around 50ms of latency, which is pretty astounding.
A little wrong: There were multiple issues with the Touchmarks benchmark. They reportedly "discovered an optimization in our iOS test app that was not present in our Android or Windows Phone test apps", they had known race conditions that could introduce additional delay on Android that were never fixed, etc.
Not necessarily the one you care about: That statistic measure physical touch down to visible response, but you only actually care about the time until the application receives the event because that's the point you actually can kick off the network activity. Considering the display latency side of it is ~48ms, that's a fairly significant difference.
I'll edit the article when I get a chance.
For what it's worth, I agree with pretty much the rest of your post. Too often I see people start to complain about "pre-mature optimization" but when you're trying to do something like hit a smooth 60fps animation then a lot of these things really matter. Profiling is great when you have hotspots, but too often these things are plagued by a death of a thousand cuts.
I know many high-performance programmers and all of them profile because profiling is how you test your mental model against reality. Yes, as the author says, having a mental model of machine performance is important. But you need to test that against reality or you are guaranteed to be surprised in a big way, eventually.
Example: How does he even know that his div optimization matters? If he is even reading through one pointer in that time, he is probably taking a cache miss on that read, the latency of which is going to completely hide an integer divide. The author seems generally to not understand this, since he spends most of his time talking about instruction counts. Performance on modern processors is mostly determined by memory patterns, and you can have all kinds of extra instructions in there and they mostly don't matter.
Which this guy would know if he profiled his code.
Sorry if I wasn't clear - I love profilers! CodeAnalyst in particular is my go-to choice for "quick, I need a sample histogram across my functions".
You're right that an example involving the memory subsystem would have been a good idea.
My two points are:
* It's possible to know something is on the latency critical path (e.g. div is ~20 cycle latency, but you run ~2 in parallel) without needing a profiler. Just look at the data flow through your algorithm.
* When you begin an application, you should know your performance goals and approximately how you plan to hit them. If you end up building an application where you round-trip to the network six times to build your UI, you've just limited your best possible load time in Australia to over a second.
That's all. :)
p.s. I've never used that div optimization, though I think it's interesting.
I have done things "knowing" what the outcome would be only to be surprised, and I never would have known if I hadn't measured.
In the specific example of buffer-builder, I have already built up a (reasonably accurate) mental model of modern CPUs, and I knew what I wanted the generated code to look like.
Once I made the generated code look like I wanted, then I was not surprised to find that it outperformed existing libraries by 5x. :)
I suspect the alternative approach, "profile the existing libraries and optimize hot spots" would have taken a lot more time.
I too feel comfortable working with modern CPUs but after performance sensitive projects I benchmark and/or profile to identify what I didn't know. How else can you learn (after listening to all the experts and reading all the documentation)?
As for your feeling it would have taken longer with the "alternative approach", I must again ask for numbers. How do you know which approach would take longer without taking measurements on it? Is that with you taking that approach or an expert with that approach taking it. Are you an expert in that approach, yet humbly avoided stating so in the blog post?
I don't really see them as alternatives. Using all the knowledge you have up front is simply a good design strategy, but once that knowledge is exhausted you can get more through testing empirically.
If you're doing audio mixing, for instance, you probably have a thread that has to respond with samples within 1ms. Missing the window means catastrophic audio glitches. (it sounds terrible)
It's a mistake to write this in Ruby and expect the profiler to tell you something you don't already know.
My day job is basically performance optimization. It is not simple nor is it easy to use a profiler. In fact, profilers often don't tell you anything useful at all. They'll tell you sort of where you're spending your time, but they don't help at all in telling you why you're spending your time there.
Most of the major performance wins I find are from optimizing the architecture not from optimizing a hot loop. I can't even think of the last time I've seen any measureable difference from optimizing a hot loop. Heck, I can't even remember the last time I found a hot loop.
Then again I work on things more like game engines and web browsers. Huge systems with tens of thousands of lines of code on the hot path. Profilers don't help all that much here, and I can optimize in advance in less time than it'll take to setup the profiler and "justify" the change.
* While throughput numbers increase over time, latency has only inched downwards. Thus, on most typical programs, you’re likely to find yourself latency-bound before being throughput-bound.
* A profiler is not needed to achieve the desired performance characteristics. An understanding of the problem, an understanding of the constraints, and careful attention to the generated code is all you need.
Yeah, I'm running WP Super Cache, which means the site is still up.
Do you have a recommended static site builder?
I'd be convinced only if you showed a benchmark with and without the trick. I still suspect it doesn't matter in the end. But the only we'd know is if the author ran a benchmark. Which he refuses to do, because he is so sure of himself.
(1<<i)&3 for i in [0, 2]
The Haskell one is kind of a long story. In short, we love Haskell but a particular inner loop was destroying our performance. We were getting PHP-level performance in Haskell, where normally we can expect Java-level performance. So we took this inner loop (JSON encoding and URL encoding) and built BufferBuilder to solve it once and for all. Now JSON encoding and URL encoding are barely visible in the timings.
The div -> shift trick is worth knowing about, though I haven't had a reason to use it yet.
If a quick obfuscation in the name of performance can be explained away with a single comment, there's no need to jump over the FFI.
However, if you use an external library, or you are interfacing with C++ overloaded methods or classes, you have to write an interface inside the `extern c` wrapper which handles calling out to C++ code.
Of course, there is always the option of running something like swig to create the C bindings for you.
We conspired our use of C++ to exclude the C++ standard library. It's basically C with stricter pointer conversion rules.
When I read this post, I suddenly realized why one's chosen programming language (or more precisely, the compilation approach and runtime environment) has such an impact on application startup time, though he only briefly touched on that. Think of the logic required to locate and load each module or class as it's first needed (e.g. Python or a typical Java or .NET runtime) versus just mapping the executable into memory and jumping to main (AOT-compiled native code, best if statically linked). Good luck if your application uses the former approach and it typically starts when a user's computer starts up, on a computer with a spinning disk.
There are so many layers of abstraction and so much to understand in modern computing that it is a Bad Idea(tm) to tell engineers to not profile.