

Thinking About Performance - GarethX
http://chadaustin.me/2015/04/thinking-about-performance/

======
mdwrigh2
> Well, good luck, because Android devices have touchscreen latency of 100+
> ms.

For what it's worth, this statistic is a little outdated, a little wrong, and
not necessarily the one you care about.

A little outdated: It was done in 2013, and in the past couple years the touch
panels on most flagship Android devices have gotten significantly better. Even
the linked article was comparing Apple's latest device to older flagship
models. The Nexus 5, released a week after Touchmarks published their numbers,
consistently has about 70ms of latency, for example. The M8 reportedly[1] gets
around 50ms of latency, which is pretty astounding.

A little wrong: There were multiple issues with the Touchmarks benchmark. They
reportedly "discovered an optimization in our iOS test app that was not
present in our Android or Windows Phone test apps", they had known race
conditions that could introduce additional delay on Android that were never
fixed, etc.

Not necessarily the one you care about: That statistic measure physical touch
down to visible response, but you only actually care about the time until the
application receives the event because that's the point you actually can kick
off the network activity. Considering the display latency side of it is ~48ms,
that's a fairly significant difference.

[1]: [http://www.phonearena.com/news/Funky-metrics-HTC-
One-M8-has-...](http://www.phonearena.com/news/Funky-metrics-HTC-One-M8-has-
the-fastest-46ms-phone-display-touch-response-time-so-far_id54887)

~~~
chadaustin
Thanks! I searched for more recent numbers, but couldn't find them. Much
appreciated.

I'll edit the article when I get a chance.

~~~
mdwrigh2
Thanks for caring to fix it!

For what it's worth, I agree with pretty much the rest of your post. Too often
I see people start to complain about "pre-mature optimization" but when you're
trying to do something like hit a smooth 60fps animation then a lot of these
things really matter. Profiling is great when you have hotspots, but too often
these things are plagued by a death of a thousand cuts.

------
jblow
This article seems goofy and weird. He spends a LOT of time randomly talking,
in order to justify not using a profiler, when profiling is such a simple and
easy thing.

I know many high-performance programmers and all of them profile because
profiling is how you test your mental model against reality. Yes, as the
author says, having a mental model of machine performance is important. But
you need to test that against reality or you are guaranteed to be surprised in
a big way, eventually.

Example: How does he even know that his div optimization matters? If he is
even reading through one pointer in that time, he is probably taking a cache
miss on that read, the latency of which is going to completely hide an integer
divide. The author seems generally to not understand this, since he spends
most of his time talking about instruction counts. Performance on modern
processors is mostly determined by memory patterns, and you can have all kinds
of extra instructions in there and they mostly don't matter.

Which this guy would know if he profiled his code.

~~~
chadaustin
Hi Jon. I'm certainly familiar with caches and memory optimizations. I also
know when I'm compute-bound (as in, the prefetcher is running ahead).

Sorry if I wasn't clear - I love profilers! CodeAnalyst in particular is my
go-to choice for "quick, I need a sample histogram across my functions".

You're right that an example involving the memory subsystem would have been a
good idea.

My two points are:

* It's possible to know something is on the latency critical path (e.g. div is ~20 cycle latency, but you run ~2 in parallel) without needing a profiler. Just look at the data flow through your algorithm.

* When you begin an application, you should know your performance goals and approximately how you plan to hit them. If you end up building an application where you round-trip to the network six times to build your UI, you've just limited your best possible load time in Australia to over a second.

That's all. :)

p.s. I've never used that div optimization, though I think it's interesting.

~~~
sqeaky
How do you know you are right if you didn't measure?

I have done things "knowing" what the outcome would be only to be surprised,
and I never would have known if I hadn't measured.

~~~
chadaustin
You have to measure at some point to build up your mental model. I run
experiments all the time.

In the specific example of buffer-builder, I have already built up a
(reasonably accurate) mental model of modern CPUs, and I knew what I wanted
the generated code to look like.

Once I made the generated code look like I wanted, then I was not surprised to
find that it outperformed existing libraries by 5x. :)

I suspect the alternative approach, "profile the existing libraries and
optimize hot spots" would have taken a lot more time.

~~~
sqeaky
With results that good it is surprising that you didn't document your
benchmark procedures and include them in your article.

I too feel comfortable working with modern CPUs but after performance
sensitive projects I benchmark and/or profile to identify what I didn't know.
How else can you learn (after listening to all the experts and reading all the
documentation)?

As for your feeling it would have taken longer with the "alternative
approach", I must again ask for numbers. How do you know which approach would
take longer without taking measurements on it? Is that with you taking that
approach or an expert with that approach taking it. Are you an expert in that
approach, yet humbly avoided stating so in the blog post?

I don't really see them as alternatives. Using all the knowledge you have up
front is simply a good design strategy, but once that knowledge is exhausted
you can get more through testing empirically.

------
mkesper
TLDR: * To hit your performance goals, you first need to define your goals.
Consider what you’re trying to accomplish.

* While throughput numbers increase over time, latency has only inched downwards. Thus, on most typical programs, you’re likely to find yourself latency-bound before being throughput-bound.

* A profiler is not needed to achieve the desired performance characteristics. An understanding of the problem, an understanding of the constraints, and careful attention to the generated code is all you need.

------
BruceM
Since y'all killed his blog again:

[http://webcache.googleusercontent.com/search?q=cache:wjD83Ex...](http://webcache.googleusercontent.com/search?q=cache:wjD83ExB22YJ:chadaustin.me/2015/04/thinking-
about-performance/&hl=en&gl=th&strip=1)

~~~
M8
It's not HN's fault, the blog is not performant :).

~~~
angersock
Should've picked a webscale langauge like Perl, Ruby, or Javascript.

------
alain94040
_And then we got to "you have to run a profiler to know it matters." I contend
it’s possible to use your eyes and brain to see that a div is on the critical
path in an inner loop without fancy tools. You just have to have a rough sense
of operation cost and how out-of-order CPUs work_

I'd be convinced only if you showed a benchmark with and without the trick. I
still suspect it doesn't matter in the end. But the only we'd know is if the
author ran a benchmark. Which he refuses to do, because he is so sure of
himself.

------
smitherfield
Interesting, but I have to wonder what the point of his project to reproduce
the performance of C++ with Haskell is if he's doing things like replacing

    
    
      (i+1)%3
    

with

    
    
      (1<<i)&3 for i in [0, 2]
    

It seems to me like it would be far more human-readable and human-maintainable
to simply write it in C++ in the first place.

~~~
falcolas
Any time you cross the FFI, you start losing things. Specific to Haskell, if
you hit the FFI into C++, you have to go through two translation layers
(Haskell to C, C to C++) in each direction, losing the ability to quickly
identify identify code paths. Not to mention many debuggers and profilers tend
to shrug their shoulders in ignorance when the FFI is crossed.

If a quick obfuscation in the name of performance can be explained away with a
single comment, there's no need to jump over the FFI.

~~~
GFK_of_xmaspast
I don't know haskell, but I would have expected the c++ end to have been
compiled with C linkage, why do you have to have the second jump?

~~~
falcolas
It would depend entirely upon whether you wrote all of the C++ code and thus
provided yourself with C wrappers inside an `extern c` call. In which case,
yes, it would be simpler (though still with an interpreter layer going from C
to C++).

However, if you use an external library, or you are interfacing with C++
overloaded methods or classes, you have to write an interface inside the
`extern c` wrapper which handles calling out to C++ code.

Of course, there is always the option of running something like swig to create
the C bindings for you.

------
mwcampbell
Another insightful post from Chad on performance is this one:

[http://chadaustin.me/2009/02/logic-vs-array-
processing/](http://chadaustin.me/2009/02/logic-vs-array-processing/)

When I read this post, I suddenly realized why one's chosen programming
language (or more precisely, the compilation approach and runtime environment)
has such an impact on application startup time, though he only briefly touched
on that. Think of the logic required to locate and load each module or class
as it's first needed (e.g. Python or a typical Java or .NET runtime) versus
just mapping the executable into memory and jumping to main (AOT-compiled
native code, best if statically linked). Good luck if your application uses
the former approach _and_ it typically starts when a user's computer starts
up, on a computer with a spinning disk.

------
jconley
Modern CPU's are so complex I'm not convinced that one can reason about the
performance implications of micro optimizations such as the div trick. Perhaps
the 0.001% of developers that specialize in, and understand how said CPU's
work might be able to, but for the rest of us, we have profilers.

There are so many layers of abstraction and so much to understand in modern
computing that it is a Bad Idea(tm) to tell engineers to not profile.

