

Traversing a linearized tree - joaquintides
http://bannalia.blogspot.com/2015/06/traversing-linearized-tree.html

======
nkurz
I like the article, and I like that the author gives specific details
regarding the compiler and target machine, but trying to analyze performance
at this level based only on C++ source code seems like a hopeless task. It's
much easier if you remove the uncertainty of the compiler and analyze the
assembly that computer is actually running, and then determine the specific
factors actually governing performance.

Here's my strained analogy:

Assume that you wanted to write poetry in English with the goal of getting
great reviews from English speaking critics. But unfortunately, you don't
speak any English. So instead you write poems in your native language, and
then feed them through Google Translate. The critics read the machine
translation, and give their review as a single score. Based on that single
data point, you try to determine how best to modify your native language poem
to improve the score. Once you can no longer make progress, you publish your
native language poem and encourage others to test the results with different
critics and different machine translators.

Learning assembly is much easier than learning English. You still have to deal
with the differences between processors, but removing the layer of translation
makes optimization a much easier task. You can still write the final algorithm
in a higher level language, but with a target in mind for the output you want,
and with knowledge of the level of performance that is possible.

~~~
joaquintides
Hi, I totally agree with you analyzing the assembly produced can provide lots
of insight. For the particular case of understanding cache friendliness,
though, I'm not so sure looking at the assembly can help that much, since
caching manifests itself ony at run time. One has to learn about it in
indirect ways via measuring.

~~~
nkurz
_For the particular case of understanding cache friendliness, though, I 'm not
so sure looking at the assembly can help that much_

While the assembly won't tell you the cache behavior, it gives you the
information to correctly reason about it. By contrast, the compiler is free to
substitute (and frequently will substitute!) a different algorithm than you
wrote. Or needlessly spill variables to registers, or decide that undefined
behavior allows it to optimize out the lookup.

For example, if you write a loop within a loop that strides over data in a
particular pattern, modern compilers feel free to reverse the direction your
iterator progresses, or even invert the two loops to stride in the manner it
thinks best. It will even detect common patterns, and substitute a completely
different algorithm. Sometimes this means the actual instructions and timing
for your test cases will be identical even though the cache behavior of your
algorithm as intended should be different. In practice, this optimization
often works in your favor, but under analysis, it will lead you down many
false paths.

 _since caching manifests itself ony at run time. One has to learn about it in
indirect ways via measuring._

Or even better, modern processors have hardware counters that allow these to
be measured directly. For Linux 'perf', Intel's 'VTune', and 'likwid' are
helpful for this. I don't know what the parallel tools are for Windows, but
presume they must exist.

But the part I'm stressing is that even with exact measurement of the number
of cache hits and misses, you should not (must not) assume that the code you
wrote in C++ is representative of the assembly that is actually being
executed. Compiling with -O0 can improve the alignment between source and
assembly, but at the cost of making measurements useless by dramatically
increasing function overhead and making excessive use of the stack.

I stand by the statement that if you want to benchmark the merits of an
implementation (rather than just its incidental performance on a particular
processor with a particular compiler with particular options), you should be
looking at the assembly rather than the source code. It's not that looking at
the source will always lead you astray, or that it gives you no insight, but
that reasoning about the source will mislead you much more frequently than
looking at the actual assembly being executed.

------
koverstreet
bcache uses this technique, for building up lookup tables for searching btree
nodes.

One thing I didn't see mentioned in this treatment is that's possible to
directly compute, given the size of the tree and a position in the tree (i.e.
index into the array), the position of that node in an inorder traversal - and
it's fast. That can be quite useful (bcache uses it for pointer compression).

[http://evilpiepirate.org/git/linux-
bcache.git/tree/drivers/m...](http://evilpiepirate.org/git/linux-
bcache.git/tree/drivers/md/bcache/bset.c#n278)

~~~
joaquintides
Thanks for the link. The inorder_next you refer to is equivalent to my
increment (even the structure of the code is the same) with the difference
that inorder_next relies on 1-based indices and uses intrinsic ffz. Definitely
worth a try.

