
How We Made Our Face Recognizer 25x Faster - nickb
http://lbrandy.com/blog/2008/10/how-we-made-our-face-recognizer-25-times-faster/
======
alecco
Summary: Cache miss problems. His approach to solve it is first to move
decision data up in the tree to prevent walking into every child node, and
second reorganizing nested loops to improve locality.

His loops probably can be improved further unless the missing bits (he only
gives pseudo-code) are a limit.

There's mention of Oprofile, something I didn't knew existed. It looks like a
very nice complement for Valgrind (Cachegrind in this scenario.) Yay!

------
markessien
What 'cache' is he talking about there? The hardware based memory cache or he
implemented some type of local cache? And if he is talking about the hardware
caches, is there really enough memory there to cache the information needed
for face comparisons?

~~~
scott_s
The cache in the processor - whichever one is the final cache before going to
memory, which is the L2 or the L3, depending on the processor.

There almost certainly is not enough memory to cache _all_ of the information.
But hardware cache friendly algorithms have good locality. This means that
once they pull data into the cache, they do everything to it that they need to
before it gets kicked out. Cache unfriendly algorithms continually kick out
and pull in the same data.

The simplest example of this is iterating over a two-dimensional array in C.
Since C stores arrays in row-major format (the rows are contiguous in memory),
you want to iterate over the rows first, then the columns:

    
    
      for (i = 0; i < ROW; ++i)
        for (j = 0; j < COL; ++j)
          matrix[i][j] = 0;
    

The opposite will result in many more cache misses:

    
    
      for (i = 0; i < COL; ++i)
        for (j = 0; j < ROW; ++j)
          matrix[j][i] = 0;
    

In the second case, in the inner loop the column is fixed, and it's the row
that is changing. Since a two-dimensional matrix in C has contiguous rows,
this means that each load in the inner most loop will probably result in a
cache miss. In the first case, since the inner loop has a fixed row, and is
iterating over the columns, the memory being accesses is contiguous, so it
will consume all of the data on a cache line before kicking it out.

~~~
markessien
But he's using 8 cores, which are all parallely addressing memory. Which of
the items being processed gets stored in this cache? When one processor takes
control over the memory bus, it's going to be needing memory data from a
completely different part of memory, no?

~~~
scott_s
Honestly, I'm not sure what you're getting at. So I'll try to explain what I
think might not be clear.

Hardware caches are stupid. They are a way to avoid hitting main memory every
time a load is issued; the idea is that if you use a bit of memory once,
you're likely to use it and memory close it again. Cache lines are usually
replaced on a Least Recently Used basis. I say they're "stupid" because they
have no knowledge of the algorithm being executed, and can't be told "keep
this in the cache." They only react to memory access patterns. If you want
something to stay in the cache, _use it_ , and don't use anything else until
you're done with it.

And just to be clear, at the hardware level abstractions like "items" don't
exist. It's all just bytes - and if you're dealing with caches, then it's
going to be cache lines, which are often something like 64 or 128 bytes. If I
access a memory location, the entire cache line it's on, gets pulled into the
cache with it.

Looking at the article, his iteration order was face1, face2, features. This
means that he was fixing the faces being compared, and changing which features
he was looking at. So face1 and face2 were (probably) staying in the cache,
while the features were not.

In his case, the features are stored in a tree - which means _feature n_ and
_feature n+1_ are not contiguous in memory. Going from one node in the tree to
another node will almost certainly result in cache misses. The first algorithm
he presented iterated over the entire tree for each face pair.

The second algorithm fixed the feature being looked at, and iterated over the
faces. So, for every node in the tree, the algorithm stopped and did all of
its work before moving on. So you have one tree traversal instead of N * M,
and significantly less cache misses.

Note that his solution doesn't necessarily generalize; the most cache friendly
approach depends on the relative sizes of the data. If the faces were
significantly larger, then it might be _less_ expensive to navigate through
the tree every time, keeping the faces in the cache. Optimizing algorithms for
good cache usage requires knowing how your data is laid out in memory, how
much of it you're dealing with, and how you're accessing it.

~~~
lbrandy
Author here. You almost have it completely right.

The features are not stored in a tree. They are an ARRAY of trees. So we hold
face1 and face2 constant, load up the tree for feature 1, descend it, record
the value. Then we load up the tree for feature 2, descend it, record the
value. Etc. If we have 1000 of faceA, faceB, and 5000 features, that means
we'd do 5000 descensions of -different- trees, then iterate to the next face,
a million times.

You are absolutely correct about the data sizes mattering in that the reason
this fails is because you cannot, over the course of a million matches, cache
all 5000 trees (given their size).

The new algorithm, however, held the tree for feature 1 fixed, and iterated
over the faces, which, in this case are simple vectors. This results in using
the -same- tree for 1 million consecutive comparisons, then switching trees
for feature 2, and so on.

In spite of this correction, pretty much everything you've said is accurate.
The primary difference is this scenario is much, much worse on cache than the
one you've described.

~~~
scott_s
Thanks for the clarification. I admit I was having to make assumptions because
I didn't know your data structures.

------
wastedbrains
Very cool write up. I always find it interesting with small changes to code
can result in significant improvements. Even when it might lead to a little be
less natural code (loosing the compare_faces(face1,face2) function)

------
st3fan
I like it how the comments on that blog entry are more about the ethics of
face recognition.

