
Have You Tried Using a Nearest Neighbor Search? - gjstein
http://www.cachestocaches.com/2016/3/have-you-tried-nearest-neighbor/
======
GrantS
Very good to be reminded of this. The funny thing is that there was a brief
time right before deep learning took over computer vision research, but after
people realized the implications of big data, where it looked like the future
could be really simple nearest neighbor search supported by enormous data
sets. For example, the famous "80 million tiny images" paper by Torralba,
Fergus, and Freeman from 2008.

PDF:

[http://people.csail.mit.edu/torralba/publications/80millionI...](http://people.csail.mit.edu/torralba/publications/80millionImages.pdf)

Web:

[http://groups.csail.mit.edu/vision/TinyImages/](http://groups.csail.mit.edu/vision/TinyImages/)

------
itodd
I'm currently trying to find similar strings (max hamming of 2) between two
very large (1e9) sets. Approximate Nearest Neighbor search makes this just
barely manageable. Annoy[0] from Spotify deserves a look for anyone who is
outgrowing traditional nearest neighbor algorithms.

[0] [https://github.com/spotify/annoy](https://github.com/spotify/annoy)

~~~
dalke
Assuming your strings are a lot larger than length 2, you might look at
locality-sensitive hashing.

If you are using fixed length _bit_ strings, let me know as there are other
ways to get sublinear search time for NxM searches. Handwave: order your
strings in M by popcount. If string n has popcount p then you only need to
compare to the m in M which have between p-2 and p+2 bits set.

You can subdivide the search space even further. And with newer processors,
the POPCNT instruction is your friend.

~~~
itodd
They are between about 5 and 32 characters. I have converted them to
bitstrings (one-hot encoded). They are grouped by length so they are fixed. If
you have more information on the algorithm you describe, I would love to hear
it. There are not many duplicates in a 1e9 sampling. This is a protein library
with a huge diversity for what it's worth.

~~~
Someone
Huh? I cannot reconcile that with what you wrote earlier: _" I'm currently
trying to find similar strings (max hamming of 2) between two very large (1e9)
sets."_

'Hamming' I only know as Hamming distance
([https://en.m.wikipedia.org/wiki/Hamming_distance](https://en.m.wikipedia.org/wiki/Hamming_distance)).

If that is what you are referring to, the first thing I would do is to divide
and conquer by splitting the sets of strings by length (Hamming distance isn't
defined for strings of unequal length)

After that, I would handle the larger strings by deriving 3 about-equal sized
string keys from them. For example, split strings of length 14 in parts of
lengths 5, 5, and 4. If two of the strings have Hamming distance <= 2, one of
those derived keys must be equal in both strings by the pigeonhole principle.
For length 14, and assuming a ACGT alphabet (or am I showing my ignorance
here?), that divides and conquers by a factor 256 or 1024. Since the naive
"compare all pairs" algorithm is O(nm) for sets of size n and m, respectively,
in comparisons, that should get you a speed up of a factor of at least 4
orders of magnitude.

Another approach might be to consider all possible pairs of replacements, and
check whether they could make the strings equal. For example, assume the
errors are that one A got replaced by a C and one T by an A. Then, do a
replace all in all strings for those replacements, look for duplicates in the
results, and then check the Hamming distances between the original strings
that became equal after making those replacements (I would guess it isn't
necessary to physically do the string replacements; adjusting the comparison
function during sorting may be faster)

Again assuming a small alphabet, that should be doable, as _all possible pairs
of replacements_ isn't that many pairs.

And of course, the two can be combined.

(It would be a bit more messy to implement and would give a bit less of a
speed up, but I think the gist of the above would work fine if you meant edit
distance instead of Hamming distance)

Edit: an alternative could be to split each string in trigrams ("abcde" =>
{"abc", "bcd", "cde"}), index them with some text search engine such as
Lucene, using no tokenizer, and then do a billion searches in that set.
Anything with few changes will end up high in the search results there, and it
is fairly easy to determine hard bounds there. At 10ms per search and 1E9
searches, that would 'only' take 1E8 seconds, or 3 years. It is easily
parallelized and distributed, though.

~~~
nkurz
_Since the naive "compare all pairs" algorithm is O(nm) for sets of size n and
m, respectively, in comparisons, that should get you a speed up of a factor of
at least 4 orders of magnitude._

Could you flesh out that algorithm? Your "pigeonhole" insight is correct, but
I don't see it leading to a speedup, much less a 10000x speedup. In practice,
the limiting factor is likely memory access, and the length of the strings is
irrelevant in the size ranges we're discussing. But likely I'm not
understanding your approach.

~~~
Someone
Let's say you have two sets of 1E8 strings of length 9. The naive algorithm
compares all pairs, so it does 1E16 "check if Hamming distance <= 2" calls.

If you group them, first by "first 3 characters equal", then by "middle 3
characters equal", then by "last three characters equal", with a 20-letter
alphabet, you will, in each case, get in the order of 1E3 groups each with in
the order of 1E5 strings each. Comparing pairs in each groups takes 1E10
"check if Hamming distance <= 2" calls, for a total of 1E13 such calls.

You have to do this 3 times, so this will give you about a factor of 300,
assuming the grouping to be free (it isn't, of course, but sorting is O(n log
n), and this is a tiny bit simpler)

For longer strings, the speed up increases, though. Length 12 gives you 1E16
vs 10000 groups of 1E4 strings each, for 3 times 1E12 "check if Hamming
distance <= 2" calls. That gets you close to that 4 orders of magnitude.

Tangential: you have to implement your "check if Hamming distance <= 2" call
correctly. It shouldn't do _" Hamming(s,t) <= 2"_, but bail out early. Also,
you will call it at times where you know the first/middle/last third of
characters to be equal, so it should skip those characters (ideally, those
shouldn't waste cache line memory at all, but getting there may be too costly
in setup time)

New idea: store both sets of strings in tries, iterate over all length 3
prefixes in the first trie, then walk the second one, pruning paths as soon as
you hit branches where you know Hamming distance will be larger than two, and
peeking in the first trie to check whether pruning is possible. I think that
can be made to effectively do what the 'split it 3 parts trick does', but
smarter, and may be faster. Implementing it will require some thinking,
though, and that thinking may lead to the conclusion that the idea doesn't
work.

------
leecarraher
I think your results are pretty strongly predicated on the fact that we have
immensely powerful computers these days such that brute forcing is a viable
option in many casual real world situations (detect objects in a picture). In
fact exact NN algoirthms for high dimensional embeddings converge to linear
search performance as the dimensionality increase so a brute force NN is often
a good choice always. But that doesn't mean there aren't other problems where
the datasets are truly large and require fast query responses (bioinformatics,
GIS, anomaly detection...). Which is where you get approximate nearest
neighbor search. Haar transform is useful if you dont expect your object to
ever rotate or significantly change size, but for more robust image
recognition i would suggest david lowes' sift. Which actually spurred much of
the approximate nearest neighbor search space, since you didn't necessarily
need the closest match, rather just something that was near. Currently the
optimal approximate nearest neighbor search algorithm has complexity O(n^rho+d
log n) where rho is typically less than 0.37.

------
bpires
I don't think YOLO [0], the object detector he talked about, requires a
massive amount of data as he claimed. Yes, if you want to learn how to
classify 1000 different categories like on ImageNet, then yes, you need a lot
of data. But if you're taking a pretrained network like YOLO (it was
pretrained on ImageNet and trained on Pascal), you don't need a lot of images.
I've retrained it with the KITTI dataset [1] and had no issues at all. They're
only 7k images. By the way KITTI actually has a vehicles dataset that might be
helpful for your case. And also by the way, you don't even need to retrain
YOLO with your vehicle dataset. It was trained on Pascal VOC [2], a dataset of
20 categories and one of the categories is car. So YOLO already knows how to
detect cars, it just might not be ideal for your dataset, but you don't care
anyways since you just want any solution to compare to as a baseline. This
would probably have been even less work than training the cascade classifier
you used and have achieved better results.

[0]: [http://pjreddie.com/darknet/yolo/](http://pjreddie.com/darknet/yolo/)

[1]:
[http://www.cvlibs.net/datasets/kitti/](http://www.cvlibs.net/datasets/kitti/)

[2]:
[http://host.robots.ox.ac.uk/pascal/VOC/](http://host.robots.ox.ac.uk/pascal/VOC/)

~~~
romaniv
_> They're only 7k images._

I feel like all those deep learning papers distorted people's perception of
scale. If you need to take those 7k images by hand because your application
domain is obscure and they aren't available in an existing dataset, that's
_way_ beyond feasible.

~~~
WiseWeasel
You could generate 7k images at a resolution of 4096x2160 pixels by walking
around the vehicle for just under four minutes while shooting 4k video at 30
FPS, something of which modern phones are capable.

~~~
romaniv
Yes, but how different would those 7K frames would really be? Same lighting,
same background, same surrounding objects, the exact same condition of the
vehicle's interior and exterior, same quirks of the camera's color profile,
etc, etc. It would be an interesting experiment to actually try this, but I
have a feeling the results wouldn't be all that good. Point being, you
probably wouldn't get most of the benefits of deep learning and you might as
well use the same approach the author used.

~~~
WiseWeasel
As you walk around the vehicle, and change the angle from low to high, the
lighting and background should have a good variance.

~~~
fredophile
No they won't. All of the pictures will have lighting from the time of day and
weather conditions from the time and place the pictures were taken. The same
problems will happen for the background. If I want my neural network to
identify the make and model of cars, but every picture I have of a Mazda3 is
taken at noon on a sunny day in suburbia then it is reasonably likely to train
on the wrong features and either identify trucks on sunny days in suburbia as
Mazda3's or not recognize a Mazda3 photographed on a rainy night.

~~~
WiseWeasel
A human might have difficulty recognizing a Mazda 3 on a rainy night as well.
You can adjust color temperature and white balance in post-processing, or film
a couple minutes at night too. Point is, generating 7k images is not
insurmountable, especially in this case with the criteria that it only has to
recognize a particular car.

------
liviu-
> I quickly discovered that such a system was overkill for me, and resorted to
> using an open source implementation of a simpler algorithm [1].

Maybe worth pointing out that the "simpler algorithm" they used seems to be a
cascading ensemble of adaptive boosting algorithms, technique similar to the
ones used on Kaggle to win the big prizes. Maybe simpler than some neural nets
in some ways, but nothing close to the simplicity of nearest neighbour search.

[1]
[http://docs.opencv.org/2.4/doc/tutorials/objdetect/cascade_c...](http://docs.opencv.org/2.4/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html)

------
wodenokoto
In the statistical learning course available online as a mooc at stanford they
take it a step further:

"If it wasn't for the course of dimensionality we would only use nearest
neighbor search"

~~~
infodroid
> the course of dimensionality

dimensionality is surely a curse :)

