Hacker News new | past | comments | ask | show | jobs | submit login
Have You Tried Using a Nearest Neighbor Search? (cachestocaches.com)
87 points by gjstein on April 13, 2016 | hide | past | favorite | 32 comments

Very good to be reminded of this. The funny thing is that there was a brief time right before deep learning took over computer vision research, but after people realized the implications of big data, where it looked like the future could be really simple nearest neighbor search supported by enormous data sets. For example, the famous "80 million tiny images" paper by Torralba, Fergus, and Freeman from 2008.





I'm currently trying to find similar strings (max hamming of 2) between two very large (1e9) sets. Approximate Nearest Neighbor search makes this just barely manageable. Annoy[0] from Spotify deserves a look for anyone who is outgrowing traditional nearest neighbor algorithms.

[0] https://github.com/spotify/annoy

Assuming your strings are a lot larger than length 2, you might look at locality-sensitive hashing.

If you are using fixed length bit strings, let me know as there are other ways to get sublinear search time for NxM searches. Handwave: order your strings in M by popcount. If string n has popcount p then you only need to compare to the m in M which have between p-2 and p+2 bits set.

You can subdivide the search space even further. And with newer processors, the POPCNT instruction is your friend.

And with newer processors, the POPCNT instruction is your friend.

You probably are aware, but on newer processors with AVX2, using SIMD to count bits can be almost twice as fast as using builtin POPCNT. And if the move to AVX-512 ever happens, this advantage will probably grow. Wojciech has numbers here, although I think he may be slightly understating the advantage: http://wm.ite.pl/articles/sse-popcount.html

Yes. I saw that going around HN recently. I still have it open in a tab. But to me, that new hardware is still a dream.

I thought popcnt has a bug? I am just going off of memory of reading some very indepth blog posts on how it slowed down run times rather than sped them up. I could have forgotten though. Before I even finished this post I just went and found it. Looks like it might be only for haswell and sandy/ivy bridge http://stackoverflow.com/questions/25078285/replacing-a-32-b...

There is a bug that significantly reduces performance if certain register combinations are used, but the current generation of compilers have been patched to avoid triggering it. So it's something you should be on the lookout for, but if you are compiling your own code (or working directly in assembly) it's not a reason to avoid POPCNT.

If you are distributing high-performance code that might be compiled with unpatched compilers, you probably should take steps to mitigate. In this case using a snippet of inline assembly where you have control of the registers is probably the best solution, since even if you avoid explicitly using __builtin_popcount() the compiler might "optimize" your algorithm to use it anyway.

My own code is in assembly. Plus, when that came out I tested the different alternatives, and found that my hardware didn't have the problem. Perhaps it was too old.

I also did extensive comparisons between the POPCNT version and the nibble-based version using PSHUFB, and showed that the POPCNT version was always faster.

They are between about 5 and 32 characters. I have converted them to bitstrings (one-hot encoded). They are grouped by length so they are fixed. If you have more information on the algorithm you describe, I would love to hear it. There are not many duplicates in a 1e9 sampling. This is a protein library with a huge diversity for what it's worth.

Swamidass and Baldi - http://www.igb.uci.edu/~pfbaldi/download/download.inc.php?pi... . With a followup in Nasr, Hirschberg and Baldi - Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2926297/ .

Both emphasize the Jaccard-Tanimoto similarity, which is a bit more complicated than the Hamming. The ideas are easily ported to a Hamming search.

The first paper uses the pre-computed popcount, like I mentioned. I don't recall if it also suggests sorting by popcount, but that's what I do. If both {N} and {M} are pre-sorted then you'll improve cache coherency, because the queries for |n|=i all search the same subrange of i-2<=|m|<=i+2. This is also easy to parallelize with OpenMP.

The second paper observes that you can precompute additional information. For example, consider the subset of M where |m| = 100. You could also compute the popcount of just the first 1/2 of each m, or the popcount of the even numbered bits. This must also be within +/-2 of the popcount same subset of your query.

That is, if the first 1/2 of your query had 40 bits set (so the remainder had 60 bits set) then you can only match those {m} where the first 1/2 set between 38 and 42 bits.

The downside of more subdivisions like this is the increased amount of index overhead and the increased number of if/branches. At some point it's faster to just do a brute force linear search.

On my laptop, with 1024 bit strings, I can do 1M x 1M tests (with a high threshold like what you suggest) within a few hours, giving the pruning from just the first paper.

Send me an email (if my HN account name is ${U} then I'm ${U}@${U}scientific.com ) and with some sample data and I'll see if my code works for what you need, and give you a pointer to the code.

Huh? I cannot reconcile that with what you wrote earlier: "I'm currently trying to find similar strings (max hamming of 2) between two very large (1e9) sets."

'Hamming' I only know as Hamming distance (https://en.m.wikipedia.org/wiki/Hamming_distance).

If that is what you are referring to, the first thing I would do is to divide and conquer by splitting the sets of strings by length (Hamming distance isn't defined for strings of unequal length)

After that, I would handle the larger strings by deriving 3 about-equal sized string keys from them. For example, split strings of length 14 in parts of lengths 5, 5, and 4. If two of the strings have Hamming distance <= 2, one of those derived keys must be equal in both strings by the pigeonhole principle. For length 14, and assuming a ACGT alphabet (or am I showing my ignorance here?), that divides and conquers by a factor 256 or 1024. Since the naive "compare all pairs" algorithm is O(nm) for sets of size n and m, respectively, in comparisons, that should get you a speed up of a factor of at least 4 orders of magnitude.

Another approach might be to consider all possible pairs of replacements, and check whether they could make the strings equal. For example, assume the errors are that one A got replaced by a C and one T by an A. Then, do a replace all in all strings for those replacements, look for duplicates in the results, and then check the Hamming distances between the original strings that became equal after making those replacements (I would guess it isn't necessary to physically do the string replacements; adjusting the comparison function during sorting may be faster)

Again assuming a small alphabet, that should be doable, as all possible pairs of replacements isn't that many pairs.

And of course, the two can be combined.

(It would be a bit more messy to implement and would give a bit less of a speed up, but I think the gist of the above would work fine if you meant edit distance instead of Hamming distance)

Edit: an alternative could be to split each string in trigrams ("abcde" => {"abc", "bcd", "cde"}), index them with some text search engine such as Lucene, using no tokenizer, and then do a billion searches in that set. Anything with few changes will end up high in the search results there, and it is fairly easy to determine hard bounds there. At 10ms per search and 1E9 searches, that would 'only' take 1E8 seconds, or 3 years. It is easily parallelized and distributed, though.

Since the naive "compare all pairs" algorithm is O(nm) for sets of size n and m, respectively, in comparisons, that should get you a speed up of a factor of at least 4 orders of magnitude.

Could you flesh out that algorithm? Your "pigeonhole" insight is correct, but I don't see it leading to a speedup, much less a 10000x speedup. In practice, the limiting factor is likely memory access, and the length of the strings is irrelevant in the size ranges we're discussing. But likely I'm not understanding your approach.

Let's say you have two sets of 1E8 strings of length 9. The naive algorithm compares all pairs, so it does 1E16 "check if Hamming distance <= 2" calls.

If you group them, first by "first 3 characters equal", then by "middle 3 characters equal", then by "last three characters equal", with a 20-letter alphabet, you will, in each case, get in the order of 1E3 groups each with in the order of 1E5 strings each. Comparing pairs in each groups takes 1E10 "check if Hamming distance <= 2" calls, for a total of 1E13 such calls.

You have to do this 3 times, so this will give you about a factor of 300, assuming the grouping to be free (it isn't, of course, but sorting is O(n log n), and this is a tiny bit simpler)

For longer strings, the speed up increases, though. Length 12 gives you 1E16 vs 10000 groups of 1E4 strings each, for 3 times 1E12 "check if Hamming distance <= 2" calls. That gets you close to that 4 orders of magnitude.

Tangential: you have to implement your "check if Hamming distance <= 2" call correctly. It shouldn't do "Hamming(s,t) <= 2", but bail out early. Also, you will call it at times where you know the first/middle/last third of characters to be equal, so it should skip those characters (ideally, those shouldn't waste cache line memory at all, but getting there may be too costly in setup time)

New idea: store both sets of strings in tries, iterate over all length 3 prefixes in the first trie, then walk the second one, pruning paths as soon as you hit branches where you know Hamming distance will be larger than two, and peeking in the first trie to check whether pruning is possible. I think that can be made to effectively do what the 'split it 3 parts trick does', but smarter, and may be faster. Implementing it will require some thinking, though, and that thinking may lead to the conclusion that the idea doesn't work.

I did mean hamming. I group by length as insertions and deletions are not expected. I convert to a one-hot vector so I can xor to get the differences.

Thanks for the suggestions. I hadn't thought about splitting the the strings into 3 roughly equal sizes though, that would definitely help narrow the search pool. FWIW, the strings use the protein alphabet (20 values instead of 4).

Do you have a suggestion about the case where the hamming distance of the nearest neighbor is unknown? My first thought is to start at the "distance-of-p" block and continue with the "p+i" and "p-i" block in a loop.

Yes, that is why my code does.


I applied a technique for quickly finding exact k nearest neighbors or nearest neighbors in a specific hamming distance using research[0] which includes a library + source code. The largest dataset that I tested was 10M 256 bit codes but the technique scales well with larger datasets. You can look at the performance figures I achieved in my paper at [1]. [0] http://www.cs.toronto.edu/~norouzi/research/mih/ [1] https://github.com/putterson/undergrad-thesis/raw/master/mai...

What are you running this on and how fast do you need it to be? You can brute force nearest-neighbours on a GPU pretty quickly (many times faster than approximate on a CPU).

I was able to do nearest-neighbour from each pixel in an image O(1e6) to a list of keypoint locations O(1e4) in about 50ms on an old GPU compared to minutes of CPU time.

It's a much smaller dataset, but still a simple way of getting a huge speedup over CPU alone. Also unlike Annoy or Flann, the results are guaranteed to be exact. (It's also incredibly simple to implement).

See for example:


EDIT: I saw below that you can split the strings by length which will shorten the time considerably.

Have you looked at the simhashing/minhashing techniques for that?

I think your results are pretty strongly predicated on the fact that we have immensely powerful computers these days such that brute forcing is a viable option in many casual real world situations (detect objects in a picture). In fact exact NN algoirthms for high dimensional embeddings converge to linear search performance as the dimensionality increase so a brute force NN is often a good choice always. But that doesn't mean there aren't other problems where the datasets are truly large and require fast query responses (bioinformatics, GIS, anomaly detection...). Which is where you get approximate nearest neighbor search. Haar transform is useful if you dont expect your object to ever rotate or significantly change size, but for more robust image recognition i would suggest david lowes' sift. Which actually spurred much of the approximate nearest neighbor search space, since you didn't necessarily need the closest match, rather just something that was near. Currently the optimal approximate nearest neighbor search algorithm has complexity O(n^rho+d log n) where rho is typically less than 0.37.

I don't think YOLO [0], the object detector he talked about, requires a massive amount of data as he claimed. Yes, if you want to learn how to classify 1000 different categories like on ImageNet, then yes, you need a lot of data. But if you're taking a pretrained network like YOLO (it was pretrained on ImageNet and trained on Pascal), you don't need a lot of images. I've retrained it with the KITTI dataset [1] and had no issues at all. They're only 7k images. By the way KITTI actually has a vehicles dataset that might be helpful for your case. And also by the way, you don't even need to retrain YOLO with your vehicle dataset. It was trained on Pascal VOC [2], a dataset of 20 categories and one of the categories is car. So YOLO already knows how to detect cars, it just might not be ideal for your dataset, but you don't care anyways since you just want any solution to compare to as a baseline. This would probably have been even less work than training the cascade classifier you used and have achieved better results.

[0]: http://pjreddie.com/darknet/yolo/

[1]: http://www.cvlibs.net/datasets/kitti/

[2]: http://host.robots.ox.ac.uk/pascal/VOC/

> They're only 7k images.

I feel like all those deep learning papers distorted people's perception of scale. If you need to take those 7k images by hand because your application domain is obscure and they aren't available in an existing dataset, that's way beyond feasible.

You could generate 7k images at a resolution of 4096x2160 pixels by walking around the vehicle for just under four minutes while shooting 4k video at 30 FPS, something of which modern phones are capable.

Yes, but how different would those 7K frames would really be? Same lighting, same background, same surrounding objects, the exact same condition of the vehicle's interior and exterior, same quirks of the camera's color profile, etc, etc. It would be an interesting experiment to actually try this, but I have a feeling the results wouldn't be all that good. Point being, you probably wouldn't get most of the benefits of deep learning and you might as well use the same approach the author used.

As you walk around the vehicle, and change the angle from low to high, the lighting and background should have a good variance.

No they won't. All of the pictures will have lighting from the time of day and weather conditions from the time and place the pictures were taken. The same problems will happen for the background. If I want my neural network to identify the make and model of cars, but every picture I have of a Mazda3 is taken at noon on a sunny day in suburbia then it is reasonably likely to train on the wrong features and either identify trucks on sunny days in suburbia as Mazda3's or not recognize a Mazda3 photographed on a rainy night.

A human might have difficulty recognizing a Mazda 3 on a rainy night as well. You can adjust color temperature and white balance in post-processing, or film a couple minutes at night too. Point is, generating 7k images is not insurmountable, especially in this case with the criteria that it only has to recognize a particular car.

> I quickly discovered that such a system was overkill for me, and resorted to using an open source implementation of a simpler algorithm [1].

Maybe worth pointing out that the "simpler algorithm" they used seems to be a cascading ensemble of adaptive boosting algorithms, technique similar to the ones used on Kaggle to win the big prizes. Maybe simpler than some neural nets in some ways, but nothing close to the simplicity of nearest neighbour search.

[1] http://docs.opencv.org/2.4/doc/tutorials/objdetect/cascade_c...

In the statistical learning course available online as a mooc at stanford they take it a step further:

"If it wasn't for the course of dimensionality we would only use nearest neighbor search"

> the course of dimensionality

dimensionality is surely a curse :)

In my field you could state that if it was not for the curse of dimensionality, we would use anything but MCMC.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact