
Interesting data structures: the BK-tree - rudi-rau
http://signal-to-noise.xyz/post/bk-tree/
======
fake-name
I have a C++ + Cython implementation of a BK-tree that I use for large-scale
image deduplication on github: [https://github.com/fake-
name/IntraArchiveDeduplicator/tree/m...](https://github.com/fake-
name/IntraArchiveDeduplicator/tree/master/deduplicator)

The core is a single, header-only BK-tree implementation here:
[https://github.com/fake-
name/IntraArchiveDeduplicator/blob/m...](https://github.com/fake-
name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp).

There are also relatively comprehensive unit-tests for the whole thing:
[https://github.com/fake-
name/IntraArchiveDeduplicator/tree/m...](https://github.com/fake-
name/IntraArchiveDeduplicator/tree/master/Tests)

I use it for image deduplication across a ~28M image dataset. It works quite
well, though it does take a lot of ram to hold the entire dataset in memory
(~18 GB).

~~~
pokoleo
Out of curiosity: why not store hashes of the images instead of the entire
image?

Or even better: a bloom filter to find likely duplicates?

~~~
fake-name
The 18 GB is for just the hashes. The dataset is several terabytes.

A bloom filter is useless, as it doesn't let you do searches within an edit
distance. Basically, I use a BK-tree of perceptual hashes
([https://en.wikipedia.org/wiki/Perceptual_hashing](https://en.wikipedia.org/wiki/Perceptual_hashing)).
Then, I can find similar images by searching within a certain edit-distance.

Basically, the end result is a system that functions very similarly to how
google's reverse image search works.

~~~
jamra
I used a google paper's approach in a mosaic competition. I didn't end up
submitting it, but I got it working.

It uses locality sensitive hashing to hash vectors into buckets. These buckets
are a subset of the total set of items. It worked with similar images of a
small size when I used it, but I didn't have many images.

In my case, the vectors were just the rgb values of the down-sampled image.

------
charles-salvia
There are many metric-based data structures, such as M-tree, Vantage Point
Tree, etc. They all suffer from the same major problem: they all rely on
triangle inequality to speed up search by pruning impossible paths. But this
doesn't ever result in an easily measurable complexity guarantee (like say,
the guaranteed log(N) of a balanced binary search tree), so it's very common
for certain datasets to result in severe performance degradation, particularly
when the data is high-dimensional.

Metric-based data structures face the peculiar problem that the identity of
each data point itself is not an actual "key", in the way that a data point in
a hash-table or binary tree is a key. Rather, each data point in a metric tree
is only "indexed" with relation to it's distance from all other keys. This
makes it hard to get easily measured performance guarantees.

~~~
mshenfield
This issue also gets mentioned in the sklearn documentation [0]. The "curse of
dimensionality" explains how high dimensional data tends to be close together,
making efficient indexing on a distance metric difficult.

I feel like the poster-child for the frequency illusion. There was an article
on 538 a few weeks ago that caught my interest which used cosine similarity (a
non-metric distance function). I dove in, and started wondering if there were
any ways to find nearest neighbors using cosine similarity efficiently. A week
later, Facebook's Faiss implements something of a general purpose toolkit for
indexing metric and not metric spaces to minimize storage. And now this. _It
's like the world is trying to tell me something!_

[0] [http://scikit-learn.org/stable/modules/neighbors.html#k-d-
tr...](http://scikit-learn.org/stable/modules/neighbors.html#k-d-tree)

~~~
yorwba
If you want to use the triangle inequality with cosine similarity, just use
angles instead:
[https://en.wikipedia.org/wiki/Cosine_similarity#Angular_dist...](https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity)

EDIT: Deleted calculation about approximating arccos(z) <= arccos(x) +
arccos(y) by cos(arccos(z)) <= cos(arccos(x) + arccos(y)) <= xy -
(1-x^2)(1-y^2), completely disregarding that cos is a decreasing function in
the relevant interval.

------
benhoyt
Nice! I released a Python BK-tree module on PyPI recently, along with a write-
up about how I'm using it for duplicate image detection:

* [https://pypi.python.org/pypi/pybktree/1.0](https://pypi.python.org/pypi/pybktree/1.0) * [http://tech.jetsetter.com/2017/03/21/duplicate-image-detecti...](http://tech.jetsetter.com/2017/03/21/duplicate-image-detection/)

------
mack73
Very interesting model that suffers from a couple of things that other data
structures solve better IMO. One thing that I don't understand and I guess I
have to go back and read the article again, is how the distance from any node
X to the arbitrarily selected root M tells me anything about the distance
between X and the query term Q. If I live in London and the distance from me
to Paris is 50.000 km, and you live in Rome, 70.000 km from Paris, that does
not give me enough to calculate the distance between Rome and London.

IMO there is no data structure better suited for a search tree than the
doubly-chained character trie [0], as it has the following features:

\- small in memory (every word is stored once at the most, but more often less
than once)

\- traversing and doing exact matching allow you to prune half of the tree at
every step

\- requires you to calculate the distance only at querying time, not when
constructing the tree

\- apart from exact and fuzzy matches also support prefix matches (traverse up
to the path of the prefix, return all children that are EndOfWords)

\- can be represented on disk in a way that makes reading insignificantly
slower than reading from memory

\- if nodes are lexically sorted when tree is constructed, you have also
pretty much solved range queries

[0]
[https://github.com/kreeben/resin/blob/master/src/Resin/IO/Lc...](https://github.com/kreeben/resin/blob/master/src/Resin/IO/LcrsTrie.cs)

The trie is to be the perfect tool for when you find yourself combatting big
data. What you find out eventually about all of the cloud offerings from all
of the big players that claim they will solve both storing and map/reducing
over your data is that you can do the same thing on your laptop, for cheap
money, as long as the data is represented in a compressed form that still
allow for querying. The trie is a zip file and a search engine in one.

Edit: formatting and clarification

~~~
hexane360
>If I live in London and the distance from me to Paris is 50.000 km, and you
live in Rome, 70.000 km from Paris, that does not give me enough to calculate
the distance between Rome and London.

Yes, but you do know that the distance from London to Rome is less than 12,000
km, because of the triangle inequality axiom. So if that's too far, that
search path is discarded.

Then we just start the process over again, by comparing each subtree back to
the original location.

~~~
mack73
Thanks, The idea behind the BK-tree is ingenious.

I'm struggling with finding a use case for that data structure through. Why
would you construct a BK-tree that would only become powerful when it contains
millions of words, which would then create a nuisance when representing that
amount of data in memory, making it not so fast anymore, when you could
represent the same data in a compressed form and with the same (as well as an
extended set of) querying capabilities?

Perhaps BK-trees are for big machines with powerful CPUs? I'm sure there is a
setup that would make that tree in fact better than any other tree.

~~~
rocho
I don't think the best use case for BK-trees is spell-checking and words. The
area in which they are used most successfully is image deduplication. In that
case the metric you're going to use is some form of perceptual hashing.

------
adrianratnapala
The article explains why the search algorithm is valid. But it doesn't explain
why the tree should be built up in the way described.

Is there some rationale? And can it be extended to non-integer metrics?

~~~
mshenfield
[https://en.wikipedia.org/wiki/BK-tree](https://en.wikipedia.org/wiki/BK-tree)

    
    
      A BK-tree is... specifically adapted to discrete metric spaces.
    

There are others that have a similar principal for real valued metrics.

• Ball Tree (also known as VP-Tree) (implemented in sklearn)

• GH-Tree (rough description in this large PDF
[https://github.com/searchivarius/nmslib/raw/master/docs/manu...](https://github.com/searchivarius/nmslib/raw/master/docs/manual.pdf)
)

• Multi-Vantage Point Tree

As for the rationale, I think grouping them together is an efficient way to
exploit the triangle inequality for each subtree.

------
scentoni
[https://en.wikipedia.org/wiki/BK-tree](https://en.wikipedia.org/wiki/BK-tree)

------
UhUhUhUh
I like the reference to a metric, which I always felt could be a useful
approach to many problems, including in AI. Re spell-checking, I wonder
whether there is an algorithm that uses topography (i.e. vicinity of keys) to
assist with the correction...

~~~
charles-salvia
There is ... a crude spell checker could be implemented by dumping a
dictionary of words into a metric tree and using levenshtein distance (or
damerau-levenshtein distance) as the distance function. The spell check would
then consist of doing a K nearest neighbor lookup.

But it doesn't work as well as you expect. A metric tree requires a _true_
metric to be used as the distance function, which means the distance function
must satisfy both symmetry and triangle inequality. Levenshtein distance
satisfies this - but unfortunately this usually isn't enough for a robust
spell-checker that just "does the right thing". For that, you usually need
something like a weighted levenshtein distance (e.g., an incorrect first
character should weigh more than an incorrect Nth character, or an incorrect
consonant should weigh more than an incorrect vowel). But it is much harder to
guarantee that such a distance function is a true metric.

~~~
contravariant
Well for a spell checker its not too unreasonable to require the triangle
inequality, if there's some series of possible typos to get from word A to
word B then the distance between A and B shouldn't be more than the total
'edit distance' of the typos, however you measure it.

Making things symmetrical might not make as much sense though, omitting a
letter might be far more likely than adding an additional letter, for
instance. Assuming it satisfies all other axioms you should be able to turn
this into a true metric, by just measuring the distance both ways and taking
the average.

Then again it looks like the BK-tree might not need the symmetry axiom, at
least not in an obvious way.

------
mack73
The idea of representing a large corpus in a compressed form and learning both
concepts as well as semantics about that corpus is facinating to me. A tree
like the one described in the article measures semantic distance, much like a
character trie does. What data structure support storing or calculating
conceptual distances? I've read about word2vec and the mechanics behind the
training process but never how that looks in memory or when serialized to
disk.

Also, is there a link between the sematic and conceptual distance between two
terms, does anyone know?

It would surprise me if there was such a correlation, because the concepts of
"near" are different in the semantic and conceptual world. Then again, it
would be facinating if there was.

------
jwatte
Minor quibble: It says std::deque is a doubly linked list, but that's what
std::list is.

FWIW, Deque is actually more like a vector.

~~~
rocho
Mhm maybe in C++ a deque is a vector of vectors (in stdlibc it's just a double
pointer to T IIRC). In Python it's a doubly-linked list of blocks:
[https://github.com/python/cpython/blob/master/Modules/_colle...](https://github.com/python/cpython/blob/master/Modules/_collectionsmodule.c#L24)

------
DroidX86
Very nicely written post. Always great to know there are Data Structures for
solving most problems efficiently.

~~~
socmag
Yes interesting stuff. Seems similar in nature to "Vantage Point" trees.

[http://stevehanov.ca/blog/index.php?id=130](http://stevehanov.ca/blog/index.php?id=130)

~~~
fizwhiz
All the images under the "How searching a VP-Tree works" are broken :/

~~~
socmag
oh bummer, it was working six months ago...

Wayback machine to the rescue!

[https://web.archive.org/web/20160629190654/http://stevehanov...](https://web.archive.org/web/20160629190654/http://stevehanov.ca/blog/index.php?id=130)

------
aryehof
Data structures and algorithms were in general use, a textbook detail a few
years ago. Why the current day renaissance and emphasis?

~~~
mack73
Because today we're wondering a whole lot about how to represent big data in
memory over many machines if you're problem is in such a way divisable, or on
one machine if possible because what we want to do is to draw conclusions from
that data, by querying it, and if we find a way to both store big data cheaply
in a form where it is still queryable, then we have solved a problem most
businesses today tamper with, that is what do I do with all this data?

Businesses are getting crushed under the weight of their data. They "solve" it
by stashing it onto Hadoop. Unfortunately, no one then knows how to map/reduce
over that data. So they're none the wiser even though they kept the data that
could have made their analysts smarter than their competitor's. This of course
totally awsome if you're a Hadoop consultant.

Well, that's at least why I dwell on data structures.

Edit: spelin

~~~
aryehof
I suspected so, but couldn't have stated it nearly so well.

------
x4nt
I had a similar idea last week in the car, cool to see it's actually a thing

------
wlesieutre
Why did this get [flagged]? Is that solely user reports? It seems like a good
link.

~~~
Retric
At work I got: Cyber Security - Web URL Blocked

Which might be related?

~~~
rocho
I'm the author of the post. Could you help me figure out what is triggering
the warning?

~~~
Retric
I don't get a lot of info, but the URL (*.xyz) is kind of odd and might be the
issue?

~~~
rocho
Weird, I don't see why a whole TLD would be blocked. Even Alphabet uses it!

I chose it because .io was way too expensive for my student budget and xyz
looked cool.

~~~
flyinghamster
I hate to say this, but on the server I manage at my day job, I've never, ever
seen legitimate email traffic from an .xyz domain, or a .club, .rocks, .work,
or any other "gold rush" TLD. Nothing but snowshoe spam/phishing/malware, to
the point that my policy is "block on sight."

