
ScaNN: Efficient Vector Similarity Search - polm23
https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html
======
syllogism
A practical tip: try hard not to need this sort of thing, and you'd be
surprised how often you can avoid it.

Before you start looking at approximate nearest neighbors, ask yourself the
following:

1\. Do I know all of the vectors ahead of time? (i.e. query vectors are not
constructed at runtime, it's a fixed inventory of items being embedded)

2\. Do I need to return an unlimited number of neighbors per query? (i.e. do
you ever need more than like, a few thousand neighbors)

If the answer to both of these is "no", then you should probably just
precompute the neighbors exactly on one or more GPUs. It's feasible for
millions of terms, and may actually be faster than the index-building steps
for many ANN algorithms. Example code:
[https://github.com/explosion/sense2vec/blob/master/scripts/0...](https://github.com/explosion/sense2vec/blob/master/scripts/06_precompute_cache.py)

A cache of k neighbors for vectors of dimension d will be k/d the size of the
original vector data, because one integer and one float both occupy 32-bits of
data. So if your vectors are 300 dimensions, you can save 150 neighbors and
serve the results with half the space of the original data. The responses are
then instant and exact, and you'll never have problems with load management
because the runtime is so cheap. Example demo:
[https://explosion.ai/demos/sense2vec](https://explosion.ai/demos/sense2vec)

Again, this doesn't work for queries you have to construct on the fly, for
instance it won't work if you need to construct a vector that incorporates
information from the user's session or something. It only works if the process
of constructing the vectors is already a batch job to e.g. embed the user's
history. It also won't work well if you have to page through arbitrary amounts
of history, as storing more than a few thousand neighbors per query would get
impractical. Generally the similarity accuracy degrades to uselessness long
before that point though (especially if you're using an approximation
technique).

------
jamiedamien
How is this similar / different to Johnson-Lindenstrauss?
([https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma))

The principle seems very similar: find a simpler representation that preserves
distance.

~~~
jointpdf
The JL lemma is an essentially an error bound, which reassures us that
dimensionality reduction (compress big vectors —> small vectors) is not a lost
cause. More specifically, it says that even if you perform a random
projection—the most naive DR method—to compress a high-dimensional data matrix
to a low-dimensional one (i.e. same number of rows, but many fewer columns),
it will only the distort the pairwise Euclidean distances between points (row
vectors of the data matrix) by _$this_much_.

Example: if you crudely crunch MNIST from a [60000 x 784] matrix down to
[60000 x 15] matrix (15 is arbitrary here—anything smaller than the original
784 works) using random projection, the JL lemma tells you how badly you
mangled the similarity structure of your dataset in the worst case scenario.
It turns out it’s not so bad, so we can use DR to compress our data matrix and
then confidently do useful things with the vectors it contains (like search
for similar ones).

ScaNN is an _approximate nearest neighbor (ANN) search_ algorithm. There is an
entire zoo of them, but they all have the same goal: given a query vector _q_
and a data matrix _X_ (where _q_ has the same number of elements as a row
vector in _X_ ), find the _k_ vectors in _X_ that are “most similar” to _q_ ,
without exhaustively checking all the possible matches. Most similar could be
any similarity or distance measure, but in this case it’s the simple inner
product.

To do this more quickly, the algorithm uses a type of vector quantization
(i.e. sort the high-D vectors into buckets), which is similar to k-means.
K-means can also be construed as a dimensionality reduction technique, you are
just reducing to a single dimension. That way you’re only looking in buckets
where there’s hope of finding close matches.

This works great in real life with socks—if you’re trying to find a match for
a stray black dress sock, then 1) make sure you presort your socks so that
they are approximately binned by similarity and 2) look in the bin that
contains socks most similar to your “query sock” (i.e. don’t look in the
colorful athletic sock or plain white dad-sock bins). I’m not actually
kidding, if you have a billion unmatched socks in a box, try this and you’ll
see.

Anyways...as for this particular algorithm, someone else may be able to
explain the details better. But my coarse understanding is that the loss
function (which measures how good the bucketing is—how purely you pre-binned
your socks) was defined in such a way that accounts for the density of the
region in question—a neighbor in NYC means something different than a neighbor
in Wyoming, so you need more granular buckets for denser regions. However, I
thought this was essentially how almost every ANN algorithm worked, so how
this one blows all the others out of the water (faster for a given recall
value) is mystifying to me.

[PS—someone please correct me if any aspect of my loosey goosey explanation is
inaccurate]

------
contravariant
In case anyone else was confused why they didn't just project onto q first if
they want to have <q,x'> as close as possible to the original <q,x>.
Apparently q is a random query vector and their research shows how to define
an alternative quantization loss where you weigh quantization loss more
heavily when |<q,x>| is high [1].

It's worth noting that without this additional weighting you end up with the
euclidean distance under the assumption that q is distributed symmetrically.

[1]: [https://arxiv.org/abs/1908.10396](https://arxiv.org/abs/1908.10396)

------
biophetik
I'll definitely have to try this out, and from the benchmarks the results look
impressive. In the documentation I noticed there was a dependency on
tensorflow. I get why it's needed, but it would have been nice to figure a way
without it. As an annoy user, it's very easy to use, from install to the API.
Looking forward to the continued discoveries in approximate nearest neighbor
tech.

------
PaulHoule
An interesting result in an area where there are many obvious things you think
would help but don't.

~~~
wokwokwok
Unrelated, but I feel strangely sad reading this, agreeing, then going...

Wait, that’s a very arbitrary and general statement. Is this a bot?

(I know (hope, given I’ve read your comments before... :) you’re not, but
still, a sign of the times I guess...)

------
nseggert
This is definitely great foundational work, but if you click through to
[http://ann-benchmarks.com/](http://ann-benchmarks.com/), you can see that
they show dominant performance on the glove-100 benchmark, but decidedly
middling performance on the other benchmarks for which they provide results
(glove-25 and lastfm-64). I don't see this doing anything to unseat HNSW as my
go-to algorithm for its strong performance across a wide variety of
benchmarks.

------
thomasahle
Personally I'm waiting for this to get added to [http://ann-
benchmarks.com/](http://ann-benchmarks.com/) to see how it measures up across
datasets.

Edit: Looks like it's already there for some datasets. Interestingly it is
primarily strong for the glove-100 dataset. I wonder if there are parts of
scann that (accidentally) exploit structure specific to word vectors.

~~~
jointpdf
Look again, it’s in there.

------
Immortal333
This is great. This is already better that FLANN (by facebook), Annoy (by
spotify). [https://github.com/erikbern/ann-
benchmarks#evaluated](https://github.com/erikbern/ann-benchmarks#evaluated)

Still, We are very far away from creating reverse video search engine. Also,
Most of these techniques are in-memory not on disk.

I am really interested in making reverse video search engine.

~~~
natemwilson
How would a reverse video search engine work? You give it a clip and it shows
you which longer clips it is in?

~~~
Immortal333
Yes, Exactly same as you described.

~~~
jointpdf
Could you take (embeddings of) frames from your query video and measure the
pairwise similarity to frames from the reference videos, then rank by some
summary function of that list of similarity scores?

What are the bottlenecks besides performance?

------
logicslave
This is really powerful, there are many profitable AI startups that could be
built using this technology. I am almost mad that its on the HN front page

~~~
scottlocklin
The basic idea of product quantization is 10 years old now and there are many
examples out there [0][1][2][etc]. This looks like a high quality library
though.

[0] [https://github.com/yahoo/lopq](https://github.com/yahoo/lopq)

[1] [https://github.com/cgtuebingen/Product-Quantization-
Tree](https://github.com/cgtuebingen/Product-Quantization-Tree)

[2] [https://github.com/technicolor-research/pq-fast-
scan](https://github.com/technicolor-research/pq-fast-scan)

[3] [https://github.com/matsui528/rii](https://github.com/matsui528/rii)

------
graycat
I thought that k-D trees were good for that problem. With some backtracking in
the tree and constructing cutting planes, can get actual nearest neighbor
search.

~~~
jonas21
kd-trees don't work well for high-dimensional vectors, so when working with
embeddings, you typically have to use something else.

~~~
CyberDildonics
This hasn't been true for multiple decades at this point. SIFT was originally
predicated on best bin first searching through kd-trees to find aproximate
nearest neighbors for vectors with over 4000 dimensions.

If you prioritize the bins you search by distance, you are are more likely to
find closer points sooner. If you are finding a set number of closest points,
then you will have a minimum distance to be included in the point set sooner.
If that minimum distance is shorter than the distance to a bin, you can skip
it.

~~~
thomasahle
If you have heuristics fast enough to make kd-trees competitive, it would be
interesting to have them added to [http://ann-benchmarks.com/](http://ann-
benchmarks.com/)

~~~
CyberDildonics
What do you mean by "heuristics"? What I described was how flann works.

It is not only on those benchmarks but is one of the reasons they exist in the
first place, since it instigated a lot of the research into approximate
nearest neighbors.

------
the_optimist
previously discussed here:

[https://news.ycombinator.com/item?id=23737338](https://news.ycombinator.com/item?id=23737338)

