
Scann: Scalable Nearest Neighbors - blopeur
https://github.com/google-research/google-research/tree/master/scann
======
greenyoda
If someone is looking for the referenced paper ( _Accelerating Large-Scale
Inference with Anisotropic Vector Quantization_ ), it can be found here:
[https://arxiv.org/abs/1908.10396](https://arxiv.org/abs/1908.10396)

(It's also linked in the repository's docs/algorithms.md, to which there's a
link at the bottom of the main page.)

~~~
thomasahle
The paper is a bit odd. It feels like a lot of things are missing:

\- How are the quantized vectors used to do NNS? Do we have to brute force
compare against each codeword?

\- In thst case, how many codewords are there? How big is k compared to n? How
does changing k affect performance?

\- What architecture is used to train the embeddings? We are only given the
loss function.

\- How fast is the preprocessing step? If we need to do Lloyd's algorithm type
iterations combined with training neural nets, this is potentially a lot
slower than Faiss which just picks random landmark points.

------
softwaredoug
Right now ANN is a huge (and fascinating) area for active data
structures/Algos research.

I do sort of wonder if we will reach a point where instead of pure recall per
speed, we’ll have a family of algorithms with more trade offs for the use
case. Where we begin to look at metrics for that domain for a given ANN
approach instead of just its ability to recreate a nearest neighbors set
retrieval.

Like one thing I see assumed is that while recall at N is good, does this also
mean the “ranking” of that top N is ideal? I don’t want to have to manually
compute NN on this top N if I can avoid it, for example.

And are there specific vector spaces where one ANN is preferred? Or will there
be some universal approach that just works for everything?

I realize it’s too early to tell, but these questions always percolate in my
mind when we hit a new benchmark in recall for speed. Especially since I still
see people doing more naive things that seem to work perfectly fine for their
use case (like KD trees, random projections, or LSH)

~~~
thomasahle
> And are there specific vector spaces where one ANN is preferred? Or will
> there be some universal approach that just works for everything?

This is a great question for research right now. Basically euclidean (and
inner product) search is quite well understood, but for distance measures such
as edit distance we still have no idea what the best approach is.

The most general result is
[https://ieeexplore.ieee.org/abstract/document/8555102](https://ieeexplore.ieee.org/abstract/document/8555102)
, which works for all normed spaces.

However it is possible that people will try to just map everything to inner
products using neural networks.

------
ricksharp
Ok, so I am just trying to understand the basic concepts in the paper and put
it in my own words:

It seems that the primary idea is that quantization precision is more
important where there is a high density of neighbors.

I.e. at the edges the quantized sections (buckets) could be large since there
are few items there, but at high density areas, the buckets should be much
smaller in order to have an even distribution of objects per bucket as
possible.

Therefore, the overall effectiveness of a Quantization loss function should
not be evaluated on a sum of squared error (that assumes the vector space has
consistent linear value), but should rather consider the densities of the
vector space and use that as a weight of the errors at different regions.

To me it seems analogous to a hash set, where the goal would be to have even
distribution (same number of items in every bucket).

We want to quantize space so that every position has about the same number of
items.

~~~
ur-whale
>should rather consider the densities of the vector space and use that as a
weight of the errors at different regions.

Sounds like the n-dimensional version of an octree

------
cs702
Wow, this looks impressively fast at very reasonable recall rates in the ANN-
benchmarks. It seems to leave faiss and nmslib in the dust. Pulling up the
arXiv paper as we speak to figure out what these guys are doing to achieve
such impressive results.

------
hoseja
[https://github.com/google-research/google-
research/commit/40...](https://github.com/google-research/google-
research/commit/406566cfafc83bcc4d54f82efa43fd3819039905#diff-9f09552fb2b5917f8532e55facc3734b)

~~~
scribu
Are you point out the removal of the Apace License header or what?

------
eximius
I'm surprised that I don't see DBSCAN, HDBSCAN, Spectral, etc. I don't even
_recognize_ these methods. Am I missing something or have the methods I'm
familiar with become obsolete that fast?

~~~
cs702
DBSCAN, HDBSCAN, Spectral, and many other classical algorithms are not
scalable. You cannot use them to search for approximate nearest neighbors,
say, among several billion embeddings in a 1024-dimensional vector space,
which is the sort of thing you _can_ do with this library.

~~~
ghj
I don't know anything about this field. Are these distributed systems? A
billion 1024 dimensional vectors sounds like a whole lot of memory (8 TB if
each dimension is a double?).

~~~
creato
The link is from google research. I'd imagine google has at least a few
datasets of this scale.

------
jszymborski
Would this algorithm suit the case of wanting to find neighbors within a set
radius of a point? Does anyone know of an approximate method for doing this?

~~~
scribu
sklearn's BallTree has a query_radius method [1], but it's exhaustive, rather
than approximate.

[1] [https://scikit-
learn.org/stable/modules/generated/sklearn.ne...](https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree.query_radius)

------
phenkdo
how does this compare to milvus?

~~~
cs702
Not comparable.

milvus integrates and wraps libraries like faiss and nmslib, which are
alternatives to Scann. (If anything, I expect the milvus developers will soon
be looking into integrating and wrapping Scann.)

