
Faiss: A library for efficient similarity search - sidcool
https://code.facebook.com/posts/1373769912645926/faiss-a-library-for-efficient-similarity-search/
======
garysieling
This looks really interesting. Full search engines rank documents based on the
similarity of a query to documents in the index, but it seems like there are a
lot of opportunities to do more with similarities between content.

One thing I'm interested in is the ability to find the most dis-similar
document to a query / other documents (perhaps within some constraints). I
think this would potentially be really helpful for discovery. For context, on
[https://www.findlectures.com](https://www.findlectures.com), I'd like to show
talks on subjects I know the least about i.e., most dis-similar from all talks
I've previously watched.

~~~
jhj
Faiss query, at least for IVFPQ which is the main emphasis of the library, is
based on real vectors as input but against quantized database vectors; the so
called “ADC” (asymmetric distance computation), so you don't have to query
based on a database vector.

If the data is normalized to lie on a hypersphere, you could find nearest
neighbors to a query vector on the opposite point of the hypersphere, for
instance.

But, as we sometimes find with, say, negative mining, sometimes what you are
interested in is not the “hardest” negative or most dissimilar object as those
don’t provide much utility in comparison to your current object; you might
something that is reasonably close in some ways but quite different otherwise,
which is a more useful negative example. This is more difficult.

~~~
garysieling
Thanks for the comment, that is a good way to articulate the problem I'm
interested in. For instance, seeing novel topics, but retaining some form of
high quality results - "low quality" would also be an opposite result, but
clearly isn't the right answer.

------
ethanwillis
This looks great. I'm actually giving a talk on this type of search for my
Master's project this coming Friday. I was looking at providing semantic
similarity search across a collection of research papers to build a "semantic
research network".

Currently I'm using a lot of pre-processing on the backend before putting all
the results in Mongo for end-users.. but this looks like I can get closer to
real time with a modest GPU cluster. My only complaint is that I wish GPU
support wasn't CUDA only...

~~~
jhj
Hi, I’m the author of the Faiss GPU code.

The GPU implementation depends upon a lot of particularities in CUDA (warp
shuffles and warp ballots, tricks with using register memory), and gets much
of its speed in this way.

Much of the code would be much slower if unable to use register file memory
and instead forced to use shared memory instead (the k-selection most
notably). Some needed functionality is supported in OpenCL 2.0; e.g.,

[https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xht...](https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/cl_khr_subgroups.html)

but are still extensions or implementation specific, and I wouldn’t hold my
breath for them to work well.

OpenCL tends to be lowest common denominator and for bleeding-edge GPU
programming it won’t map well. I don’t think shuffles are supported at all in
OpenCL (?). They are available directly in AMD’s GPU ISA, but have slightly
different semantics, and at that point you're just using something that is as
proprietary as CUDA. Not to mention trying to support wavefront size 64 vs.
warp size 32 etc:

[http://gpuopen.com/amd-gcn-assembly-cross-lane-
operations/](http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/)

~~~
visarga
Great work! As a demo, could you make a Reddit comment or thread search engine
based on document embeddings? It would be interesting to search through the
treasure trove with word vectors. The comment dataset is already available in
the wild.

------
gbrits
Could this be used for record deduplication? Say you have products from
various data sources. A product has various features/facts that almost always
don't line up 100%. Could this library be used to match over such records? One
of the more specific constraints is that no 2 records from dataset A can match
on 1 record from dataset B and viceversa

------
hossbeast
Badly in need of a rename

------
carlodelmundo
How do you handle incremental reindexing?

Content creation has a high velocity and new multimedia is being generated
every second. The methods in the paper assume that the dataset is static.

~~~
jhj
Remove is supported for some indices.

The indexing scheme (IVFPQ, IMI, etc.) is based on the statistics
(distribution) of the underlying data. If the statistics of the new data are
not changing much, then it is reasonable to just remove the old vectors and
add new ones.

If the distribution changes significantly over time, then rebuilding the index
by reclustering would be required. For large enough datasets, one would
probably have multiple shards of data that are assumed to be reasonably
independent. Individual shards can be periodically reindexed to replace the
old shard, but this demands that the original vector data (e.g., 128-dim
vectors * sizeof(float) = 512 bytes per vector) is kept around rather than the
compressed form (4 - 64 bytes per vector).

As with anything, there is a tradeoff between improving query time versus
query accuracy versus preprocessing/index build time versus data storage:

no build time, high query time, high storage, exact accuracy: Faiss IndexFlat

low build time, med query time, high storage, high accuracy: Faiss
IndexIVFFlat

med build time, low query time, low-med storage, med-high accuracy: Faiss
IndexIVFPQ

very high build time, low query time, low-high storage (whether stored as a
k-NN graph or raw data), high accuracy: NN-Descent by Dong et al. (e.g.,
nmslib)

IndexIVFPQ with perhaps IMI is typically what we concentrate on, seems to be a
reasonable sweet spot for billion-scale datasets.

~~~
carlodelmundo
Thanks for the response. Great work here.

