
Indexing Billions of Text Vectors - martinlaz
https://0x65.dev/blog/2019-12-07/indexing-billions-of-text-vectors.html
======
Der_Einzige
I love this and the previous two posts!

This is extremely pertinent to me as I have written my own "semantic search
engine" and extractive summarizer using word vectors and it is available here:

[https://www.nlpsearch.io](https://www.nlpsearch.io)

So far, it's using a 100D glove model (very simple and not all that good
compared to BERT et al) for performance reasons and because I'm hosting it for
free with AWS

A much more fleshed out version of my extractive summarizer with a CLI
(supporting all major language models like GPT-2) is also available here:

[https://github.com/Hellisotherpeople/CX_DB8](https://github.com/Hellisotherpeople/CX_DB8)

I love reading about how 0x65.dev is dealing with issues I've ran into. Trying
to index large numbers of text vectors is a headache for me, and at the moment
that I saw this, I was writing code to get the popular approximate nearest
neighbors library "Annoy" to work with my research engine.

As far as the issues that other HN users noted ("Isn't average pooling not
that great?) the answer is "Yeah we know but there's literally nothing
better". They authors mentioned using a weighted average of vectors, which I
assume is being weighted using tf-idf over the word vectors. I've read several
papers which say that doing this is not necessarily better than traditional
averaging. I've found some success with concatenating average and max and min
pooled vectors together before cosine similarity searches - but it's certainly
not a good fix and is much slower for not a lot of gain

If anyone is a total nerd about word/document vectors and wants to chat /
brainstorm about applying text vectors to search / extractive summarization -
please contact me! I'm desperate to work with others on this problem!

~~~
visarga
I tried weighted average, but in my case the weights were computed by doing
dot product between all vectors in the phrase (m x m), then taking the average
over rows for each word and normalising. Kind of like a poor man's
Transformer. It will boost words that are supported by other similar words in
the same phrase.

~~~
Der_Einzige
Can you provide psudocode or code for this?

I'd post a snippet of my implementation but I'm too dumb to figure out the
format.

But wow this is much better for results! Now to figure out how to make it
fast!

~~~
visarga
Something like this, including thresholding for pairs of words that have low
dot product:

    
    
        import numpy as np
    
        def inner_product_rank(vecs, threshold=0.5):
            sims_mat = np.dot(vecs, vecs.transpose())
            sims_mat = sims_mat - threshold
            sims_mat[sims_mat<0] = 0
            ranks = np.sum(sims_mat, axis=0)
            return ranks
    

Then you can take a weighted sum of the vecs or use the ranks to select the
most related words. It is also possible to run spectral clustering on sims_mat
to get the main topics of the text, it works quite well.

------
markpapadakis
What does each of those 200 dimensions represent/encode?

This is a very interesting and pragmatic design. I tried a couple of queries
and the results were great for most of them, but not so much for esoteric
ones, or for queries involving names, which makes sense according to the
implementation specifics outlined in their blog posts.

I never heard of this company before, and I am very interested in IR
technologies( [https://github.com/phaistos-
networks/Trinity](https://github.com/phaistos-networks/Trinity) ). I am glad
they are trying and it looks like they have a chance for something great
there.

Ahrefs.com is also working on a large scale web search engine, but other than
their CEO announcing it on Twitter, there hasn’t been any other update since.

~~~
PeterisP
By default, there's no attempt to connect a particular semantic meaning with a
particular dimension - it's worth noting that all the popular methods of
calculating word vectors can/will give results that can differ by an arbitrary
linear transformation, so in the event that they contain a very particular
"semantic bit", it's still likely to be "smeared" across all 200 dimensions -
you could have a linear (unambigious, reversible) transformation to a
_different_ 200-dimension space where that particular factor is isolated in a
separate dimension, but you would have to explicitly try and do that.

So the default situation is that each individual dimension means "nothing and
everything"; if you had some specific factors which you know beforehand and
that you want to determine, then you could calculate a transformation to
project all the vectors to a different vector-space where #1 means thing A, #2
means thing B, etc - for example, there some nice experiments with 'face'
vectors that can separate out age/masculinity/hair length/happiness/etc out of
the initial vectors coming out of some image analysis neural network with an
unclear meaning of each separate dimension.

~~~
finnh
Even without a priori semantic goals, I think you can also transform/rotate
your dimensions to maximize their interpretability.

Simplified example: if your two-dimensional system gives you two points:

    
    
        -0.5, 0.5
        0.5, 0.5
    

Then you losslessly rotate to

    
    
        0, 1.0
        1.0, 0
    

With the idea that the latter is simpler for humans to assign semantics to

~~~
jointpdf
This is the idea behind non-negative matrix formulation (NMF). As the name
implies, it forces the entries of the embedding matrices (for both the reduced
document and term matrix) to be nonnegative, which results in a more
interpretable “sum of parts” representation. You can really see the difference
(compared to LSA/SVD/PCA, which does not have this constraint) when it’s
applied to images of faces. Also, NMF has been shown to be equivalent to
word2vec. The classic paper is here:
[http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeun...](http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeung1999.pdf)

PS—There should be a negative sign on the (2,2) entry of the first matrix.

~~~
raverbashing
> non-negative matrix formulation (NMF)

*factorization ;)

Also PCA follows a similar idea as well (I mean, rotating vectors), but it's
usually done is a much lower dimensional space

~~~
jointpdf
Ugh, that one was auto-correct, I swear. I have no idea what’s going on at
Apple’s NLP department.

------
ricardobeat
I'm really curious about the trade-off between memory usage and latency. Can
the work be parallelized, or does it mean you now need 5x+ the number of
servers for the same amount of traffic? That would quickly counter (thought
not entirely) the reduced memory requirements.

~~~
gravadlax
Hi, I am one of the authors of this post.

As you mention, it is a trade-off. In this case we are trying to balance CPU
and memory usage. First of all, if you just use the original vectors (or the
quantized ones mentioned in the post), it might very well be the case that
there is no server where you can even place the index.

So let us consider the example in the post and that we have a server with
enough memory. We put the index (1 TB) in RAM and assume that one request
takes 1-2 ms. Then a single CPU core can handle 500-1000 requests per second.
So now we are using _a lot_ of RAM and very little CPU. Depending on your use
case, this can very well be better than what is proposed in the post, but for
some it might be preferable to use servers with more balanced resources.

And just to clarify, the memory is shared between CPUs. So you pay the memory
price once and you can then scale it to all the CPUs.

------
jhj
Nice read!

I'm curious as to what settings you used for product quantization, and the
general setup that you tried. I find that there can generally be a smooth
quality degradation from raw float32, to float16, through 8/6/4 bit scalar
quantization, to product quantization down to a very small number of bits per
vector if tuned properly. The quantization method is independent of the search
method used (brute force, cell probe with inverted lists (IVF), cell probe
with IMI inverted lists, HNSW), though both affect quality of results.

FYI, the Faiss library (used across Facebook and Instagram) has a CPU HNSW
index implementation as well, with a GPU one possibly to come at some point.

Generally we do not use HNSW for our largest indices due to the memory
overheads and restrictions on vector removal, but HNSW is useful as a top-
level index for inverted list searching on these large indices (an index
sitting on top of sub-indices), and indeed is possibly required for the
largest indices, e.g.
[https://github.com/facebookresearch/faiss/wiki/Indexing-1T-v...](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors)

~~~
gravadlax
Hi, I am one of the authors of the post.

We tried using LOPQ, but I don't know the exact parameters used. We probably
did not put enough effort into it, though.

Your approach seems interesting. Maybe we will try something similar in the
future.

------
hos234
hmm...this 4 Billion queries number comes from where?

Looking at Stackexchange most of the sites have less than a million
(possible?) queries(questions). Stackoverflow, probably the largest, has 18
million questions. Wikipedia article count is about 5 million.

Is the assumption that people are coming up with 1000 different ways to query
every possible result endpoint?

~~~
gravadlax
Hi, I am one of the authors of this post.

The number 4 billion is made-up. It was chosen in order to illustrate the
scale of the problem we are trying to solve. And it also happens to be just
below the maximum number possible to store in an unsigned 4 byte integer. This
is of course beneficial when trying to build an index with a small footprint.
With more queries, we would have to consider using long integers (8 bytes) or
some custom type with e.g. 5 bytes.

If you're interested in where the queries come from, you might find these
previous posts interesting: [https://0x65.dev/blog/2019-12-05/a-new-search-
engine.html](https://0x65.dev/blog/2019-12-05/a-new-search-engine.html)
[https://0x65.dev/blog/2019-12-06/building-a-search-engine-
fr...](https://0x65.dev/blog/2019-12-06/building-a-search-engine-from-
scratch.html)

~~~
elcritch
A thought regarding the 4 byte vs 8byte. Both of those will be highly
optimizable by custom/manual vector instructions, like the modern AVX512
vector instructions. Likely someone on your team thought of that or does that
but in the off chance you haven’t looked into it yet it might prove very
useful!

Thanks for the posts, they’re fascinating and the idea of more alternative
search engines would be great. Best of luck!

------
FpUser
Very nice and useful set of articles. Makes for good reading. Kudos

------
freediver
What do you use this similar search for? Is it for data on what previous users
clicked on?

------
aldoushuxley001
Lots of blog posts since their launch. 7 pretty in-depth blog posts in 7 days
since Dec 1st.

~~~
Zhyl
They're gaming the system by producing intellectually stimulating quality
material aimed at the hackernews demographic!

~~~
deugtniet
As the average HN consumer, I can get behind this.

~~~
aldoushuxley001
hm damn, where's the median HN consumer when you need them?

