Hacker News new | past | comments | ask | show | jobs | submit login
Indexing Billions of Text Vectors (0x65.dev)
252 points by martinlaz on Dec 7, 2019 | hide | past | favorite | 40 comments



I love this and the previous two posts!

This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:

https://www.nlpsearch.io

So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS

A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:

https://github.com/Hellisotherpeople/CX_DB8

I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.

As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain

If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!


I tried weighted average, but in my case the weights were computed by doing dot product between all vectors in the phrase (m x m), then taking the average over rows for each word and normalising. Kind of like a poor man's Transformer. It will boost words that are supported by other similar words in the same phrase.


Can you provide psudocode or code for this?

I'd post a snippet of my implementation but I'm too dumb to figure out the format.

But wow this is much better for results! Now to figure out how to make it fast!


Something like this, including thresholding for pairs of words that have low dot product:

    import numpy as np

    def inner_product_rank(vecs, threshold=0.5):
        sims_mat = np.dot(vecs, vecs.transpose())
        sims_mat = sims_mat - threshold
        sims_mat[sims_mat<0] = 0
        ranks = np.sum(sims_mat, axis=0)
        return ranks
Then you can take a weighted sum of the vecs or use the ranks to select the most related words. It is also possible to run spectral clustering on sims_mat to get the main topics of the text, it works quite well.


Hey - how can I contact you?


gedboy2112@gmail.com


Maybe delete your email from this post when the guy contacts you


What does each of those 200 dimensions represent/encode?

This is a very interesting and pragmatic design. I tried a couple of queries and the results were great for most of them, but not so much for esoteric ones, or for queries involving names, which makes sense according to the implementation specifics outlined in their blog posts.

I never heard of this company before, and I am very interested in IR technologies( https://github.com/phaistos-networks/Trinity ). I am glad they are trying and it looks like they have a chance for something great there.

Ahrefs.com is also working on a large scale web search engine, but other than their CEO announcing it on Twitter, there hasn’t been any other update since.


By default, there's no attempt to connect a particular semantic meaning with a particular dimension - it's worth noting that all the popular methods of calculating word vectors can/will give results that can differ by an arbitrary linear transformation, so in the event that they contain a very particular "semantic bit", it's still likely to be "smeared" across all 200 dimensions - you could have a linear (unambigious, reversible) transformation to a different 200-dimension space where that particular factor is isolated in a separate dimension, but you would have to explicitly try and do that.

So the default situation is that each individual dimension means "nothing and everything"; if you had some specific factors which you know beforehand and that you want to determine, then you could calculate a transformation to project all the vectors to a different vector-space where #1 means thing A, #2 means thing B, etc - for example, there some nice experiments with 'face' vectors that can separate out age/masculinity/hair length/happiness/etc out of the initial vectors coming out of some image analysis neural network with an unclear meaning of each separate dimension.


Even without a priori semantic goals, I think you can also transform/rotate your dimensions to maximize their interpretability.

Simplified example: if your two-dimensional system gives you two points:

    -0.5, 0.5
    0.5, 0.5
Then you losslessly rotate to

    0, 1.0
    1.0, 0
With the idea that the latter is simpler for humans to assign semantics to


This is the idea behind non-negative matrix formulation (NMF). As the name implies, it forces the entries of the embedding matrices (for both the reduced document and term matrix) to be nonnegative, which results in a more interpretable “sum of parts” representation. You can really see the difference (compared to LSA/SVD/PCA, which does not have this constraint) when it’s applied to images of faces. Also, NMF has been shown to be equivalent to word2vec. The classic paper is here: http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeun...

PS—There should be a negative sign on the (2,2) entry of the first matrix.


> non-negative matrix formulation (NMF)

*factorization ;)

Also PCA follows a similar idea as well (I mean, rotating vectors), but it's usually done is a much lower dimensional space


Ugh, that one was auto-correct, I swear. I have no idea what’s going on at Apple’s NLP department.


Is this the same intent that a 'variation autoencoder' would perform?

Also, is it possible in non-variational implementations (like this one) that some of the dimensions represent multiple groups? For example, not just 0.5 and -0.5 groups, but also a 0.0 group in the middle. Then your rotation wouldn't be sufficient, you would need to increase the dimensionality to cleanly separate the groups.


In general, the individual coordinates of a word embedding (and hence a sentence embedding) have no semantic meaning, at least not in any "normal" sense.

The n-dimensional space into which the embeddings have been represented is mainly just a space such that similarity between two vectors/embeddings in this space has some semantic meaning, e.g. query or word similarity.

A few more details are here: https://www.quora.com/Do-the-dimensions-of-Word2Vec-have-an-...


My favorite paper for introducing the idea is this oldie but goodie: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf

The math behind Word2Vec and GloVe is different. Most word embedding models have been shown to be equivalent to some version of matrix factorization, though, and, at least if you're comfortable with linear algebra, the SVD formulation makes it relatively easy to get an intuitive grasp on what the dimensions in an embedding really "mean".


Word embeddings are unsupervised learning, so the features are not chosen, only the number of features. The model then learns the scalars for each feature as a single vector depending on the algorithm/architecure.

When using CBOW, for instance, with a set window size N, the features learned for a single term are based on the order of the preceding N terms.

This will result in similar vectors for terms appearing in the same context. It has its pros and cons though - a great example being “the reservation is confirmed” vs “the reservation is cancelled” - where confirmed/cancelled will have similar features.


It seems odd to me that they ended up computing the average of the word vectors which is way too simple to represent meaning. Similarity metric over this semantic vector is not accurate.


Just a couple of things:

- We are actually using a weighted average.

- This is just one of the techniques used for recall. The aim at this point is to be super fast and scalable, and not super accurate. Once we reach the precision stage of the ranking (as defined in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...), we can afford to do fancier matching.

/ One of the authors


Weighted based on what, do you keep IDF values of some sort?

Even then it's hard to imagine how this performs great if these are plain Word2vec vectors. Saying it's just the recall step is a bit hand wavy as these will be actually selecting the documents you will be performing additional scoring on and may very well end up excluding a multitude of great results.

In any case, once more these are very interesting to read and as a search nerd, and I can't help but wonder about all the alternatives considered.


We do have our own custom word (piece) embeddings that we have trained on <good query, bad query>-pairs. There are a few more details about it in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr....


I'm really curious about the trade-off between memory usage and latency. Can the work be parallelized, or does it mean you now need 5x+ the number of servers for the same amount of traffic? That would quickly counter (thought not entirely) the reduced memory requirements.


Hi, I am one of the authors of this post.

As you mention, it is a trade-off. In this case we are trying to balance CPU and memory usage. First of all, if you just use the original vectors (or the quantized ones mentioned in the post), it might very well be the case that there is no server where you can even place the index.

So let us consider the example in the post and that we have a server with enough memory. We put the index (1 TB) in RAM and assume that one request takes 1-2 ms. Then a single CPU core can handle 500-1000 requests per second. So now we are using a lot of RAM and very little CPU. Depending on your use case, this can very well be better than what is proposed in the post, but for some it might be preferable to use servers with more balanced resources.

And just to clarify, the memory is shared between CPUs. So you pay the memory price once and you can then scale it to all the CPUs.


Nice read!

I'm curious as to what settings you used for product quantization, and the general setup that you tried. I find that there can generally be a smooth quality degradation from raw float32, to float16, through 8/6/4 bit scalar quantization, to product quantization down to a very small number of bits per vector if tuned properly. The quantization method is independent of the search method used (brute force, cell probe with inverted lists (IVF), cell probe with IMI inverted lists, HNSW), though both affect quality of results.

FYI, the Faiss library (used across Facebook and Instagram) has a CPU HNSW index implementation as well, with a GPU one possibly to come at some point.

Generally we do not use HNSW for our largest indices due to the memory overheads and restrictions on vector removal, but HNSW is useful as a top-level index for inverted list searching on these large indices (an index sitting on top of sub-indices), and indeed is possibly required for the largest indices, e.g. https://github.com/facebookresearch/faiss/wiki/Indexing-1T-v...


Hi, I am one of the authors of the post.

We tried using LOPQ, but I don't know the exact parameters used. We probably did not put enough effort into it, though.

Your approach seems interesting. Maybe we will try something similar in the future.


hmm...this 4 Billion queries number comes from where?

Looking at Stackexchange most of the sites have less than a million (possible?) queries(questions). Stackoverflow, probably the largest, has 18 million questions. Wikipedia article count is about 5 million.

Is the assumption that people are coming up with 1000 different ways to query every possible result endpoint?


Hi, I am one of the authors of this post.

The number 4 billion is made-up. It was chosen in order to illustrate the scale of the problem we are trying to solve. And it also happens to be just below the maximum number possible to store in an unsigned 4 byte integer. This is of course beneficial when trying to build an index with a small footprint. With more queries, we would have to consider using long integers (8 bytes) or some custom type with e.g. 5 bytes.

If you're interested in where the queries come from, you might find these previous posts interesting: https://0x65.dev/blog/2019-12-05/a-new-search-engine.html https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...


A thought regarding the 4 byte vs 8byte. Both of those will be highly optimizable by custom/manual vector instructions, like the modern AVX512 vector instructions. Likely someone on your team thought of that or does that but in the off chance you haven’t looked into it yet it might prove very useful!

Thanks for the posts, they’re fascinating and the idea of more alternative search engines would be great. Best of luck!


Thanks for these writeups. This is a very inspiring project.


Why would Stackoverflow and Wikipedia cover all queries for a web search?


Very nice and useful set of articles. Makes for good reading. Kudos


What do you use this similar search for? Is it for data on what previous users clicked on?


Lots of blog posts since their launch. 7 pretty in-depth blog posts in 7 days since Dec 1st.


They're gaming the system by producing intellectually stimulating quality material aimed at the hackernews demographic!


I thought the same after seeing in the corner of my eye 2nd article - didn't know they did more (and likely have more in the backlog). Fantastic pr with genuine value.


As the average HN consumer, I can get behind this.


hm damn, where's the median HN consumer when you need them?


I wonder how big their backlog of posts is. A single, short blog post takes me several hours to write and I need to plan each one in the back of my mind for about a week beforehand.


Seems like they have 24/25

> each day until this Christmas, we will explain one core piece.

https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-w...


According to their comment on yesterday's post, they are doing an "advent calendar" of blog posts with one post each day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: