This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:
So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS
A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:
I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.
As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain
If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!
I tried weighted average, but in my case the weights were computed by doing dot product between all vectors in the phrase (m x m), then taking the average over rows for each word and normalising. Kind of like a poor man's Transformer. It will boost words that are supported by other similar words in the same phrase.
Then you can take a weighted sum of the vecs or use the ranks to select the most related words. It is also possible to run spectral clustering on sims_mat to get the main topics of the text, it works quite well.
What does each of those 200 dimensions represent/encode?
This is a very interesting and pragmatic design. I tried a couple of queries and the results were great for most of them, but not so much for esoteric ones, or for queries involving names, which makes sense according to the implementation specifics outlined in their blog posts.
I never heard of this company before, and I am very interested in IR technologies( https://github.com/phaistos-networks/Trinity ). I am glad they are trying and it looks like they have a chance for something great there.
Ahrefs.com is also working on a large scale web search engine, but other than their CEO announcing it on Twitter, there hasn’t been any other update since.
By default, there's no attempt to connect a particular semantic meaning with a particular dimension - it's worth noting that all the popular methods of calculating word vectors can/will give results that can differ by an arbitrary linear transformation, so in the event that they contain a very particular "semantic bit", it's still likely to be "smeared" across all 200 dimensions - you could have a linear (unambigious, reversible) transformation to a different 200-dimension space where that particular factor is isolated in a separate dimension, but you would have to explicitly try and do that.
So the default situation is that each individual dimension means "nothing and everything"; if you had some specific factors which you know beforehand and that you want to determine, then you could calculate a transformation to project all the vectors to a different vector-space where #1 means thing A, #2 means thing B, etc - for example, there some nice experiments with 'face' vectors that can separate out age/masculinity/hair length/happiness/etc out of the initial vectors coming out of some image analysis neural network with an unclear meaning of each separate dimension.
This is the idea behind non-negative matrix formulation (NMF). As the name implies, it forces the entries of the embedding matrices (for both the reduced document and term matrix) to be nonnegative, which results in a more interpretable “sum of parts” representation. You can really see the difference (compared to LSA/SVD/PCA, which does not have this constraint) when it’s applied to images of faces. Also, NMF has been shown to be equivalent to word2vec. The classic paper is here: http://www.cs.columbia.edu/~blei/fogm/2019F/readings/LeeSeun...
PS—There should be a negative sign on the (2,2) entry of the first matrix.
Is this the same intent that a 'variation autoencoder' would perform?
Also, is it possible in non-variational implementations (like this one) that some of the dimensions represent multiple groups? For example, not just 0.5 and -0.5 groups, but also a 0.0 group in the middle. Then your rotation wouldn't be sufficient, you would need to increase the dimensionality to cleanly separate the groups.
In general, the individual coordinates of a word embedding (and hence a sentence embedding) have no semantic meaning, at least not in any "normal" sense.
The n-dimensional space into which the embeddings have been represented is mainly just a space such that similarity between two vectors/embeddings in this space has some semantic meaning, e.g. query or word similarity.
The math behind Word2Vec and GloVe is different. Most word embedding models have been shown to be equivalent to some version of matrix factorization, though, and, at least if you're comfortable with linear algebra, the SVD formulation makes it relatively easy to get an intuitive grasp on what the dimensions in an embedding really "mean".
Word embeddings are unsupervised learning, so the features are not chosen, only the number of features. The model then learns the scalars for each feature as a single vector depending on the algorithm/architecure.
When using CBOW, for instance, with a set window size N, the features learned for a single term are based on the order of the preceding N terms.
This will result in similar vectors for terms appearing in the same context. It has its pros and cons though - a great example being “the reservation is confirmed” vs “the reservation is cancelled” - where confirmed/cancelled will have similar features.
It seems odd to me that they ended up computing the average of the word vectors which is way too simple to represent meaning. Similarity metric over this semantic vector is not accurate.
- This is just one of the techniques used for recall. The aim at this point is to be super fast and scalable, and not super accurate. Once we reach the precision stage of the ranking (as defined in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...), we can afford to do fancier matching.
Weighted based on what, do you keep IDF values of some sort?
Even then it's hard to imagine how this performs great if these are plain Word2vec vectors. Saying it's just the recall step is a bit hand wavy as these will be actually selecting the documents you will be performing additional scoring on and may very well end up excluding a multitude of great results.
In any case, once more these are very interesting to read and as a search nerd, and I can't help but wonder about all the alternatives considered.
I'm really curious about the trade-off between memory usage and latency. Can the work be parallelized, or does it mean you now need 5x+ the number of servers for the same amount of traffic? That would quickly counter (thought not entirely) the reduced memory requirements.
As you mention, it is a trade-off. In this case we are trying to balance CPU and memory usage. First of all, if you just use the original vectors (or the quantized ones mentioned in the post), it might very well be the case that there is no server where you can even place the index.
So let us consider the example in the post and that we have a server with enough memory. We put the index (1 TB) in RAM and assume that one request takes 1-2 ms. Then a single CPU core can handle 500-1000 requests per second. So now we are using a lot of RAM and very little CPU. Depending on your use case, this can very well be better than what is proposed in the post, but for some it might be preferable to use servers with more balanced resources.
And just to clarify, the memory is shared between CPUs. So you pay the memory price once and you can then scale it to all the CPUs.
I'm curious as to what settings you used for product quantization, and the general setup that you tried. I find that there can generally be a smooth quality degradation from raw float32, to float16, through 8/6/4 bit scalar quantization, to product quantization down to a very small number of bits per vector if tuned properly. The quantization method is independent of the search method used (brute force, cell probe with inverted lists (IVF), cell probe with IMI inverted lists, HNSW), though both affect quality of results.
FYI, the Faiss library (used across Facebook and Instagram) has a CPU HNSW index implementation as well, with a GPU one possibly to come at some point.
Generally we do not use HNSW for our largest indices due to the memory overheads and restrictions on vector removal, but HNSW is useful as a top-level index for inverted list searching on these large indices (an index sitting on top of sub-indices), and indeed is possibly required for the largest indices, e.g. https://github.com/facebookresearch/faiss/wiki/Indexing-1T-v...
hmm...this 4 Billion queries number comes from where?
Looking at Stackexchange most of the sites have less than a million (possible?) queries(questions). Stackoverflow, probably the largest, has 18 million questions. Wikipedia article count is about 5 million.
Is the assumption that people are coming up with 1000 different ways to query every possible result endpoint?
The number 4 billion is made-up. It was chosen in order to illustrate the scale of the problem we are trying to solve. And it also happens to be just below the maximum number possible to store in an unsigned 4 byte integer. This is of course beneficial when trying to build an index with a small footprint. With more queries, we would have to consider using long integers (8 bytes) or some custom type with e.g. 5 bytes.
A thought regarding the 4 byte vs 8byte. Both of those will be highly optimizable by custom/manual vector instructions, like the modern AVX512 vector instructions. Likely someone on your team thought of that or does that but in the off chance you haven’t looked into it yet it might prove very useful!
Thanks for the posts, they’re fascinating and the idea of more alternative search engines would be great. Best of luck!
I thought the same after seeing in the corner of my eye 2nd article - didn't know they did more (and likely have more in the backlog). Fantastic pr with genuine value.
I wonder how big their backlog of posts is. A single, short blog post takes me several hours to write and I need to plan each one in the back of my mind for about a week beforehand.
This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:
https://www.nlpsearch.io
So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS
A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:
https://github.com/Hellisotherpeople/CX_DB8
I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.
As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain
If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!