This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:
So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS
A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:
I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.
As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain
If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!
I tried weighted average, but in my case the weights were computed by doing dot product between all vectors in the phrase (m x m), then taking the average over rows for each word and normalising. Kind of like a poor man's Transformer. It will boost words that are supported by other similar words in the same phrase.
Then you can take a weighted sum of the vecs or use the ranks to select the most related words. It is also possible to run spectral clustering on sims_mat to get the main topics of the text, it works quite well.
This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:
https://www.nlpsearch.io
So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS
A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:
https://github.com/Hellisotherpeople/CX_DB8
I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.
As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain
If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!