Hacker News new | past | comments | ask | show | jobs | submit login

I love this and the previous two posts!

This is extremely pertinent to me as I have written my own "semantic search engine" and extractive summarizer using word vectors and it is available here:

https://www.nlpsearch.io

So far, it's using a 100D glove model (very simple and not all that good compared to BERT et al) for performance reasons and because I'm hosting it for free with AWS

A much more fleshed out version of my extractive summarizer with a CLI (supporting all major language models like GPT-2) is also available here:

https://github.com/Hellisotherpeople/CX_DB8

I love reading about how 0x65.dev is dealing with issues I've ran into. Trying to index large numbers of text vectors is a headache for me, and at the moment that I saw this, I was writing code to get the popular approximate nearest neighbors library "Annoy" to work with my research engine.

As far as the issues that other HN users noted ("Isn't average pooling not that great?) the answer is "Yeah we know but there's literally nothing better". They authors mentioned using a weighted average of vectors, which I assume is being weighted using tf-idf over the word vectors. I've read several papers which say that doing this is not necessarily better than traditional averaging. I've found some success with concatenating average and max and min pooled vectors together before cosine similarity searches - but it's certainly not a good fix and is much slower for not a lot of gain

If anyone is a total nerd about word/document vectors and wants to chat / brainstorm about applying text vectors to search / extractive summarization - please contact me! I'm desperate to work with others on this problem!




I tried weighted average, but in my case the weights were computed by doing dot product between all vectors in the phrase (m x m), then taking the average over rows for each word and normalising. Kind of like a poor man's Transformer. It will boost words that are supported by other similar words in the same phrase.


Can you provide psudocode or code for this?

I'd post a snippet of my implementation but I'm too dumb to figure out the format.

But wow this is much better for results! Now to figure out how to make it fast!


Something like this, including thresholding for pairs of words that have low dot product:

    import numpy as np

    def inner_product_rank(vecs, threshold=0.5):
        sims_mat = np.dot(vecs, vecs.transpose())
        sims_mat = sims_mat - threshold
        sims_mat[sims_mat<0] = 0
        ranks = np.sum(sims_mat, axis=0)
        return ranks
Then you can take a weighted sum of the vecs or use the ranks to select the most related words. It is also possible to run spectral clustering on sims_mat to get the main topics of the text, it works quite well.


Hey - how can I contact you?


gedboy2112@gmail.com


Maybe delete your email from this post when the guy contacts you




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: