
Ask HN: Tf-idf search index vs. brute-force Doc2vec and query comparison - pacavaca
If on one hand, we have a classic search index with ~tf-idf based scoring algorithm (a well configured elastic search, let&#x27;s say). And on the other hand, we have a list of document vectors generated for every document, using some sort of modern, semantic-aware Doc2vec algorithm (a-la word2vec).
Now, if we completely omit the speed concern and for the second case will just iterate over every document and calculate a distance from it to a query vector and then pick N closest as a search result. Is it a common sense that these results will definitely be more relevant to a query than those, obtained from a regular search engine? Or will the improvement be just marginal or not better at all?
Can anyone point me to an existing experiment with some real numbers available?<p>Also, am I right that the second way should do things like matching &quot;health insurance&quot; to &quot;employee benefits&quot; and &quot;SF taxi&quot; to &quot;California transportation&quot; kind of out of the box, assuming that the &quot;Doc2vec&quot; is well trained and produces rather large vectors for every document (or let&#x27;s even assume we only work with document titles and hence rely on &quot;Sentence2vec&quot;).<p>I would be really grateful if someone could shade some light on this area of information retrieval for me. Thanks!
======
PaulHoule
With some kind of "doc2vec" you can get improved results for "more like this"
queries where the user supplies a document and the system finds more.

This leads to "relevance feedback" that really works.

I worked with a "doc2vec" system for patent search, it did a great job in the
scenario that somebody writes a paragraph describing an invention. In the case
of a short query we fell back on something closer to tf*idf.

Click on my HN profile link and I can tell you more.

------
lovefromatx
word2vec is useful when one wants to build a machine learning based system. It
allows you to get away with a really small matrix [number of
documents,~25-1000]. This really makes ML feasible. Another advantage is
preserving context. A vector for car and vehicle are closely aligned.

Problem when implementing a vector based search engine system is that your
recall is going to be really high. You will potentially get a lot of
marginally related results with your query.

My recommendation will be to implement a tf idf based system. You could
enhance your queries by also enriching them with synonyms as well. You could
find synonyms by using something like LDA, get a topic model and use the words
add the words from that topic in the query.

------
pacavaca
Thank you!

