StarSpace can compute 6 types of entity embeddings, of which word embeddings are just one type. It's a whole family of algorithms.
I mean king = queen -woman + man
That's the kind of thing we have ontologies for.
This article mentions that word embeddings are useful inside translators, but from the viewpoint of somebody who wants to extract meaning from text, what use is something that doesn't handle polysemy and phrases?
It's not word embeddings job to handle phrases - but nearly all modern phrase embedding algorithms sit on top of word embeddings. They often create a weighted average of embeddings by using an attention model, or they can use a more complex model such as an LSTM with attention (e.g. CoVE - https://arxiv.org/abs/1708.00107 ).
Word embeddings can handle polysemy - high dimensional vectors can (and do) hold information of various types that is used in different contexts in different ways. Some approaches deal with this more directly (e.g. including part-of-speech as part of the vocab item), and that sometimes can help a bit.
For instance, the random result for ImDB in Table 2 is 88.4 and the best one is 92.1; that's really not a lot of lift. I could see TREC-6 and TREC-50 results being good enough to let off the leash, but I still have a hard time picturing this being useful in the real world.
To see an example, type "fuel" in the search input on this page: https://openvoyce.com//products/quuu
You'll see many relevant results, none of them using the word "fuel". This is done purely with postgres, computing a L2 distance sort - no elasticsearch.
What is the shape of the database? Do you normalize each document into a single vector which is compared, or are you keeping per-word vectors? I'm imagining you probably don't have a database with a row for every word, but maybe you do?
How do you pre-filter the list of documents to compute L2 for? If no pre-filtering, can this approach scale into millions of documents?
I have a separated service that contains the word embeddings, generated with word2vec. The idea is to generate an embedding for the document by making an average of the embeddings of the words it contains, each having a coefficient based on the word's rarity (so, a rarer word has more weight than a stop word).
When saving a document, OpenVoyce is contacting this API and asks to generate an embedding for the document, then it only saves that in its own database (as a "cube" vector of 200 dimensions).
From there, searching for something new is just about asking for an embedding for the search terms and using `cube_distance()`  as sort function, it does not require pre-filtering since stop words are already weighted off (although, there is some filtering in the API as it ignores words it doesn't know).
It would still help to be able to define user specific stop words, though. For example, on Quuu's OpenVoyce, most suggestions are about adding new categories, so "category" should be considered a stop word, that's something I plan to implement.
I can't tell yet how it scales to million of records because we're very far from there for now (there are 4500+ suggestions and comments on OpenVoyce at present day). My bet is that if the amount of data becomes a problem, it may be fixed by reducing the number of dimensions of the vectors.
Oh, there's also something to know: the cube extension for postgres doesn't allow for more than 100 dimensions. This is something configurable, but only by editing a header file from the extension (that's the author's recommended method). I've detailed the problem and solution on my pg350d repos 
Have you done any evaluation to show that this is better than a more a conventional search engine? The Lesson of TREC is that 95% of the things that will "obviously" improve search results will not.
Previously one used TF-IDF to represent a document, but now one uses Word2Vec and will usually get better results.
For example, From Word Embeddings To Document Distances shows how to use a new distance measure (Word Mover Distance) in classification tasks. This leads to state-of-the-art performance on 6 out of 8 classic text classification datasets (and very close on the other 2).
In most other practical NLP tasks you find similar results: replacing an older representation with a word embedding almost always improves performance.
Somewhat unrelated, but the old joke goes: give one problem to two ontology experts, and they will come up with three different ontologies.
Selecting context words differently is also an option for improvement. Using dependency structures to "filter" out context window seems to work better than "filtering" using subsampling frequent words illustrate that there is room. We may see other solutions to select context words in the future, as a building block as it is. Especially lately with the StarSpace hype advocating the idea of general purpose - task-agnostic - embeddings.
Or we can also consider that the expected improvements are insignificant w.r.t. improvements with the model learnt on those embeddings for downstream tasks that may update embeddings especially for this task...
() disclaimer: I am a co author
Keep up the great work !
You will note that negative sampling improved by leveraging information on word pairs form dictionaries entries (we called it "controlled negative sampling") do help, though not much. It actually really depends in the rare words rate (see section 5.4, improvement ranges from 0.7% up to 10%). But I guess it is already an interesting, somehow counter-intuitive, observation.
Another very interesting observation is that you can also choose to just clamp a general purpose dataset and expand it with external contextual information (meaning not using is for supervision but rather just collapse it at the end of the training corpus in a raw form [^]). In our case, we call those corpus :
- corpus A : plain old wikipedia dump
- corpus B : plain old wikipedia dump + dictionaries text collapse at the end of it.
It sounds a bit naive : the latter part of the training corpus is really small w.r.t. the full wikipedia dump.
Nonetheless, it has an significant impact on word similarity (see improvement in Table 2 to see how those training corpus influences representations learnt by word2vec, fasttext and dict2vec).
(Related to the effect of the training corpus : https://arxiv.org/pdf/1507.05523v1.pdf)
I mention this effect of training corpus content here since it sounds like an interesting info for the working natural language processing practitioners (get a mid size general training corpus, add as many contextual corpus as possible => may yield useful embeddings...).
[^] to be entirely fair, this has been suggested to us by an anonymous reviewer, many thanks for him/her for pointing this out : I found the results surprising.