
LexVec, a word embedding model written in Go that outperforms word2vec - atrudeau
https://github.com/alexandres/lexvec
======
rspeer
As pre-built word vectors go, Conceptnet Numberbatch [1], introduced less
flippantly as the ConceptNet Vector Ensemble [2], already outperforms this on
all the measures evaluated in its paper: Rare Words, MEN-3000, and
WordSim-353.

This fact is hard to publicize because somehow the luminaries of the field
decided that they didn't care about these evaluations anymore, back when RW
performance was around 0.4. I have had reviewers dismiss it as "incremental
improvements" to improve Rare Words from 0.4 to 0.6 and to improve MEN-3000 to
be as good as a high estimate of inter-annotator agreement.

It is possible to do much, much better than Google News skip-grams
("word2vec"), and one thing that helps get there is lexical knowledge of the
kind that's in ConceptNet.

[1] [https://blog.conceptnet.io/2016/05/25/conceptnet-
numberbatch...](https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch-
a-new-name-for-the-best-word-embeddings-you-can-download/)

[2] [https://blog.luminoso.com/2016/04/06/an-introduction-to-
the-...](https://blog.luminoso.com/2016/04/06/an-introduction-to-the-
conceptnet-vector-ensemble/)

~~~
rspeer
That said: LexVec gives quite good results on word-relatedness for using only
distributional knowledge, and only from Wikipedia at that. Adding ConceptNet
might give something that is more likely to be state-of-the-art.

~~~
glup
...And just distributional knowledge makes it easy to train new models on
domain-specific corpora, or new languages. Is it possible to do the same with
ConceptNet?

I generally find that expert-derived ontologies suffer from bad coverage of
low frequency items, rigidly discrete relationships, and are usually limited
to a single language. That said, they're vastly better than nothing for a lot
of tasks (same goes for WordNet).

~~~
rspeer
You can retrain your distributional knowledge _and_ keep your lexical
knowledge. Moving to a new domain shouldn't mean you have to forget everything
about what words mean and hope you manage to learn it again.

The whole idea of Numberbatch is that a combination of distributional and
lexical knowledge is much better than either one alone.

BTW, ConceptNet is only partially expert-derived (much of it is crowd-
sourced), aims not to be rigid like WordNet is, and is in a whole lot of
languages.

"Retraining" ConceptNet itself is a bit of a chore, but you can do it. That
is, you can get the source [1], add or remove sources of data, and rebuild it.
Meanwhile, if you wanted to retrain word2vec's Google News skip-gram vectors,
you would have to get a machine learning job at Google.

[1]
[https://github.com/commonsense/conceptnet5](https://github.com/commonsense/conceptnet5)

------
herrkanin
It feels weird how word embedding models have come to refer to both the
underlying model, as well as the implementation. word2vec is the
implementation of two models: the continuous bag-of-word and the skipgram
models by Mikolov, while LexVec implements a version of the PPMI weighted
count matrix as referenced in the README file. But the papers also discuss
implementation details of LexVec that has no bearing on the final accuracy. I
feel like we should make more effort to keep the models and reference
implementations separate.

~~~
tdj
Aren't skip-grams equivalent to NMF of the PPMI matrix?

[https://papers.nips.cc/paper/5477-neural-word-embedding-
as-i...](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-
matrix-factorization.pdf).

------
loudmax
If anyone else is wondering what the heck "word embedding" means, it's a
natural language processing technique.

Here's a nice blog post about it: [http://sebastianruder.com/word-
embeddings-1/](http://sebastianruder.com/word-embeddings-1/)

It can process something like this: king - man + woman = queen

Neat-o.

~~~
lukasb
Post starts off kind of dense. Finally we get to the section "Word embedding
models" and I say "ah ha! here we'll get a concise definition." Cut to ...

 _Naturally, every feed-forward neural network that takes words from a
vocabulary as input and embeds them as vectors into a lower dimensional space,
which it then fine-tunes through back-propagation, necessarily yields word
embeddings as the weights of the first layer, which is usually referred to as
Embedding Layer._

Naturally.

(Thanks, I believe that it is a great blog post, but I might look elsewhere
for an intro ... :)

~~~
oelmekki
Would love such kind of intro too, especially since I'm working on a product
that could greatly benefit from NPL and neural networks.

Could a kind person provide a good reference for something we could learn
from?

Or (as I fear), are we past this short time in the beginning of a
technic/science (I think about computing, here) when you can learn without
going through academic studies?

~~~
aweinstock
This is fairly readable (high-level) post on word embeddings:
[http://colah.github.io/posts/2014-07-NLP-RNNs-
Representation...](http://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)

~~~
iraphael
And this is one that explains everything OP needs to understand the confusing
sentence from their intro:

[http://mccormickml.com/2016/04/19/word2vec-tutorial-the-
skip...](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/)

~~~
oelmekki
Thanks :) I feel like some key concepts are a bit beyond my reach, but that
they are such a small quantity that it can be a good starting point for
googling around. Thank for your help!

------
rpedela
Slightly off-topic, but I thought this would be a good place to ask.

Are there any word embedding tools which take a Lucene/Solr/ES index as input
and output a synonyms file which can be used to improve search recall?

~~~
gtani
There's a few projects that use ES/lucene as a backend/datastore once the
feature engineering is done, but I don't see models operating on the native
indexes directly, maybe the format is too different from one-hot (after
turning off stemming/stopwords and other info-losing steps)

[http://lucene.472066.n3.nabble.com/Where-Search-Meets-
Machin...](http://lucene.472066.n3.nabble.com/Where-Search-Meets-Machine-
Learning-td4203127.html)

[https://news.ycombinator.com/item?id=11876542](https://news.ycombinator.com/item?id=11876542)

------
IshKebab
Has anyone done any work on handing words that have overloading meanings?
Something like 'lead' has two really distinct uses. It's really multiple words
that happened to be spelt the same.

~~~
riyadparvez
Well, there is Sense2Vec: [https://github.com/spacy-
io/sense2vec](https://github.com/spacy-io/sense2vec)

~~~
eva1984
Sense2Vec can solve this one, but what if both meaning of the world are of the
same pos tag?

------
ianbertolacci
Reminds me of Chord[1], word2vec written in Chapel

[1] [https://github.com/briangu/chord](https://github.com/briangu/chord)

------
mooneater
Are there IP considerations? Word2vec is patented.

~~~
meeper16
System and method for generating a relationship network - K Franks, CA Myers,
RM Podowski - US Patent 7,987,191, 2011 -
[http://www.google.com/patents/US7987191](http://www.google.com/patents/US7987191)

~~~
glup
Would this really be usable in court? It seems super general to me, using a
lot of common techniques. Silly question, is it infringement to use any part
of the patent?

~~~
cschmidt
It is only infringement if you do something matching every part of some claim.
There may be lots of stuff in the description, and that doesn't matter. That
is, if a claim is a system comprising A, B, C, and D, and you do just A, B,
and C, then you're fine.

------
ris
Well done, that's probably the _least_ relevant use of "written in go" in a HN
headline I've seen. And there's some stiff competition for that title.

------
PaulHoule
From the viewpoint of commercial applications I find this profoundly
depressing.

When the state of the art for accuracy is 0.6 on some task, you are going to
always be a bridesmaid and never a bride, but hey, you can get bragging rights
cause you did well on Kaggle.

~~~
computerex
That depends, to be honest. 60% accuracy, depending on the task, is _far_
better than guessing at random. Secondly, depending on the task, human
performance may not be that great either. Combined with controls, heuristics
and validation, these "weak" models can still be of great use in commercial
settings.

