
Sense2vec – A Fast and Accurate Method for Word Sense Disambiguation - williamtrask
http://arxiv.org/abs/1511.06388
======
syllogism
Like a lot of work in NLP at the moment, this is a reasonably straight-forward
mash-up of existing techniques. This particular idea is pretty obvious. What
wasn't obvious was whether it would work well, which was why nobody else got
around to trying it yet. The experiments are nicely conducted, with strong
baselines across multiple evaluations. The authors also include experiments
with variations of the idea, to further validate the approach.

To understand the technique, first understand word2vec:

[http://rare-technologies.com/word2vec-tutorial/](http://rare-
technologies.com/word2vec-tutorial/)

[http://colah.github.io/posts/2014-07-NLP-RNNs-
Representation...](http://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)

Now understand part-of-speech tagging:

[http://spacy.io/blog/part-of-speech-POS-tagger-in-
python/](http://spacy.io/blog/part-of-speech-POS-tagger-in-python/)

By default word2vec gives you clusters for each word, this paper is giving you
clusters for word_POS, e.g. The_DT apple_NNP employee_NN is_VBZ eating_VBG
an_DT apple_NN. The same trick is done with named entity labels as well.

The following papers explain how the new word vectors are used in a dependency
parser:

Collobert and Weston (2011):
[http://arxiv.org/pdf/1103.0398.pdf](http://arxiv.org/pdf/1103.0398.pdf)

Wang and Manning (2014):
[http://cs.stanford.edu/~danqi/papers/emnlp2014.pdf](http://cs.stanford.edu/~danqi/papers/emnlp2014.pdf)

Yoav Goldberg (2015):
[http://u.cs.biu.ac.il/~yogo/nnlp.pdf](http://u.cs.biu.ac.il/~yogo/nnlp.pdf)
Survey/review, aimed at grad students

~~~
abtinf
"This particular idea is pretty obvious. What wasn't obvious was whether it
would work well, which was why nobody else got around to trying it yet."

Ideas are often obvious after you hear them for the first time.

~~~
bhickey
This one is actually kind-of, sort-of obvious. For NaNoGenMo this year I
smushed together word2vec and a POS tagger. What the authors have done here is
really cool and goes miles beyond my hacks, but the kernel of the idea should
be obvious to anyone familiar with word2vec.

------
j_jochem
> We demonstrate that these embeddings can disambiguate both contrastive
> senses such as nominal and verbal senses as well as nuanced senses such as
> sarcasm.

If this is true, that's astonishing. Also, this might allow us to build
assistive technologies for people who are unable to perceive sarcasm.

------
habitue
Supervised disambiguation? Isn't the entire reason word2vec is exciting
because it's unsupervised?

~~~
ya3r
I have seen this many times that people claim the word2vec is unsupervised.
But I think that is inaccurate.

word2vec is using a very weak supervision which is the order in which words
appear in a meaningful sentence. And I think it is fascinating to use this
kind of weak supervision to build distributed embedding for words.

~~~
fnl
Skip-grams only take focus word, context word pairs. So those pairs do not
take order into account

~~~
ya3r
being in the same context is some kind of order.

~~~
fnl
That's still a bit far fetched. "Weakly supervised" refers to using a small
amount of labeled data. This is not the case for word2vec and similar
embedding methods. As a matter of fact, rather, the method presented here,
sense2vec, would qualify, at it indeed is weakly supervised.

~~~
ya3r
Where we use "small amount of labeled data" is called "semi-supervised".

------
andrewtbham
I think this paper is interesting... The section on sarcasm is interesting.
They got good results distinguishing apple the noun (which is similar to
apples, pear, peach, blueberry) from Apple the proper noun (which is similar
to Microsoft, iphone, ipad, samsung.)

While they got good results telling bank the noun from bank the verb... they
didn't differentiate bank the noun (financial ) from bank the noun (the side
of a river).

Or even more complicated... look at all the uses of bank on word net.

[http://wordnetweb.princeton.edu/perl/webwn?s=bank](http://wordnetweb.princeton.edu/perl/webwn?s=bank)

I'm not sure I can grasp the implications of disambiguating some words, but
not others. For some applications it might make sense but... trying to
disambiguate words or leave them as single vectors is still, in my mind, an
open research question.

~~~
williamtrask
Great point. In my opinion, for improving the quality of syntactic tasks such
as POS / dependency parse, the difference between disambiguating and not
disambiguating riverbank and financial bank will be minimal. However, for
semantic tasks (perhaps NER, information extraction, question answering) the
difference would be more profound. This paper is primarily focused on a much
more efficient method to do the former.

------
jilebedev
For a layman's introduction to how (pardon the hyperbole) soul-crushingly
difficult this problem is, have a look at this amateur attempt to process
language inputted by players into a video game:
[https://www.youtube.com/watch?v=Ff6V1yFafW4](https://www.youtube.com/watch?v=Ff6V1yFafW4)

------
fnl
What always worries me with all WSD approaches is the performance tradeoff:
How much more performance is gained from using more complex per-sense word
vectors designs vs. "standard" word embeddings. Setup complexity can often
increase significantly for these models and training times are much longer,
while the gains from these approaches are not terribly clear to date.

~~~
williamtrask
In this case, there is no performance tradeoff except that of running your
core NLP pipeline... for which there are several very fast options

~~~
fnl
So your results demonstrate that directed dependency labeling works better
with vectors learned from PoS-tagged words than with PoS-tagged vectors
(learned from untagged words)? And if so, why are you sure you are not
overfitting on the corpus or that the "unseen" (in your case: label +) word
(pairs) issues will in the end do more harm than what you gain when using this
approach on truly independent data/text?

EDIT: Sorry, this question above is probably too convoluted to understand. As
I understand, the evaluation of the UAS in the paper was made by letting the
parser use the gold PoS labels from the UD treebank (plus either word
embeddings). But what would happen if the PoS labels for evaluating the
dependency parser came from a PoS tagger, as would be the case when working on
unseen data? I might imagine that "plain" embeddings could maybe produce a
better UAS in that case, because they are not as "overfitted" as the
"enriched" embeddings (as those are derived from the PoS-tagger labeled words
in the first place).

~~~
williamtrask
Perhaps, but seeing as POS taggers are ~97% accurate (at least in English),
I'd expect this to be minimal. Furthermore, the baseline neural network also
has access to the gold standard POS tags, so the comparison of adding POS
disambiguated embeddings is pretty clean. It's the difference between "words +
pos tags" as features and "pos-disambiguated word + pos tags".

~~~
fnl
Using accuracy to measure PoS taggers makes results look good, but is
ensnaring due to their huge bias: Tagging every word by the majority tag found
during training and everything else as either NNP or NNPS (with suffix -s)
means the statistical baseline is already well beyond 90% accuracy. However,
my point was that from the results shown its not clear to me if the gains in
attachment score you saw when using Gold Standard PoS tags would be lost in a
"real-world" usage when you have to rely on the tagger's own PoS tags. In such
a case, it could be that your embeddings contribute much less "new knowledge"
than what you see in your results, using independent (Gold) PoS tags. This
might be mitigated by using two independently trained and set up PoS taggers,
however. But this finally gets us back to my initial concern: How much
performance gain really is in there from all this added complexity and is that
"worth the effort"?

~~~
williamtrask
Generally, the industry benchmarks dependency parsing using gold standard POS
tags. However, your point is well taken. Personally, I have little doubt that
it would still yield the same level of improvement, but fortunately a bit of
experimentation can settle it for sure :)

Perhaps also relevant to this conversation, the disambiguation for pre-
training did in fact use "real-world" tags (not gold standard). Thus,
sense2vec as an algorithm was able to sort through the noise generated by
mistakes in the part-of-speech tagger to still generate meaningful embeddings.

------
n0us
Anyone know of an open source implementation? I've only had a chance to scan
the document but it appears to only go into the theory.

~~~
syllogism
We'll have this implemented in spaCy before too long. It's actually super easy
to do --- all you need is to merge the part of speech tags or entity labels
onto the tokens before feeding the text to Gensim or another word2vec
implementation.

I've wanted to do this for a while, so it's nice to see that it works well.

~~~
danieldk
Indeed, the implementation is very simple. We had considered this idea some
times as well, but like you didn't get around to implement/evaluate it yet.
So, it's good to hear that it works.

One thing notably absent from the paper is a discussion of the trade-off
between augmenting tokens with annotations in this way for sense
disambiguation vs. data sparseness. Their approach may make the embeddings for
frequent senses better, but the difficulty in WSD is typically in low-
frequency senses. I think that particularly in disambiguation using part-of-
speech tags, there is still a high semantic relatedness between senses,
especially in languages with frequent nominalization or verbalization.

I can imagine that an model that predicts a target (or context) in decomposed
form (token and label) might improve embeddings for low-frequent senses.

~~~
syllogism
This isn't _really_ WSD though, or at least, only very weakly.

Rare words are usually pretty unambiguous for part-of-speech. I would guess
this mostly has an effect on the top 5,000 items of the vocabulary, and most
of the rest of the lexicon only has a single "sense".

~~~
danieldk
_This isn 't really WSD though, or at least, only very weakly._

Sure. I was pointing to real WSD where sparseness becomes even a stronger
problem than when your definition of sense is restricted to part-of-speech tag
or sentiment.

 _Rare words are usually pretty unambiguous for part-of-speech._

I was talking about (possibly) frequent words where some part-of-speech are
infrequent, not about rare words. To take five more or less random examples
form the Brown corpus (yes, we train on large corpora, but I think similar
distributions could hold for less frequent forms in languages with e.g.
frequent nomalization, not everyone speaks English!):

    
    
       mother NN 173 VB 1
       code NN 20 VB 1
       hanging NN 1 VBG 20
       level JJ 14 NN 172 VB 2
       services NNS 115 VBZ 1
    

If your learning method is as coarse-grained as simply throwing the token plus
part-of-speech into word2vec or wang2vec, some will be below the frequency
cut-offs (or will be to sparse to learn good embeddings), while other 'senses'
may in reality be semantically similar.

~~~
syllogism
Thanks for the explanation. I see what you're saying now.

