

Glove: Global vectors for word representation - vkhuc
http://www.socher.org/index.php/Main/GloveGlobalVectorsForWordRepresentation

======
teraflop
The "download word vectors" links are broken. Actual data is here:
[http://www-nlp.stanford.edu/data/](http://www-nlp.stanford.edu/data/)

------
languagehacker
This is pretty badass. I'm assuming unseen words are really what's left to
work on. If you ensemble this model it with one that uses the same ideas but
generalizes outside of specific terms, you might be able to get there. For
instance, generate a matrix that represents n-gram word sequences as each
word's part of speech and semantic category. When making predictions on unseen
words only, you then can use those values to help guide your prediction. You
could use cues in phonology and morphology to predict the unseen word's
semantic category. You could build off that value with cues from morphology
and word ordering to predict the part of speech of the word. Once you have
that, and the information for adjacent, existing words, you might be able to
make a more reliable prediction on even hapax legomena.

~~~
bravura
Yes, the method you propose for inducing a representation for unseen words is
sound.

However, once you can train on almost one trillion tokens, the issue of
unknown words is not going to happen very often. i.e. what's really left to
work on is inducing higher quality representations of observed words. The goal
would be that a simple model could inject these representations and perform
well on, say, the word analogies task (or any other pure lexical semantics
task).

What's interesting about Pennington et al's work for me is how they found a
really fast training method, and thus could train on 840B tokens from Common
Crawl. I've spent a lot of time thinking about this problem, and this approach
is quite elegant.

------
LisaG
So excited so see Common Crawl data be useful for such fascinating work!

I work at Common Crawl :)

------
heyalexej
This looks very interesting. Couldn't find anything about a licence though.

