
Exploring Word2Vec - sujayskumar
http://sujayskumar.blogspot.com/2017/03/exploring-word2vec_3.html
======
patelajay285
We recently open sourced a library that makes it easy to quickly get started
with pre-trained word vector models like word2vec from Google, Facebook, and
Stanford. It adds a ton of extra functionality like fast similiarity indexing
with Annoy and fairly robust out-of-vocabulary word lookups (handling
misspellings) out of the box. If you want to quickly get started with pre-
built models, it might be worth checking out along with Gensim:

[https://github.com/plasticityai/magnitude/](https://github.com/plasticityai/magnitude/)

~~~
neoncontrails
Are document vectors supported analogously to word vectors? It's a bit
peculiar to me that gensim withholds useful word2vec methods from the Doc2Vec
class, like most_similar_from. I'd really enjoy having a common interface to
both vector spaces.

~~~
patelajay285
Currently, it doesn't support document vectors, but it should be possible to
add it and maintain the same interface. We're open to accepting a pull request
if your interested in building that feature.

------
thomasahle
> One important thing to note is that Word2Vec does not consider the
> positional variable of the context words.

There is a very interesting new approach of learning Gaussian mixtures from
words, which addresses this problem:
[https://github.com/benathi/word2gm](https://github.com/benathi/word2gm)

It also solves another interesting problem, which is that Word2Vec doesn't
always put "categories" close to their actual contents.

Eg. I wrote a simple AI for playing Codenames with word vectors:
[https://github.com/thomasahle/codenames](https://github.com/thomasahle/codenames)
But where a human use the clue "Countries 2" for "Germany" and "England", the
Glove vectors (at least in my implementation) much prefers the clue "France
2", which is pretty confusing at first.

~~~
yorwba
The Gaussian mixture representation is quite interesting, so thanks for the
link. But it seems like they do not make use of the relative position of words
in the text either. They simply maximize the margin of the expected likelihood
for words that occur in the same context over those that do not.

However, in most cases it's probably actually better to ignore word order,
since that captures more semantic relatedness rather than the syntactic
features that influence word order. It also lets you translate between
languages that do not necessarily have similar grammar: "Word Translation
Without Parallel Data"
[https://arxiv.org/abs/1710.04087](https://arxiv.org/abs/1710.04087)

~~~
Cybiote
Whether it's okay to ignore word order really depends on your goals. In my
work, tracking word order to build a sentence representation is necessary to
get composition rules correct. You can get away with ignoring order
surprisingly often but when it's needed, it's vital. When aggregated across an
entire corpus, order ends up mattering a lot, in an absolute sense (something
looking only at accuracy numbers will hide).

Compare the following:

1) You must state the benefit.

2) You must benefit the state.

3) Feed fish vs. Fish feed.

Sometimes limits in inference:

1) A loves B.

2) Cats eat fish.

3) Cats are mammals.

Sometimes the sequence preceding or following will be important:

1) Problems with learning

2) Learning with problems

3) It has a _learning problem_

4) A _learning problem_ for it.

Or consider an example from:

[https://www.researchgate.net/publication/2335962_How_Well_Ca...](https://www.researchgate.net/publication/2335962_How_Well_Can_Passage_Meaning_be_Derived_without_Using_Word_Order_A_Comparison_of_Latent_Semantic_Analysis_and_Humans)

1) It was not the sales manager who hit the bottle that day, but the office
worker with the serious drinking problem.

2) That day the office manager, who was drinking, hit the problem sales worker
with the bottle, but it was not serious.

Order ends up mattering when you want to capture nuance or power higher order
inference.

~~~
yorwba
Right, what I meant was not that order doesn't matter at all, but that order
mostly doesn't matter for the meaning of individual words (except for some
homonyms which can be disambiguated using syntactical information).

When you want to handle whole sentences, obviously order becomes much more
relevant, but for that you'd feed the word vectors into an LSTM or similar to
handle the order-dependence.

------
stochastic_monk
I enjoyed this writeup, as well as the canonical word2vec explanation linked.

I will say that he’s wrong that NFL never co-occurs with ML. I’ve had
discussions involving the No Free Lunch theorem by its initials.

~~~
sujayskumar
Valid point. I should have made it clearer that NFL stands for National
Football League. I was trying to demonstrate the difference in domain, one
being sports and the other being computer science.

~~~
stochastic_monk
I understand, it’s just a funny coincidence.

------
Jhsto
Can someone explain why isn't the word 'the' considered part of the vocabulary
in the blog post?

~~~
swyx
usual procedure in NLP:
[https://en.wikipedia.org/wiki/Stop_words](https://en.wikipedia.org/wiki/Stop_words)

~~~
neoncontrails
Word2Vec actually employs a default strategy of pruning the most common
0.0001% of words in the corpus. My understanding is that this permits the RNN
layer to train on the unadulterated text rather than an unnatural concentrate
of tokens, which is supposedly beneficial.

Interestingly there's some controversy whether this benefit extends to
punctuation. Most W2V tutorials instruct the programmer to apply tokenizing
functions that leave only space-separated word characters, annihilating
everything else, but I've actually observed higher accuracy scores from models
that tokenize non-word characters individually, i.e. as strings of length 1.
I'm not sure exactly why this is the case, but there is a common sense
interpretation which might explain. Output from the precursory tokenizer
output is often transduced by word2phrase, an optional bigram recognizer.
Mikolov strongly recommends using it, but I'm sure it struggles to infer
pointwise mutual information from a text that's missing ~70% of its contents.

