
word2vec in yhat: Word vector similarity - dfrodriguez143
http://danielfrg.github.io/blog/2013/09/21/word2vec-yhat/
======
Radim
For people interested in a cleaned-up, commented and de-obfuscated word2vec, I
recently ported the original C code to Python [1].

My HN submission of this endeavour received no love, but I think it's
worthwhile nevertheless as the Python code is not only more concise, readable
and extendable, but the training's actually faster too [2].

[1]
[https://github.com/piskvorky/gensim/blob/develop/gensim/mode...](https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py)

[2] [http://radimrehurek.com/2013/09/word2vec-in-python-part-
two-...](http://radimrehurek.com/2013/09/word2vec-in-python-part-two-
optimizing/)

~~~
dfrodriguez143
Your submission receives no love but my one afternoon hack does... oh the
humanity... lol

That is some amazing work, thanks!

------
judk
Word2vec seemed intuitively obvious me, but I really have a hard time
believing that it works in only 1000 dimensions, generating results beyond
cherry picked demo examples.

Are there really only 1000 independent concepts in the English language?

~~~
gojomo
I'm still impressed it only takes 26 letters, in words of average size around
5! By comparison, 1000 continuous dimensions seems positively resplendent with
expressiveness.

FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space
about 2^939 times larger than 26^5 (all possible words up to 5 letters).

~~~
judk
Yes, but there are exponentially more concepts than words. The words we have
are sparse set of labels for particularly relevant combinations.

But yeah, the continuous dimensions can hide many more binary dimensions.

For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension,
but that feels a bit like cheating.

So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit
dimensions.

------
3JPLW
Very cool. I missed the original word2vec software discussion back in August:
[https://news.ycombinator.com/item?id=6216044](https://news.ycombinator.com/item?id=6216044)

And the paper itelf is a very worthwhile read:
[http://arxiv.org/abs/1301.3781](http://arxiv.org/abs/1301.3781)

------
dhammack
The vectors learned from word2vec are pretty amazing. A few days after the
tool was released I wrote a script which uses the vector representations to
figure out which word in a list isn't like the others [1]. Things like:

->math shopping reading science

I think shopping doesnt belong in this list!

->rain snow sleet sun

I think sun doesnt belong in this list!

etc.

[1]
[https://github.com/dhammack/Word2VecExample](https://github.com/dhammack/Word2VecExample)

------
gojomo
Eventually computers will be talking about us behind our backs in these high-
dimensional vectors, only occasionally translating down to English
approximations, to humor us. "Goo goo, gah gah, human?"

~~~
seiji
Have you read the [Message Contains No Recognizable Symbols] series? It's
pretty great:
[http://www.ssec.wisc.edu/~billh/g/mcnrs.html](http://www.ssec.wisc.edu/~billh/g/mcnrs.html)

~~~
gojomo
Haven't but will check it out, thanks!

------
gojomo
Cool web demo powered by word2vec, by Christopher Moody:

[http://thisplusthat.me/](http://thisplusthat.me/)

