Hacker News new | past | comments | ask | show | jobs | submit login
word2vec in yhat: Word vector similarity (danielfrg.github.io)
59 points by dfrodriguez143 on Sept 30, 2013 | hide | past | web | favorite | 15 comments

For people interested in a cleaned-up, commented and de-obfuscated word2vec, I recently ported the original C code to Python [1].

My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].

[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...

[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...

Your submission receives no love but my one afternoon hack does... oh the humanity... lol

That is some amazing work, thanks!

Its sad you didnt get the love on your submission; you changes are very neat and having word2vec inside gensim feels like a really awesome feature.

Well done!

Mikolov said that he hoped word2vec would "significantly advance the state of the art" of NLP, but really the state of the art can only advance when people can understand and manipulate the code. You're making that possible. Thank you.

Word2vec seemed intuitively obvious me, but I really have a hard time believing that it works in only 1000 dimensions, generating results beyond cherry picked demo examples.

Are there really only 1000 independent concepts in the English language?

No but with n binary dimensions (with value 0 or 1) you can encode 2^n unique identifiers.

So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.

Note: the default dimensionality of word2vec is 100 instead of 1000. Apparently you can get better results with dim=300 and a very large training corpus. To leverage higher dimensions you need: more CPU time to reach convergence and a lot more data to leverage the added model capacity.

I'm still impressed it only takes 26 letters, in words of average size around 5! By comparison, 1000 continuous dimensions seems positively resplendent with expressiveness.

FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).

Yes, but there are exponentially more concepts than words. The words we have are sparse set of labels for particularly relevant combinations.

But yeah, the continuous dimensions can hide many more binary dimensions.

For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension, but that feels a bit like cheating.

So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit dimensions.

I wouldn't be surprised if you cover most basic english with 1000 concepts. That would give a lot of combinations.

Very cool. I missed the original word2vec software discussion back in August: https://news.ycombinator.com/item?id=6216044

And the paper itelf is a very worthwhile read: http://arxiv.org/abs/1301.3781

The vectors learned from word2vec are pretty amazing. A few days after the tool was released I wrote a script which uses the vector representations to figure out which word in a list isn't like the others [1]. Things like:

->math shopping reading science

I think shopping doesnt belong in this list!

->rain snow sleet sun

I think sun doesnt belong in this list!


[1] https://github.com/dhammack/Word2VecExample

Eventually computers will be talking about us behind our backs in these high-dimensional vectors, only occasionally translating down to English approximations, to humor us. "Goo goo, gah gah, human?"

Have you read the [Message Contains No Recognizable Symbols] series? It's pretty great: http://www.ssec.wisc.edu/~billh/g/mcnrs.html

Haven't but will check it out, thanks!

Cool web demo powered by word2vec, by Christopher Moody:


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact