Hacker News new | past | comments | ask | show | jobs | submit login
A Word Is Worth a Thousand Vectors (stitchfix.com)
127 points by astrobiased on Mar 11, 2015 | hide | past | web | favorite | 21 comments

As a native Chinese speaker, this comes so natural.

pork = pig + meat

So the year of 2015 is the year of ram/sheep/goat, in Chinese they literally means

Ram = male ∪ caprinae

sheep = wool ∪ caprinae

goat = mountain ∪ caprinae

basically, word composition is pretty common in analytic language like Chinese, but kinda new idea in fusional languages like English.

This is super-cool. I am trying to start learning Chinese on Coursera https://www.coursera.org/learn/learn-chinese

Learning a language that is based on different principles is an enormous brain exercise. Also, remembering the characters is a challenge.

I've always wondered about doing this in non-flat spaces. Like if I add the "7100 miles west" vector to the "California" point, I get Turkmenistan. If I add "7100 miles west" again, I get back near "California." Similarly, adding the "not" vector twice might get you back where you started in a word embedding. Anyone know if anyone is working on this? It could be tricky because "7100 miles west" lives in the tangent space to the space "California" lives in, but that in itself could be an interesting thing to study in the context of words.

Take a look at some of the compositional models here under publications: http://www.socher.org/.

Here is the demo webpage for the sentiment analysis system: http://nlp.stanford.edu/sentiment/

Wondering how this differs from the SemanticVectors package? Will have to look into word2vec further.

Word2vec is usually the standard neural word embeddings implementation. There are other algorithms as well such as glove[1], document embeddings[2] and backpropagation based methods[3]. Facebook just came out with a paper recently that beat word2vec as well[4]. Neural word embeddings are a neat way of representing concepts. I see a great future for automated feature engineering with text (joining audio and images) in deep learning.

[1]: http://nlp.stanford.edu/pubs/glove.pdf

[2]: http://cs.stanford.edu/~quocle/paragraph_vector.pdf


[4]: http://arxiv.org/abs/1502.01710

It's my first time seeing the package, but looking over the docs it looks like it implements LSA. The major difference here is that word2vec dramatically outperforms LSA in a variety of tasks (http://datascience.stackexchange.com/questions/678/what-are-...). My experience has been that the vector representations in LSA can be underwhelming and poorly performant. I can't comment on the Random Projection and Reflective Random Indexing techniques SemanticVectors implements.

This link is about document distances but still compares other techniques nicely: http://datascience.stackexchange.com/questions/678/what-are-...

Sorry, I should have specifically mentioned how it differs from random indexing/projection. I was immediately reminded of a similar inference example using random indexing/projection.


Very exciting stuff. I love how you can take simple building blocks and create something elegant and fun with them.

However, why are there words more similar to "vacation" than "vacation"?

Thanks! The word 'vacation' is just removed from the list since it's exactly what we're looking for.

It's not removed from the list -- it is second from the bottom. madsravn's question is a good one.

The one in the list includes a period after it, so I believe it is just a case of slightly dirty data.

Good observation -- I missed that (obviously). They seem to be using data from the word2vec project, so I would guess that it is intentional rather than a lack of cleaning.

But how come it is less similar to itself than other words?

I don't understand how the item matching is working. Do they have textual descriptions of each item (including colors and patterns), or are they somehow building vectors for the images and then doing cross-modal vector calculations?

If it's the first option, then generating those descriptions seems and important thing to mention.

If it's the second, then it's a pretty significant result! I've seen some papers that indicate some possibilities in that area, but never anything working as well as this.

The item vectors are generated from the text: customers and stylists write text about the stripes, or maternity, and word2vec associates this with the item in question. How that happens is treated in the next section about summarizing documents (a 'document' here is the collection of all text about an item).

So we don't do any fancy deep learning from the images themselves, although this is on the horizon :)

Colors would be trivial to do at least.. "The Dress" not withstanding..

Am I wrong that the title seems to imply something negative about word vectors? But the article is super pumped about them!

It's just a play on the phrase a 'picture is worth a thousand words' :) The word vectors themselves contain sophisticated relationships that seem almost miraculous, we're definitely not negative on them

(Speaking as one of the authors of the post)

By simple algerba, we can now prove that a picture is worth 1,000,000 vectors.

> If the gender axis is more positive, then it's more feminine; more negative, more masculine.

Reminds me of my java textbook: the example was to model some employee[1] and it's gender was a `boolean`: `false` for man; `true` for woman. Of course that was just an intermediate example before they showed off the `enum` solution.

[1] because hey, java OOP + CRUD business application example == match made in heaven as an example, apparently.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact