The "deep" in deep learning refers to hierarchical layers of representations (to note: you can do "deep learning" without neural networks).
Word embeddings using skipgram or CBOW are a shallow method (single-layer representation). Remarkably, in order to stay interpretable, word embeddings have to be shallow. If you distributed the predictive task (eg. skip-gram) over several layers, the resulting geometric spaces would be much less interpretable.
So: this is not deep learning, and this not being deep learning is in fact the core feature.
I'm uncertain whether you mean the algorithm or the output. The question is interesting to both however.
The most common method for producing word vectors, skip-grams with negative sampling, has been shown to be equivalent to implicitly factorizing a word-context matrix[1]. A related algorithm, GloVe, only uses a word-word co-occurrence matrix to achieve a similar result[2].
You can also view the output as an embedding in a high dimensional space (hence the name word vectors) but more surprisingly you can learn a linear mapping between vector
spaces of two languages, which lends it immediately useful in translation. From [3]: "Despite its simplicity,
our method is surprisingly effective: we can
achieve almost 90% precision@5 for translation
of words between English and Spanish".
[3]: "Exploiting Similarities among Languages for Machine Translation" - page 2 has an intuitive 2D graphical representation http://arxiv.org/pdf/1309.4168.pdf
I think people see it as "deep" (even though it's not!) due to the representation learning component. Word2vec is often used as PART of a deep neural network for feature engineering.
- using layers of random forests (trained successively rather than end-to-end). Random forests are commonly used for feature engineering in a stack of learners.
- unsupervised deep learning with modular-hierarchical matrix factorization, over matrices of mutual information of the variables in the previous layers (something I've personally worked on; I'd be happy to share more details if you're interested).
Not particularly. The desire of neural networks here are the non linear transforms you can do with the data. There's definitely some appeal and things to try here though. Gradient boosted trees and other attempts to augment random forest are pretty main stream though.
Nit: gradient boosting isn't an 'augmentation' of random forests - if anything, it's the other way round. AdaBoost is from 1995, the GBM paper was 1999, and Breiman's random forest paper in 2001 explicitly couches it as an enhancement to AdaBoost.
> word2vec is a Deep Learning technique first described by Tomas Mikolov only 2 years ago but due to its simplicity of algorithm and yet surprising robustness of the results, it has been widely implemented and adopted.
How about a single layer neural net trained with dropout? Not deep learning because there's only 1 layer, but the technique is fairly new and used in deep learning, popularized by some of the guys that popularized deep learning, usually mentioned in conversations about deep learning. It's really a shallow neural model though. Word2vec is similarly related, but you're right in that it is not a perceptron-based neural network with multiple layers (ie "deep neural network")
Still, interesting blog post, worth reading and googling for more information. FWIW I just use the deep learning definition of "neural net or representation learning research since 2006" and find it fits better.
> the technique is fairly new and used in deep learning, popularized by some of the guys that popularized deep learning, usually mentioned in conversations about deep learning
Logistic regression with regularization is fairly new? 'Pioneered' by the same people as deep convolutional neural networks? Are you certain about this?
I think you need to define regularization. L1/L2? Yes those are old.
Drop out itself IS fairly new[1]. As well as its newer cousin drop connect.
I agree neural nets themselves are basically just a crazier parametric model. Many of the things we do to modify the gradient are applicable to logistic and other simpler regression techniques as well.
I've played with word2vec, and it took nothing to get really interesting combinations.
The first thing I tried was computer : server :: phone : ?
I didn't really have a great answer for that in my head before I ran it. Word2vec decided the closest match was "voicemail". It breaks down when you feed it total nonsense, but what would you expect it to do? :P
I'm constantly super impressed by properties of the vectors.
Mikolov, et al. 2013 [1] do a proper evaluation of this. E.g. they found that the skip-ngram model has a 50.0% accuracy for semantic analogy queries and 55.9% accuracy for syntactic queries.
word2vec comes with a data set that you can use to evaluate language models.
I would insist on a better dataset before really calling these "semantic analogies" (and don't just take my word for it: Chris Manning complained about exactly this in his recent NAACL talk).
The only semantics that it tests are "can you flip a gendered word to the other gender", which is so embedded in language that it's nearly syntax; and "can you remember factoids from Wikipedia infoboxes", a problem that you could solve exactly using DBPedia. Every single semantic analogy in the dataset is one of those two types.
and "can you remember factoids from Wikipedia infoboxes",
That's a simplification. E.g. I have trained vectors on Wikipedia dumps without infoboxes, and I queries such as Berlin - Deutschland + Frankreich work fine.
Of course, even the remainder of Wikipedia is nice text in that it will contain sentences such as 'Berlin is the capital of Germany'. So, indeed, it makes doing typical factoid analogies easier.
That said -- I am more interested in the syntactic properties :).
I didn't mean that you have to learn the data from Wikipedia infoboxes, just that that's a prominent place to find factoids.
It's a data source that you could consult to pass 99% of the "semantic analogy" evaluation with no machine learning at all, which is an indication that a stronger evaluation is needed.
I am not getting the "Obama + Russia - USA = Putin" piece nor the "King + Woman - Man" bit either. Nothing particularly meaningful came up on a search for the latter. Could someone explain?
If I understand your question, the idea is that one can do “arithmetic” on concepts. Essentially the first equation asks “Obama : USA :: ? : Russia”. Similarly, “King : Man :: ? : Woman”.
The way the corpus “talks about” Obama in relation to the USA is similar to how the corpus talks about Putin in relation to Russia. That the system can reveal this is amazing to me.
When I see things like this, it makes me wonder how much data forms each of these vectors; if a single article were to say things about Obama, or humans and animals, would it produce these results?
The Korean Hangeul alphabet is an interesting compromise. It's an alphabet, but multiple letters are grouped together into syllabic characters when written. Those syllables in many cases map back to Chinese Han characters by way of sound values (a lot of the Korean vocab is Chinese in origin, even though the language have very distinct grammar), which means the boundaries between morphemes often match the character boundaries. This is reflected in the orthography, where in case of multiple options for how to distribute letters over characters, the option that keeps the same morpheme spelled consistently through use in different words is preferred. So while you can write phonetically as in Latin, the written language retains a high level of morphological information and things feel very Lego-like.
There seems to be a pattern in Asian languages that try to equate symbols that represent things to the physical respresentation. I recall a friend stating the Korean word/character for balance looked like a person holding jugs of water on each shoulder.
Some Han characters are ideographic in nature, but not all of them. Korean used to be written with Han characters (using sets of very complicated rules for how to apply them to the language) prior to the invention of Hangeul, but other than a handful of them they aren't in widespread use any more outside specialized or educational contexts. However, some of the Hangeul letters are featural in design, e.g. the velar consonant ㄱ (g/k) is meant to be a side view of the tongue when producing its sound.
There is some nifty work on trying to structure an actual grammar based on the geometry of the embedding space, but "look at this cool thing we found" is a totally worthwhile thing to publish.
(Embeddings themselves are useful for a whole lot more than that, though.)
If these are the most interesting results, then it's not. Most of the work seems to be in the (human provided) search for happy combinations like these. Are they typical? Is the whole space like this? Is there a consistent metric for deciding that these are in fact clever and it's not just the human doing the work in making it make sense in a funny way?
As pointed out elsewhere, the results are pretty accurate, not nearly perfect but enough to open up new applications. http://arxiv.org/pdf/1301.3781.pdf
Word2vec is exciting because it has a useable amount of accuracy for a relatively small amount of cleverness on the part of the human. You can just throw a huge corpus at it and it performs reasonably well.
What does "accurate" mean in this context? If there's a standard meaning of it, I'd be glad to judge it by that.
But the metric the author offered, as their defining results, was some cute things with Putin and Obama.
So yeah, if the results are accurate relative to some standard metric, I wasn't disputing that. I was questioning the reliability of the results the author presented, about cute Obama/Putin connections.
Somewhat surprisingly, it was found that similarity of word representations goes beyond simple
syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that
vector(”King”) - vector(”Man”) + vector(”Woman”)
results in a vector that is closest to the vector representation of the word
Queen.
In this paper, we try to maximize accuracy of these vector operations by developing new model
architectures that preserve the linear regularities among words. We design a new comprehensive test
set for measuring both syntactic and semantic regularities, and show that many such regularities
can be learned with high accuracy.
And if the author highlighted those as "the results", I would be impressed. As it stands, they were citing more "moonshot" kind of results like the ability to get abstract concept composition right, which is really hard to judge, because you don't know what set to sample over, or whether the equations are really insightful vs "I guess that might make sense...".
I'm much more confused over how the author characterized the results ("look at this Animal/Human/Ethics thing!"). Again, if there are standard metrics by which it did well, great! But why isn't he citing them rather than these more abstract, hard-to-judge victories?
This is the exciting part - where it feels like the computer understands you. It's not a test, it's just a conversation. You can ask the computer a question and get a really meaningful answer.
More significant I guess is how excited the author manges to be over these coincidences. Maybe a word of caution is needed: Overly generous interpretations is how things like Nostradamus or the hidden code of the bible retain their credibility with parts of the population.
And frankly, if you think that stock market \approx thermometer is insightful, you should probably be kept away from positions of responsibility.
Scroll down and you have some text boxes you can try some yourself. Here's some I tried (filtering out repeats and plural forms of input words. Those artifacts seem to happen a lot and would be easy to ignore)
cat:dog::bread:butter eh? I guess.
sword:shield::attack:protect okay that works.
up:down::left:leaving Eh, not great. I guess if you think they're analogous in terms of tense it kind of works. Disappointing, word2vec. (to be fair, "right" was third highest)
drive:car::dick:? Whatever it was, it made me giggle immaturely.
I bet you could run a bruteforce search over the possibility space for equations that can be made from simple vector addition/subtraction operations within some margin of error (since nothing perfectly overlaps) and look for interesting candidates manually post-hoc.
Really, there's a lot of \approx going on here -- you have thresholds for distance inside your vector space. Not only that, this shows that he cherrypicked "great" examples.
Word embeddings using skipgram or CBOW are a shallow method (single-layer representation). Remarkably, in order to stay interpretable, word embeddings have to be shallow. If you distributed the predictive task (eg. skip-gram) over several layers, the resulting geometric spaces would be much less interpretable.
So: this is not deep learning, and this not being deep learning is in fact the core feature.