Word embeddings using skipgram or CBOW are a shallow method (single-layer representation). Remarkably, in order to stay interpretable, word embeddings have to be shallow. If you distributed the predictive task (eg. skip-gram) over several layers, the resulting geometric spaces would be much less interpretable.
So: this is not deep learning, and this not being deep learning is in fact the core feature.
The most common method for producing word vectors, skip-grams with negative sampling, has been shown to be equivalent to implicitly factorizing a word-context matrix. A related algorithm, GloVe, only uses a word-word co-occurrence matrix to achieve a similar result.
You can also view the output as an embedding in a high dimensional space (hence the name word vectors) but more surprisingly you can learn a linear mapping between vector
spaces of two languages, which lends it immediately useful in translation. From : "Despite its simplicity,
our method is surprisingly effective: we can
achieve almost 90% precision@5 for translation
of words between English and Spanish".
: "Neural Word Embedding
as Implicit Matrix Factorization" http://papers.nips.cc/paper/5477-neural-word-embedding-as-im...
: "Exploiting Similarities among Languages for Machine Translation" - page 2 has an intuitive 2D graphical representation http://arxiv.org/pdf/1309.4168.pdf
Curious. Can you share a few examples and applications please?
- using layers of random forests (trained successively rather than end-to-end). Random forests are commonly used for feature engineering in a stack of learners.
- unsupervised deep learning with modular-hierarchical matrix factorization, over matrices of mutual information of the variables in the previous layers (something I've personally worked on; I'd be happy to share more details if you're interested).
Are these methods main stream? Esp. the layered RF, how good/bad does it do as compared to regular ones?
… And patented http://www.freepatentsonline.com/9037464.html
I don't think there will be any problems here.
Still, interesting blog post, worth reading and googling for more information. FWIW I just use the deep learning definition of "neural net or representation learning research since 2006" and find it fits better.
Logistic regression with regularization is fairly new? 'Pioneered' by the same people as deep convolutional neural networks? Are you certain about this?
Drop out itself IS fairly new. As well as its newer cousin drop connect.
I agree neural nets themselves are basically just a crazier parametric model. Many of the things we do to modify the gradient are applicable to logistic and other simpler regression techniques as well.
The first thing I tried was computer : server :: phone : ?
I didn't really have a great answer for that in my head before I ran it. Word2vec decided the closest match was "voicemail". It breaks down when you feed it total nonsense, but what would you expect it to do? :P
I'm constantly super impressed by properties of the vectors.
word2vec comes with a data set that you can use to evaluate language models.
The only semantics that it tests are "can you flip a gendered word to the other gender", which is so embedded in language that it's nearly syntax; and "can you remember factoids from Wikipedia infoboxes", a problem that you could solve exactly using DBPedia. Every single semantic analogy in the dataset is one of those two types.
The syntactic analogies are quite solid, though.
That's a simplification. E.g. I have trained vectors on Wikipedia dumps without infoboxes, and I queries such as Berlin - Deutschland + Frankreich work fine.
Of course, even the remainder of Wikipedia is nice text in that it will contain sentences such as 'Berlin is the capital of Germany'. So, indeed, it makes doing typical factoid analogies easier.
That said -- I am more interested in the syntactic properties :).
It's a data source that you could consult to pass 99% of the "semantic analogy" evaluation with no machine learning at all, which is an indication that a stronger evaluation is needed.
The way the corpus “talks about” Obama in relation to the USA is similar to how the corpus talks about Putin in relation to Russia. That the system can reveal this is amazing to me.
I also recommend using Gensim for word embeddings.
Like Obama + 2017 = 1981 + Carter or Obama + 2017 = Bush + 2009
Do you have some more results to share coming from your model?
There is some nifty work on trying to structure an actual grammar based on the geometry of the embedding space, but "look at this cool thing we found" is a totally worthwhile thing to publish.
(Embeddings themselves are useful for a whole lot more than that, though.)
Word2vec is exciting because it has a useable amount of accuracy for a relatively small amount of cleverness on the part of the human. You can just throw a huge corpus at it and it performs reasonably well.
But the metric the author offered, as their defining results, was some cute things with Putin and Obama.
So yeah, if the results are accurate relative to some standard metric, I wasn't disputing that. I was questioning the reliability of the results the author presented, about cute Obama/Putin connections.
Somewhat surprisingly, it was found that similarity of word representations goes beyond simple
syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that
vector(”King”) - vector(”Man”) + vector(”Woman”)
I'm much more confused over how the author characterized the results ("look at this Animal/Human/Ethics thing!"). Again, if there are standard metrics by which it did well, great! But why isn't he citing them rather than these more abstract, hard-to-judge victories?
And frankly, if you think that stock market \approx thermometer is insightful, you should probably be kept away from positions of responsibility.
Scroll down and you have some text boxes you can try some yourself. Here's some I tried (filtering out repeats and plural forms of input words. Those artifacts seem to happen a lot and would be easy to ignore)
cat:dog::bread:butter eh? I guess.
sword:shield::attack:protect okay that works.
up:down::left:leaving Eh, not great. I guess if you think they're analogous in terms of tense it kind of works. Disappointing, word2vec. (to be fair, "right" was third highest)
drive:car::dick:? Whatever it was, it made me giggle immaturely.
Really, there's a lot of \approx going on here -- you have thresholds for distance inside your vector space. Not only that, this shows that he cherrypicked "great" examples.