
Deep Learning at x.ai - dfkoz
https://x.ai/deep-learning-at-x-ai/
======
jknz
From my understanding, they ran word2vec [1] on their email dataset. Anyone
can run word2vec on any dataset with a single desktop machine. What I don't
get is why word2vec is not mentioned?

Edit: the mentioned algorithm is t-SNE -- which seems to be another algorithm
for dimension reduction. I don't know how it compares to word2vec

[1] for instance,
[https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/...](https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html)

[2] [https://lvdmaaten.github.io/tsne/](https://lvdmaaten.github.io/tsne/)

~~~
cjauvin
word2vec is an algorithm to produce meaningful "word embeddings", which is a
vector representation in a usually high-dimensional space. t-SNE is a
dimensionality-reduction algorithm. Both can be used together, as they serve
different purposes.

~~~
jknz
One could argue that Word embeddings are also dimensionality-reduction
techniques: Words live in an infinite dimensional space, and the embeddings is
a finite-dimensional projection of this infinte dimensional space.

~~~
mrdrozdov
I think of word vectors in the opposite light. Words stored in a dictionary
have 1-dimension (their index), making comparisons more or less random. Word
vectors augment the information you have about a word by continually examining
the context that the word appears in a corpus of text.

------
mrdrozdov
> A RNN makes predictions based on sequential data. When a RNN is trained on
> sequences of words, it learns to represent each word as a high dimensional
> vector which encodes the model’s understanding of that word. By projecting
> these high dimensional vectors into a two dimensional space, it’s possible
> to visualize their relationships and glean insights into the concepts that
> the model has learned.

It sounds like what's being visualized are the probability vectors that the
model creates, which are usually a value for each possible class (noun, verb,
etc. in this case). If this is the case, I don't see how the t-sne
visualization is much more useful than a confusion matrix. Typically prior to
training, words are translated from dictionary indexes into word embeddings
(high dimensional vectors, where dimension is >> than the number of classes)
that let you compare them and do vector algebra like "king + queen - woman =
man". You can visualize the word embeddings, and color code them by class
after training to see if there are any sorts of patterns in your word
embeddings.

> The RNN learned all of this semantic understanding without a human ever
> having to code a definition of concepts like nouns, verbs, universities,
> cities, meetings, or social media. This is the power of deep learning
> algorithms.

Was this an unsupervised approach? If so, that seems a little unusual for Part
of Speech Tagging (POS tagging). I suppose the author could mean that the
model was used to label Out of Vocabulary (OOV) words, aka words that never
appeared in the training set. Labeling OOV data points is sort of the general
benefit of machine learning, and I'm not sure can be attributed solely to Deep
Learning. The main benefit I've gleaned from Deep Learning is that it
automates the feature engineering phase of the machine learning pipeline.

There are lots of good resources for RNNs, LSTMs, Word Embeddings and t-sne
out there from Stanford, NYU, Theano, TensorFlow, and the like. Here's a blog
post that gives some background if you're interested:
[http://colah.github.io/posts/2014-07-NLP-RNNs-
Representation...](http://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)

~~~
adamklec
Hello. This is Adam. I trained the model and made the visualization. Thanks
for your comments.

This model is not a POS tagger. The model was trained to predict the next word
in the email given the preceding words. So in that sense, it's similar to the
word2vec models discussed in the link you shared. However for this work I used
a recurrent neural network to learn a language model of the emails in our
database.

After training, I extracted the learned word vectors from the model (they are
the weights that connect the input layer which uses a one-hot-encoding of
vocab words to the embedding layer). I then used the t-SNE algorithm to reduce
the dimensionality of the learned word vectors and then plotted them in 2
dimensions. The colors representing the parts of speech were added after the
fact to show that the model had learned to distinguish between nouns, verbs,
etc.

~~~
jdonaldson
How does the recurrent neural network technique compare to the CBOW technique
in word2vec? CBOW would've been the first thing I tried.

~~~
adamklec
I agree that's an interesting comparison to make but I'm not sure of the
answer. The original purpose of this work was not to generate word vectors but
rather to evaluate whether we have enough data to start using deep learning
algorithms. That an RNN trained on our data was able to learn word vectors
with a significant amount of structure seems like a positive sign. But I don't
know how the quality of these word vectors would compare to vectors generated
by more standard word2vec algorithms.

~~~
nicklo
There are tons of ways to evaluate word vector quality! Word analogy tasks,
word similarity tasks, contextual prediction tasks, etc.

This link contains a bunch of relevant evaluation datasets and benchmarks
obtained using word2vec, GloVe, etc. You can evaluate your RNN-learned vectors
and compare them to a traditionally trained word2vec-trained vectors. Link
here:
[http://www.bigdatalab.ac.cn/benchmark/bm/Domain?domain=Word%...](http://www.bigdatalab.ac.cn/benchmark/bm/Domain?domain=Word%20Representation)

For more background on evaluating word vectors check out these pretty great
lecture notes from Socher's NLP class:
[http://cs224d.stanford.edu/lecture_notes/LectureNotes2.pdf](http://cs224d.stanford.edu/lecture_notes/LectureNotes2.pdf)

Also, here's the original papers from a few years ago that introduced many of
these datasets and evaluation standards:

[https://papers.nips.cc/paper/5021-distributed-
representation...](https://papers.nips.cc/paper/5021-distributed-
representations-of-words-and-phrases-and-their-compositionality.pdf)

[http://www.cs.cmu.edu/~mfaruqui/papers/acl14-vecdemo.pdf](http://www.cs.cmu.edu/~mfaruqui/papers/acl14-vecdemo.pdf)

------
corndoge
This article is the ML equivalent of Hello World.

~~~
diskcat
Hello World is not trivial if you are writing it in java.

~~~
minimaxir
Not a great analogy. It's still one line of code; just a very stupid one line
of code. [System.out.println("Hello World"))]

~~~
yummyfajitas
It's not even a stupid line of code - it's just a line of code which happens
to namespace things which most other languages don't.

System is a global object which handles various pieces of the runtime system.
One variable on this object is `out` (of type PrintStream) which represents
stdout. The `println` method of PrintStream actually does the work.

I get why this seems overly verbose in comparison to Python and similar
languages, but I don't think it's stupid. It's just emphasizing explicitness
and uniformity over brevity.

------
uday_nandam
This seems like a poor attempt at getting free press.

~~~
BinaryIdiot
If that's the case then it worked and I applaud their effort. I would have
liked to hear about how they're using neural networks in system. I wrote a
good-enough system that can handle similar things with scheduling meetings but
it's an expert system, not neural net so I'm curious how they use them if at
all.

~~~
marjimbel
Marcos here, ds at x.ai ... happy to answer that question in person. If you
are around NYC, feel free to pass by our offices for a coffee/chat :-)

~~~
BinaryIdiot
Ha, I'd love to but unlikely I'll be around NYC any time soon.

