
Organizing My Emails with a Neural Net - andreyk
http://www.andreykurenkov.com/writing/organizing-my-emails-with-a-neural-net/
======
tyingq
Popfile is worth looking at if this was interesting to you.

It is a general purpose naïve bayesian email classifier that you can integrate
with almost any email system.

They took some of the concepts in the article mentioned here and expanded on
them a bit.

For example, they have the idea of "pseudowords"[1] so that you're working
with more than just the words in the email. Like html:td for example...it
expands to the number html table cells in an email, which might help with
choosing a bucket.

[1][http://getpopfile.org/docs/faq:pseudowords](http://getpopfile.org/docs/faq:pseudowords)

~~~
andreyk
Nice, thanks for the link. I did find a few research papers and the like on
this topic, but none of those seemed to have very interesting
approaches/results. But this looks quite interesting. It would fun to compare
Popfile to my simple approach with a publicly available dataset (such as
Enron), perhaps.

~~~
Esau
CRM114 in an interesting mail sorting tool, although I don't think it has been
updated in years:

[http://crm114.sourceforge.net/](http://crm114.sourceforge.net/)

~~~
jorangreef
Orthogonal sparse bigrams are another interesting tokenization method, the
authors of this paper tested it out against CRM114 with good results:

[http://www.siefkes.net/ie/winnow-spam.pdf](http://www.siefkes.net/ie/winnow-
spam.pdf)

~~~
Lanzaa
Orthogonal sparse bigrams (OSBs) have been implemented in CRM114 since the
publication of that paper.

------
grinich
This is really awesome-- I work at Nylas and would love to turn this into a
plugin for N1 ([https://nylas.com/n1](https://nylas.com/n1)). If the author's
hanging out in this thread, please email me :)

~~~
burkesquires
As a Nylas user...please do!

~~~
andreyk
Done! :)

------
thibauts
Is it common practice to use the most frequent words as features ? It looks
like they don't carry much information, by definition. As a first naive
approach, I'd rank the words by the inverse of how many categories they appear
in, factor in the overall frequency, with a weight or something more clever
and then take the top N.

Well, thinking more about it leads me to tf-idf and naive bayes (of course),
at which point you pretty much already have a classifier. So it seems feature
selection _is_ learning in itself and defines the maximum accuracy you'll be
able to reach ? This is border philosophical but I'd love to read more about
these matters. Pointers welcome !

~~~
fchollet
> Is it common practice to use the most frequent words as features ? It looks
> like they don't carry much information, by definition.

The common practice with a small-ish dataset is to use e.g. the top 10k or 20k
most frequent words, but filter out the top 50-100 so most frequent words, as
those indeed do not carry much information. A commonly used weighting scheme
is TF-IDF
([https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)),
which comes included in Keras.

Anyway, this is a cool ML starter project. Keras makes it really easy to do
this sort of fast experimentation with a range of different neural networks
models.

~~~
andreyk
As a side note, early on (in the first version of this project) I tried doing
something significantly more fancy by taking the most frequent words in each
category of the training set that also accounted for a large fraction of the
word occurrence in all the emails (so words that have high counts of which at
least 10-25% were from a single category). It seems like something like this
should be better than just taking the most common words, but either the
parameters I chose or my code itself was no good. Just taking a lot of the
most frequent words and doing feature selection probably results in fairly
similar features, though...

------
cJ0th
This is very impressive in itself but what is the practical use? I mean you
probably could get (almost) the same result with much less work. For instance,
"Academic" is every mail send from an .edu domain, "financial" is everything
that comes from your bank, "personal" mails come from a sender that is in your
personal address book and so on.

In short, there are "good enough" rules that require much less processing.

------
amelius
> I would really like it if gmail indeed had such a machine-learned approach
> to suggesting a category for each email for one-click email organizing

Yes, but I guess you need to train it first. And if you have been bad at
categorizing in the first place, you will start with bad training data.

------
bunderbunder
That confusion matrix makes Edward Tufte so very, very sad.

How on earth am I to tell from that visualization how often it mislabels
financial emails as personal? Eyedropper the colors and hope the values in the
shades of blue follow a linear scale?

~~~
rprospero
As someone who regularly encounters these types of diagrams, I guess that I'm
too close to the issue, because I'm not seeing the problem. To answer your
question, financial e-mails are more likely to be mislabelled as personal than
as professional, but not as often as it labels them correctly.

A quick curosry glance at the central diagonal tells me that Finance,
Personal/Programming, Professional/EPFL, and Group work are the categories of
e-mail which are most likely to be categorized incorrectly by the software.
Looking at the columns tells me that the Academic and Personal are going to
have the most misfiled messages in them.

~~~
jamessb
> To answer your question, financial e-mails are more likely to be mislabelled
> as personal than as professional, but not as often as it labels them
> correctly.

Yes, but how likely is each mislabelling? There is no scale to indicate how
the colors map onto probabilities.

As well as providing a scale, it would be helpful to make the heatmap an
_annotated heatmap_ , in which each square is labelled with the corresponding
value (perhaps with values below some threshold omitted to reduce clutter).

Example:
[https://web.stanford.edu/~mwaskom/software/seaborn/_images/s...](https://web.stanford.edu/~mwaskom/software/seaborn/_images/seaborn-
heatmap-5.png)

(from the Seaborn docs:
[https://web.stanford.edu/~mwaskom/software/seaborn/generated...](https://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.heatmap.html))

Edit: Consider another question - _If an email has the true label 'Financial',
is it more likely to be mislabeled or correctly labelled?_ I can guess, but
without more knowledge of the scale I can't be certain.

~~~
vidarh
If this was a formal paper, and if the code hadn't been freely available, I'd
have agreed with you.

As it is, the specifics of how it mislabels his e-mail is not all that
interesting to me, as I don't have access to his e-mail and so the specific
numbers are pretty much irrelevant.

------
ryanmonroe
>I would really like it if gmail indeed had such a machine-learned approach to
suggesting a category for each email for one-click email organizing

I don't think you can create custom categories but Google's inbox.google.com
does this

------
syllogism
Why not just use all the words?

~~~
vidarh
There are two main reasons to try to pare it down: The processing time, and
that _if_ you manage to pare it down to the words that are actually relevant,
then you reduce the chance of over-fitting to specific features that are
actually irrelevant (e.g. different frequencies of words like "is" is quite
likely to be irrelevant; but of course you do this kind of filtering at your
peril - what might seem irrelevant could also turn out to be highly
significant in context so it's hard to get right)

~~~
syllogism
In my experience it's usually best to start with all the words. If you use a
decent implementation that supports sparse vectors it's no problem, certainly
not for these sorts of data sizes.

Usually you'll end up with a frequency threshold, but it's usually best to
trim at the very low end --- like, words occurring 5 or fewer times. Further
over-fitting can be controlled with regularization and parameter averaging.

------
binalpatel
Glad you came to the conclusion that I always seem to come to, it's all about
the features. Engineer great features and you'll get great results, bad
features and you'll have bad results.

------
misiti3780
I see his conclusion is deep learning doesnt work here - but it might be
interesting to incorporate word2vec features into this and see if the
performance doesnt increase (or at least hover around 94%)

~~~
andreyk
I ran one experiment with an Embedding layer (second one in the Deep Learning
Is No Good Here section), which I assumed to be analagous to word2vec. I hoped
it would help as well so was rather dissapointed it did not seem to be so.
Perhaps I should have tried a few more configurations, though.

~~~
nicklo
An Embedding layer by itself will try to learn word vectors from the data you
train it on (error gradient will propagate back to embedding layer, update the
vector weights). Word2vec and word vectors are only really useful with tons of
training data to learn good embeddings. I believe OP is referring to using
Google's pre-trained word-vectors (trained on massive amounts of text [100
Billion words]). Can be found here:
[https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/)

This is pretty straightforward to implement in keras, you just need to supply
pre-trained word-vectors weights to your embedding layer.

    
    
      Embedding(vocab_size, 300, weights = [word2vec_weights])
    

Where 'word2vec_weights' is a numpy matrix with shape (vocab_size, 300).

~~~
andreyk
Ah, I honestly assumed it would use an existing set of weights such as from
word2vec. Though I now realize that does not make sense since you can also
embed images or really any real numbered inputs. I will give this a try.

~~~
nicklo
Cool! Let me know if you run into any issues.

------
artagnon
Cool project. Another experiment that'd be useful is auto-filling responses to
emails based on historical data (Inbox on my phone does an okay job of this).
I think it's a similar challenge.

------
itamarwe
Have you tried random forests?

