
Open sourcing our email signature parsing library - orliesaurus
http://blog.mailgun.com/open-sourcing-our-email-signature-parsing-library/
======
chatmasta
This is cool. It's great to see so many small-medium size projects
implementing machine learning algorithms. It wasn't that long ago that such a
capability was far beyond the capabilities of an average programmer. Libraries
like numpy/scikit, pyml, and their counterparts in other languages have made
this far more prevalent.

I recently helped my girlfriend (bioinformatics PhD! /brag) implement an SVM
in Python for feature selection of 20k genes, to determine which could be used
to classify a tumorous cell. I was amazed that with less than 20 lines of
Python, she had a fast, functional SVM that could classify with ~80% accuracy.
That's damn impressive.

In general the scientific community is not filled with the highest quality
programmers, so it's great that we are seeing development in easily accessible
ML toolkits, because they enable sub-par scientific programmers to employ
modern ML algorithms without needing to know the details of how to implement
them.

Tl;dr Cool project, glad to see machine learning libraries taking off, will
lead to curing cancer :)

~~~
mqsiuser
> In general the scientific community is not filled with the highest quality
> programmers

Some will think different. Peer groups tend to think they are the greatest
(academia, doing a phd... I know of some with very good self-esteem)

------
wiml
The example with dashes made me chuckle, because there actually is a standard
for setting off a signature --- two dashes with one trailing space. Searching
for this works tolerably well in my experience; if it's in a message and not
more than X lines from the end, it's a pretty reliable indicator.

Where Mailgun's library looks really useful to me is in parsing HTML mail
(ironic that a supposed semantic markup language makes this problem _harder_
to solve, but that's the net for you...)

~~~
michaelmior
How is this a standard? I agree that it's fairly common, but I see a lot of
different variations just in my own inbox.

~~~
LukeShu
It is literally a "proposed standard":
[http://tools.ietf.org/html/rfc3676#section-4.3](http://tools.ietf.org/html/rfc3676#section-4.3)

~~~
e12e
Indeed. Checking for a signature in email is pretty easy[1]: look for eol-
dash-dash-space-eol ("\^-- \$"), if found, cut there. If not found, the sender
and/or email client can't be trusted to compose proper email -- don't attempt
automatic signature stripping, and forward the whole (probably top-posted)
mess.

I'm only half-joking.

[1] Because, if "proper" quoting is used _and_ the client can't properly strip
signatures -- at least those included signatures will be "\^>\\+ -- $", not
"\^-- \$". Now assuming a proper email client/user, those sigs should've been
stripped anyway... But such an assumption is likely to lead to tears and
unhappiness anyway...

It even works for "proper" top-posters (as if there was such a thing) --
because the last reply will come first, followed by a dash-dash-space
delimited signature, followed by all the stuff you'd typically strip out in a
reply.

~~~
Brad2earth
my team has parsed over a billion emails in the past 3 years auto-updating our
clients' address books.

the "\-- " is indeed the standard and most common ever since Usenet in 94, but
of course we've built a ton of variation within our algorithms to handle every
thing else you might see.

Feel free to check out our infographic on what you'll find in the average
professional's email signature: [http://www.evercontact.com/blog/infographic-
the-anatomy-of-a...](http://www.evercontact.com/blog/infographic-the-anatomy-
of-an-email-signature.1.html)

~~~
8_hours_ago
FYI: the infographic shows the delimiter as "\--" instead of "\-- "

~~~
e12e
As I mentioned elsewhere, that's more than a small nitpick, because something
like:

    
    
        XI
        --
        Chapter XI...
    

Is perfectly fine in text-email -- so adding the space on the end is very
useful -- as you'd rarely need to escape "\-- " on a line by itself -- not so
for just "\--".

------
byoung2
It would be nice if email just had a separate signature field (e.g. subject,
body, signature)

------
orliesaurus
A little gem to solve a royal pain in the back, that anyone working with email
data can relate to

------
wyred
Should we be concerned that they are looking at our email contents in order to
build this system?

    
    
      We did a lot of research, looked at all the variations of
      email that passes through Mailgun

~~~
codezero
Probably not. Here are enough public mailing lists to use as training sets I
think. They could have used company mail to train on as well.

------
djyaz1200
We use Mailgun and their parsing tool @ sendsmart.com and we LOVE IT! Thanks
Mailgun!!

------
aantix
This is fantastic that they are open sourcing this. I've tried stripping out
email quotations in the past and it's definitely a hard problem.

------
ape4
I guess there is the "standard" of two dashes before the sig

~~~
e12e
To clarify, the nice thing about dash-dash-space, rather than simply dash-dash
("\-- " vs "\--") is that the former will very rarely need escaping (because
you _could_ conceivably delimit parts of a message by two dashes, like:

    
    
        IX
        --
        Chapter IX.
    

Without the trailing space, everything below your "sub-header" won't be cut).

