Hacker News new | comments | show | ask | jobs | submit login
Open sourcing our email signature parsing library (mailgun.com)
130 points by orliesaurus 708 days ago | past | web | 20 comments

This is cool. It's great to see so many small-medium size projects implementing machine learning algorithms. It wasn't that long ago that such a capability was far beyond the capabilities of an average programmer. Libraries like numpy/scikit, pyml, and their counterparts in other languages have made this far more prevalent.

I recently helped my girlfriend (bioinformatics PhD! /brag) implement an SVM in Python for feature selection of 20k genes, to determine which could be used to classify a tumorous cell. I was amazed that with less than 20 lines of Python, she had a fast, functional SVM that could classify with ~80% accuracy. That's damn impressive.

In general the scientific community is not filled with the highest quality programmers, so it's great that we are seeing development in easily accessible ML toolkits, because they enable sub-par scientific programmers to employ modern ML algorithms without needing to know the details of how to implement them.

Tl;dr Cool project, glad to see machine learning libraries taking off, will lead to curing cancer :)

> In general the scientific community is not filled with the highest quality programmers

Some will think different. Peer groups tend to think they are the greatest (academia, doing a phd... I know of some with very good self-esteem)

The example with dashes made me chuckle, because there actually is a standard for setting off a signature --- two dashes with one trailing space. Searching for this works tolerably well in my experience; if it's in a message and not more than X lines from the end, it's a pretty reliable indicator.

Where Mailgun's library looks really useful to me is in parsing HTML mail (ironic that a supposed semantic markup language makes this problem harder to solve, but that's the net for you...)

How is this a standard? I agree that it's fairly common, but I see a lot of different variations just in my own inbox.

It is literally a "proposed standard": http://tools.ietf.org/html/rfc3676#section-4.3

Indeed. Checking for a signature in email is pretty easy[1]: look for eol-dash-dash-space-eol ("\^-- \$"), if found, cut there. If not found, the sender and/or email client can't be trusted to compose proper email -- don't attempt automatic signature stripping, and forward the whole (probably top-posted) mess.

I'm only half-joking.

[1] Because, if "proper" quoting is used and the client can't properly strip signatures -- at least those included signatures will be "\^>\+ -- $", not "\^-- \$". Now assuming a proper email client/user, those sigs should've been stripped anyway... But such an assumption is likely to lead to tears and unhappiness anyway...

It even works for "proper" top-posters (as if there was such a thing) -- because the last reply will come first, followed by a dash-dash-space delimited signature, followed by all the stuff you'd typically strip out in a reply.

my team has parsed over a billion emails in the past 3 years auto-updating our clients' address books.

the "-- " is indeed the standard and most common ever since Usenet in 94, but of course we've built a ton of variation within our algorithms to handle every thing else you might see.

Feel free to check out our infographic on what you'll find in the average professional's email signature: http://www.evercontact.com/blog/infographic-the-anatomy-of-a...

FYI: the infographic shows the delimiter as "--" instead of "-- "

As I mentioned elsewhere, that's more than a small nitpick, because something like:

    Chapter XI...
Is perfectly fine in text-email -- so adding the space on the end is very useful -- as you'd rarely need to escape "-- " on a line by itself -- not so for just "--".

figure dash, em dash, en dash or horizontal bar?

Hyphen/minus. ASCII 0x2D or EBCDIC 0x60.

It would be nice if email just had a separate signature field (e.g. subject, body, signature)

A little gem to solve a royal pain in the back, that anyone working with email data can relate to

Should we be concerned that they are looking at our email contents in order to build this system?

  We did a lot of research, looked at all the variations of
  email that passes through Mailgun

Probably not. Here are enough public mailing lists to use as training sets I think. They could have used company mail to train on as well.

We use Mailgun and their parsing tool @ sendsmart.com and we LOVE IT! Thanks Mailgun!!

This is fantastic that they are open sourcing this. I've tried stripping out email quotations in the past and it's definitely a hard problem.

I guess there is the "standard" of two dashes before the sig

To clarify, the nice thing about dash-dash-space, rather than simply dash-dash ("-- " vs "--") is that the former will very rarely need escaping (because you could conceivably delimit parts of a message by two dashes, like:

    Chapter IX.
Without the trailing space, everything below your "sub-header" won't be cut).

two dashes and a space before your ~/.signature :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact