

How Mailinator searches for the word "pen1s" in 185 emails every second - zinxq
http://mailinator.blogspot.com/2008/01/how-to-search-for-word-pen1s-in-185.html

======
tlrobinson
_But instead, let's say we start off by hopping ahead 5 positions and instead
look for the "s" at the end of "pen1s". Of course we don't find it, but then
instead of checking the next spot, we again skip ahead 5 and look for "s". In
fact, because we'll get zero matches, we can skip ahead 5 each time.

    
    
        XXXXXXXXXXXXXXXXXXXXXXX
            ^    ^    ^  ....
            s?   s?   s?  ...
    

if we don't find the 's' in any of those positions, that immediately rules out
the possibility that "pen1s" precedes it.

So instead of 1000 comparisons - we only do 200 !_

Uhhh, no... consider the string to be searched:

    
    
        Xpen1sXXXXXXXXXXXXXXXXX
            ^    ^    ^  ....
            s?   s?   s?  ...
    

Obviously you can't just skip 5 characters just because there's no "s" at
position 5. The Boyer-Moore algorithm says you can skip those 5 characters
_because there's no X in the target string_.

~~~
henryw
"The algorithm has more complexity to it than I've illustrated (like what
happens if you encounter an 'e' in one of the hops?) ..."

~~~
tlrobinson
His statement is still incorrect.

~~~
mojuba
Agreed, I, too, had the feeling that he didn't get the algorithm. In fact, you
have to compare every 5th position with all 5 letters of the pattern - p, e,
n, 1 and s, (edit) or the other way around, like you said, which can be done
nicely by using character sets like in Pascal or bitmaps in C++.

Which brings up a question whether this method is any better than, for
example, the Intel x86 instruction for linearly scanning strings (the SCAS
family, if I'm not mistaken).

------
daniel-cussen
You could mark as ugly every word with a non-alphabetic symbol in the middle,
weird punctuation, or weird capitalization. Sure, there's like a quadrillion
ways of spelling "pe&#960;i$", but how often do those symbols appear in the
middle of the words in a ham email? You never tell your friends to mEeT yOu at
th3 &4r? Srsly, why even give words with numbers and symbols in them the
benefit of the doubt?

There's be few exceptions, including:

emails (email@site.top-level domain) Myspace names that are annoyingly
decorated with symbols (although...) Serial codes tokens, passwords, technical
names (i386, ISO9660) etc

They might be few enough to make into a dictionary and even allow for common
misspellings.

So, first you take all the emails Mailinator gets, you put the ones with weird
words in custody, and you let the ones with safe words go by.

The remaining emails will mostly be spam. You can further refine the process
by getting the computer to find and replace numbers and symbols with letters.

1 l, i, I, 3 e, E, 4 a, A $ s, S &#960; n, r,

etc.

You then get some ambiguity. To take Paul's example, "pen1s" can become
"penls" or "penis". You can safely assume it's "penis", or use Bayesian
functions to mark it down, or whatever. If you have two weird symbols, (say
you have p3n1s) you might get four results. But if the list of possibilities
includes bad words, it increases the likelihood the email is spam. Season to
taste with Bayesian filtering.

You weight words according to the company they keep, so pretty soon latex gets
on the list of words to watch out for, along with "princess" and "offshore."

Finally, you can pluck emails out of the inboxes of users if you find out
they're spam after the fact. You would have to figure out how to reconcile ex
post facto filtering with user privacy first. In any case, suppose 1,000,000
Mailinator users receive the same email from a porn site, and you notice that
the first 1000 users who read it delete it as soon as they see it in their
inbox. If the response from the users who see the site goes above a certain
threshold of marking it as spam, you pluck it out of the inboxes of users who
haven't seen it yet. You can then tell them about it and let them see it if
they like, etc. But everyone would benefit from having a few people delete it
from the inbox.

Then, you could mark as questionable any email that has more than a certain
number of recipients. These are generally spam. Even if they're from people
you know, they're generally spam. I hate it when people put me on their
mailing list. You can put these emails in a special part of the inbox,, put
them in a memo folder instead of the inbox, or something along those lines.

You can then treat email that you're not sure of as second-class email. You
can set up the inbox so it's sorted by rank (like a news aggregator) instead
of by date. You might combine the two and plot them using two dimensions, like
vertical position and color. So a very recent memo goes to the bottom and has
a bright color, and an old, personalized email from your friend goes on top,
in a darker color.

~~~
jgrahamc
One thing that's become clear (at least in my experience with Bayesian
filtering) is that less is more. You can actually waste a lot of time trying
to reverse spammer trickery like misspelling. Dumb beats out smart very often.

In POPFile we do an enormous amount of work to detect spammer trickery and a
couple of years back I did an analysis (published at Virus Bulletin) of the
effectiveness of this extra work.

Quick summary of the highlights:

1\. Looking at words in the headers was more important than the body 2\. Of
those the Subject line is the most important (most important words were free,
what, only, huge, judge, visa and soft) 3\. Most useful metadata was image
height and width, font color and size, Euclidean distance between foreground
and background color (bucketized), and the background color itself. 4\. The
only trick worth spending the time on was Invisible Ink (same background and
foreground color). Camouflage was close to Invisible Ink. 5\. The charset is
interesting

In my commercial spam filter I do as little intelligent work as possible.
Everything else is done by the Bayesian magic.

------
antirez
185 emails / second seems very slow...

~~~
imsteve
I'm pretty sure grep could do 185000 if it the data is pre-cached...

------
derefr
"It's not about the content (although unlawful or hate stuff can stay away
thanks)"

So does this mean that mailinator filters for "unlawful or hate stuff"? I'd
like to receive my daily insults, thanks.

------
curi
so you build his graph thing at the end. and if you go part way down, then
dead end (no match), then don't you have to go back to the 2nd char of the
non-match and try to match a word from there? and thus do a lot more
comparisons than the number of bytes.

the reason is when you put 100+ words in the tree, they'll share some
substrings.

------
gscott
Thats almost as fast as a one of those guys looking for action at a bathhouse

