

The War Against Spam: A report from the front line - siddhant
http://www.google.com/buzz/goog.research.buzz/DvRkTRUBSys/This-paper-is-short-but-sweet-and-quite-accessible

======
barmstrong
Wow - I think I underestimated the complexity of what Google is working on
here. I had always assumed it was a pretty normal (perhaps distributed) well
trained Bayesian algorithm. The article does a good job of highlighting some
of the other issues they are facing. Makes me more tolerant of the few spam
that do get through....

~~~
larsberg
I understand the difficulty of the problem, but I still have trouble not
getting irritated. Between 5 and 10 affiliate marketing messages make it
through to my gmail inbox every day since I moved over in September, despite
diligently marking them as spam each time. Are these guys really that smart? I
don't read the contents of the body, but it's easy enough for a human to tell
just from the sender, subject, and first line that it's spam...

~~~
cdavid
The problem of spam is that you do not want many false rejections (much better
getting some spam through than losing some valid emails).

In your example, you say it is easy to see from sender/subject/first line it
is a spam, but only because you "know" it is a spam. Could you guarantee the
email cannot be spam with 100 % confidence ?

------
pwpwp
"the rarity with which users feel the need to check their spam box for false
positives demonstrates a high precision of classification"

I find this a curious claim. How does people not checking their spam boxes
demonstrate that there are, in fact, few false positives?

~~~
mike-cardwell
If there were lots of false positives, people would check their spam folder a
lot more. They'd expect to see non-spam email in there.

~~~
pwpwp
No. Missing emails from friends and acquaintances would get noticed (and
therefore people would look in their spam boxes). Missing emails from
strangers could be disappearing regularly without anyone noticing.

~~~
mike-cardwell
People look in their spam folder. They see an email which shouldn't have been
in there. They thus increase their frequency of checking.

That is what Google is claiming, and what I am agreeing with them about.

~~~
qjz
_People look in their spam folder._

Are you sure? I'd be willing to bet that the majority of users either don't
know the spam folder exists or trust it to be a one-way black hole dustbin
they never feel compelled to dig around in.

False positives are more likely to be discovered when reported by the sender
directly to the recipient via other means than email. Even then, I'd be
surprised if half of them followed up by trying to locate the message in the
spam folder. Frequency of checking seems like an unreliable metric when most
users live in the INBOX, especially when Google virtually organizes the
messages into "conversations" for them.

------
tedunangst
I have a simple technique gmail could use that would cut the email delivered
to my inbox by over 90%. Observe that I don't correspond with anyone in
Spanish, French, Russian, Italian, Chinese, or any other of a long list of
languages, and put those message in the spam folder.

~~~
dchest
This case is already covered by Bayesian analysis, no need to make special
rules for it.

Also, learn languages! :-)

~~~
tedunangst
If 5 years and tens of thousands of messages have failed to convince gmail
that messages to me in chinese are 100% spam, I have little faith more
training is going to fix the problem.

~~~
patio11
Filter on 了 or 的 in the body of the message. That should catch most natural
Chinese of non-trivial lengths. (You'll lose most Japanese ham, too, but I'm
guessing not a problem for you.)

------
racecar789
Spam tip: Have your personal domain redirect all emails to gmail. Then when
signing up on a website, enter "name_of_website.com@yourdomain.com" as your
email. Helps to see which websites sell your email address to spammers. Also
provides protection against websites stealing your email password.

~~~
aw3c2
You do not need gmail for that. Just set up "catch-all" mail for your domain.

------
mmaunder
Great $self->patOnBack() but web spam is a huge unsolved problem that hurts
Google's core business and is possibly the biggest threat to their revenue and
market share. It's also the biggest opportunity for innovators in the search
space right now.

~~~
mootothemax
_Great $self- >patOnBack() but web spam is a huge unsolved problem_

And if you'd read the linked papers (or even the summary page for that matter)
you'd have read that this has nothing to do with web spam - only the
traditional, meaty email kind.

