
Ask HN: Why is still so hard to detect spam? - nkkollaw
I&#x27;m using macOS Mail as a secondary client alongside Google Inbox (which works great), and it&#x27;s amazing how bad the spam filter is.<p>I&#x27;ve been training it to detect spam by marking emails manually for about 2 weeks, and it&#x27;s still having a hard time recognising most spam emails.<p>I know that spammers get clever trying to trick spam filters, but I feel like some emails are really a no-brainer they&#x27;re spam: for instance, I&#x27;m getting hundreds of email in Chinese, and I don&#x27;t even have a Chinese font installed (which is information I assume macOS could potentially have access to). What are the chances it&#x27;s <i>not</i> spam?
======
BjoernKW
I'm not sure but it might be that providers of client-side email software that
includes spam filters have just given up.

Personally, I have been relying exclusively on server-side spam filters (GMail
in my case) for years now and it works well enough. Server-side spam filters
on the other hand cannot know for sure you're not expecting emails in Chinese.
I suppose that's what these spammers are capitalising on in that case.

Spam filtering on the client side is expensive, both in terms of the effort
required to create the software and regarding CPU cycles on your local
machine. Server-side spam filters like Google's use simple voting algorithms
that scale very well. On the client side you need language models tuned
specifically to your email usage in order to avoid false positives. These
models and the training required are quite complex. Simple Bayesian filters
simply aren't good enough anymore to beat today's sophisticated spam creation
algorithms.

It's a bit like an inverse P vs. NP problem: Spam looking like natural
language on the surface can easily be created but verifying that it's spam is
very hard, on a local level at least.

~~~
nkkollaw
Definitely. Google Inbox has been a godsend. It works using feedback from
millions of users I'd assume.

Still, I think Mail could mark emails written in a language you don't speak as
spam pretty easily.

------
beagle3
I'be been using thunderbird for years, and its spam detector works better than
gmails for me - I haven't had either a false negative or false positive from
thunderbird in over a year, and I do have false positives (but not negatives)
from google.

My mail server does do some gray and red filtering, so it starts with a
relative clean slate (avg 3 spams/day over 20 addresses which have been public
for at least 10 years), so that might be part of it.

