
SpamAssassin is back - l2dy
https://lwn.net/Articles/769917/
======
ThJ
I have a mail server for my personal email that I've been maintaining for the
better part of a decade. It started out as Qmail/Courier, changed to
Postfix/Courier and finally Postfix/Dovecot, always with Spamassassin as the
spam filter.

For a while, it was configured to discard spam. Later, I configured it to put
the spam in a special IMAP folder. I have the same system today, but on top of
that, I have a background shell script that watches my mail folders and runs
sa-learn on them. Anything that appears in the spam folder is learned as spam.
Anything that appears elsewhere is learned as ham. Most of the time, this will
be email that has already been classified correctly, but if I move an email,
it will reclassify it.

Some time ago, it began leaking spam into the inbox. It turned out that I
needed to make the Bayes scores more aggressive, and after I did that, it's
been more or less perfect. More or less zero false positives and maybe a
couple of false negatives per week.

I find that the good old Bayes classifier is still the most powerful tool in
my spam filtering toolbox. You just have to be persistent and consistent in
how you train it and tune it. For example, you shouldn't train it to classify
legitimate newsletters as spam even if they are undesirable. Instead, you
should unsubscribe and put those in the trash folder. I find that this
considerably reduces the false positives.

~~~
baud147258
Cool setup! How do you check false positives? You take a look at the spam
folder for legit emails?

~~~
systemfreund
Not the OP, but I have a similar setup, however instead of Postfix I am using
OpenSMTPd, which I found much easier to configure.

Checking for false positives is a manual process. I've never had legitimate
mail marked as spam though, only the other way round.

My experience with Gmail's filter is not as nice btw. At work, I had a couple
of situations where important mails from a customer were flagged as Spam,
which went unnoticed for two weeks.

------
perlgod
I used to run Postfix + Dovecot + Rspamd with all the bells and whistles
enabled [1], but I recently switched to OpenSMTPD + spamd on OpenBSD.

Unlike rspamd, which has pluggable modules for everything under the sun (RBLs,
word filters, Bayesian filtering + learning), spamd uses plain ol' graylisting
with some PF integration to throttle spammers connections to 1
character/second for maximum annoyance.

With Rspamd, I never got _any_ spam in my Inbox. With spamd, I get maybe 1
spam mail every two weeks. To me, spamd's ridiculous simplicity is worth the
tradeoff.

You do have to be careful with graylisting large mailers like Gmail, since
they rarely retry the mail from the same IP address. For this, OpenBSD's
smtpctl now has the spfwalk [2] command to whitelist the big guys. That's what
I use in my current setup [3], which was linked here a few days ago.

[1] [https://www.c0ffee.net/blog/mail-server-
guide/](https://www.c0ffee.net/blog/mail-server-guide/)

[2]
[https://poolp.org/posts/2018-01-08/spfwalk/](https://poolp.org/posts/2018-01-08/spfwalk/)

[3] [https://github.com/cullum/dank-
selfhosted](https://github.com/cullum/dank-selfhosted)

~~~
cristoperb
For greylisting nowadays I use Postfix's builtin postscreen[1] along with a
few DNS-based whitelists and postscreen_dnsbl_whitelist_threshold=-1 to make
sure gmail etc don't get delayed. No extra software required. Though it would
be nice if it had builtin ability to find IP addresses to whitelist via SPF.

I recently set up a small postfix+dovecot system, and postscreen with DNS
blacklists alone seems quite effective. But I do plan to add spamassassin or
rspamd or spamd at some point.

1:
[http://www.postfix.org/POSTSCREEN_README.html](http://www.postfix.org/POSTSCREEN_README.html)

------
codetrotter
> The current code is getting old, and there is interest in applying deep-
> learning techniques to the spam-detection problem.

Yes please! I archive all my mail, both desired and spam mail, with the
intention of using this data to train a neural network that will be able to
classify mail as spam, desired newsletter or desired personal mail.

------
decasteve
> some sites are using it to detect spam submitted in web forms, for example.

I actually have the web server send the form contents via email because my
email server runs SpamAssassin and it has done a great job catching form spam.

------
tyingq
I let Google filter my spam these days, but back when I did it myself I liked
popfile better because it could handle general classification versus just spam
or not-spam.

So I could train it to move mail into a "High Priority" bucket, or "Not Spam,
But Promotional" bucket, etc.

Curious if SpamAssassin can do that now.

I was also curious about the "freemail antiforge" feature mentioned in the
article, but couldn't find much about it.

~~~
ryanmercer
>I let Google filter my spam these days,

I do this but I have to check my spam filter every day and scan titles because
it regularly grabs things I'm subscribed to, and have marked 'not spam' (in
some cases dozens of times), and filters them as spam. I've also found some
friends randomly ending up in spam, in one instance a thread that we'd already
had a dozen or so exchanges on.

It mostly seems to throw stuff entities send via mailchimp in there but it
will throw other stuff in there from time to time too. No matter what I do, it
ALWAYS puts valid email from vanguard in there.

~~~
tyingq
Interesting. I assume you've tried a filter with the "never send to spam"
option? Like this: [https://imgur.com/a/q4gxipf](https://imgur.com/a/q4gxipf)

~~~
ryanmercer
Never looked for that, I'm on an awful lot of email lists for businesses and
podcasts and it randomly gobbles them up into spam. Probably easier to just
check my spam every day than to try and find them all and manually add all of
those from addresses.

Will work for vanguard though!

------
breakall
Spam detection on my personal email server took a huge leap forward when I got
Spamassassin correctly configured to use the DNS blocklists (URIBL, etc.). I
had to set up a local DNS server because the blocklists weren't responding to
requests coming via my hosting provider's DNS, and the scores had to be
tweaked over a few weeks, but now it's working great. The content rules like
"hi my dear", "request for money", and "all caps" are still in there, but the
blocklists do the heavy lifting.

------
z3t4
It can be exciting when your filter blocks a ton of spam, but then you get a
few false positives, and of course they happened to be very important mails
... I think the spam filter needs to be plugged into the MTA, so that when a
message is flagged as spam, the MTA can tell the sender that hey, this message
got flagged as spam.

~~~
viraptor
The problem with that is that the sender can then iterate on the message until
they don't get a "flagged as spam" notification. This applies both to the real
sender and to the spammer.

~~~
lifeisstillgood
Just dont block anything that has an email address you have mailed _to_
before. Its highly unlikely an important email will come from someone I have
not mailed.

I am playing with different gmail api fixes to do things like this - with
12,000 unread mails in my inbox I need some automated help.

And if the lottery commission wants to tell me I have won, they can get my
number easily.

~~~
tinus_hn
If you never sign up for services on the web and are not part of mailing
lists.

------
LinuxBender
SA is good for small orgs. For my own server, I just use regex rules in
postfix called S25R [1]. It has been sufficient for me and keeps the run queue
very low.

I implemented S25R regex rules in a medium sized company that had a very bad
spam problem due to old outlook setups replying to all the spammers with OOO
replies. They were using SA and tried to keep up on rule tuning, but it was a
losing battle. The OS run queue on the 6 inbound servers averaged 6 and would
sometimes peak at 14+. I switched to S25R regex rules and the run queue
dropped to 0.2. I did reject some "valid" emails from folks that left the
company and were running their own business from their home cable modems, but
eventually whitelisted some of them. The employees were very happy with the
change. They went from receiving 2k+ spam msgs per day (each) to less than a
dozen.

[1] - [http://www.gabacho-net.jp/en/anti-spam/anti-spam-
system.html](http://www.gabacho-net.jp/en/anti-spam/anti-spam-system.html)

------
linsomniac
One thing I really liked about SpamAssassin and how prevalent it was was that
I could set hashcash on my outbound e-mails and basically eliminate the
chances of my e-mail being caught as spam. This was probably a decade ago
(I've since moved to gmail), but it would take 10-60 seconds per recipient to
generate the hashcash, and SpamAssassin would give those messages a large
positive weight.

I'm kind of surprised that it never caught on.

~~~
gst
Is hashcash still feasible today? With botnets it shouldn't be a major problem
to generate a large number of hashcash stamps. In addition, the current
hashcash approach uses SHA-1 which would be trivial to speed up with GPUs or
FPGAs (similar to how Bitcoin mining works). At the same time it wouldn't be
possible to increase the required "price" of hashcash stamps as this would
lock out users who can only use CPUs to calculate the stamp.

~~~
linsomniac
That's a good question. I assumed that the hashcash benefit in SA had been
tuned over the years. I had picked the largest hashcash at the time for the
10-60 seconds per recipient. Not sure what the benefit the GPU can bring is,
10x? 100x?

Yes, botnets or GPUs can be used to accelerate hashcash. But the point of
hashcash isn't that it can't be accelerated, it is that it increases the cost
of sending each message dramatically. Most spam seems to be sent because it is
just so damn cheap to send a million messages. Adding a second to each
recipient is a significant cost.

Of course, as we saw with car alarms and immobilizers, there are unintended
consequences with things. In cars, we moved from people stealing cars in the
middle of the night, to people getting confronted at gunpoint to get the keys.
I wonder if increasing the cost of sending spam might just promote the highly
targeted methods like catfishing or whatnot.

------
defanor
Not directly related to SA being back, but since spam combat stories and
approaches are shared here: I used to run SA with postfix, but it was quite a
memory hog, occasionally leading to OOM killer raging. Half a year ago I've
disabled it and revised regular postfix (+ postscreen) settings, consulting a
couple of helpful articles [1,2], and didn't get any "real" spam (i.e., not
counting Bitbucket) since.

Likely the incoming spam varies for different people and their mailboxes, but
tweaking those standard settings can be surprisingly efficient.

[1]
[http://rob0.nodns4.us/postscreen.html](http://rob0.nodns4.us/postscreen.html)

[2] [http://jimsun.linxnet.com/misc/postfix-anti-
UCE.txt](http://jimsun.linxnet.com/misc/postfix-anti-UCE.txt)

------
creeble
Hmm. Maybe it's time to give SA a try again.

I had very little luck in keeping SA performing well over time. It would go
for a few months doing a decent job, then just seem to fall off the tracks for
a new particular reign of spam, even with graylisting on.

I finally gave up and have been using... A Barracuda box. I hate to say it
(because they're stupid expensive for all but the smallest box), but it has
worked incredibly well, with zero overhead. I've gone for over a year without
so much as logging in to the box.

Ultimately, I think the 'cuda is just running SA with Barracuda's collective
filtering added on, but it sure works for me and my dozen or so users.

Maybe Barracuda doesn't sell that many spam boxes (they have quite a few other
products), but I can say that it has sure worked for me.

If I could keep SA tuned well with minimal effort, it would be worth the
savings however.

~~~
tushar-r
>Maybe Barracuda doesn't sell that many spam boxes

Honestly, I feel we're still primarily known for our anti-spam solutions, and
we're investing quite a bit into AI based detection of spam/phishing etc.
Depending on what you need, you may want to look at our cloud-based solutions
over a physical box, in case the pricing is better.

(I work for Cuda in case that wasn't clear. Work on one of the "quite a few
other products" :-D)

------
linker3000
A big shout out to Julian Field et. al at MailScanner - I don't manage any
mail servers thesedays, but going back around 15 years, it was at the heart of
several setups.

[https://www.mailscanner.info/](https://www.mailscanner.info/)

And, yes, it uses Spamassassin.

------
jrnichols
Between SpamAssassin, amavis, and postscreen (mainly postscreen) I rarely see
any spam anymore.

but I do see that development has been kind of slow, and not just with SA -
most everything email related. It truly frightens me just how many people gave
up and handed all of their email over to Google and Gmail, who continue to
prefer their way of doing things.

i'm looking at rebuilding a mail server on a new VPS pretty soon, and I'll
happily set up spamassassin once again. it's great to see that it's still
going strong.

------
aduitsis
Extremely glad to see a large project written in Perl pick up steam again.

~~~
ttul
I find it frightening that a big Perl project is still being maintained. The
roster of folks who can parse that code is getting smaller and smaller with
time. It would be nice to see someone spark up a neural net based spam
filtering project that leverages a more modern language like Rust or Go.

------
leowinterde
Running rspamd on some servers which filters fine and the dmarc reports are an
awesome feature. A comparison would be interesting in the future.

~~~
ibotty
There is [https://lwn.net/Articles/732570/](https://lwn.net/Articles/732570/)

------
dev_dull
> _Just like Gmail, SpamAssassin isn 't the perfect filter for everybody right
> out of the box; it's really a framework that can be used to create that
> filter._

I find gmail to be about 99% accurate “out of the box”. With today’s
technology and processing power I don’t see why that can’t also be true for
other spam offerings.

------
kokey
When I was involved with the mechanics of an ESP delivering bulk mail for
clients, I found SpamAssassin to be very useful for scanning e-mails even
before they get sent in order to detect possible abuse.

------
INTPenis
Spamassassin is and has been perfectly adequate as long as you train your
filter regularly with new spam.

And the more users you filter for, the more spam you should be feeding your
filter, and regularly.

When the person says their gmail account is full of spam I take that with a
grain of salt because I've been using gmail since it started, when it was like
5 invites only, and I use it as my throwaway account everywhere. I know for a
fact that gmail does an excellent job at filtering spam.

~~~
tjoff
Gmail is terrible.

False positives are a nightmare and if you have to check the spam folder then
the spam filter does more harm than good.

That is also my experience with SA.

~~~
ryandrake
Additionally, I found Gmail will sometimes deny legitimate E-mail as "spam"
_as it comes in_ , before it even reaches your Spam folder. I tested this for
a week or so by forwarding my personal E-mail from my mail server to my
address at Gmail and checking my server logs. This was one of the things that
finally pushed me over to self-hosting. Mail does not get lost with
Spamassassin if it's set up to deliver spam into a separate folder. As long as
the mail passes the trivial SPF, DKIM, etc. checks, it will at least get to my
junk folder.

------
stcredzero
I'm picturing a new mascot. Blocky in the way Spongebob is, but ninja themed
and made out of canned meat product.

------
krupan
Wouldn't it be awesome if you could filter your facebook/twitter/whatever feed
through spamassasin too?

------
appleflaxen
I miss blue frog.

