Hacker News new | past | comments | ask | show | jobs | submit login
SpamAssassin is back (lwn.net)
324 points by l2dy 5 months ago | hide | past | web | favorite | 68 comments

I have a mail server for my personal email that I've been maintaining for the better part of a decade. It started out as Qmail/Courier, changed to Postfix/Courier and finally Postfix/Dovecot, always with Spamassassin as the spam filter.

For a while, it was configured to discard spam. Later, I configured it to put the spam in a special IMAP folder. I have the same system today, but on top of that, I have a background shell script that watches my mail folders and runs sa-learn on them. Anything that appears in the spam folder is learned as spam. Anything that appears elsewhere is learned as ham. Most of the time, this will be email that has already been classified correctly, but if I move an email, it will reclassify it.

Some time ago, it began leaking spam into the inbox. It turned out that I needed to make the Bayes scores more aggressive, and after I did that, it's been more or less perfect. More or less zero false positives and maybe a couple of false negatives per week.

I find that the good old Bayes classifier is still the most powerful tool in my spam filtering toolbox. You just have to be persistent and consistent in how you train it and tune it. For example, you shouldn't train it to classify legitimate newsletters as spam even if they are undesirable. Instead, you should unsubscribe and put those in the trash folder. I find that this considerably reduces the false positives.

My use followed a similar arc.

One tip: look at your mail server logs for the top 100 bad addresses. For me they were a combination of incorrect first names (doug@, joe@) and incorrect roles (sales@, support@). Deliver all of that mail directly to a spam trainer. That gives me good odds the Bayesian filter will have repeatedly seen novel spam before they try my account.

I also subaddress [1] any address I give to a vendor. So when those leak to spammers, I route those to sa-learn as well. It's a real pleasure to use the key spammer advantage -- indiscriminate volume -- against them.

[1] https://tools.ietf.org/html/rfc5233

I use Gmail because I know I'm not going to be able to keep an email server secure over time. But I still have to deal with shitty newsletters and other junk. After putting up with it for too long I decided to just domain-filter people that send me newsletters. The only thing that caused me trouble was LinkedIn, where people would occasionally message me expecting a fast response. So I solved that problem by deleting my LinkedIn account. If I reset my, say, Twitter password I know where to find it.

All other email from them and countless others is useless.

I'm no l33t hax0r or anything, but when I see "I've hosted a personal mail server for 10 years", I think, "I could probably own that person."

Almost certainly false (I couldn't even think of a more technical word than "own"), but I just dunno why people take that risk. I get that the threat profile of Joe Schmoe isn't the same as Car Company, Inc., but still...

I find the "anything I don't like goes into spam" is among the biggest problems for classifying spam using rules shared across users, and has been forever. We support training the classifier in our products, via a couple of different mechanisms, but don't really make it too evident how to enable it and use it, because the result is often a lot of confusion as users break their classifier over time by putting legitimate (but no longer wanted) mail into the spam folder.

Unsurprisingly, GMail also exhibits this problem. I have to check my spam folder regularly because it gets a ton of false positives these days. Nearly always newsletters and such I've signed up for. Another category of common false positive, which is admittedly harder to cope with, is moderation notifications from my WordPress sites...it's often actually a spam post, but it seems to learn the shape of those messages and start to treat them all as spam, even the legitimate ones.

Spam filtering has always been, and continues to be, hard. It seemed, for a brief shining moment a few years ago, that spam was on the run...but, it's come roaring back in recent years, I guess as cloud-based mail services made it feasible for spammers to bounce from one provider to another, to keep sending through "known trustworthy" senders. There probably ought to be some way to punish those providers that look the other way when used for spam, but blocking them completely is likely to impact non-spammers (but might be a necessary pain to reduce spam).

I forget at times that I originally started using Gmail because of the spam filtering. My main problem wasn't getting spam, it was getting alerts on my damn phone that I got spam. I don't mind checking the spam folder, as it's like having a bunch of people at the front door waiting for me to come outside. If they all knocked, I'd never go outside.

I've more recently switched to rspamd from spamassassin. I've been using SA for more than a decade on my own server, but have been getting frustrated with its spam matching more recently. I've tweaked it and tweaked it (including doing things like telling it to trust pyzor more) and still obvious stuff is leaking through. That's even with a cron job running to train itself on email I move to a specific folder. There's certain patterns of spam that just won't go away.

CPU usage was noticeable for it, too. I started looking for compiled alternatives, and rspamd fit that need. It's faster than SA, and using less resources, and seems to be hitting better than SA filtering.

I'm a happy SpamAssassin user too, just wanted to add that I've found using sa-learn to teach it about ham/spam makes the spam detection a lot better. When spam gets through, I move it to a junk folder, and from time to time, use that folder to train SA on. The level of false positives is tiny, perhaps 1-2 emails a year.

Did the same. From Qmail/Courier to now using Postfix/Dovecot. But am not using Spamassassin. Instead, filter all TLDs except those that I white listed. Then by reverse ip, dmarc and sfp check. And lastly block domains from which I have received a new mail and marked it as spam from the client side within 24 hours.

conventional wisdom is to filter unwanted newsletters and never click the unsubscribe links. Clicking those links confirms that your email address is valid and being monitored which can make it more valuable when the company shares and sells your email address to others. Also attackers can forge newsletters from respectable companies or just made up something new entirely and the unsubscribe links can be used to send you to malicious sites that can open the door to malware infections and phishing. When someone sends you trash, you should assume it is toxic and avoid interacting with it as much as possible.

When I said legit newsletters, I meant the sort you get from sites you've already signed up for. Newsletters that arrive out of the blue are not legit and you should of course never interact with those, but that's self-evident for the reasons you stated, and the whole point of me mentioning legit newsletters specifically was to point out that these shouldn't go in your spam folder because they tend to look too similar to legitimate notification emails, and you only end up skewing the filter toward false positives.

I encountered same thing, malicious spam, & logged it at https://gitlab.com/davchana/gmail-indian-spam-domains/blob/m...

I use a similar setup. Spam is simply discarded, whereas mails from all whitelisted users will force autolearn. Any mails thrown into the spam folder will be passed on to sa-learn as spam (and submitted to SpamCop and Razor2).

So far, no users have complained about false positives being discarded, and moving mails to the spam folder usually results in similar mails being blocked in the future.

The Bayes filter is very, very effective on it's own, but as you mentioned it's not perfect. I've found that using Razor2 as a compliment is very effective.

I have the same postfix dovecot setup and would love to setup Spamassassin correctly. Do you know of any tutorials which outlines how?

I used to operate a mail server for my company, but it was just too much extra work to get sa working company-wide. It just got pinched out of the list of things I was willing to spend any time on because it is such a nuisance.

You can also use the same technique to categorize your e-mail in more categories then just spam and not spam, I for example have a category for "commercial", eg. newsletters and offers, kinda like Gmail started with some time ago. But I also have more categories, like a category for each project.

I wish I had the patience to train Bayes properly. The problem in my case is coming up with a good method of feeding it ham (not just for my own mailbox, but for every user in the company, who obviously don't all correspond with the same people or utilize their mailboxes in the same way). I end up doing a LOT of manual configuration, my local.cf is currently 178 KB (which doesn't sound that big, but it's a lot of rules to run against each and every message).

I use Roundcube for webmail, which auto-collects outgoing addresses in an address book (and you can obviously manually add addresses as well). I have a script that periodically exports the address book for each user and uses that as a post-Bayes filter; anything from a known address is automatically whitelisted, and if it would have been marked as spam by Bayes then it is also trained as ham. It works pretty well.

Hey - could you share some of your SA rules/config that are working well for you?

I tend to find that when I manually categorize spam/ham and sa-learn, spam goes to 0 for a couple of weeks, then creeps upward as trends change (?!) or as they learn (?!?!).

Cool setup! How do you check false positives? You take a look at the spam folder for legit emails?

Not the OP, but I have a similar setup, however instead of Postfix I am using OpenSMTPd, which I found much easier to configure.

Checking for false positives is a manual process. I've never had legitimate mail marked as spam though, only the other way round.

My experience with Gmail's filter is not as nice btw. At work, I had a couple of situations where important mails from a customer were flagged as Spam, which went unnoticed for two weeks.

Pretty much. I just check it occasionally. Did the same thing back when I was a Gmail user, because Gmail has false positives too from time to time.

I used to run Postfix + Dovecot + Rspamd with all the bells and whistles enabled [1], but I recently switched to OpenSMTPD + spamd on OpenBSD.

Unlike rspamd, which has pluggable modules for everything under the sun (RBLs, word filters, Bayesian filtering + learning), spamd uses plain ol' graylisting with some PF integration to throttle spammers connections to 1 character/second for maximum annoyance.

With Rspamd, I never got any spam in my Inbox. With spamd, I get maybe 1 spam mail every two weeks. To me, spamd's ridiculous simplicity is worth the tradeoff.

You do have to be careful with graylisting large mailers like Gmail, since they rarely retry the mail from the same IP address. For this, OpenBSD's smtpctl now has the spfwalk [2] command to whitelist the big guys. That's what I use in my current setup [3], which was linked here a few days ago.

[1] https://www.c0ffee.net/blog/mail-server-guide/

[2] https://poolp.org/posts/2018-01-08/spfwalk/

[3] https://github.com/cullum/dank-selfhosted

For greylisting nowadays I use Postfix's builtin postscreen[1] along with a few DNS-based whitelists and postscreen_dnsbl_whitelist_threshold=-1 to make sure gmail etc don't get delayed. No extra software required. Though it would be nice if it had builtin ability to find IP addresses to whitelist via SPF.

I recently set up a small postfix+dovecot system, and postscreen with DNS blacklists alone seems quite effective. But I do plan to add spamassassin or rspamd or spamd at some point.

1: http://www.postfix.org/POSTSCREEN_README.html

> The current code is getting old, and there is interest in applying deep-learning techniques to the spam-detection problem.

Yes please! I archive all my mail, both desired and spam mail, with the intention of using this data to train a neural network that will be able to classify mail as spam, desired newsletter or desired personal mail.

> some sites are using it to detect spam submitted in web forms, for example.

I actually have the web server send the form contents via email because my email server runs SpamAssassin and it has done a great job catching form spam.

I let Google filter my spam these days, but back when I did it myself I liked popfile better because it could handle general classification versus just spam or not-spam.

So I could train it to move mail into a "High Priority" bucket, or "Not Spam, But Promotional" bucket, etc.

Curious if SpamAssassin can do that now.

I was also curious about the "freemail antiforge" feature mentioned in the article, but couldn't find much about it.

>I let Google filter my spam these days,

I do this but I have to check my spam filter every day and scan titles because it regularly grabs things I'm subscribed to, and have marked 'not spam' (in some cases dozens of times), and filters them as spam. I've also found some friends randomly ending up in spam, in one instance a thread that we'd already had a dozen or so exchanges on.

It mostly seems to throw stuff entities send via mailchimp in there but it will throw other stuff in there from time to time too. No matter what I do, it ALWAYS puts valid email from vanguard in there.

Interesting. I assume you've tried a filter with the "never send to spam" option? Like this: https://imgur.com/a/q4gxipf

Never looked for that, I'm on an awful lot of email lists for businesses and podcasts and it randomly gobbles them up into spam. Probably easier to just check my spam every day than to try and find them all and manually add all of those from addresses.

Will work for vanguard though!

Thanks for mentioning POPFile. That's my baby.

Ahh. Thank you. Also a happy Cloudflare user.

Spam detection on my personal email server took a huge leap forward when I got Spamassassin correctly configured to use the DNS blocklists (URIBL, etc.). I had to set up a local DNS server because the blocklists weren't responding to requests coming via my hosting provider's DNS, and the scores had to be tweaked over a few weeks, but now it's working great. The content rules like "hi my dear", "request for money", and "all caps" are still in there, but the blocklists do the heavy lifting.

It can be exciting when your filter blocks a ton of spam, but then you get a few false positives, and of course they happened to be very important mails ... I think the spam filter needs to be plugged into the MTA, so that when a message is flagged as spam, the MTA can tell the sender that hey, this message got flagged as spam.

The problem with that is that the sender can then iterate on the message until they don't get a "flagged as spam" notification. This applies both to the real sender and to the spammer.

Just dont block anything that has an email address you have mailed to before. Its highly unlikely an important email will come from someone I have not mailed.

I am playing with different gmail api fixes to do things like this - with 12,000 unread mails in my inbox I need some automated help.

And if the lottery commission wants to tell me I have won, they can get my number easily.

If you never sign up for services on the web and are not part of mailing lists.

It has mostly been invoices. It's however like if they want the invoice to get lost so they can send a remainder with a +50$ added to the 5$ bill.

Otoh if the sender could reliably know if the message was received/read, then we could get rid of the Victorian age idea of mailing bits of dead trees around for official notices.

Spammers use different (hacked) servers, so by the time someone have acted on the abuse letter and blocked outgoing SMTP the spammer has switched servers 10 times. About the spammer iterating until the message get through it probably wouln't be worth the time, or the spammer doesn't have the technical ability to, or he/she would be able to get a real job that pays out 1000x more then what those spams generate. Getting a spam notification would also make lazy major e-mail providers to use existing standards such as SPF because their customers would complain.

Spammers do tend to use botnets these days but ISPs would love it if you sent reports to their abuse teams. Each ISP can help clean up bots on their network and while it is basically an endless game of whac-a-mole it does have an impact on the spam volumes leaving respectable networks. If you can automate the process of sending off a notice to the abuse team in addition to any other filter/drops it would be a win/win for everyone.

ARF is great, but as long the message includes the full message headers it should be possible for the ISP to track down the infected user responsible.

I'm not sure what this response has to do with me telling you about an existing standard for what you propose?

I interpret the article you linked that a abuse mail would be sent, quote: "from users to some anti-spam center". While I propose that the mail transfer agent (MTA) on the receiving end tells the sending MTA whether or not an e-mail was (automatically) flagged as spam.

    Sending MTA: HELO I would like to send an e-mail
    Receiving MTA: Great, send it to me
    Sending MTA: Sending ...
    Receiving MTA: Analyzing mail, please wait ...
    Receiving MTA: Email accepted, but it was marked as spam
    Sending MTA: OK, goodbye
It would then be up to the sending MTA to notify the user, like it does now when an e-mail is undelivered.

The fundamental problem is that SMTP mandates accepting multiple recipients per transaction, but only a single status code when completing the transaction. Transaction level rejection is incompatible with per-user rules that filter the message body.

Ideally what's needed is a mandatory, pervasively supported SMTP extension that permits per-recipient status codes, such as LMTP provides. More feasible is permitting MTAs to accept only a single recipient per transaction, but even that is difficult given the huge installed base of MTAs and libraries.

there's an "e.g." in front of the quote, and the example shown is for reporting to the source domain, which then can decide how to handle it (e.g. inform the user if it was a false positive, or take measures against the user if it was actual spam)

SA is good for small orgs. For my own server, I just use regex rules in postfix called S25R [1]. It has been sufficient for me and keeps the run queue very low.

I implemented S25R regex rules in a medium sized company that had a very bad spam problem due to old outlook setups replying to all the spammers with OOO replies. They were using SA and tried to keep up on rule tuning, but it was a losing battle. The OS run queue on the 6 inbound servers averaged 6 and would sometimes peak at 14+. I switched to S25R regex rules and the run queue dropped to 0.2. I did reject some "valid" emails from folks that left the company and were running their own business from their home cable modems, but eventually whitelisted some of them. The employees were very happy with the change. They went from receiving 2k+ spam msgs per day (each) to less than a dozen.

[1] - http://www.gabacho-net.jp/en/anti-spam/anti-spam-system.html

One thing I really liked about SpamAssassin and how prevalent it was was that I could set hashcash on my outbound e-mails and basically eliminate the chances of my e-mail being caught as spam. This was probably a decade ago (I've since moved to gmail), but it would take 10-60 seconds per recipient to generate the hashcash, and SpamAssassin would give those messages a large positive weight.

I'm kind of surprised that it never caught on.

Is hashcash still feasible today? With botnets it shouldn't be a major problem to generate a large number of hashcash stamps. In addition, the current hashcash approach uses SHA-1 which would be trivial to speed up with GPUs or FPGAs (similar to how Bitcoin mining works). At the same time it wouldn't be possible to increase the required "price" of hashcash stamps as this would lock out users who can only use CPUs to calculate the stamp.

That's a good question. I assumed that the hashcash benefit in SA had been tuned over the years. I had picked the largest hashcash at the time for the 10-60 seconds per recipient. Not sure what the benefit the GPU can bring is, 10x? 100x?

Yes, botnets or GPUs can be used to accelerate hashcash. But the point of hashcash isn't that it can't be accelerated, it is that it increases the cost of sending each message dramatically. Most spam seems to be sent because it is just so damn cheap to send a million messages. Adding a second to each recipient is a significant cost.

Of course, as we saw with car alarms and immobilizers, there are unintended consequences with things. In cars, we moved from people stealing cars in the middle of the night, to people getting confronted at gunpoint to get the keys. I wonder if increasing the cost of sending spam might just promote the highly targeted methods like catfishing or whatnot.

These days it would probably make more sense to just require a small cryptocurrency payment in place of hashcash. There are some obvious adoption and scalability issues with that but it may be workable as a prioritization mechanism.

Not directly related to SA being back, but since spam combat stories and approaches are shared here: I used to run SA with postfix, but it was quite a memory hog, occasionally leading to OOM killer raging. Half a year ago I've disabled it and revised regular postfix (+ postscreen) settings, consulting a couple of helpful articles [1,2], and didn't get any "real" spam (i.e., not counting Bitbucket) since.

Likely the incoming spam varies for different people and their mailboxes, but tweaking those standard settings can be surprisingly efficient.

[1] http://rob0.nodns4.us/postscreen.html

[2] http://jimsun.linxnet.com/misc/postfix-anti-UCE.txt

Hmm. Maybe it's time to give SA a try again.

I had very little luck in keeping SA performing well over time. It would go for a few months doing a decent job, then just seem to fall off the tracks for a new particular reign of spam, even with graylisting on.

I finally gave up and have been using... A Barracuda box. I hate to say it (because they're stupid expensive for all but the smallest box), but it has worked incredibly well, with zero overhead. I've gone for over a year without so much as logging in to the box.

Ultimately, I think the 'cuda is just running SA with Barracuda's collective filtering added on, but it sure works for me and my dozen or so users.

Maybe Barracuda doesn't sell that many spam boxes (they have quite a few other products), but I can say that it has sure worked for me.

If I could keep SA tuned well with minimal effort, it would be worth the savings however.

>Maybe Barracuda doesn't sell that many spam boxes

Honestly, I feel we're still primarily known for our anti-spam solutions, and we're investing quite a bit into AI based detection of spam/phishing etc. Depending on what you need, you may want to look at our cloud-based solutions over a physical box, in case the pricing is better.

(I work for Cuda in case that wasn't clear. Work on one of the "quite a few other products" :-D)

A big shout out to Julian Field et. al at MailScanner - I don't manage any mail servers thesedays, but going back around 15 years, it was at the heart of several setups.


And, yes, it uses Spamassassin.

Between SpamAssassin, amavis, and postscreen (mainly postscreen) I rarely see any spam anymore.

but I do see that development has been kind of slow, and not just with SA - most everything email related. It truly frightens me just how many people gave up and handed all of their email over to Google and Gmail, who continue to prefer their way of doing things.

i'm looking at rebuilding a mail server on a new VPS pretty soon, and I'll happily set up spamassassin once again. it's great to see that it's still going strong.

Extremely glad to see a large project written in Perl pick up steam again.

I find it frightening that a big Perl project is still being maintained. The roster of folks who can parse that code is getting smaller and smaller with time. It would be nice to see someone spark up a neural net based spam filtering project that leverages a more modern language like Rust or Go.

Running rspamd on some servers which filters fine and the dmarc reports are an awesome feature. A comparison would be interesting in the future.

> Just like Gmail, SpamAssassin isn't the perfect filter for everybody right out of the box; it's really a framework that can be used to create that filter.

I find gmail to be about 99% accurate “out of the box”. With today’s technology and processing power I don’t see why that can’t also be true for other spam offerings.

When I was involved with the mechanics of an ESP delivering bulk mail for clients, I found SpamAssassin to be very useful for scanning e-mails even before they get sent in order to detect possible abuse.

Spamassassin is and has been perfectly adequate as long as you train your filter regularly with new spam.

And the more users you filter for, the more spam you should be feeding your filter, and regularly.

When the person says their gmail account is full of spam I take that with a grain of salt because I've been using gmail since it started, when it was like 5 invites only, and I use it as my throwaway account everywhere. I know for a fact that gmail does an excellent job at filtering spam.

Gmail is terrible.

False positives are a nightmare and if you have to check the spam folder then the spam filter does more harm than good.

That is also my experience with SA.

Additionally, I found Gmail will sometimes deny legitimate E-mail as "spam" as it comes in, before it even reaches your Spam folder. I tested this for a week or so by forwarding my personal E-mail from my mail server to my address at Gmail and checking my server logs. This was one of the things that finally pushed me over to self-hosting. Mail does not get lost with Spamassassin if it's set up to deliver spam into a separate folder. As long as the mail passes the trivial SPF, DKIM, etc. checks, it will at least get to my junk folder.

I'm picturing a new mascot. Blocky in the way Spongebob is, but ninja themed and made out of canned meat product.

Wouldn't it be awesome if you could filter your facebook/twitter/whatever feed through spamassasin too?

I miss blue frog.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact