For a while, it was configured to discard spam. Later, I configured it to put the spam in a special IMAP folder. I have the same system today, but on top of that, I have a background shell script that watches my mail folders and runs sa-learn on them. Anything that appears in the spam folder is learned as spam. Anything that appears elsewhere is learned as ham. Most of the time, this will be email that has already been classified correctly, but if I move an email, it will reclassify it.
Some time ago, it began leaking spam into the inbox. It turned out that I needed to make the Bayes scores more aggressive, and after I did that, it's been more or less perfect. More or less zero false positives and maybe a couple of false negatives per week.
I find that the good old Bayes classifier is still the most powerful tool in my spam filtering toolbox. You just have to be persistent and consistent in how you train it and tune it. For example, you shouldn't train it to classify legitimate newsletters as spam even if they are undesirable. Instead, you should unsubscribe and put those in the trash folder. I find that this considerably reduces the false positives.
One tip: look at your mail server logs for the top 100 bad addresses. For me they were a combination of incorrect first names (doug@, joe@) and incorrect roles (sales@, support@). Deliver all of that mail directly to a spam trainer. That gives me good odds the Bayesian filter will have repeatedly seen novel spam before they try my account.
I also subaddress  any address I give to a vendor. So when those leak to spammers, I route those to sa-learn as well. It's a real pleasure to use the key spammer advantage -- indiscriminate volume -- against them.
All other email from them and countless others is useless.
Almost certainly false (I couldn't even think of a more technical word than "own"), but I just dunno why people take that risk. I get that the threat profile of Joe Schmoe isn't the same as Car Company, Inc., but still...
Unsurprisingly, GMail also exhibits this problem. I have to check my spam folder regularly because it gets a ton of false positives these days. Nearly always newsletters and such I've signed up for. Another category of common false positive, which is admittedly harder to cope with, is moderation notifications from my WordPress sites...it's often actually a spam post, but it seems to learn the shape of those messages and start to treat them all as spam, even the legitimate ones.
Spam filtering has always been, and continues to be, hard. It seemed, for a brief shining moment a few years ago, that spam was on the run...but, it's come roaring back in recent years, I guess as cloud-based mail services made it feasible for spammers to bounce from one provider to another, to keep sending through "known trustworthy" senders. There probably ought to be some way to punish those providers that look the other way when used for spam, but blocking them completely is likely to impact non-spammers (but might be a necessary pain to reduce spam).
CPU usage was noticeable for it, too. I started looking for compiled alternatives, and rspamd fit that need. It's faster than SA, and using less resources, and seems to be hitting better than SA filtering.
So far, no users have complained about false positives being discarded, and moving mails to the spam folder usually results in similar mails being blocked in the future.
The Bayes filter is very, very effective on it's own, but as you mentioned it's not perfect. I've found that using Razor2 as a compliment is very effective.
I tend to find that when I manually categorize spam/ham and sa-learn, spam goes to 0 for a couple of weeks, then creeps upward as trends change (?!) or as they learn (?!?!).
Checking for false positives is a manual process. I've never had legitimate mail marked as spam though, only the other way round.
My experience with Gmail's filter is not as nice btw. At work, I had a couple of situations where important mails from a customer were flagged as Spam, which went unnoticed for two weeks.
Unlike rspamd, which has pluggable modules for everything under the sun (RBLs, word filters, Bayesian filtering + learning), spamd uses plain ol' graylisting with some PF integration to throttle spammers connections to 1 character/second for maximum annoyance.
With Rspamd, I never got any spam in my Inbox. With spamd, I get maybe 1 spam mail every two weeks. To me, spamd's ridiculous simplicity is worth the tradeoff.
You do have to be careful with graylisting large mailers like Gmail, since they rarely retry the mail from the same IP address. For this, OpenBSD's smtpctl now has the spfwalk  command to whitelist the big guys. That's what I use in my current setup , which was linked here a few days ago.
I recently set up a small postfix+dovecot system, and postscreen with DNS blacklists alone seems quite effective. But I do plan to add spamassassin or rspamd or spamd at some point.
Yes please! I archive all my mail, both desired and spam mail, with the intention of using this data to train a neural network that will be able to classify mail as spam, desired newsletter or desired personal mail.
I actually have the web server send the form contents via email because my email server runs SpamAssassin and it has done a great job catching form spam.
So I could train it to move mail into a "High Priority" bucket, or "Not Spam, But Promotional" bucket, etc.
Curious if SpamAssassin can do that now.
I was also curious about the "freemail antiforge" feature mentioned in the article, but couldn't find much about it.
I do this but I have to check my spam filter every day and scan titles because it regularly grabs things I'm subscribed to, and have marked 'not spam' (in some cases dozens of times), and filters them as spam. I've also found some friends randomly ending up in spam, in one instance a thread that we'd already had a dozen or so exchanges on.
It mostly seems to throw stuff entities send via mailchimp in there but it will throw other stuff in there from time to time too. No matter what I do, it ALWAYS puts valid email from vanguard in there.
Will work for vanguard though!
I am playing with different gmail api fixes to do things like this - with 12,000 unread mails in my inbox I need some automated help.
And if the lottery commission wants to tell me I have won, they can get my number easily.
ARF is great, but as long the message includes the full message headers it should be possible for the ISP to track down the infected user responsible.
Sending MTA: HELO I would like to send an e-mail
Receiving MTA: Great, send it to me
Sending MTA: Sending ...
Receiving MTA: Analyzing mail, please wait ...
Receiving MTA: Email accepted, but it was marked as spam
Sending MTA: OK, goodbye
Ideally what's needed is a mandatory, pervasively supported SMTP extension that permits per-recipient status codes, such as LMTP provides. More feasible is permitting MTAs to accept only a single recipient per transaction, but even that is difficult given the huge installed base of MTAs and libraries.
I implemented S25R regex rules in a medium sized company that had a very bad spam problem due to old outlook setups replying to all the spammers with OOO replies. They were using SA and tried to keep up on rule tuning, but it was a losing battle. The OS run queue on the 6 inbound servers averaged 6 and would sometimes peak at 14+. I switched to S25R regex rules and the run queue dropped to 0.2. I did reject some "valid" emails from folks that left the company and were running their own business from their home cable modems, but eventually whitelisted some of them. The employees were very happy with the change. They went from receiving 2k+ spam msgs per day (each) to less than a dozen.
 - http://www.gabacho-net.jp/en/anti-spam/anti-spam-system.html
I'm kind of surprised that it never caught on.
Yes, botnets or GPUs can be used to accelerate hashcash. But the point of hashcash isn't that it can't be accelerated, it is that it increases the cost of sending each message dramatically. Most spam seems to be sent because it is just so damn cheap to send a million messages. Adding a second to each recipient is a significant cost.
Of course, as we saw with car alarms and immobilizers, there are unintended consequences with things. In cars, we moved from people stealing cars in the middle of the night, to people getting confronted at gunpoint to get the keys. I wonder if increasing the cost of sending spam might just promote the highly targeted methods like catfishing or whatnot.
Likely the incoming spam varies for different people and their mailboxes, but tweaking those standard settings can be surprisingly efficient.
I had very little luck in keeping SA performing well over time. It would go for a few months doing a decent job, then just seem to fall off the tracks for a new particular reign of spam, even with graylisting on.
I finally gave up and have been using... A Barracuda box. I hate to say it (because they're stupid expensive for all but the smallest box), but it has worked incredibly well, with zero overhead. I've gone for over a year without so much as logging in to the box.
Ultimately, I think the 'cuda is just running SA with Barracuda's collective filtering added on, but it sure works for me and my dozen or so users.
Maybe Barracuda doesn't sell that many spam boxes (they have quite a few other products), but I can say that it has sure worked for me.
If I could keep SA tuned well with minimal effort, it would be worth the savings however.
Honestly, I feel we're still primarily known for our anti-spam solutions, and we're investing quite a bit into AI based detection of spam/phishing etc. Depending on what you need, you may want to look at our cloud-based solutions over a physical box, in case the pricing is better.
(I work for Cuda in case that wasn't clear. Work on one of the "quite a few other products" :-D)
And, yes, it uses Spamassassin.
but I do see that development has been kind of slow, and not just with SA - most everything email related.
It truly frightens me just how many people gave up and handed all of their email over to Google and Gmail, who continue to prefer their way of doing things.
i'm looking at rebuilding a mail server on a new VPS pretty soon, and I'll happily set up spamassassin once again. it's great to see that it's still going strong.
I find gmail to be about 99% accurate “out of the box”. With today’s technology and processing power I don’t see why that can’t also be true for other spam offerings.
And the more users you filter for, the more spam you should be feeding your filter, and regularly.
When the person says their gmail account is full of spam I take that with a grain of salt because I've been using gmail since it started, when it was like 5 invites only, and I use it as my throwaway account everywhere. I know for a fact that gmail does an excellent job at filtering spam.
False positives are a nightmare and if you have to check the spam folder then the spam filter does more harm than good.
That is also my experience with SA.