Hacker News new | past | comments | ask | show | jobs | submit login
A Plan for Spam (2002) (paulgraham.com)
43 points by vinnyglennon 31 days ago | hide | past | web | favorite | 34 comments

Candidate for best blog post of all time? Quickly made a big impact on a real problem in the computing world.

I remember reading this at the time, installing one of the implementations that immediately popped up (SpamAssassin), and finally having my spam separated from my ham.

Edit: I had not considered until now the possibility that the "IKEA effect" might have made me overestimate the quality of filtering because of the effort that I put into training the classifier!

Where's the Gmail option to let me turn a label into a unique bayesian filter?

In the past year I have gotten >1,750 mails from recruiters, most of them unique addresses. I don't want to mark them as "Spam" because this is a type of "Spam" that I want to keep so I can refer to it later. I'd like to also un-train "not-so-Spammy" messages so I can see the jobs I'd be interested in, but I'm afraid of false positives making these harder to retain.

Personally I disabled Gmail’s spam filter long time ago[0]. (False positives plus I have some weak morbid curiosity as what comes in spam.) Not getting too much junk so far.

Recently I started to give a different email to each service so that I can see if any get compromised, but so far I don’t think I’ve noticed anything like that.

[0] I borrowed someone’s workaround where you create a filter that excludes emails matching a specific random UUID, which no email would match, with action “never send to spam”. Perhaps there’s now a more straightforward option.

There's no way to disable spam filtering in Gmail because most of the filtering happens long before user filters are checked.

If it weren't true, or spam folders would contain thousands of messages.

My spam folder receives no more than a dozen of messages per month and I know that there're tens of thousands attempts to send me a spam.

I agree that there may be earlier filtering stages, though in my experience even if you send unauthenticated messages from not-really-configured postfix on Ubuntu they would still be viewable in recipient’s spam, or at least that was the case a few years back.

I’m pretty sure recruiters’ emails peterwwillis mentioned would not be as bad as to be eliminated before spam folder.

They used to have smart labels in the gmail labs config section, but dropped it. It appeared to be a Bayesian classifier. You couldn't add your own labels, but it did work well. No idea why they dropped it.

You could use popfile, it supports moving mail to named folders, which isn't the same as just applying a label, but works. http://getpopfile.org

There is a "filter similar message" option available (it is hidden with the answer /transfer / print / etc options).

This really was the golden age of blogging. That unpretentious style of writing made me engage with so many ideas I never would have, reading professional writing.

This still exists, you just need to know where to look.

You can't just write a comment like that. links please. ;)

When I first read this article it felt terribly old. An article from 10 years ago? That's ancient! Now it's almost 20 years old and for some reason it feels a lot fresher today than it did even back then. (Maybe history starts to compact the older you get.)

Funny thing is that even to this day, most AI companies employs algorithms no sophisticated than in this almost 20-year old article.

This post blew my mind at the time, it was my first exposure to ML techniques.

I remember reading pg's post, it was a classic.

Anyone here working on anti-spam?

Where are things at today?

From my perspective (~500 employee mail server), greylisting had a much larger impact at the time, thanks to the spambots/viruses attempting direct connection to mail servers. Extremely effective, zero false positives, much lighter on resources. I did use both, of course, so that I could keep a record of how effective the systems were.

Today the situation has flipped. Most of the spam we get is coming from authoritative servers (ie: gmail, yahoo, etc), making stuff like SPF/DKIM/etc next to worthless from a spam perspective (it's still marginally useful for forgeries), while bayes (or in general, trainable) filters are essentially the only thing that can differentiate it reliably.

With a modern setup, you can basically next to zero spam and no false positives. In fact, honest email marketing (ie: mailing lists you've actually subscribed to) are from my experience the only thing that throws these filters off.

Thanks, one thing we also found is that spammers tend to be poor at RFC standards, in a way that Gmail etc. will have no problem with, but which are obviously broken.

For example, we use our own https://github.com/ronomon/mime to detect and reject email which has missing multi-parts (no terminating boundary delimiter). All of this has been spam so far, and we are yet to see a false positive. I don't think SpamAssassin has a rule for this (yet)?

Another example is illegal header characters, which are almost always spam, with a handful of false positives (usually machine-generated).

That is an interesting approach. Care to let us know how you go from https://github.com/ronomon/mime to some kind of SMTP server plugin (like for postfix for example)?

Thanks, you might find Haraka to be easiest since it's already Javascript.

Postfix may require a process callout, you might need to write a milter.

I agree that greylisting was the cat's meow back then. I setup a VM running CentOS with Postfix and Postgrey as a MTA for our "work" email server and the result was a massive reduction in spam.

I was running my own email server round about this time, and gave it up mid-2000s, in great part because of spam. Both inbound filtering and outbound deliverability into other people's filters.

It seems that centralisation has been highly effective in spam-fighting. Google must have a huge corpus of spam and ham, and they also have the benefit of being able to spot patterns of incoming mail "live" across a very large number of accounts.

I'm not directly working on spam, but it seems like there's a heavy reliance on ipv4 address reputation these days, which is probably one reason mailservers won't use ipv6 any time soon.

Address reputation was always big, even back then.

In the last years though, I cannot really recommend to use any of the DNSBL anymore. I've encountered more cases where legitimate servers were blocked due to netblock vicinity or indeed previous ownership than actual spam issues.

Greylisting will still catch dynamic allocations almost as effectively, while you won't reject legitimate mail due to server and/or DNSBL issues.

Have you tried using a quorum of DNSBL (e.g. barracudacentral.org, cbl.abuseat.org, truncate.gbudb.net) to reduce false positives?

In other words, if at least two DNSBL queries agree, then reject, or feed this information to the rest of the spam pipeline?

This is pretty common, and systems like SA do this for you by batching responses and calculating a score.

I found this to be pretty much worthless if you already have greylisting, even for high-quality curated lists such as spamhaus SBL/XBL.

Not "working" on anti-spam, but for my family's email server, we run rspamd and it works really well.

It sounds good but ...

...it did not work out.

17 years later and spam is still annoying. I use ThunderBird and tag every spam message as spam.

Still, some spam messages get through.

And because there are false positives sometimes, I always look through my spam folder before I empty it.

> Still, some spam messages get through.

I believe spammers actively test their messages against things like SpamAssassin to sneak them through.

>17 years later and spam is still annoying. I use ThunderBird and tag every spam message as spam.

With Gmail you hardly see any spam.

>Still, some spam messages get through.

Some? Back in the day it was in the hundreds every day if you were active online...

Gmail also dumps a lot of non-spam emails into its spam folders.

The ones that drive me nuts (and have led me to having to sign up to fastmail for my business emails, rather than sending from my own shared server) are the ones that are valid responses to emails sent from GMail in the first place. Excellent logic (or lack thereof), Google.

I don't actually believe so, but I have often wondered whether GMail has such an overly fastiduous spam filter simply to encourage people into its fold... .

Gmail can do that because they can see into the inboxes of many million of people. So if a mail is flagged as spam somewhere, it can use that info as a signal to classify it as spam in my inbox.

That is not the content based plan for spam PG describes.

And it comes at a cost. Complete loss of control over your inbox.

Switched from gmail because receiving too much spam. Part of the reason seems to be people using my generic name as a fake email address when they sign up for stuff, probably credit card promotions and the like. Wonder what it would be like if I used a custom domain with gmail. But not going back....

Ironically I run my own mail-server, and the majority of incoming spam I receive is routed through gmail.

I have a very different experience, I have only a couple of spams that don't end up in the spam folder. I do get promotional emails but when you unsubscribe they don't come back.

I use Office 365 and Gmail. Both do a good job.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact