Spam countermeasures have led to some fascinating new technologies. In addition to the one cited in the article was Hashcash, which was developed a few years earlier:
Hashcash requires the sender to expend a small quantity of computational work, and attach a proof of this to the email before a recipient even opens it. The underlying assumption is that the burden would be insignificant for a real email sender, but onerous for a spammer.
The approach inspired the powerful anti-spam system at the center of Bitcoin.
It might be interesting to catalog all of the most innovative early approaches to combating spam, and the unrelated technologies that later arose from them.
I recently came across this idea to prevent friend request spam on social media, using the Lightning network. The idea is that the person sending you the friend request must attach a small amount of money (in the form of a special lightning payment). If the friend request is legitimate, you can accept it (and not claim the money). But if it's spam you reject the friend request but take the money instead. This is different than Hashcash in that it incurs real monetary costs to the attacker. Instead of using compute resources (for the mining algorithm), the money would actually end up in your pocket.
Except the cost of phone calls, text messages and even postal mail is sufficiently low to result in spam across all three mediums in ever larger quantities than before
I don't think this would work in real life. When Bill Gates was proposing something similar, one of the engineers on the anti-spam team built an FPGA that would allow a spammer to solve such puzzles quickly enough to prevent spamming. His observation to me was that even though such a technique would work, botnets had eliminated the need for specialized hardware.
How does that work for free email newsletters? They're typically already paying an email service provider. Having to pay even more could be onerous for large, double opt-in lists, even though they're not spamming.
I'm guessing that 99% of newsletters are sending vastly fewer messages than the people who send me half a dozen "Welcome to Adulte Sex " messages a day.
Maybe, but spammers are also more likely to be sending from rented botnets that can handle the increase in CPU usage spread over a large number of other people's computers.
The article has the merit of significantly reducing the size of the original essay, while IMO still retaining two strong messages:
1. The original (is it?) method used to prevent spam, and
2. The 'seed' factor, which is expected to make spammers work harder. At mid-page I was thinking "meh, spammers will just have to improve their writing then", but this may not be sufficient thanks to the user-specific seed.
[edit: I didn't realize the original article was from 2002. I agree the article is a bit obsolete at that point.]
Modern-day spam is typically generative, and modelling the distribution of "natural e-mail messages" is sadly too naive today. Human beings also understand text through vision, not through bits -- so 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words. That also gave rise to what is probably a more common spam variant today: the text-embedded-as-an-image type. I've long been of the impression that the only proper way to do text analysis is by vision, a more end-to-end solution as it were.
> 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words
No it doesn’t, and pg explains why in his essay. (Don’t know if the article states this too as since I’ve already read the essay before I didn’t bother to read a summarizing article about it. The essay is really excellent though.)
Quote from the essay:
> I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.
> [...]
> To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.
He directly inspired SpamBayes, a Python based plugin for Outlook. That thing was like pure black magic back in the day. Worked amazingly well.
Also, Yahoo! Mail implemented at least partly Bayesian based filtering not long after the original PG essay.
I'm guessing GMail is using some kind of statistical filter for not only spam, but general categorization, though as others have pointed out GM is pretty aggressive in filtering. I think that GM might be using collaborative data, aggregating across user behavior and I believe many users deal with signups they're tired of by marking as spam, sharing that behavior across the userbase would cause things that you willingly signed up for to be filtered out.
Meanwhile the spam filter at gmail (business account) regularly marks email messages that I send from our own systems (say showing disk space) to our own accounts (and nowhere else) as spam. Also sales lead forms. So they come in and we say 'not spam' and the filter is not smart enough to understand the next very similar email is also not a spam. A human could quite easily see by just the format and the layout (in addition to numerous times marking 'not spam') that it was the not spam.
The technique still works and works excepionally well.
I only use a bayes filter (I use bogofilter) and out of some 15k spam e-mails and 150k ham e-mails I received in the last 9 months or so, I've seen like ~10 false negatives and around the same number of false positives. That's way better than gmail can ever dream of, where I always had issues with automated messages ending up in spam. With bogofilter, the training is always explicit, and otherwise the model doesn't change, so I can be sure that when my automated messages pass, they will always pass, and don't randomly stop passing "just because".
I started with a big archive (around 100k msgs each) of SPAM and HAM for the initial training. I learned from the early days of using e-mail to archive my spam, for the eventual training, since my first Linux job involved setting up a DSPAM installation.
I use the word "unsubscribe" to filter out a lot of emails. There are occasional false positives but after combining the above filter with a whitelist of senders it has proven to be very effective.
I prefer to filter on the word "hurry", among other things. That seems to catch all the mail lists I don't want to hear from while keeping those I do available. I don't hurry.
I only filter them into a pre-spam folder and batch spamflag later, that way I don't miss anything hit by a false positive (e.g. a Signal v. Noise digest had the phrase "Why the hurry?" at one point)
Using this method I don't think I had a single black friday email hit my inbox last year :)
Oh and of course "Greeting(s?) of the day" has never been a legitimate email. You can safely auto-spam that phrase.
The whole point of these Bayesian filters is that you don't have to spend mental cycles thinking of clever words to filter on. Just mark 'em spam and not spam, the algorithms are plenty good at finding which words are spammy or legitimate without human help.
I believe this raises the issue of short-term solutions (such as filtering on 'click') for which spammers will eventually find turnarounds vs long-term solutions. But do we need really need long-term solutions if an update of filtering terms every year is enough to keep the pace? Also, does a long-term solution against spamming even exist? Since the Graham solution from 2002 doesn't seem to have been implemented yet, the answer is no?
Another issue is the carbon/energetic cost of spamming. I'm no expert, but given the probably high figures, should this issue be escalated to a higher level, e.g. an international agreement along with heavy penalty if caught running a larger spamming farm? Compared to the difficult issues of drug trafficking etc. I don't see how there could not be a consensus for spam.
http://www.hashcash.org/papers/hashcash.pdf
Hashcash requires the sender to expend a small quantity of computational work, and attach a proof of this to the email before a recipient even opens it. The underlying assumption is that the burden would be insignificant for a real email sender, but onerous for a spammer.
The approach inspired the powerful anti-spam system at the center of Bitcoin.
It might be interesting to catalog all of the most innovative early approaches to combating spam, and the unrelated technologies that later arose from them.