Hacker News new | past | comments | ask | show | jobs | submit login
Paul Graham provides answer to spam emails (2002) (infoworld.com)
61 points by tosh on May 3, 2019 | hide | past | favorite | 33 comments



Spam countermeasures have led to some fascinating new technologies. In addition to the one cited in the article was Hashcash, which was developed a few years earlier:

http://www.hashcash.org/papers/hashcash.pdf

Hashcash requires the sender to expend a small quantity of computational work, and attach a proof of this to the email before a recipient even opens it. The underlying assumption is that the burden would be insignificant for a real email sender, but onerous for a spammer.

The approach inspired the powerful anti-spam system at the center of Bitcoin.

It might be interesting to catalog all of the most innovative early approaches to combating spam, and the unrelated technologies that later arose from them.


I recently came across this idea to prevent friend request spam on social media, using the Lightning network. The idea is that the person sending you the friend request must attach a small amount of money (in the form of a special lightning payment). If the friend request is legitimate, you can accept it (and not claim the money). But if it's spam you reject the friend request but take the money instead. This is different than Hashcash in that it incurs real monetary costs to the attacker. Instead of using compute resources (for the mining algorithm), the money would actually end up in your pocket.


A much older example is the high cost of sending text messages basically eliminates text message spam.


Except the cost of phone calls, text messages and even postal mail is sufficiently low to result in spam across all three mediums in ever larger quantities than before


As he said, it is a much older version. As with everything the cost decreases over time to the point that it no longer outweighs benefit.

Put another way, this is the unfortunate flip side of free unlimited talk and text.


I don't think this would work in real life. When Bill Gates was proposing something similar, one of the engineers on the anti-spam team built an FPGA that would allow a spammer to solve such puzzles quickly enough to prevent spamming. His observation to me was that even though such a technique would work, botnets had eliminated the need for specialized hardware.


Not to mention that spam is frequently sent by pwnd boxen or botnets. Why would the attacker care that it racks up someone else's electricity bill?


How does that work for free email newsletters? They're typically already paying an email service provider. Having to pay even more could be onerous for large, double opt-in lists, even though they're not spamming.


I'm guessing that 99% of newsletters are sending vastly fewer messages than the people who send me half a dozen "Welcome to Adulte Sex " messages a day.


Maybe, but spammers are also more likely to be sending from rented botnets that can handle the increase in CPU usage spread over a large number of other people's computers.


I think this article doesn't provide much value over the original essay [1]. And of course, all excitement from the article is now highly obsoleted.

[1] http://www.paulgraham.com/spam.html


The article has the merit of significantly reducing the size of the original essay, while IMO still retaining two strong messages: 1. The original (is it?) method used to prevent spam, and 2. The 'seed' factor, which is expected to make spammers work harder. At mid-page I was thinking "meh, spammers will just have to improve their writing then", but this may not be sufficient thanks to the user-specific seed.

[edit: I didn't realize the original article was from 2002. I agree the article is a bit obsolete at that point.]


Modern-day spam is typically generative, and modelling the distribution of "natural e-mail messages" is sadly too naive today. Human beings also understand text through vision, not through bits -- so 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words. That also gave rise to what is probably a more common spam variant today: the text-embedded-as-an-image type. I've long been of the impression that the only proper way to do text analysis is by vision, a more end-to-end solution as it were.


> 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words

No it doesn’t, and pg explains why in his essay. (Don’t know if the article states this too as since I’ve already read the essay before I didn’t bother to read a summarizing article about it. The essay is really excellent though.)

Quote from the essay:

> I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

> [...]

> To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.

http://www.paulgraham.com/spam.html

And as for your point about text in image I don’t know of any email client today that defaults to showing images from unkown senders.

I receive a lot of spam and it is all very distinct in nature and the Bayesian approach is still the way to go for fighting it I think.


You may very well be true, but then it's a pity that a 2019 article on a 2002 method didn't mention it?


Wait did Paul Graham inspire SpamAssassin (the original 'big deal' Bayesian spam filter)?

Cool circa-2002 game for Unix nerds: you have 60 seconds to telnet to port 25 and generate the highest spamassassin score.


SpamAssassin predates the article. Not sure it had Bayesian filtering from the start tho.

Paul Graham however definitely championed and popularized the idea of Bayesian filtering.


SA preexisted Bayesian filters by a year. Bogofilter, a Bayesian classifier, was inspired by Graham.

http://www.catb.org/~esr/bogofilter/bogofilter.html

https://en.m.wikipedia.org/wiki/Apache_SpamAssassin

Spamassassin's Bayesian classifier credits Graham.

https://metacpan.org/release/Mail-SpamAssassin/source/lib/Ma...


He directly inspired SpamBayes, a Python based plugin for Outlook. That thing was like pure black magic back in the day. Worked amazingly well.

Also, Yahoo! Mail implemented at least partly Bayesian based filtering not long after the original PG essay.

I'm guessing GMail is using some kind of statistical filter for not only spam, but general categorization, though as others have pointed out GM is pretty aggressive in filtering. I think that GM might be using collaborative data, aggregating across user behavior and I believe many users deal with signups they're tired of by marking as spam, sharing that behavior across the userbase would cause things that you willingly signed up for to be filtered out.


Meanwhile the spam filter at gmail (business account) regularly marks email messages that I send from our own systems (say showing disk space) to our own accounts (and nowhere else) as spam. Also sales lead forms. So they come in and we say 'not spam' and the filter is not smart enough to understand the next very similar email is also not a spam. A human could quite easily see by just the format and the layout (in addition to numerous times marking 'not spam') that it was the not spam.


Yeah, I also noticed this.

And I'm pretty sure that some years ago gmail was much better at this.


This article is 17 years old ... I must be missing something. Why is it relevant right now?


The technique still works and works excepionally well.

I only use a bayes filter (I use bogofilter) and out of some 15k spam e-mails and 150k ham e-mails I received in the last 9 months or so, I've seen like ~10 false negatives and around the same number of false positives. That's way better than gmail can ever dream of, where I always had issues with automated messages ending up in spam. With bogofilter, the training is always explicit, and otherwise the model doesn't change, so I can be sure that when my automated messages pass, they will always pass, and don't randomly stop passing "just because".

I started with a big archive (around 100k msgs each) of SPAM and HAM for the initial training. I learned from the early days of using e-mail to archive my spam, for the eventual training, since my first Linux job involved setting up a DSPAM installation.


HN has an obsession about Paul Graham.


I mean, this is literally Paul Graham's site.


I use the word "unsubscribe" to filter out a lot of emails. There are occasional false positives but after combining the above filter with a whitelist of senders it has proven to be very effective.


I prefer to filter on the word "hurry", among other things. That seems to catch all the mail lists I don't want to hear from while keeping those I do available. I don't hurry.

I only filter them into a pre-spam folder and batch spamflag later, that way I don't miss anything hit by a false positive (e.g. a Signal v. Noise digest had the phrase "Why the hurry?" at one point)

Using this method I don't think I had a single black friday email hit my inbox last year :)

Oh and of course "Greeting(s?) of the day" has never been a legitimate email. You can safely auto-spam that phrase.


That's why it's now "if you don't wish to receive these emails, click here".


Then, as mentioned in the article, the you could filter on the word "click" too.


The whole point of these Bayesian filters is that you don't have to spend mental cycles thinking of clever words to filter on. Just mark 'em spam and not spam, the algorithms are plenty good at finding which words are spammy or legitimate without human help.


I believe this raises the issue of short-term solutions (such as filtering on 'click') for which spammers will eventually find turnarounds vs long-term solutions. But do we need really need long-term solutions if an update of filtering terms every year is enough to keep the pace? Also, does a long-term solution against spamming even exist? Since the Graham solution from 2002 doesn't seem to have been implemented yet, the answer is no?

Another issue is the carbon/energetic cost of spamming. I'm no expert, but given the probably high figures, should this issue be escalated to a higher level, e.g. an international agreement along with heavy penalty if caught running a larger spamming farm? Compared to the difficult issues of drug trafficking etc. I don't see how there could not be a consensus for spam.


If only we would've reached and saved Usenet in time. A lot of the current web is a bad proprietary replacement of it.


Yeah. I suspect a few bayes-based techniques could have auto-moderated most noise out of most Usenet groups pretty effectively.

But can AI stop politicians like Cuomo from censoring unpalatable speech (Usenet/Reddit/Voat) to death?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: