
How a blog spammer got past Akismet's filters - Sam_Odio
http://www.bomega.com/2007/05/11/wordpress-matt-mullenweg-and-spam/
======
sethjohn
About 5 years ago I heard Bill Gates give a talk proposing a system to charge
1/10th penny charge for anyone to send you an e-mail. After the e-mail is
accepted, you can excuse the charge for anyone sending legitimate e-mail...but
spammers would wind up with many thousands of dollars in bills.

Seems like a genius idea, wonder why I haven't heard anything more about it?

~~~
randallsquared
There are a number of problems with it. First, it won't actually affect
spammers unless nearly everyone is using it... what could force them to accept
the charges? Second, it's turned out that even the email systems that ask
people to confirm they sent something (which spambots can't understand in
general) are too much for many people to bother with, so asking them to join a
system in order to pay a fraction of a cent for each email they write would
likely be too much trouble.

Lastly, spam filters actually work really well. I get roughly a thousand spams
a day to my primary address which has been continuously in use for ~7 years,
and have to "mark as junk" maybe two spams a day -- the rest go straight to my
junk folder.

------
photomatt
Akismet doesn't use Bayesian filtering.

~~~
dood
Do you mean they don't use a naive Bayesian classifier, or that they don't use
Bayesian probability at all? I'd have a hard time believing the latter.

------
imp
Thanks for the information, very useful.

------
mojuba
For how long will people believe Bayesian filtering stops spam? It's just a
cute theorem that in no way can compete with human intelligence. Or do you
think some probability calculations can outsmart a spammer?

~~~
byrneseyeview
The nice thing about Bayesian filtering is that it makes spam possible _if and
only if_ it's worth reading anyway. If spammers realize that the only way to
get past a filter is to make their message statistically indistinguishable
from something useful, they may decide that the best way to do that is to make
it something useful.

~~~
mojuba
Spam was never absolutely useless (except those special messages that check
your address for availability), and some people do buy fake Rolex watches or
cheap viagra pills. These offers may or may not be legal, and that's a
different kind of problem. Unsolicited mail can be useful to someone, or
otherwise it wouldn't exist at all.

The real problem with spam, again, is that you will always receive it unless
you clearly define what's important for you, or maybe what's not important -
either way. Is that possible?

~~~
byrneseyeview
Are you confusing gross and net use? Spammers value responses, and they can
maximize those by either a) sending more spam, or b) getting more readers per
message. I suspect that a) is finally becoming prohibitive because every new
message trains all the Bayesian filters out there -- so it might work better
for them to have fewer, weirder spams. And why just inject randomness into
your message when you can inject usable order?

The kind of people who read spam now are the kind of people who mass-forward
jokes, so I can imagine a savvy spammer _hiring comedians to write topical
jokes, and appending a Viagra ad to end_. They can 'beat' Bayesian filters by
wrapping their spam in a message people will actually read and possibly
forward.

~~~
mojuba
I'm not talking about fighting spam. I'm trying to understand what is spam,
because unless we understand that we won't be able to stop it.

Let's say you got a message that offers SEO optimization for your web site,
and you are not a SEO expert. The message has a decent business writing style,
there's a signature with a decent-looking contact information with an postal
address, email and everything. And the message invites you to click on a link.
Is it spam or not?

~~~
byrneseyeview
Let's not make it binary. There are unsolicited commercial messages that we
like; there are unsolicited noncommercial messages that we hate, and there are
messages we wish we hadn't solicited -- I think filters should help us exclude
email that isn't worth our time to read, and let us read the email that is (in
fact, rather than a filter I'd hope we could some day have a 'prioritizer'
that gave us interesting, critical stuff first and that deleted less
meaningful messages if we ignored them long enough).

Your SEO example illustrates what I'm talking about: it may be one of those
offers that is worth the time it takes to consider it, so you'd hope that a
spam filter wouldn't bounce it. On the other hand, I dislike SEO and worry
that the users I'd get from it are lower-quality, so I wouldn't want to
bother. This is why Bayesian filters can (and should) be somewhat
personalized. An invitation to a conference might look like spam to 99.9% of
people, but that other .10% shouldn't be punished for having strong but
unconventional preferences.

~~~
mojuba
Ok, so you hope to teach your filter that you are not interested in SEO
(because you know that "decent" companies rarely advertise themselves by
sending spam). What if one day someone you know writes to you personally and
offers just friendly to optimize your site because she thinks it doesn't show
up in Google results? And let's say there will be some link, too. There's a
chance this message will hit the thresholds and you won't see it.

Another possibility could be that we are discussing this by email and we use
plenty of those "bad" words. Now get it: Viagra, Rolex watches, SEO, efficient
email advertising ;)

I'm arguing that the only way to stop some really shitty mail is to force
senders pay some price. Could be money or maybe some computational resources
(say, MD5). The _recipient_ (updated) would be able to see exactly how much
time or other resources were spent on the given message. The more resources
the more important the message. 1000 iterations of MD5? Ok, that must be
someone I know personally. 10 iterations? Looks like an ad sent to many
recipients. That's it.

But computers become more and more powerful. Fine, in 5 years you will be able
to raise your thresholds and require, say, 10,000 MD5 iterations to catch your
attention.

