
How to beat comment spam - dendory
http://dendory.net/blog.php?id=5078058e
======
cmer
I was in the business of fighting web spam for over 5 years (Defensio) and
while these techniques help, they're not the definitive answer.

Spam bots are now extremely sophisticated and have been able to execute
Javascript and "read" and understand web pages for many years. They'll also
post bogus comments that are somewhat related to your article but sneak in a
fishy URL in there. We had many false-positive reports that were actually real
spam. It's just really hard to detect by a human. Of course, JavaScript-based
technique will eliminate some easy to catch spam, but nothing a 3rd party
service couldn't catch.

Another huge problem is that people are paid next to nothing in China and
India to manually spam websites and break captchas. The number of human
spammer keeps increasing. When I left last year, it was becoming a huge
problem. Definitely the biggest headache for us in ~5 years.

In my experience, the best protection against web spam is still
Akismet/Mollom/Defensio. And for the record, I know we didn't like when people
used other mechanism to stop some spam before it got to us because we didn't
get to see the full corpus, which was invaluable to us in helping all our
users fight spam.

~~~
Kesty
I think the kind of defense you need to use depends on what kind of website
you have.

Based on my experience if you have a small/medium website you won't find bots
that execute javascript, understand a web page or use human spammers.

Those are reserved for the big ones, for all the others is mostly general-
purpose bots that try every form they can find on the internet. Where speed is
most important than accuracy spammer won't use the "Heavy" bots.

~~~
cmer
Actually, the sophisticated bots typically target platforms, not websites. So
if your website runs Wordpress, you're much more likely to be spammed hard
than if you custom-built a comment form.

------
TomGullen
We got hit with a huge wave recently, that sent over 40,000 visits a day to
our site and nearly ground it to a halt.

The number 1 effective thing we have found to do is to not allow hyperlinks to
be posted if they are not trusted (not enough rep/point/score whatever)

Overnight it basically stopped the spam wave. Your removing the one thing of
value for them, a hyperlink. I'm a big fan of accessibility and this works
well with it. The only other technique we use is honeypot form fields which do
catch a fair few, but nowadays a lot of spam I suspect is paid human spam.

~~~
ryanwanger
Blocking any spam that contains a link is helpful, for sure, but doesn't get
everything. Every few months I see waves of comments like: "Really graet
article. We need more people like you in the world."

Each comment has exactly one pair of transposed letters. There is no product
being pitched, and no url (we don't display or link to email address either).
It's baffling.

~~~
eli
Some blog platforms whitelist comments from people who have had previous
comments approved. I'm pretty sure these meaningless (but positive) comments
are an attempt to get on that list.

~~~
zalew
yes. also some platforms (mostly forums, less of blogs) allow editing of
posts, so forum spammers sometimes post meaningless crap only to replace it
later with spam.

------
Kudos
You accept comment submission via GET requests?

I may not have reverse engineered it fully, but something like this will allow
me to post images around the internet that actually create comments on your
site by the IP of the visitor.

<img
src="[http://dendory.net/blog.php?id=5078058e&cn=Kudos&cp=...](http://dendory.net/blog.php?id=5078058e&cn=Kudos&cp=Spam%20urls)
/>

~~~
fooyc
You can do that with POST too if the site doesn't have CSRF protection ;)

~~~
Kudos
You cannot do it in drive-by format like you can with a GET. It's passive and
can be posted almost anywhere, message boards, email, Facebook.

~~~
driverdan
You can with simple JS posting a hidden iframe. This obviously won't work on
3rd party sites but works just fine from sites you control.

------
tomwalsham
My personal favourite quick-fix (which doesn't stand up to targeted attacks,
but is a very effective band-aid), is to put the following : <input
type='text' name='website' style='display:none'>

Then disallow any form submissions server-side which contain a value for
'website'. Automated bots can't resist filling out that field.

~~~
shrub
This happened to me recently with a WP blog. It happened quite by accident,
however, since the client just didn't want the website field. When comments
still came in with a URL, the client was concerned that I had screwed up - but
it clicked right away for me that these must be bots. It might have been a
little disheartening for the client, since a number of these spam messages
were along the lines of "I have never read such a great article. I have
bookmarked your blog and will come back every day to read more of your
insightful posts." What unaware blog owner wouldn't want that on their
comments? Crafty spammers.

------
reasonattlm
I've used this sort of Javascript-based approach for years, with great
success:

[http://www.exratione.com/2010/12/how-to-block-999-of-all-
mov...](http://www.exratione.com/2010/12/how-to-block-999-of-all-movable-type-
comment-spam/)

It works very well unless you're big enough to merit individual attention from
a spammer. It's not rocket science - it just raises the bar a little above the
level of effort that people who spam everything, everywhere are willing to put
in.

That might change.

The real merit of Javascript used this way is that there are so many different
possible approaches and ways to write the code that parsing has to be done on
a site-by-site basis. It should even be possible to write something that auto-
generates -and-mixes various combinations to make it annoying and costly for
an individual to keep working at breaking the protection, and thus increasing
the size of community/site you could protect this way.

~~~
theoj
There is a Wordpress plugin called Spam Free Wordpress that implements a
variation of this and has effectively cut spam on my sites to zero.

The plugin improves on the method described by randomly generating the value
of the additional token parameter, and keeping a list of all generated tokens.
If the server receives a comment post request which does not contain one of
the generated tokens, then that comment is guaranteed to be automated spam.

------
obituary_latte
I recently set up a WP site and forum for a product my brothers are trying to
sell.

We're not allowing commenting on WP, but obviously have to allow people to
post on the forum. The forum software offered a couple of (unofficial) anti-
spam plugins, but they were not effective at all.

Decided to try re-captcha, but found that to be equally ineffective (hadn't
read about just how broken re-captcha is until this incident).

So I spent 10 minutes writing a little script that checks for mouse movement
and clears a pre-populated field. If the field isn't empty, bot it is.

Wasn't sure it'd work, but so far, so good. I know it's not ideal and will be
a problem for people without js enabled, but the site and product are
targeting a demographic in which that's likely to be a rare occurrence so the
benefit > risk.

~~~
fuzzix
"So I spent 10 minutes writing a little script that checks for mouse movement
and clears a pre-populated field. If the field isn't empty, bot it is"

Nice idea. I tend not to use the mouse a whole lot once the 'reply' link has
been clicked, have you had any complaints of legitimate posts being lost?

I'm wondering if adding a check for key down/up events would mitigate this
potential issue since a spam bot is not likely to generate those either.

~~~
artursapek
I think he probably starts checking for the mouse movement as soon as the page
loads.

~~~
obituary_latte
Exactly right, and there is a threshold set. Though it's not used when people
try to post but rather when they try to register, I'd imagine it'd work
similarly well on an "open" comment page. For a while at least.

------
jiggy2011
From my experience running popular open source applications seems to pretty
much guarantee spam.

For example, We built a website with a forum some years back and used phpBB.
Within days massive amounts of explicit porn had been posted all over it and
we had a client threatening to sue.

We tried everything we could to get rid of it, stopping images/hyperlinks from
being posted, adding captchas , anti-spam plugins and doing stuff like adding
sneaky hidden form fields.

At one point we even deleted the signup form and required administrators to
create accounts by hand on request for users, yet the bots still somehow
managed to create their own accounts on the forum.

None of it worked for over a month at a time.

In the end I just built a super simple php forum by hand in a few hours with
very rudimentary anti-spam since it was a small forum and we weren't using
many phpBB features anyway.

Took over a year for the bots to come back and at that point switching the
HTML around and changing the form field names seems to have kept them away
thus far.

------
daveid
Another technique I find to be working really well is the "honeypot"
technique. I create a CSS-hidden input field with a delicious, attractive name
"url" and then validate it to be empty.

~~~
Kesty
I use both the hidden honeypot and a random javascript injection that has to
be matched server-side. Both have to pass.

The "problem" with this kind of tricks is that they works for small/medium
website and only if they are not adopted as part as a big library that
everyone uses.

They are not that hard to beat if you want to spam someone intentionally or if
they are implemented by a well known plugin for (wordpress/joomla/etc..)

------
toadburglar
I sense a lot of nativity in this post. For starters using GET just means that
once one spam user creates a rule for your site, they can spam it until you
change the variable names in the query string. Using JS to submit a form,
whilst should be fine, but I STILL encounter people without JS, and personally
without a JS fallback I think it's just bad coding.

A simple honeypot with some CRSF tokens would reduce spam, if you want to beat
spam altogether, then invest some time in a captcha, but expect it to come at
the user's expense.

------
JimWestergren
With 5 lines of PHP I was able to block 94,94% of the spam on a WordPress
blog. I simply checked how long time it took for reading my article, writing
and submitting a comment. Less than 10 seconds = block with a friendly
message. Code and more details here: [http://www.jimwestergren.com/a-new-
approach-to-block-web-spa...](http://www.jimwestergren.com/a-new-approach-to-
block-web-spam/)

~~~
kanzure
> I simply checked how long time it took for reading my article, writing and
> submitting a comment. Less than 10 seconds = block with a friendly message

Some of the bots simulate mouse movements, some of them even inject
letters/words into textarea elements as if someone is typing. It's not that
hard to make it look like someone is correcting typos.

------
hashtree
One thing that really helps:

Server-side, encrypt a token which, including representing the unique form
instance, contains a tick count and set a hidden input's value to it. Now,
ensure that each form instance cannot be submitted more than once AND that the
delta between the current tick count and the form's tick count is greater than
or equal to the amount of time that would be need for a human to fill out the
form.

You MUST ensure client-side error detection is superb (as you want to catch
all errors prior to submitting), handle for back button usage properly
(browser caching directives, http status codes, etc), and ensure you handle
for browsers which may auto fill information in for the user.

You would be surprised just how many bots come in and either used a cached
form or immediately submit it. Assuming they are smart enough to bypass both
of these, you just reduced the number of times they could potentially spam you
dramatically.

The tick count figure needs to be done on a form by form basis, as each one
likely has a different minimum.

~~~
necro
I added something similar to our framework where we do the encryption server
side when a form is generated. In our token we encrypt a form generation time
and captcha question and answer variables. This allows us to easily render on
the form a textual or graphical captcha and pass the answer encrypted. The
form processing simply decrypts the data and decides one, if a form is too
fast or stale based on the difference of the form generation and submit time
and two, it compares the captcha answer to that which was passed in the
encrypted token.

------
kenkam
This is a cat and mouse game. If enough websites out there use this technique
then there will be bots that can circumvent this, although some non-trivial
amount of work is needed to parse the javascript.

~~~
phpnode
Not really hard at all, you can automate a browser using phantomjs and it's
really fast too.

------
dendory
Just to respond to some of the comments I've seen. Basically, yes it's true
that my sites aren't very high profile, and if someone were to target them
directly it would be trivial to bypass this system. The point was more that
the current, well used bots that send spam randomly, do not work against them.

Interestingly enough one of you, someone who saw the story here, decided to
actually write one such bot and start spamming my blog post, but again they
were pretty stupid and it was trivial to block. Still, pretty sad that someone
would go to this length and actually try and send hundreds of spam posts just
for the kick of it.

Also a lot of people mentioned captcha, and yes I guess I should have
mentioned that, but the reason I never used one is because I didn't get any
spam in the first place.

~~~
krapp
Look at it this way -- now you get to find another new way to easily beat
comment spam. And then another. And then another...

------
mlitwiniuk
Idea is good for massive/popular spam bots. But... well... I've been there and
I know, that spam bots evolve if only there is someone, who can tune bot a
little. So changing bot from looking for submit button into just submiting
form is quite easy. Also - technology goes forward and there's no big problem
today to write bot, that understand JavaScript. And as I have mentioned this -
the best solution, I've found to fight spam bots is to create hidden (or
visible, what the hell) field with initial value, that's later changed by
JavaScript. Checking for value, you expect it to be set by bot works like a
charm. But - as long as you say, that this solution works, it's worth
mentioning and remembering.

------
mikeash
It's important to note that it's extremely easy to "beat" comment spam if you
have a relatively low-traffic site and some programming time to spend on a
custom solution.

The per-message payoff for spam is horrendously low. Spammers only do it
because they can post a huge number of messages. The big threats are
necessarily automated, and that automation isn't going to bother with special
cases for any site that isn't worth their while.

For the longest time, the anti-spam measure on my blog's comments was a field
that literally said:

    
    
        Type the word "elbow": _____
    

And it only accepted the comment if you typed the word "elbow". It wasn't even
a dynamic word. It was literally hardcoded to be the word "elbow". This
stopped almost all spam for years.

Somebody finally added this to their bot, so I modified it slightly, to:

    
    
        Type the word "humour", but with American spelling: _____
    

Once again, this stopped almost all spam for years.

A few months ago, more for fun and curiosity than because I really needed it,
I replaced that anti-spam field with a JavaScript hashcash-based solution.
Basically, when the user wants to make a comment, the page fetches a problem
from the server whose solution is difficult to compute but easy to verify. The
page then computes the solution on the commenter's computer, and posts it
along with the comment. I tuned it to take about 20-30 seconds on modern
hardware/browsers.

For the curious, the problem I chose is a standard one you'll find if you
search for "hashcash". The quick version is that the server generates some
random data and gives it to the client. The client then searches for a salt
that, when added to the data, produces a SHA-1 hash with a given number of
leading zero bits. The number of leading zero bits required can be easily
tuned, with each additional bit roughly doubling the amount of time it takes
to find a solution. The client's solution can easily and quickly be verified
by just combining the client's solution with the generated data and counting
the number of leading zeroes in the SHA-1 hash.

Now, this would not stand up to a concerted effort. My JavaScript
implementation is pretty slow, which means that the 30-second work required by
my page could be reduced to <1s of CPU time for a program optimized to break
my protection. But it doesn't matter, because it's not worth anybody's time to
do this.

I occasionally get spam, still. From looking at the logs, I'm about 99.9% sure
that these spam comments are being posted by actual human beings sitting at a
browser. I have no idea how it could possibly be cost effective to do this,
but the quantity is low enough that it's not a real problem.

My crazy hashcash solution has an additional benefit, which some might see as
a liability. I only start the work when the user clicks on the comment form,
in order not to burn up their battery unnecessarily if they don't plan to
leave a comment. The user then has to wait until the proof of work is
completed, typically 20-30 seconds, before they can post a comment. This
strongly discourages short, off-the-cuff comments, which are almost invariably
worthless anyway.

In short: spam prevention is easy if your site is small and you have the time
to invest in a custom solution. _Any_ custom solution will do. As long as it
doesn't match whatever patterns spambots possess, it doesn't much matter what
you do, as long as it's unusual.

Once your site gets big enough, you'll no doubt need more. But cutesy stuff
like changing your form variable names won't save you then anyway. If you're
at the level where the linked solution works, you're at a level where nearly
_anything_ custom-made will work.

~~~
pbhjpbhj
I use a dummy field on one site - called something like "Last Name" - the
contents of which are hidden and must not be changed. The field contents are
clear they must not be changed - "Do not alter this field!" - so that it still
works for a wanted user if CSS has been tampered with.

No spam yet. But it's quite a small site, probably this is over only about
6Million hits.

For all I know it's just because it's a hand-coded site. Trying this on a WP
site is on my todo list.

~~~
vinc
I used this solution on a network of WP blogs with moderate traffic (maybe
somewhere around 100 to 500k+ visits per month at best) but after a while some
spammers took the time to script their way into the comments.

------
Brajeshwar
Like many other bloggers, I've been a victim of Blog Comment spam for quite a
while. On few occasions, I've totally disabled comments on my blog.

However, isn't that something of the past?

I've totally outsourced my blog comments to Disqus (there are other
alternatives) and I'd like to say, I'm very happy with my decision. Some
manual spams still leaks through but they're so minuscule and I don't really
fret over them any more.

~~~
krapp
>However, isn't that something of the past? Not even remotely. Adobe Business
Catalyst users have been getting hammered with comment spam for months now, it
shows up in waves on livejournal and I catch it regularly in my akismet queue
in wordpress. I see it everywhere, still. If there's a form, something will
try to post a link in it.

>I've totally outsourced my blog comments to Disqus That's all well and good
until someone writes a bot designed to target Disqus users because of the size
of its userbase.

------
jccc
May we just let comments = referrer links? Comment on your own blog, twitter
feed, etc. and traffic from those sources list automatically under the
content.

Fighting these kinds of problems makes for interesting mental challenges, but
a technical solution isn't necessarily the best one. Shouldn't the price of
having space on my site to comment be that you do so from some kind of online
identity of your own?

------
kalleboo
I just thought of this method: randomize the input names on each form load,
and include they key to the hash in a hidden field. This way the bot would
have to be smart enough to go off field order instead of name (you could even
randomize field order using some clever CSS). Or are they already smart enough
to deal with that?

------
jasondc
Wait until you get the SPAM bots targeting your payment forms to validate
stolen credit cards...whole different set of challenges. We had 800 payments
in one day from this type of attack.

------
sheraz
You might want to look at project honeypot
(<https://www.projecthoneypot.org/>)

They are an open and distributed service that uses various signatures (ip, tar
pits, etc) to block spammers and bots on your site.

I put it up on one of my sites and saw an immediate drop to almost nil. I went
from 100+ spammy messages a day to less than 20 in the last 3 months.

------
ck2
There are many bots now that can handle javascript.

What is working for you is it's custom code.

Anything that doesn't match standard templates is helpful.

------
Tichy
Why not use that headless chrome project for spamming (momentarily forgot the
name)? That should foil the JavaScript evasion methods.

Downside is of course having to download the sites you want to spam, whereas
apparently traditionally spammers just send post requests.

------
borski
For forms that you don't want to annoy the user with a crazy hard to read
CAPTCHA, you might want to check out Negative Captchas:
<https://github.com/subwindow/negative-captcha>

------
pkulak
Yup. You just have to make your system a bit unlike everyone else's. I just
hid my normal comment field with css, and made a new, visible one with a
different name. Any comment that came in with the old parameter name was
chucked. Done and done.

------
jarofgreen
I tried to comment on the original blog post in both FireFox and Chrome and it
just said "Comment not sent!".

Anyone else?

Might this code be shutting out legitimate users? (Apart from the fact that if
you have JS turned off you can't comment, that is.)

------
obsession
I don't think Javascript tricks work very well against motivated spammers. It
is trivial to use headless WebKit client to execute Javascript and ajax
requests.

~~~
threedaymonk
I'm reading this thread whilst running a full-stack test suite against my app
- using a headless WebKit client. I expect spammers will do the same if and
when the JavaScript-unaware methods stop yielding an acceptable return, but
given their low costs that threshold may be a long way off.

I use something similar in my own site: a field in which the commenter is
asked to fill a specific value. If they're running JavaScript, I fill it in
for them and hide the element. So far, it works perfectly.

As other commenters have pointed out, however, this kind of defence only works
against generic attacks, and defending against a targeted spam attack will
always be difficult. But for the generic case, there will continue to be
simple things you can do to thwart naive attacks. One that springs to mind is
to introduce a scripted timing element. A spam bot won't wait a minute before
submitting, but a user should at least have read the post they're commenting
on.

~~~
gambler
Progressive enhancement for bot detection... I like your idea. This is much,
much better than simply stopping anyone without JS enabled from using the
form.

------
borplk
this is amazingly simple and brilliant. thanks for sharing. it is going to be
too difficult for the bots to learn to get around that for quite some time, so
now it's the time to enjoy this defense mechanism. once it becomes the
ordinary thing, the bots will evolve too for sure but we're not quite there
yet.

------
ahabman
Why don't bots just fill in the form then trigger a click on the anchor?

~~~
kami8845
Too much overhead to actually fully simulate JS execution when you can still
spam 99% of sites using GET/POST

------
alexyoung
And thus began the era of spam bots written with jsdom/PhantomJS.

------
munyukim
It's clearly a clever trick; I'm impressed.

------
gobengo
Authentication helps.

------
whyhellothere
Unfortunately (at least in the UK) this technique cannot be used on consumer
facing sites as it breaks the accessibility of the form for some disabled
users.

For personal sites it really comes down to your preferences. Personally I
would prefer that everyone was able to comment, however if it stops you having
to wade through thousands of spam messages every day I can see the point of
using it.

~~~
vidarh
Why would this be an accessibility problem? I don't see why screen readers
would have a problem dealing with it - for them the form in the users browser
will appear just the same as it otherwise would.

~~~
Isofarro
1.) Screenreaders have different modes of operation for different aspects of
web content. For dealing with Forms they have Forms mode, in which only form
elements are announced. A link isn't a form element, so they wont see the
submit button.

2.) Screenreader users have a shortcut key to submit the form - typically when
under-qualified web developers create forms without submit buttons. This fires
the form submit event, which without a JavaScript preventDefault will get the
form contents sent to the URL mentioned in the action attribute on the form.
So the screen reader user's comment is treated as spam.

