Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How To Stop Forum Spam (swapped.tumblr.com)
30 points by latitude on Nov 20, 2012 | hide | past | favorite | 39 comments


The main way to stop spam on your website is to be ahead of the herd. At the moment the general herd of websites do very little, so a couple of honeypots and IP bans are all you need most of the time.

We were swamped recently with spam when our page rank recently hit 6 (I guess spammers target sites with higher PR). For this we had to implement a block on users with under n posts being able to post URLS. That basically eradicated that wave of spam.

It's amazing how smart and adaptive some of them are though. It only took them 12 hours to realise if they post anything then edit it they can edit in links. We fixed that one promptly and they haven't been back since.

So for that reason your suggestion probably would only be temporary. I'd also be concerned about a lot of false positives.

I also believe a lot of spam is human. If you have people willing to work for a few dollars a day spamming websites that automated processes have difficulty reaching, it could well be a cost effective method for spammers to reach their targets.


> We were swamped recently with spam when our page rank recently hit 6 (I guess spammers target sites with higher PR)

We have really good search engine rankings and if you post a topic on our forum you'll be available in all the search engine results within minutes. The funniest thing I've ever experienced was discovering when a spammer will submit a post about a live stream for a football match 15 minutes before kick off that we'll see ~3,000 people browsing the forum all from google because they searched "watch barcelona vs real madrid" and our ranking is being abused by spam bots.

Prior to realising that this happened I'd always viewed spam on forums as a sort of "hope people click the links" thing but after discovering just how much traffic people like those mentioned in my example can drive... these spammers must be making a lot of money. The forums are used as their quick way to get into search results for a term they know is about to blow up, I think it's pretty clever and obvious in hindsight.

http://rmbls.net/post/21518498093/barcelona-vs-real-madrid


We have that as well, and ranked very highly for some long tails all relating to live sports events.

However, it is our opinion that the traffic these links generate is fake (and we have some evidence for it as well). I think it's aim is to trick Google into thinking it's a good quality result, or alternatively the links often are very heavy in adverts so the fake traffic routed via Google, your website and finally the spammers website could be used to generate lots of fake ad clicks in a semi realistic way. That's our conspiracy theory anyway, I very rarely see real world examples of content on the web being ranked so well so quickly and driving so much traffic in such a short space of time.

As a small unscientific test when a seemingly popular spam sports event thread was posted, I deleted it and created a duplicate under my account. It didn't get any traffic at all. This in some way helps validate my suspicions that the traffic being driven is not a result of Google's process, but is artificial.

I was going to write a blog post about it but never got round to it!


Hm, that sounds like a plausible theory. I've never investigated beyond looking at the locations via google analytics and they always seemed to be European visitors (Spain was the majority) which I assumed meant the traffic was legitimate because the football matches of interest match up. Next time it happens I'll look up some of the IP addresses used and see if they match any known spammers, although I suspect if they are doing some sort of click fraud they'd be using bot nets?


yup, once your site becomes popular you get more determined attackers. filtering robots is only the beginning of the game...


The easiest way to do it in my experience is to write your own simple forum software. This should only be a few hours work in a framework and will have the added bonus of being more specialised to your needs and probably simpler to use.

If it's a small forum you probably don't need any actual moderation features (stuff you do need you can do through a scaffolding interface).

Bots seem to be attracted to phpBBs like flies to shit so not being fingerprinted is a huge bonus.

Also helps to do weird stuff, like instead of having a <input type="submit" /> for a form submit use an IMG tag with an onclick that reads data from the form fields in the most roundabout way possible.

Also make the URL of each thread change when there is a post so it makes it harder for the bots to tell if their posts are working or not. Better yet don't display the results on the post on the site itself for 30 seconds or so but use JS to make them appear to the submitting user.

Some of those tricks can of course hurt usability and accessability though but I managed to reduce spam to 0 on one forum.


The sort of forums you're talking about won't be what spammers really care about. If they can hit a small forum with no work needed they may as well but the targets that they'll work for are the high traffic forums that can't sacrifice usability, SEO or accessibility in any meaningful way to stop spam.

If you want to stop spam on a small forum you don't need to break usability because nobody is going to spend time looking at how you're doing things to get around your spam prevention.


I suppose it depends upon how sophisticated the spam bots you are dealing with are and whether they have human assistance.

We originally used phpBB and applied various anti-spam plugins, including some modifications that were made manually to the php code (including honeypot fields that were hidden with CSS) but the bots kept spamming it regardless.

I guess there is a sense of "we know this is a phpBB therefor we must be able to spam it so keep trying" vs "I don't know what this is","can I POST this form and see instant results?","no?","give up then".

If it had been a larger forum then I'm sure these tactics would not have been effective regardless since we would have suddenly become worthy of having a custom spam bot written just for us.

Those tactics are somewhat extreme, in many cases just giving form fields weird names has good results. Of course there is the problem of "what is bad for spam bots can be bad for screen readers".


"Some of those tricks can of course hurt usability and accessability though but I managed to reduce spam to 0 on one forum."

Did the number of non-spam posts also get to 0?

</snark> Sorry, had to.


Nope, it increased because people actually started to use the forum :)


I've tried having forms generated by javascript over a honeypot, and hashing the field names to make them impossible to memorize. I don't know yet how effective those would be but i'm sure people are going to complain about the js forms.


Depends on your audience, the HN crowd definitely would. Doing weird stuff like <IMG /> instead of submit buttons will hurt people who use keyboard shortcuts so this is an area where having a less tech savvy audience can be an advantage since they will use the mouse for everything.


It's already bitten me because I can't use a password manager to login to my own admin panel. Still though there is something to be said for your forms not having a static fingerprint maybe.

Nothing can be done about targeted attacks I think, besides not being worth the effort, or having moderators and constantly adapting. I wonder how much randomness you can introduce into the posting process, though, before it's no longer worth the effort. Add a csrf value to the form action? Rearrange the order of the fields? Add a captcha based on an average of the edit distance of an IP's previous posts, with a random threshold?


CSRF won't usually work unless the spam bot is dumb enough to try and change the token. They usually won't touch hidden fields, the things they like most are <textarea />.

Doubt re-ordering would make much difference either since they probably work on field names and tags.

CAPTCHA can work but you need a difficult one which will also be difficult for humans. Plus spammers can outsource captcha breaking to humans via injecting them into free porn/torrent/movie sites (solve the captcha to get the content) or just by paying someone in india a few cents.


I wonder whether the problem of captcha farming could be mitigated using branded captchas (which obviously don't belong to a site they might be injected into) and checking referers, and also making them complicated enough that they're not worth whatever the typical per-captcha rate is. Something visual, like a puzzle captcha, instead of filling text out.

Then again, that captcha might end up being too complicated for regular users to be willing to fill out as well.


99% of spammers can be stopped simply by IP and email bans.

For which I find this one of the best providers of up to date lists... oh, and they have an API I use at registration time: http://www.stopforumspam.com/

The last 1%, I let my other users flag, and then I ban them and add their data to the site above.

This isn't a big problem anymore, it just sounds like the author hasn't integrated his forum with the site above.


The principal difference is that Stopforumspam is a blacklisting method, and what I described is a whitelisting one.

Theirs is an effective approach, and it works great for dedicated forum sites. But I think it's an overkill for simpler setups for two reasons. First, it creates an obvious dependency on an external service. Second, if I want to allow unregistered posting, it leaves me only with an IP address as a data point and here I wouldn't bet that their by-IP detection is too accurate.

It's a simple, self-contained, virtually maintenance-free way to detect humans trying to post. I don't argue it's a superior to other methods, but it is simple and it helps simplifying the user experience of real visitors.


But yet, if they are using Google over https, searched the issue there and found your forum... the lack of referrer header on your http site and no other activity for the user would have you falsely identify the user as a spammer.

The cost of a false positive is great... you may lose a customer.

By adding an email field to your form you'd get to use the blacklist and avoid false positives and the risk of offending an already aggravated customer.


Ah, no. Of course, he's not identified as a spammer. Instead, he is identified as someone who needs to go through additional checks when posting. This can be combined with the stopforumspam check, and that would yield 3 categories of visitors - white, gray and blacklisted.

It's just that in my own setups I haven't come across for the need to blacklist people.

PS. If they searched the issue, they were presumably having it with my software, so it's quite likely they were on the site before and I can recognize them. That's not to say that direct hits into an appropriate threads from https://google aren't possible. But they aren't that common either.


I've been developing this 'invisible-catpcha' thing for a while now, which is basically meant to block bruteforce attacks and spammers alike from posting comments and registering accounts etc. The visual difference to the end-user is - as said - invisible, which obviously should be a good thing. Now that I read your comment I start to wonder would anyone really even use my service, as then their site's spam blocking would be dependent on me? Is there other people ho see it this way?


FWIW I'm pretty sure that my views on external dependencies aren't very common or popular. I think you'd have no problem finding users for your service.


IMO, this depends - i'm sure quite a few factor this into their decision on using or not but in the end it depends on your reputation being stable/online 99.XX% or being flakey...


Unless you're google, I would much prefer to have something I can read the source code of and install myself.


We use stopforumspam on few large sites.

It misses about 50% of the spammers, there are too many fresh IPs for it to know them all.

It also does "false positives" in that spammers are now using email addresses from legit users, so an email match alone is no longer good enough (if you don't do email verification).

The cookie method described by the post also won't work in the realworld - many times people drop into the very article they are looking for directly from a search engine, read it on the same page, and then leave a comment, so no crawl history.


This is not really about how to stop spam, as much as it is a bite-sized statement that we should analyze behavior to tell bots apart from real users.


I've found the best solution is to use a variety of strategies. We recently switched a forum from PhpBB to MyBB in the hopes that a slightly less popular platform would get less spam... we had a spam user sign up and post within 60 seconds of finishing our conversion.

However, we set up a StopForumSpam plug-in, a time-based plug-in (if you complete the registration page in < X seconds, you're probably a bot), and a custom questions plug-in, and we haven't seen any spam since. If they start figuring out to break through those walls, we'll try a few more... (On the other hand, it' s a small enough forum that I don't know if we're missing false positives.)

Yes, it's a cat and mouse game, but it's one you can generally stay on top of with minimal configuration once you understand what the spammers want and how to use your platform. And I have to admit there's some thrill involved at successfully thwarting their plans...


Switching to another "mainstream" forum software is probably in vain. We have our own software with unusual URLs, navigation, registration (requiring confirmation via e-mail) and still get a couple of spambots trying to post their shortened URLs every day, as well as obviously human spammers who try to participate in discussions for 4-5 posts, then post a few spam URLs.


Also I'm pretty certain that it's a mixture of bots and humans by now. I'm using random questions currently and whenever I change them the amount of registered bots goes down for 1-2 days. Then it slowly comes up again over the week until I think up new questions.

So I guess there's probably someone going over unknown questions regularly and adding new answers.


Nice idea until spammers just make bots that idle on your site and follow a few links before and after signing up.


They said the same about greylisting. That the spammers would just write a retry logic into their sending daemons. This was several years ago, back when I enabled it on my mail server, and I am yet to see a single spam that didn't come from Hotmail or Gmail.

[1] http://en.wikipedia.org/wiki/Greylisting


Sadly, greylisting is an absolute nightmare for server-admins on the sending end. We sent out time-sensitive emails to our clients and Yahoo's Greylisting could sometimes delay accepting the mail for upto an hour. This causes extreme frustration for our customers and some even leave as a result. Its even worse because customers are never aware of Greylisting and even if they do, Yahoo wouldn't let them have control of it. Asking someone to change their main mail provider doesn't end well either. This could also happen for confirmation emails when they first sign-up to your site. Spam does indeed cost a lot to businesses directly or indirectly.


I had this problem recently in the custom blog/comment functionality I had written for my company's app.

I've taken a few steps which has completely rid me of spam (for now):

- IP blacklisting: an obvious one but well worth the few minutes it takes to setup, most of our spam was coming from China.

- Link blocking: our comments don't really require links to work as the comments are generally short and to the point and the blog is not used by terribly tech savvy users.

- Hidden checkbox: add a hidden checkbox to the form. If it comes through as checked you know a human didn't submit the form.

Analysing a visitors progression through the site is a neat idea though - if spam becomes an issue again then I may use this approach (since I'm already gathering this data for custom analytics).


Has bayesian filtering been tried on forum posts? (It can work well for emails)

If the filter is unsure, the post can be referred to an administrator for moderation (and will subsequently be added to either the spam or ham corpus, training the filter for similar posts).

In the case of false negatives (spam gets through), a discrete "report spam" button will allow a moderator to add it to the spam corpus (again training the filter against similar occurences).

It might even be possible to use the filter score to reduce "report spam" abuse, i.e. if the filter is fairly certain it's ham, require a larger number of users to report it as spam before bothering an admin with it.


I've mentioned this before in previous posts:

  * http://news.ycombinator.com/item?id=4646710 (How to beat comment spam)
in short: https://www.projecthoneypot.org/


Doesn't this solution as soon as people come directly to the forum from a search engine, and asking bots to maintain a cookie jar is hardly challenging.


I could just as easily craft my bot to appear more like a user. "Reading" posts, checking PMs, etc. etc.


I'll give a week until the bots adapt.


See huhtenberg's comment above. It was my experience too - the spammers are just too lazy/busy/cheap to chase the fractions of the percentages.


Maybe they're just efficient. Little work for max cost is more efficient than lots of work for max cost. (This is exactly what you said with busy and chasing the fractions of percentages)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: