
How to block semalt.com referrer traffic using .htaccess - taigeair
http://logorrhoea.net/2014/01/how-to-block-semalt-com-referrer-traffic-using-htaccess/
======
antr
We saw this referral on our stats a few months back, and I was naive enough to
signup to the service to check it out... it was total rubbish, and deleted my
account immediately.

Clicky recently tweeted about banning them from the stats
[https://twitter.com/clicky/status/445704464750501890](https://twitter.com/clicky/status/445704464750501890)

------
jonahx
Would someone mind explaining what semalt actually does?

As usual I cannot penetrate the marketing-ese explanation provided by their
homepage.

~~~
qwerty_asdf
Something like... "we can tell you what google probably does with your site
and content as compared to your direct rival, without actually being google,
without involving google, and probably... sort of accurate... and... cheap
enough... based on the relative quality of our sales pitch, all while
satisfying your boss with a ' _report_ '."

------
smacktoward
The normal way to exclude a crawler is to use a robots.txt file:
[http://www.robotstxt.org/robotstxt.html](http://www.robotstxt.org/robotstxt.html)

Any ethical crawler will respect exclusions defined in that file. The article
doesn't mention trying this method before jumping to mod_rewrite, though.

So -- does Semalt's crawler look for and follow robots.txt? As long as it
does, they're doing what they should be doing to let site owners opt out of
crawling.

EDIT: Found a page on their web site where you can enter a domain to opt it
out of crawling:
[http://semalt.com/project_crawler.php](http://semalt.com/project_crawler.php)

No instructions for how to opt out via robots.txt, though. That's a big
omission. Anyone who's going to do mass crawling needs to support robots.txt.

~~~
RyJones
First comment on OP says they ignore robots.txt

------
eli
Logs are going to be full of junk from bots. Unless it's actually breaking
something, IMHO it's better to just accept that and move on.

(If you're using Apache and _do_ want to start fighting bots, I'd suggest
taking a look at mod_security. Very powerful, but beware that the default
rules can be touchy)

------
bowlofpetunias
I wonder why we still don't have a simple standard way of dealing with abusive
http-traffic, like RBL's for mail.

Way to often have I seen systems getting overrun with rogue traffic, usually
spam-bots and vulnerability scanners, and lots of small sites get into serious
trouble because of this.

------
dazc
I have been plagued by semalt for the past 2 months. This is working so far:

SetEnvIfNoCase Referer crawler.semalt.com spammer=yes SetEnvIfNoCase Referer
semalt.com spammer=yes

Order allow,deny Allow from all Deny from env=spammer

~~~
z92
Line break is mixed up.

------
davexunit
I've noticed semalt.com in my Piwik analytics a few times now. Going to try
this rule out in my web root and see if they ever show up again.

------
rglover
I've been getting a bunch of hits from this and couldn't figure out why! Glad
someone did some more research on it.

------
BorisMelnik
what is semalt

[http://www.v4technical.com/what-is-semalt-and-how-to-
block-i...](http://www.v4technical.com/what-is-semalt-and-how-to-block-it-
from-visiting-your-site/)

------
jgroszko
Anybody know how to do this in nginx?

~~~
r12e
[http://wiki.nginx.org/HttpRefererModule](http://wiki.nginx.org/HttpRefererModule)

------
taigeair
interesting new way of spamming

