
How Referral Spam Like lifehacker.com Gets into Google Analytics - NonMathGirl
http://kraftblick.com/blog/referral-spam-in-google-analytics/
======
shortformblog
This has been written about a few places, including Motherboard (where, full
disclosure, I contribute):

[https://motherboard.vice.com/read/this-pro-trump-russian-
is-...](https://motherboard.vice.com/read/this-pro-trump-russian-is-spamming-
google-analytics)

Not long after doing the interview with Vitaly, they had to write a follow-up
because the original article ended up being used as GA spam:

[http://motherboard.vice.com/read/spammer-now-spamming-
google...](http://motherboard.vice.com/read/spammer-now-spamming-google-
analytics-with-motherboard-article-on-spam)

That cycle has repeated itself a few times (The Next Web ran into something
similar a month prior).

To be honest, the real problem here, far as I can tell, is that Google hasn't
figured out a way to make whitelisting traffic easy in Analytics.

Most people don't use a tenth of GA's overall power, and as a result have no
idea where to look when it comes to building out filters and such. Yes, those
who seriously do analytics for a living know how to handle this, but the
problem is, there are a lot of people who use GA because it "just works," and
it doesn't do the job if someone's just throwing junk into the referrers.

This problem is widespread and it depreciates the value of a fundamental
Google product. Why is a company that's so good at filtering spam apparently
ignoring the problem in this case?

~~~
thewhitetulip
Exactly, I started using GA a few months back, have no clue how to filter this
out.

~~~
marktangotango
Add a filter that only allows data from your domain. I found this on a spammy
content marketing site somewhere. I added this and haven't seen any referral
spam since. I'm not sure what the pros and cons are, maybe someone here can
elucidate?

In GA: Admin -> All Filters -> Add Filter -> Predefined Filter Type -> Include
Only -> Traffic to the hostname -> tha contains -> Enter your hostname in the
text box ie mysite.com.

~~~
shortformblog
The fact that there are eight steps to get to your suggested solution only
underlines my point that Google needs to create a simple solution to this
problem. It should be three steps, tops.

That said, there are some weaknesses to this solution. Here's why: For larger
businesses, they may have multiple websites, and those websites may need to
work together. Additionally, they may be using GA to track actions through an
app, for example, or steps in a marketing process. This solution might
generally work for individual blogs, but it gets hairy as your needs get more
complex.

And, as it turns out, the spam issues become more frustrating the more complex
your setup gets.

~~~
corobo
It should be zero steps. They run Gmail, surely they have someone on staff
that knows how to block spam? Even if it's an after-the-fact block and stats
tidyup.

Agreed, Google Analytics is basically worthless now unless you run a single
site and have the time to set up filters every time some jackass gets by the
ones already in place.

------
soared
I work at a digital marketing agency and recently had to deal with this. (I
also wrote two blog posts on it, currently unpublished). I have access to ~50
properties in google analytics from work and personal accounts. About 35 of
them were affected. I couldn't discern a pattern on size or type of website.
Very interesting how widespread this was, but very very easy to filter out. It
takes 2 seconds to filter this type of spam from appearing.

A side note though, the author is incorrect on point #3 of how he expects
future spam to behave. This spam is not sent with bots clicking on links
acting like user. Spammers use the measurement protocol [1] to post data to
randomly-ish generated universal analytics ids. So they take that referral/utm
data and continuously post it to random UA-IDs until they are successful, then
hit that ID much harder.

Previously you could get around this by making your first property (GA
structure is a tree, account > property > view, where you can have multiple
properties and views) a shell. You'd never use the first property and only the
second or third because then your UA-ID goes from 5575393 to 5575393-2. That
"-2" previously wasn't attacked.

[1]
[https://developers.google.com/analytics/devguides/collection...](https://developers.google.com/analytics/devguides/collection/protocol/v1/reference)

------
snowwrestler
I have been able to thoroughly block referrer spam by simply setting a custom
segment that only passes traffic that exactly matches my site's hostname. Does
this not work for everyone?

EDIT: I see some comments with advice to set a regex filter. BE CAREFUL WITH
GA FILTERS. They discard traffic permanently! If you misconfigure one, you
could lose real traffic data and never get it back.

This is why I generally prefer to use custom segments, which hide but do not
discard traffic. However, I have to remember to set the segment(s) manually.

If you want to use filters, create a new view and set the filter there. Keep
your original view as a backup.

Best practice with GA is to create two views when you first set it up. Call
one "All Data" and don't touch it. Call the other something like "Reporting"
and set filters etc there. That way you always have a "backup" set of data.

------
nodesocket
I had also gotten absolutely swamped in Google Analytics with things like:

    
    
      Language:
        Secret.ɢoogle.com You are invited! Enter only with this ticket URL. Copy it. Vote for Trump!
        o-o-8-o-o.com search shell is much better than google!
        Vitaly rules google *:｡゜ﾟ･*ヽ(^ᴗ^)ﾉ*･゜ﾟ｡:* ¯\_(ツ)_/¯(ಠ益ಠ)(ಥ‿ಥ)(ʘ‿ʘ)ლ(ಠ_ಠლ)( ͡° ͜ʖ ͡°)ヽ(ﾟДﾟ)ﾉʕ•̫͡•ʔᶘ ᵒᴥᵒᶅ(=^ ^=)oO
        Google officially recommends o-o-8-o-o.com search shell!
    
    

However, I recently applied the following account level filter and seems to
have helped a ton. I can't guarantee this will 100% work, but it has certainly
helped.

[http://imgur.com/a/z9gSg](http://imgur.com/a/z9gSg)

EDIT: Here is the regex so you can copy and paste it:

    
    
      \s[^\s]*\s|.{15,}|\.|,

~~~
chinathrow
I see the same spam. At this point I think the filtering by the average GA
user base costs Google way more CPU cycles than an overall firewall ruleset.

~~~
nodesocket
Wait a day. The filter only affects traffic after it was applied.

------
faitswulff
Note that the "k" is a Cyrillic "к" and the trick is to get analytics users to
click referral spam:

> Referral (or ghost) spam wasn’t that innocent. Curious marketers and web
> analysts checked domains they supposedly got traffic from. Referrers got
> them transferred to trashy websites with ads, viruses or porn.

> ...[Vitaly] needed software to get his websites into people’s analytics
> reports. His words were “if only one out of 1000 people click the link to
> see who these referrers are, I’ll gain profit.”

I just installed Google Analytics on my site and saw these weird referrers,
and my site is quite obscure. Seems like this could be a widespread problem.

There's also a fix in the article from Georgi Georgiev:
[http://blog.analytics-toolkit.com/2016/language-spam-
latest-...](http://blog.analytics-toolkit.com/2016/language-spam-latest-
google-analytics-spam/)

> Log in to your Google Analytics account and navigate to Admin -> Filters
> area. Add a new filter with the following settings. Make sure you have the
> “Edit” access at the “Account” level in Google Analytics. Remember that the
> filter will eliminate traffic hits where the language dimension contains 15
> or more symbols.

> Filter Pattern is .{15,}|\s[^\s]*\s|\\.|,|\\!|\/.

~~~
soared
Any guesses on the value of the "google.com" with the Cyrillic G? I'd say
$1MM.

~~~
notatoad
i thought that when icann allowed non-latin characters in URLs, they were
going to limit it to only the ccTLD where those characters were in common
usage - so ԍooԍle.ru would be valid, but ԍooԍle.com wouldn't be registerable.

am i misremembering, or did that plan change? this just seems like a phishing
nightmare.

~~~
086421357909764
I actually just ran a query on a domain using Google Domains and was offered a
.com variant of an existing domain with the only difference, the letter c with
a Hacek (Ĉ). Phishing nightmare indeed.

------
alxmdev
> _Lifehacкer.com mimicked lifehacker.com with the only difference in Cyrillic
> letter ‘к’ instead of Latin ‘k’ used in the original traffic source. The
> substitution was obscure._

The Wikipedia article "IDN homograph attack" describes various ways in which
web browsers and ICANN try to protect users against this sort of shenanigan:
[https://en.wikipedia.org/wiki/IDN_homograph_attack#Defending...](https://en.wikipedia.org/wiki/IDN_homograph_attack#Defending_against_the_attack)

------
ericdykstra
I use Clicky as my default analytics tool for a new site. It filters out
referral spam automatically which, by itself, justifies the $10/mo for the
basic plan (which covers multiple sites).

[http://clicky.com](http://clicky.com) or
[http://clicky.com/100950546](http://clicky.com/100950546) if you want to use
my referral link.

Referral spam is just a waste of time, and if a site like Clicky can fight it
effectively, I don't see why my Google Analytics is constantly littered with
it. Gmail's spam filter is amazing, but their referral spam filter seems non-
existent.

------
soci
I found a guide to get rid of all referral and language spam in GAnalytics. So
far, works perfect for me. It filters out past spammy data and also adds a new
View to start tracking data without all the spam.
[https://www.ohow.co/ultimate-guide-to-removing-irrelevant-
tr...](https://www.ohow.co/ultimate-guide-to-removing-irrelevant-traffic-in-
google-analytics/)

------
jffry
I was reading along, about four paragraphs in, and then I was accosted by a
modal dialog: "Get Your Copy of "AdWords for B2B""

Maybe your article is useful, maybe not, I don't know because I closed the
tab, but screw you and your asshole design[1]

edit: No seriously, take a step back and look at your website in a mirror [2].
Popping up _another_ nag as I keep scrolling down? Lord almighty, you've
motivated me to blackhole your domain in /etc/hosts

[1]
[https://www.reddit.com/r/assholedesign/](https://www.reddit.com/r/assholedesign/)
[2] [http://imgur.com/a/Hgq9S](http://imgur.com/a/Hgq9S)

~~~
herbst
She is also spamming reddit badly and then answers with things like "this is
not spam" "i am a girl btw". I blacklisted the site.

~~~
NonMathGirl
It looks like you have troubles with me. I apologize if I made you feel
insecure and victimized. Sorry again, dude.

------
ifrins
I also receive all sorts of spam hits on Google Analytics accounts. My
suggestion is to create a filter that only includes the traffic directed to
your hostname. It filters pretty much everything. But it's another problem in
mobile apps tracking. I've yet to find a solution on mobile. Any suggestions?

------
aakarpost
If you want to clean up your historical data in Analytics, you can create a
segment and block the language spam.

\s[^s]*\s|.{15,}|\\.|,

Go to Audience Overview > Add Segment > New Segment > Conditions > Select
Language > Select Matches Regex & enter the above regex. Select Session &
Exclude, give it a name and hit save. Congratulations, now your historical
data is clean.

------
cypherpunks01
Why was the HN title moderated (presumably) from lifehacкer.com to
lifehacker.com?

~~~
Sir_Cmpwn
The title is normalized automatically, I think, to prevent phishing or
something.

------
bhouston
Can not google just figure out this is analytics spam?

I suspect a similar filter to like what determines spam email can be used?
Basically a bunch of rules would look for similar referrals popping up all
over the place and are invalid.

------
bhartzer
Sure, we can filter it out, but the real question is, “when will Google take
care of referral spam?”. Referral spam has been plaguing Google Analytics for
a few years now.

For some websites (with less than 100 visitors a day) it’s gotten so bad that
it’s almost making Google Analytics completely useless now.

------
chinathrow
I deal with this Vitaly on a daily base.

Google failed to block him for months - it would be so easy though. He hits /
on my sites but we never ever serve Google Analytics in this URL, only from
/<lang>. Hence all IPs logging pageviews on / are 100% spam. No false
positives.

~~~
thewhitetulip
Yes, the last time I have his spam is Dec 23, and it changes the text, the
o-o-o-h shell and vitaly etc etc Google's failure to block the spam is huge
considering all the money they make is via adverts and having GA is critical
to their $$$

------
faitswulff
"Error establishing a database connection"

Looks like HN traffic took it down. Here's Google's cached verson (Archive.org
didn't have it):
[https://webcache.googleusercontent.com/search?q=cache:cC67C6...](https://webcache.googleusercontent.com/search?q=cache:cC67C6lKYBYJ:kraftblick.com/blog/referral-
spam-in-google-analytics/+&cd=1&hl=en&ct=clnk&gl=us)

~~~
zhego
Their website is live again [http://kraftblick.com/blog/referral-spam-in-
google-analytics...](http://kraftblick.com/blog/referral-spam-in-google-
analytics/)

------
akerro
They don't get into Piwik. [https://piwik.org/](https://piwik.org/)

------
driverdan
> Poor user behavior metrics in Google Analytics negatively affected website
> ranking in Google organic search results.

Is this actually true? Does anyone have a primary source for this?

~~~
angry-hacker
No it's not true because this referral spam never touches your server. Google
does measure bounce rate when you come from Google search results, but this is
entirely different scenario.

------
kevin_thibedeau
I've dropped GA and all other Google services from my sites specifically
because of their inaction in fixing this problem. They know what the spam
domains are. They should be filtered automatically.

------
herbst
Ih god. Not only are you spamming reddit but also here.

(Sorry guys for the unfriendly comment but the same name on reddit is
basically a spam my articles account)

~~~
NonMathGirl
Your persistence in leaving comments here and there for my humble blog post is
incredible!

------
t3ra
Isn't cloudflare blocking this spam? Because I was managing a WordPress site
for a client behind cloudflare and they were still able to spam usi g this
technique

~~~
hbcondo714
No, the referal spam never hits the website. It instead goes straight to GA.

~~~
soared
hbcondo714 is correct, nothing every actually touches your website. Its send
with the measurement protocol, which allows you to POST data straight to GA.

[https://developers.google.com/analytics/devguides/collection...](https://developers.google.com/analytics/devguides/collection/protocol/v1/reference)

------
jgalt212
Unicode domain names are just a bad idea.

------
codedokode
I do not understand how domains like ɢoogle (with some unicode G-like
character) are registered? I remember when IDN were introduced there were
concerns about spoofing similar URLs and there were some restrictions on what
characters are allowed.

So now the owner of ɢoogle can add a signup form and start phishing for Google
Accounts?

And I am sure Mr. Trump has no relation to this spam. Obviously it is done to
turn people against him.

------
revelation
So someone is sending fake referrers that get into peoples favorite hidden
data collection tool, and these people have summarily complained here. _And I
'm supposed to care?_

Who knew, vacuuming up data from your visitors may not be all its made up to
be. Someone should hand this guy a medal. I should look and figure out if
browser extensions can manipulate the referrer yet, spamming all those
trackers and analytics services with fake data seems like a wonderful idea.

