Hacker News new | past | comments | ask | show | jobs | submit login
How Referral Spam Like lifehacker.com Gets into Google Analytics (kraftblick.com)
185 points by NonMathGirl on Dec 26, 2016 | hide | past | web | favorite | 50 comments

This has been written about a few places, including Motherboard (where, full disclosure, I contribute):


Not long after doing the interview with Vitaly, they had to write a follow-up because the original article ended up being used as GA spam:


That cycle has repeated itself a few times (The Next Web ran into something similar a month prior).

To be honest, the real problem here, far as I can tell, is that Google hasn't figured out a way to make whitelisting traffic easy in Analytics.

Most people don't use a tenth of GA's overall power, and as a result have no idea where to look when it comes to building out filters and such. Yes, those who seriously do analytics for a living know how to handle this, but the problem is, there are a lot of people who use GA because it "just works," and it doesn't do the job if someone's just throwing junk into the referrers.

This problem is widespread and it depreciates the value of a fundamental Google product. Why is a company that's so good at filtering spam apparently ignoring the problem in this case?

Exactly, I started using GA a few months back, have no clue how to filter this out.

Add a filter that only allows data from your domain. I found this on a spammy content marketing site somewhere. I added this and haven't seen any referral spam since. I'm not sure what the pros and cons are, maybe someone here can elucidate?

In GA: Admin -> All Filters -> Add Filter -> Predefined Filter Type -> Include Only -> Traffic to the hostname -> tha contains -> Enter your hostname in the text box ie mysite.com.

The fact that there are eight steps to get to your suggested solution only underlines my point that Google needs to create a simple solution to this problem. It should be three steps, tops.

That said, there are some weaknesses to this solution. Here's why: For larger businesses, they may have multiple websites, and those websites may need to work together. Additionally, they may be using GA to track actions through an app, for example, or steps in a marketing process. This solution might generally work for individual blogs, but it gets hairy as your needs get more complex.

And, as it turns out, the spam issues become more frustrating the more complex your setup gets.

It should be zero steps. They run Gmail, surely they have someone on staff that knows how to block spam? Even if it's an after-the-fact block and stats tidyup.

Agreed, Google Analytics is basically worthless now unless you run a single site and have the time to set up filters every time some jackass gets by the ones already in place.

Google Analytics already filters out fair amount of bot traffic from Google Analytics (compare your server log files with GA traffic and you'll see what I mean). Why they haven't tackled this problem is a mystery.

The traffic mismatch is not due to any filtering by Google. Most people useing some sort of ad-blocking technology have Google Analytics blocked by default, preventing any traffic info being reported.

I work at a digital marketing agency and recently had to deal with this. (I also wrote two blog posts on it, currently unpublished). I have access to ~50 properties in google analytics from work and personal accounts. About 35 of them were affected. I couldn't discern a pattern on size or type of website. Very interesting how widespread this was, but very very easy to filter out. It takes 2 seconds to filter this type of spam from appearing.

A side note though, the author is incorrect on point #3 of how he expects future spam to behave. This spam is not sent with bots clicking on links acting like user. Spammers use the measurement protocol [1] to post data to randomly-ish generated universal analytics ids. So they take that referral/utm data and continuously post it to random UA-IDs until they are successful, then hit that ID much harder.

Previously you could get around this by making your first property (GA structure is a tree, account > property > view, where you can have multiple properties and views) a shell. You'd never use the first property and only the second or third because then your UA-ID goes from 5575393 to 5575393-2. That "-2" previously wasn't attacked.

[1] https://developers.google.com/analytics/devguides/collection...

I have been able to thoroughly block referrer spam by simply setting a custom segment that only passes traffic that exactly matches my site's hostname. Does this not work for everyone?

EDIT: I see some comments with advice to set a regex filter. BE CAREFUL WITH GA FILTERS. They discard traffic permanently! If you misconfigure one, you could lose real traffic data and never get it back.

This is why I generally prefer to use custom segments, which hide but do not discard traffic. However, I have to remember to set the segment(s) manually.

If you want to use filters, create a new view and set the filter there. Keep your original view as a backup.

Best practice with GA is to create two views when you first set it up. Call one "All Data" and don't touch it. Call the other something like "Reporting" and set filters etc there. That way you always have a "backup" set of data.

I had also gotten absolutely swamped in Google Analytics with things like:

    Secret.ɢoogle.com You are invited! Enter only with this ticket URL. Copy it. Vote for Trump!
    o-o-8-o-o.com search shell is much better than google!
    Vitaly rules google *:。゜゚・*ヽ(^ᴗ^)ノ*・゜゚。:* ¯\_(ツ)_/¯(ಠ益ಠ)(ಥ‿ಥ)(ʘ‿ʘ)ლ(ಠ_ಠლ)( ͡° ͜ʖ ͡°)ヽ(゚Д゚)ノʕ•̫͡•ʔᶘ ᵒᴥᵒᶅ(=^ ^=)oO
    Google officially recommends o-o-8-o-o.com search shell!

However, I recently applied the following account level filter and seems to have helped a ton. I can't guarantee this will 100% work, but it has certainly helped.


EDIT: Here is the regex so you can copy and paste it:


I see the same spam. At this point I think the filtering by the average GA user base costs Google way more CPU cycles than an overall firewall ruleset.

Wait a day. The filter only affects traffic after it was applied.

Note that the "k" is a Cyrillic "к" and the trick is to get analytics users to click referral spam:

> Referral (or ghost) spam wasn’t that innocent. Curious marketers and web analysts checked domains they supposedly got traffic from. Referrers got them transferred to trashy websites with ads, viruses or porn.

> ...[Vitaly] needed software to get his websites into people’s analytics reports. His words were “if only one out of 1000 people click the link to see who these referrers are, I’ll gain profit.”

I just installed Google Analytics on my site and saw these weird referrers, and my site is quite obscure. Seems like this could be a widespread problem.

There's also a fix in the article from Georgi Georgiev: http://blog.analytics-toolkit.com/2016/language-spam-latest-...

> Log in to your Google Analytics account and navigate to Admin -> Filters area. Add a new filter with the following settings. Make sure you have the “Edit” access at the “Account” level in Google Analytics. Remember that the filter will eliminate traffic hits where the language dimension contains 15 or more symbols.

> Filter Pattern is .{15,}|\s[^\s]*\s|\.|,|\!|\/.

Any guesses on the value of the "google.com" with the Cyrillic G? I'd say $1MM.

i thought that when icann allowed non-latin characters in URLs, they were going to limit it to only the ccTLD where those characters were in common usage - so ԍooԍle.ru would be valid, but ԍooԍle.com wouldn't be registerable.

am i misremembering, or did that plan change? this just seems like a phishing nightmare.

I actually just ran a query on a domain using Google Domains and was offered a .com variant of an existing domain with the only difference, the letter c with a Hacek (Ĉ). Phishing nightmare indeed.

Probably < $0. It's a deceptive domain of a strongly contested trademark - all the Unicode variants are probably just burner domains.

There's also Cyrillic o / о and e / е. And l can be substituted for the capital I or the digit 1. We're looking at quite a few different fake "google" domains. I doubt any of them will be worth a million bucks, they are not that scarce unless someone registeres all of them at once to squeeze the market.

Probably not much cause it'd look like гоогle.com or something else silly. The Cyrillic g character would stand out and most Russians would go to yandex before Google so accidental Russian traffic would be barely worth the effort. The print form of cyrillic that would be used in browsers just doesn't have a character to make the domain worth anything.

The vowels on the other hand look the same. ОO, eе, аa.

> Lifehacкer.com mimicked lifehacker.com with the only difference in Cyrillic letter ‘к’ instead of Latin ‘k’ used in the original traffic source. The substitution was obscure.

The Wikipedia article "IDN homograph attack" describes various ways in which web browsers and ICANN try to protect users against this sort of shenanigan: https://en.wikipedia.org/wiki/IDN_homograph_attack#Defending...

I use Clicky as my default analytics tool for a new site. It filters out referral spam automatically which, by itself, justifies the $10/mo for the basic plan (which covers multiple sites).

http://clicky.com or http://clicky.com/100950546 if you want to use my referral link.

Referral spam is just a waste of time, and if a site like Clicky can fight it effectively, I don't see why my Google Analytics is constantly littered with it. Gmail's spam filter is amazing, but their referral spam filter seems non-existent.

I found a guide to get rid of all referral and language spam in GAnalytics. So far, works perfect for me. It filters out past spammy data and also adds a new View to start tracking data without all the spam. https://www.ohow.co/ultimate-guide-to-removing-irrelevant-tr...

I was reading along, about four paragraphs in, and then I was accosted by a modal dialog: "Get Your Copy of "AdWords for B2B""

Maybe your article is useful, maybe not, I don't know because I closed the tab, but screw you and your asshole design[1]

edit: No seriously, take a step back and look at your website in a mirror [2]. Popping up _another_ nag as I keep scrolling down? Lord almighty, you've motivated me to blackhole your domain in /etc/hosts

[1] https://www.reddit.com/r/assholedesign/ [2] http://imgur.com/a/Hgq9S

She is also spamming reddit badly and then answers with things like "this is not spam" "i am a girl btw". I blacklisted the site.

It looks like you have troubles with me. I apologize if I made you feel insecure and victimized. Sorry again, dude.

I also receive all sorts of spam hits on Google Analytics accounts. My suggestion is to create a filter that only includes the traffic directed to your hostname. It filters pretty much everything. But it's another problem in mobile apps tracking. I've yet to find a solution on mobile. Any suggestions?

If you want to clean up your historical data in Analytics, you can create a segment and block the language spam.


Go to Audience Overview > Add Segment > New Segment > Conditions > Select Language > Select Matches Regex & enter the above regex. Select Session & Exclude, give it a name and hit save. Congratulations, now your historical data is clean.

Why was the HN title moderated (presumably) from lifehacкer.com to lifehacker.com?

The title is normalized automatically, I think, to prevent phishing or something.

Can not google just figure out this is analytics spam?

I suspect a similar filter to like what determines spam email can be used? Basically a bunch of rules would look for similar referrals popping up all over the place and are invalid.

Sure, we can filter it out, but the real question is, “when will Google take care of referral spam?”. Referral spam has been plaguing Google Analytics for a few years now.

For some websites (with less than 100 visitors a day) it’s gotten so bad that it’s almost making Google Analytics completely useless now.

I deal with this Vitaly on a daily base.

Google failed to block him for months - it would be so easy though. He hits / on my sites but we never ever serve Google Analytics in this URL, only from /<lang>. Hence all IPs logging pageviews on / are 100% spam. No false positives.

Yes, the last time I have his spam is Dec 23, and it changes the text, the o-o-o-h shell and vitaly etc etc Google's failure to block the spam is huge considering all the money they make is via adverts and having GA is critical to their $$$

"Error establishing a database connection"

Looks like HN traffic took it down. Here's Google's cached verson (Archive.org didn't have it): https://webcache.googleusercontent.com/search?q=cache:cC67C6...

They don't get into Piwik. https://piwik.org/

> Poor user behavior metrics in Google Analytics negatively affected website ranking in Google organic search results.

Is this actually true? Does anyone have a primary source for this?

No it's not true because this referral spam never touches your server. Google does measure bounce rate when you come from Google search results, but this is entirely different scenario.

I've dropped GA and all other Google services from my sites specifically because of their inaction in fixing this problem. They know what the spam domains are. They should be filtered automatically.

Ih god. Not only are you spamming reddit but also here.

(Sorry guys for the unfriendly comment but the same name on reddit is basically a spam my articles account)

Your persistence in leaving comments here and there for my humble blog post is incredible!

Isn't cloudflare blocking this spam? Because I was managing a WordPress site for a client behind cloudflare and they were still able to spam usi g this technique

No, the referal spam never hits the website. It instead goes straight to GA.

hbcondo714 is correct, nothing every actually touches your website. Its send with the measurement protocol, which allows you to POST data straight to GA.


This is the detail that the article seemed to miss. Or did I miss it?

Cloudflare does block some spammers, and marks them as "bad browser." But it doesn't catch all types of spam attacks.


Unicode domain names are just a bad idea.

I do not understand how domains like ɢoogle (with some unicode G-like character) are registered? I remember when IDN were introduced there were concerns about spoofing similar URLs and there were some restrictions on what characters are allowed.

So now the owner of ɢoogle can add a signup form and start phishing for Google Accounts?

And I am sure Mr. Trump has no relation to this spam. Obviously it is done to turn people against him.

So someone is sending fake referrers that get into peoples favorite hidden data collection tool, and these people have summarily complained here. And I'm supposed to care?

Who knew, vacuuming up data from your visitors may not be all its made up to be. Someone should hand this guy a medal. I should look and figure out if browser extensions can manipulate the referrer yet, spamming all those trackers and analytics services with fake data seems like a wonderful idea.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact