Hacker News new | past | comments | ask | show | jobs | submit login
The Google Analytics Setup I Use on Every Site I Build (philipwalton.com)
493 points by uptown on Feb 13, 2017 | hide | past | web | favorite | 99 comments

It's pretty annoying that I have to create spam filters for Google Analytics to be useful. Every site I've installed GA on has required me to filter out spam. I don't understand why something isn't done about it at an engineering level. If site owners can set up filters against spammers, is it really that hard for Google to do it? Especially since they can see it across their accounts. Seems like it's the same type of issue that plagues email, yet Google seems to have that under control.

You can get around this with some fairly simple hacks. Write some JavaScript that evals a part of your page or something crazy like loading part of itself from Rot13 text file. Have this js generate an ID you can identify as 'real' or 'fake'. Filter your analytics by this ID. If you want to be extra funny make real and fake IDs look indistinguishable to human eyes.

99.9% of spammers are too lazy to spend any time figuring this out for a single site, and their tools won't even tell them spam isn't working. I've gotten away with adding a simple static ID to everything and except for really large juicy targets spammers don't even watse time on this.

All of my sites get zero spam with this filter

Can you elaborate on that? How do you make the real and fake ids?

The spammer bots are so dumb that using anything besides default "pageview" events seems to work

I assume he means something like the real ids being divisible by 2, while the fake ids are not.

Can you share any details on this implementation? Sounds really interesting.

I love me a good honeypot.

Can you elaborate as to what sort of spam you are referring to? Do you mean bots viewing your pages? Or is it something else?

Referral spam. Pretty common nowadays. Plenty of articles talking about it and showing how to filter it.


He is probably talking about the infamous "vitaly referrer spam". It shows up on the referrer tab of GA dashboard with a really long domain like "merry.christmas.go.to.this.link.to.get.a.price.vitaly.com".

As others have said, referer (sic) spam. Bots often show up in your logs but not Google Analytics if they don't execute JS.

Echoing what others are saying, I much prefer Google Tag Manager. Many clients use a CMS which make injecting dynamic variables into a page a bit of a pain if it's not done via rules at runtime.

The Next Web has open-sourced its Google Tag Manager setup (https://github.com/thenextweb/gtm), which has things like Scroll Tracking, Engagement Tracking (riveted.js), Outbound Link Tracking and lots of other things that are not in the default GA setup. They have recently added support for AMP.

In my experience it allows clients to get up and running with a useful GA setup in a couple of hours and means that you as a developer don't get bothered to make trivial changes.

  Scroll Tracking, Engagement Tracking (riveted.js), Outbound Link Tracking and lots of other things that are not in the default GA setup.
I understand why a site owner would want those things, but as a user it is terrifying! This is why I run an ad blocker.

Tracking & analytics are great, but respect for user privacy should be greater. The current state of analytics is too intrusive to user privacy. Hence we decided to get rid of all analytics from our website (https://ipapi.co). </Shameless plug>

Understandable - and I think people don't know enough of what Analytics solutions are doing behind the scenes for sure.

In GA's case, none of this is personally identifiable (Google actively strips info which could be PII out of listings), and the way Scroll Tracking is implemented could only ever be used in aggregate. So that is something.

Google Tag Manager is not very practical in cases where you want a secure website. It renders CSP useless because it requires all sorts of permissions like 'self' on scripts and CSS. It's good to have on the marketing webpage where no user content is displayed though.

Also worth noting--depending on your setup, GTM essentially allows arbitrary execution of JS for anyone with access to edit and publish containers.

I'm a more technically inclined marketer, but I make damn sure to check with an engineer before trying anything fancy with JS, and I make sure to test with QA.

But having stuff in GTM more or less means you have separate workflows for code outside of your existing repo. Yes you can dump a JSON export of the container and commit that, but it definitely can cause some headaches when engineers not super familiar with GTM or how it is setup have to touch things that impact it (or vice versa).

This is a really important point. It can be really helpful to decouple analytics / tracking from the application -- especially in large companies. However you need to fully understand the security implications (don't trust vendor code!), have someone that is technically inclined using it, make it easy for the product teams to debug around it, take ownership of QA, etc.

Can you elaborate on this? I recently had my marketing team ask me to replace GA with Google Tag Manager, and I'm not all that familiar with either, as running the web site is just a minor thing I do on the side of my product work... but this sounds like something I should know more about.

In the simplest terms, Google Tag Manager (GTM) is a script you load on your site that then loads additional (arbitrary) code based on conditions (which are executed client side). In the case of something like GA, let's say you want to track a new website activity or maybe want to pass data to a new custom dimension. Without GTM, you would update your GA javascript file. With GTM, you would make the update within their interface and that new code would get injected into your site. This enables (for right or wrong) the analytics team to manage what / how data gets collected in GA.

Within GTM you have the ability to inject html, javascript, css OR in many cases they have a wizard where someone less technical can put in some info and the code is generated. You can also use conditions to determine which code should load. For example, if page URL matches "orderConfirm" then execute this. Moreover, it gives you the option of when the code should run (immediate, DOM ready, onLoad).

Aside from GA, most companies have a shit load of 3rd party scripts that run on their pages, so GTM provides a central place to manage everything.

One problem is that anyone with editor access to GTM can inject just about anything (unless you block custom scripts within GTM which has other implications) and those deployments are real-time and done directly within the GTM web interface. I'm a developer on the marketing team, so I'm a fan, but it's risky in the wrong hands.

Thanks, that makes sense, and while I do like to trust my team, that sounds like something I want to keep an eye on. Appreciate the info...

On top of what peter_mcrae said, CSP[1] is a way to sandbox the page in modern browsers. But because GTM basically requires "root" access on the page it makes that sandboxing model useless.

[1]: https://en.wikipedia.org/wiki/Content_Security_Policy

The only thing I have against tag manager is the documentation is sort of confusing. That is to manually setup GTM is pretty painful. Luckily most CMS have plugins.

+1 - As a product person, being able to quickly spin up click tracking on a feature to measure its stickiness is awesome.

With the surge in Ad Blocking recently, part of me wonders how accurate the Google Analytics JavaScript tracker is today, and how accurate it will be in 5 years. I wonder if we'll see a trend back to server-side analytics soon.

Honestly, I prefer the Google Analytics setup.

People who block my tracking scripts don't want to be tracked, so I won't track them.

I use that info to see how people use a product, how they interact with it and what I can do to improve it. Where my time and money will be best spent.

If people want to block them, that's fine, I'm not going to try and get around them, but their "voice" is also muted here. I'm no longer factoring in their usage patterns, their usage at all.

I think the question is: Do you know what percentage of your visitors that applies to?

I don't disagree with your mindset at all, but could you be missing out on a large percentage of users and now know about it?

I very well could, but I'm not going to find out. All it takes is one blogger somewhere to see that I collect stats like "70% of my visitors block tracking" and then the smear pieces start with wonderful headlines like "X tracks users that explicitly disabled tracking" and I lose a significant portion of my user base.

I've had it happen before, i'm not going to make the same mistake again.

It's kind of a hamfisted approach, but if you block tracking, i'm going to treat you like you don't exist. It's not that i'm trying to punish anyone, it's just that the analytics seems to be the ONLY reliable source of getting any kind of information about what people use. Surveys only reach an extremely small subset, or nobody at all (i've had them get a 0% response rate while there were tens of thousands of daily users), unsolicited feedback is almost 100% negative and a large percentage of it nonsensical (things like "I hate the new update"... What am I supposed to do with that?), requests for feedback on new features or changes might get a few good responses, but I have no way of comparing those responses with actual usage (especially when one gets linked on reddit and suddenly gets 10X the number of responses simply because it was linked on reddit). And then if you decide to go against what your 3 responses to a problem requested, you'll get more blog posts like "x asks for feedback, does whatever they want anyway". I'm sorry, but I'm done with that. I go by usage numbers and patterns only now.

if you block tracking, i'm going to treat you like you don't exist

Are you saying that you ignore server logs?

I still use them for other purposes, but I'm not going to analyze them for any kind of tracking.

I run a service called Blockmetry [0] that measures exactly that, directly from pageviews. Some numbers get published regularly [1]. The percentage from August (last public number) is 5.2% of non-bot JS-enabled pageviews did not fire the analytics tag.

The short answer is that it's significant on an aggregate level worldwide, but the reality is that it varies _massively_ by country, device, day of the week [2], and even different sections on the same site. Additionally, there is small percentage of pageviews that have JS disabled you have to account for. This analysis was on HN earlier today [4] saying 0.2% of pageview worldwide have JS disabled, but, again, with huge variation (notably, Tor, but elsewhere too).

Q4 numbers are not released yet, but the trend is generally up, with some notable drops. Get in touch if you want more info or to set it up on your site [5].

[0] https://blockmetry.com/ [1] https://blockmetry.com/weather [2] https://blockmetry.com/blog/weekday [4] https://blockmetry.com/blog/javascript-disabled [5] https://blockmetry.com/contact

Tbh, Google Analytics samples at scale and people who are blocking like that aren't affecting the results much as a result. Well, unless they have truly "unique" patterns of using the UI specific to that demographic.

I think JavaScript analytics is more or less here to stay. A broader move to server-side analytics depends on what you're going to use the data for. When I want clean(er) data for important metrics, like revenue/conversion rate for eCommerce sites, I implement a hybrid JS/back-end solution where I send important data to the GA API or Mixpanel via some back-end service [1]. I've found that revenue data in GA compared to the database on a number of sites I have consulted for can be off quite a bit, and sometimes +/- 25%, depending on how the JavaScript has been implemented.

With larger businesses, you'll probably see more server-side implementations as they have the budgets to ensure the data they're collecting is accurate. For a blogger or a small publisher without a dedicated tech team, there's nothing easier than dropping in a script tag and watching the data roll in.


On technical sites I'm seeing traffic cut at least in half by GA against traffic measured server side (excluding bots), which isn't too surprising. On others by maybe a third. It is probably gradually becoming more and more inaccurate as users install blockers. I doubt many GA users are aware of this though. Anyone else seeing this?

Does every adblocker block ga by default? Ublock sure does but few years ago adblock (plus?) did not.

I wonder if anybody's done a mashup to compare/correlate GA figures with server-side ones from logs.

> see a trend back to server-side analytics soon.

We can only hope.

The percentage of visitors with an ad blocker depends on your site's audience. Outside of computer geeks and gamers, almost no one uses ad blockers. I wouldn't buy into the hype that the whole world is installing ad blockers.

We're seeing 30-50% of visitors blocking Google Analytics client side tracking at 80000hours.org. Target audience is university educated 18-30 year olds, ~100K unique visitors / month.

The folks at Segment.io warn their users to expect ~20%, with the caveat that blocking rates vary wildly between demographics [1].

[1] https://community.segment.com/t/1889n1/how-common-is-client-...

I encourage friends and family to install ad blockers. I've been asked numerous times over the years to look at a computer that has either died or just isn't working well anymore. I always offer to install an adblocker after the actual problem is addressed and about everyone leaves with an adblocker.

Its not technically difficult at all. It takes a few clicks to install and one click to disable on the minority of sites that don't work with ad blockers.

I highly doubt I'm entirely atypical.

Actually, number of people using ad-blockers has more than doubled between 2015 and 2016 according to the PageFair report [1].

[1] https://pagefair.com/blog/2016/mobile-adblocking-report/

"More than doubled" doesn't mean much if it's 0.0000001% to 0.0000002%. It certainly wouldn't refute what I said.

You are right, but the number of ad-block users is significant now -- around 20% (accd. to the same PageFair report).

I have also personally seen around 30% users use ad-blockers, for a site with around 100,000 visitors a day. However, most of the audience for that site is people in twenties, so it's not surprising to see higher than average ad-blocker usage.

Reports are from 20-30% of desktop and mobile users make use of some sort of adblocker/privacy tool. That is a very large set of users, and in some industries and domains the numbers go much higher.

Purely for self-protection/anti-aggravation I absolutely recommend it to every casual user I advise.

I don't believe that figure at all for the general populace.

Why not. Try sharing a link to ublock origin on facebook with text to the effect that installing the add on from the link will remove 90% of the annoying clutter from the web, make everything load faster, and protect their computers from malware.

Realize that the link in either the chrome or firefox case will be to the official addon site and in the mind of the user safe doubly so since it came from someone trusted.

What percentage of people who see it will spend the 3 clicks to install it?

* Note I know you have no incentive to actually do so its hypothetical many people are encouraging non techies to do so and have been doing so for a very long time the percentage that are aware of adblocking is increasing.

It's 50 percent on my work site and we are general entertainment for the 20 to 35 year old demographic.

Do you have any sort of citation or is this just opinion? If you have evidence, please share.

Is there any data on the number of people using ad blockers? I've personally seen a massive increase in the number of non-techy friends using them. A lot of people for example stream sports (illegally) and those sites are barely functional without a blocker.

I don't have those numbers. I know there have been some publications lately that suggest there is an increase, which make sense since I doubt the number would decrease, but like browser usage stats, those numbers are heavily dependent on the audience they were collected from.

My non-technical friends have never mentioned ads to me before in the context of the web. I doubt that means they appreciate ads on sites but I don't think it occurs to them that they need to find a way to remove them. I think they appreciate that Hulu lets you pay to remove them, or that Netflix doesn't have ads, but I never hear "this website sucks because of ads." They just assume it's the way of things.

This is correct. The vast vast majority of people are not using adblock.

Sometime ago I read an interesting interview[1] with the Economist deputy editor, Tom Standange, saying things like:

> The other thing about ads is that 41 percent of millennials are using ad block. My daughter has ad block and she goes around infecting every machine she gets to. She puts it on everything.

> But the other thing is that she lives in incognito mode. She’s a total nightmare for advertisers, because she’s not leaving any cookies and she’s not seeing any ads.

Digital privacy is an undeniable rising trend. Just stating the vast majority of people are not using adblock is, at minimum, shortsighted.

[1]: http://www.niemanlab.org/2015/04/the-economists-tom-standage...

> But the other thing is that she lives in incognito mode. She’s a total nightmare for advertisers, because she’s not leaving any cookies and she’s not seeing any ads.

Seems like she has the right idea to be honest.

> She’s a total nightmare for advertisers

Poor things, back to not knowing which 50% of the budget they're wasting, like it's the XX century.

The horror.

Do you have a source for this? Your statement doesn't carry any weight if you can't back it up. Anecdotally, I have seen a rise in non-tech-savvy people using ad blockers, but I'd be interested to see some hard data.

Reading a couple of slides in, this report applies to Asia-Pacific demographics.

It provides global statistics too.

Please don't contribute to Google's tracking dominance over the web. How insane is it that one company runs their javascript on 90% of the web?

What are the best alternatives?

Heap Analytics -- https://heapanalytics.com

And I run everything through Segment rather than embedding individual trackers directly. Segment either relays data to other services or loads required JS on page load.

The long term solution is usually building an in-house analytics service but it's not easy and cheap. We're also developing a generic analytics platform that can be installed with one-click to your favorite cloud provider so that you can store your event data in your database and build an internal analytics service without coding. Check out: https://github.com/rakam-io/rakam

https://piwik.org isn't bad, but last time I checked it was blocked in ublock origin just like GA

https://goaccess.io — simply awesome

It is actually easy to implement GA-like functionality in JS, send the event data to the server, and process it e.g. daily. You can start slowly, e.g. track only clicks (make it sync), visit time and scroll (both async).

Of course my homegrown analytics reporting is far from Google's, but at least I have found a great balance between getting useful usage data on my sites, and at the same time respecting the visitors' privacy.

Keen IO is a great alternative. Full disclosure, I'm a product manager at Keen. That said, we do help some of our customers move off of GA to gain more control and flexibility in their analytics

I last tried keen a few years ago, and really liked it.

The one thing I didn't like, and stopped using it was the pricing , which looks like has now been updated to be more realistic.

Definitely checking it out again now.

Thanks for the comment Josh. We just completely revamped our pricing. I'd love to get your feedback on what we've done, especially given your previous experience

https://piwik.org seems to be the go-to open source solution.

piwik or just grep your nginx logs.

Grep? Is that something I need a sysadmin for? /s

Quick plug for a log analyser: https://goaccess.io which I installed after realising it did all the things I actually wanted from GA (I don't actually need detailed user tracking)

Grep is unfortunately close to useless. It gives little to no insights on how long/deep a user stayed on a page, or where they went after that. Or what form fields are getting in the way of users from completing whatever you'd like them to do.

For commercial sites a solution like Adobe marketing cloud (nee Omniture) is an option


It's a tough call, especially if your revenue model is ad based. They ad networks only trust 3rd party analytics.

Remember that it's mandatory to disclose to visitors that your site uses Google Analytics in their T&C's https://www.google.com/analytics/terms/us.html (section 7, 'Privacy'). I don't see a privacy policy on this Google employee's page but perhaps they have a special exemption?

Anyhow, for many websites you'll get more accurate traffic data with GoAccess parsing your logs and showing you page views and basic demographic data. Use it alongside Google Analytics if you must, to see the exact difference between what Google tells you your page views were versus what your server tells you.

> for many websites you'll get more accurate traffic data with GoAccess parsing your logs and showing you page views and basic demographic data

Yes but remember that bot traffic may be more of an issue when analysing server side logs (a lot of bots still don't execute JavaScript).

It's hard to know how effective the bot filtering features in GoAccess are compared with those of Google Analytics.

> a lot of bots still don't execute JavaScript

I operate a service that measures this (see another comment on this discussion), and all I'll say is you'll be very surprised how many bots actually execute JS, especially stealth bots. You have to be careful either way.

> all I'll say is you'll be very surprised how many bots actually execute JS

Interesting. Do you have any numbers you can share?

I don't have access to the raw log files from the customers, so can't give you a percentage. All I'll say confidently is that my service processes a lot of bot traffic that needs to be filtered out before reporting.

BTW, are you the same Peter Hartree on this Segment thread? https://community.segment.com/t/1889n1/how-common-is-client-... It would appear we've crossed paths before on this topic. Please do email me if you want to talk properly. That Segment thread has my email.

It's very easy to navigate sites programmatically including script execution:


There is an --ignore-crawlers option that works well for me. Next thing I'll try to get working is to have the --ignore-referer=<referer> option parse Piwik's referer spam blacklist https://github.com/piwik/referrer-spam-blacklist

Not many people know about this feature of GA, but add the following to anonymize your users IP addresses before sending the information to Google.

> ga('set', 'anonymizeIp', true);

> anonymize your users IP addresses before sending the information to Google

That's a nice placebo that does almost nothing. Even if the packet body doesn't contain the IP address, it's still available in the IP header's Source Address field.

However, even if we assume Google - in a reversal of their general focus on gathering as much data as possible - doesn't recover the address from the IP header, their own documentation[1] for analytics collection URLs with the &aip=1 parameter (which should be present when 'anonymizeIp' is true) says:

    "... the last octet of the user IP address
     is set to zero ..."
Zeroing the least interesting 8 bits of the address doesn't make it anonymous. They still get to record the ASN, and they are recording at least 8 bit of fingerprintable data from other sources. I should be trivial to recover mostly-unique users, and calling this "anonymization" is at best naive and for Google, an obvious lie.

Their documentation even betrays their intentions:

    "This feature is designed to help site owners comply
     with their own privacy policies or, in some countries,
     recommendations from local data protection authorities,
     which may prevent the storage of full
     IP address information."
Actually making the data anonymous isn't the goal. They just want a rubber-stamp feature that lets them comply with the letter of the law.

[1] https://support.google.com/analytics/answer/2763052?hl=en

It looks like navigator.sendBeacon is not very well supported across browsers. [1]

Is this really a good idea?

1: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/s...

analytics.js falls back to the older methods in unsupporting browsers.

Alternative title: The Spyware I Use on Every Site I Build

Tag Manager is definitely preferable in my experience if you want to empower non technical people such as marketing to make their own changes on the fly without having to bother developers.

Yup, GTM is great until the folks in marketing add a script to the site without first testing in preprod that causes the UI for the app to not render.

You clearly have a more skilled marketing team than the one I tried to work with using GTM. I ended up dropping it because rather than implementing tracking Javascript in a text editor I was having to do it in an obtuse GUI instead - marketing wouldn't go near it.

It's not so safe, if not developers just copy paste code given by a third party they might inject insecure JS code

Don't feed Google with your visitors' data, respect their privacy, use open source Piwik instead.

What's the deal with stats delayed over 24 hours? Man, I hate that.

Beyond this info, I'd add my own suggestions from having spent a good portion of my career digging around in GA...

- If you have multiple domains, sub domains, etc. make sure to spend plenty of time reviewing the cross-domain setup documentation and test it thoroughly.

- If you have high volume, frequently do deep segmentations, use lots of custom dimensions, etc., make sure you have a clear understanding of how sampling in GA works, how to tell if you are being sampled, and find ways to avoid it by pulling reports in different ways. Otherwise you can end up in a situation where you are making decisions off of .3% of your traffic and while Google's sampling algorithm thinks it is fine, comparison against other data sources often shows it is not.

- Make sure any reporting you do across things like GA vs. AdWords is done with a clear understanding of how they each report on paid search. GA reports on it by default on a last non-direct click basis. AdWords just counts everything AdWords touches. This means that AdWords can give you a good sense of where you are gaining traction, whereas GA can help you understand how it works in conjunction with other touch points, and perhaps how you might change the way you weight things and measure success.

- GTM is powerful and free, but with great power comes great responsibility. Also, it can be a real PITA sometimes.

- Annotations are a highly underutilized tool in GA and can save you a lot of headaches. I just wish there was a way to bulk import/export them via spreadsheet or API.

- You can't currently create goal funnels from event-based conversions (please Google add this!), but the workaround for the time being is to push virtual page views at the same as the event fires, and then create funnels off of those.

- User stitching sounds awesome, but is actually much more limited than you'd think from reading overview. You need a separate view (which means your main GA view you use can't segment for the stitched sessions for comparison--just the new view which only contains the stitched users). And there's a 90 day rolling data retention window, so you need some sort of export process if you care about that data. Unfortunately, this is pretty important data if you have lots of cross-device tracking issues.

- Depending on your volume, you can reach the hit limits of the free tier pretty quickly if you start tracking a ton of events (since they all count as hits). Here's a good overview [1] of what these limits are, how they work, and what they mean for you. When I got the scary notification, Google was exceptionally unhelpful in working with me to resolve the problem, despite considerable ad spend. After reducing them to what we thought would be fine, they were unable to assure me that our data would not be nuked, and basically couldn't give me any real info beyond "this is the policy." Super frustrating.

- If you have good logging of events that tracks both server and client-side, it is healthy to compare for variances monthly or quarterly. You'd expect client-side tracking to break more often than server-side, but it is important to see how much that can alter your numbers.

[1] https://www.e-nor.com/blog/general/hit-count-in-google-analy...

Filtering out GA sessions with the language of "C" (versus actual languages like en-us, fr, etc.) goes a long way in filtering out GA spam.

This language code is 99% of the time associated with bots. I had one site where 20% of all the sessions in a given month was such fake traffic!

Isn't that a relatively easy thing for a spammer to change? Also, I'm seeing some valid traffic coming in with that language (conversions and everything).

Most won't bother

Sure, but even still, if I'm seeing valid traffic with this, then on its own it isn't sufficient to use as a filter.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact