Hacker News new | past | comments | ask | show | jobs | submit login
80% of my traffic is excluded from Google Analytics (twitter.com/eric_khun)
181 points by eric_khun 22 days ago | hide | past | favorite | 143 comments



I came across the same issue recently and found the answer in the Cloudflare help. Unique visitors on CloudFlare are different from unique visitors of Google Analytics. Google Analytics expects the client to execute JS (most bots don’t do that) and Google Analytics excludes known bots from the unique visitor count. The relevant Cloudflare article: https://support.cloudflare.com/hc/en-us/articles/36003768411...


If the difference between GA and Cloudfare is only bots, then the problem is pretty terrible..

"80% of my traffic is bots"


I’d guess adblocker cuts off Google Analytics. I know mine does.


Having close to 4 out of 5 visitors to an engineering blog running an ad blocker seems about right to me.


If it's an engineering blog, the vistors might not even have JS enabled.


This is one of the reasons I abandoned Adwords. I sell to a technical audience and as far as I could tell most of my customers run an adblocker.


What do you use in it’s stead?


Not the person you replied to but try https://ethicalads.io from Read The Docs. It's very non-intrusive and I suspect most people will whitelist it if you ask. (I whitelist it anyway)


It also includes non-Chrome browsers which have tracking protection enabled by default, such as Firefox or Brave, and anyone who had installed an ad blocker or has common hosts blocked at the DNS level (which is a checkbox on a growing amount of consumer network hardware).


Firefox doesn't block analytics by default. To test:

1. Open devtools networking

2. Visit any site with analytics

3. See requests to https://www.google-analytics.com/j/collect?... (for GA), https://plausible.io/api/event (for Plausible) etc

EDIT: expanded this into a post https://jefftk.com/p/firefox-does-not-block-analytics-by-def...


Ah, you’re right: it prominently suggests enabling ETP but doesn’t do so without user opt-in.


I develop on FF, and I have to set an IP address exclusion rule in GA so that my local development testing doesn't inflate GA numbers.

Google is a Mozilla donor and I don't think they'd like it if the browser blocked one of their major services. Ublock Origin will block GA and GTM (Google Tag Manager) however.


I've worked on several large websites and that doesn't sound terribly high. Even of just clients that execute JS often 20% of the traffic was also bots.


The main difference is the execution of JS, less so the bots.


Very few non-bot visitors refuse to execute JS.


Javascript in general? Sure.

Telemetry-specific Javascript on the other hand is prevented from executing by many ad blockers, for good reasons.


They don't block JavaScript, they block the loading of resources deemed to be for advertising. These are typically ads, but they can also be images, CSS, or other resources hosted by advertisers. Such blocking will also block any other kind of tracking done by the external domain.

On the other hand, JavaScript inline in the page is almost never blocked by users.


According to this random first result on Google 42.7% of internet users use ad blockers (which is what would usually block the js), and the number increases the techier the audience is.

https://backlinko.com/ad-blockers-users


Browser addons, rss readers, cli tools like youtube dl, light clients like chat thumbnailers, reader modes and so on don't either. This can add up depending of your site, and have a different significance than scrappers.

Ad blockers also often cut GA off.


I block all 3rd party scripts on my browser. The FSF has a patched version of Firefox called Icecat that's set up to do this easily.


There are also users who bounce before js has the time to execute. Which could easily be 10-20% of users. (GA is usually loaded via GTAG manager, so not executed at page load)

Even among js analytics tools, there are differences. In a past job, the main question customers where asking, was "why there is not the same number of pageviews than GA?".

Everyone detect bots differently, blacklist different IP, js execution orders is different and some script are blocked by adblockers and some not


I'm surprised that this isn't obvious to the tweet author (and everyone else). GA uses JS, Cloudflare is at the DNS level.

As for relevancy - GA can still be useful as it can give you more detailed information which you can't get without JS.


> GA can still be useful as it can give you more detailed information which you can't get without JS.

Detailed information about users who allow the site to execute JS, that is.

I'm not sure how useful such skewed information is.


Depends on the usecase. For example, if you profit via ads then the people you see on GA are the main ones you care about anyway.


> Detailed information about users who allow the site to execute JS, that is.

I think you are massively overestimating the number of people outside the HN crowd that have NoScript or an equivalent turned on.


For most people that is the user base that makes them the most money, so I think it would be skewed in the right direction


> Google Analytics expects the client to execute JS

This isn't applicable to internal analytics i.e. using Google Measurement Protocol, Which is a good way to avoid the cookie-tracking mess altogether and to gain more control over the analytics.

[1] https://developers.google.com/analytics/devguides/collection...


Or is that 80% of HN uses adblock?

> Is GA still relevant today?

Of course it is. While a large part of HN audience might use adblock, very few regular users do and almost no mobile users.

Looking at my logs, however, I get a similar proportion. ~25k in GA, ~100k in logs.


This is the right answer - HN is absolutely not representative of web users at large. If we were, I'm sure there would be much stricter anti-adblock/ad protection systems on large sites.


Twitch and other services have already started down the road of embedding ads in the video stream rather than loading separately, etc. I use an ad blocker and am torn sometimes because I know my favorite content creator isn’t getting my view and I’m contributing to the coming wave of even more invasive ad tech.


I think paying to get rid of ads is fair. I pay for YT Premium solely for that reason, it comes out to about $5/month/person if you use the family subscription.

It’s expensive but I find it’s worth it to not have my partner and our parents watching ads.

I watched the Euro 2020 games, and even the 5-second soundbite of “the games are sponsored by so-and-so” rubs me the wrong way. Why should I be told the name of some company when I’m watching football? I digress.


5he problem with pay-to-skip ads is that, as we learned with Cable and Airlines, companies don't like to keep money on the table. You'll pay, and still get ads. In fact, the fact the that you can pay to remove ads makes you even more lucrative target for ads. There is no winning against the targeted advertisement industry.


I get the hate for increasingly invasive ads (screw the little car that carries the ball), but I chuckled at the last sentence: we've had sponsors on football jersey for 50 years :)


We also had tobacco for many many decades as normal accepted part of life. It took two generations of concentrated efforts, but we're better off without it.

Just because something has been common for long time doesn't make it quite right.


My point wasn’t that it was invasive, but that even 5-second namedrops annoy me. The prominently placed logos on clothing and all over stadiums, even in stadium names lately, also annoy me. However, then I can just focus on the game and ignore the ads. With the “sponsored by” and actual ads, you just can’t ignore it. It is force fed. I think that’s an important difference, and also why forcing me to say some brand name because they renamed the stadium to “COCA COLA ARENA” or something similarly obnoxious is so grating.


It would be interesting to let consumers have access to adtech tools. Instead of a subscription, I could set bidding limits on the ads that would have been shown to me.

I appreciate there are commercial reasons why this won’t happen, but it would be interesting. I’d also personally never trust any of the ad brokers to be honest with individuals without an army of lawyers.


Google tried something along these lines with Contributor ( https://en.wikipedia.org/wiki/Google_Contributor), which in it's first iteration would bid on your own ad impressions. It wasn't popular, partly because it didn't cover all ads (not every ad goes through a public auction) and partly because it fundamentally cost you money. They later tried another version, which only worked with specific partners but did exclude all ads; it wasn't popular either.

(Disclosure: I work on ads at Google, speaking only for myself)


All Google ads. Not all ads.


Are you talking about the second version? That one did exclude all ads on participating publishers, because to participate you had to set it up that way


But YT Premium only blocks YT ads, it doesn't block Sponsored Content put there by the content creator. Maybe that's okay, since by definition these ads aren't of the tracking variety?

So now you don't need AdBlock, but you still need SponsorBlock. You might be able to get a sponsorless version through their Patreon, but now YT Premium is pointless.


Unfortunately the people with disposable income to pay for premium services is exactly who marketers want to advertise to.


I too think blocking ads should be worth some money. After all everyone saves a lot of time.

I suggest a small machine with pi-hole :)


If folks were using the pay version instead of the Adblock route they might spend less effort on it tbf

Also this way your favourite content creator gets paid as if you saw the ads

https://www.twitch.tv/turbo


See that’s funny, because I’m on twitch everyday and I subscribe to half a dozen streamers and always get a little annoyed when I go to other streams where I don’t subscribe and have to sit through the pre-roll ad.

I had no idea twitch turbo existed.


It's kinda weird they never really mention it. Not great for selling ads if all the people with disposable income don't see any ads I guess


Down the line it could go the way of the cable - you will see ads even if you pay.

Premium users are a perfect segment. They already shown they are willing to spend money.


Yes, I suspect the people who can afford to pay for these services are exactly the people the advertisers are targeting. Without them the free part of the model would fall apart.


Down the line unsubscribe from Turbo and go back to using Adblock then I guess


I just recently heard about ads on Twitch. Pretty sure my pi-hole is blocking it.


HN is in the vanguard, though. The crowd here is not representative of the state of the global online population at any given time, but one could say that global online sentiment tends to trail the HN sentiment by a number of years. But it's not just HN. Not by any means.

The audience for our website (gambling) tends to be both technologically sophisticated and very conscious of their privacy. Their behaviour has also been quite reliably an early indicator of a wider change. As it stands, more than 80% of page requests block analytics, trackers and other external crapware.

The players in other industries have to prepare for the coming wave, because it will hit them soon. About time, too.


I think for your site that's true but not for HN. Your users are advanced but regular users. HN, on the other hand, are not regular users but professionals.

Car enthusiasts vs mechanics in a way.


If HN was universally the vanguard, a lot more normal people would be using VS code right now.


And most of them would have transitioned from Vim or Emacs.


After iOS added Adblock capability I fully expected them to either add one enabled by default, or ask the user a yes/no question of whether they wanted one. This would make Adblock on mobile easily 25% very quickly, and Android would be forced to follow too.


I don’t think they would add a default adblocker to Safari. You can’t use any other browser engines, and a system wide ad blocker would seem pretty anti-competitive.


> a system wide ad blocker would seem pretty anti-competitive.

Only to advertisement agencies. And they can go cry up a river.


You make it sound like Apple doesn't have a search and mobile advertising arm.

https://searchads.apple.com/


Asking the user yes/no on first start would be easy. Then it’s opt-in.

They stopped tracking by default in safari which is similar.


By your same logic, a secure operating system would be anti-competitive against malware authors, gas and/or electric heating would be anti-competitive against chimney sweeps, and electric cards would be anti-competitive against oil companies.


How so? Apple isn't in the ad business, are they?

Now if Google were to do something like add a default adblocker that blocked all non-Google ads... yeah, that'd be anti-competitive.

But if Apple isn't even competing, how can it be anti-competitive?


I know Apple used to have an ad network, I don't know if it's still running.


iAd is still alive in Apple products, but it’s not an ad network anymore.

https://developer.apple.com/support/iad/

Did I miss some other product?


But do you know how many of your log hits are bots with a spoofed UA that GA filtered out?


I was just looking at one URL that hit the front page a day or two after publication so I don't think there were many bots.


Referral sources usually give away whether the hit is organic or a bot.


Hmm. So what's the other three quarters, is it really all bots or just misattributed?


The other three quarters are HN users with adblock so they don't register in GA but register in Cloudflare (and my logs) because that cannot be blocked.


Firefox blocks GA by default


GA uses first-party cookies so Firefox does not block the data. Only third-party cookies mentioned here: https://www.mozilla.org/en-US/firefox/browsers/compare/chrom...

I just double-checked my real-time analytics and looked back in the last 30 days to double-check. It's all coming through GA while I am in Firefox or other users are using Firefox.

The first vs. third-party cookies are confusing as hell.


No it doesn't. Go to nytimes.com and look in the developer tools for 'gtm.js'. That's Google Tag Manager, which is a container for GA.


I think increasingly more browsers (except Chrome and Edge) come with strong privacy centred defaults nowadays so it doesn’t require a tech savvy person to stay private or anonymous.

I wish Google Analytics had an option to run a small proxy in a Docker container so I could run it on my server and collect all traffic without being blocked.


You can exfiltrate the data via an http request or websocket and have your server submit the pageview hit directly to GA's servers. For me, this makes GA unique visitors numbers about 89% of Cloudflare's.

For this specific application, it's a game (digdig.io), so these 11% might likely be browsers that didn't fully connect to the game, or bots that don't fully support the apis I'm using.


> For this specific application, it's a game (digdig.io), so these 11% might likely be browsers that didn't fully connect to the game, or bots that don't fully support the apis I'm using.

...or people that simply refuse to correctly resolve google analytics' domains, or people who simply block google analytics' IP addresses at the firewall level...

And you know why people do that? Probably because there are developers out there that think it is a good idea to "exfiltrate the data via an http request or websocket and have your server submit the pageview hit directly to GA's servers".


I don't think you read properly. Simply blocking GA domains won't do anything. Open the website, there isn't a single request to GA. The data is collected when the game starts and sent to the game server which sends it to GA. The 11% is not caused by client side blocking.

Also, I don't care if you don't want some VERY basic info about your browser collected. If you're connecting to my servers, you already gave it to me through the headers, I'm giving it to GA too.


I think it's about developing better log analysis tools. GoAccess does a pretty good job but it's missing a couple of features that would make me entirely stop using GA. Namely: stats over time, and better source grouping.

Even events can be replaced by triggering calls to your own endpoints.


I had the same thing when my game https//termsandconditions.game got to #1 spot on HN.

I was using https://plausible.io which I presume has a higher success rate than GA, but I still saw actual usage of resources (requests to the CDN) about 4 times higher than reported.

I guess you couldn't pick more ad-block-heavy audience than HN though right?


I use plausible on my site (and tested out a few other similar options), and ublock origin and brave browser both blocked all of them.

So I guess it's not too surprising that it's still the same ratio.


I wish we could just go back to "analytics" being an moderately complicated awk invocation on whatever logs your webserver produces.


If I need accurate traffic numbers (for performance or server sizing for instance) I use GoAccess (https://goaccess.io/)


It doesn't work if the website is hosted on GitHub Pages or Netlify. Having a CDN such as CloudFlare will make log files on your server useless too.


Netlify offers an add-on for server-side analytics: https://www.netlify.com/products/analytics/


The frustrating thing is that it's cheaper at that point to run a Linode or Do instance for $5/mo where you get full logs than to pay $9/mo for Netlify analytics.

It's not a perfect comparison but I am still on the lookout for the cheapest way to host a static site while retaining access to access logs.


How so? Isn't that just for pictures and other media?


Hosting a (static) website on GitHub pages means you put all your resources including html files into a repo, then the whole website will be served from GitHub's server. You don't even have a server to view your access log. Netlify is pretty similar.

As for CloudFlare, all the requests will go through CloudFlare's server for caching, so some request won't reach your server.


> As for CloudFlare, all the requests will go through CloudFlare's server for caching, so some request won't reach your server.

Only for static files, they don't cache HTML and dynamic contents. So you can still make sense of unique visitors to your site.


That really depends on your setup. The cache whatever you tell them to cache. HTML and dynamic content is also cacheable with good results. For example you can vary the cache by the user ID cookie. Or for dynamic content you can in most cases do caching for 5min or so.


Goatcounter works on netlify, it's pretty neat.


Turning impression logs into people counts turns out to be a pretty difficult problem. Even with cookies it's still very complicated. Facebook has publicly acknowledged screwing it up several times.

We could go back to the days of talking about "hits" on a website, but for most things where you'd care about these types of metrics, it's a pretty crap metric.


This doesn't work too well nowadays since a lot of visits don't hit my server as Cloudflare itself is serving them the page.


In this case you'd get your logs from CF instead of your origin server.


Why can’t we?


What percent of visits are bots and how do you detect them serverside? That’s the only part I’m unsure of


Can you detect bots client side? I sorta assumed they used headless Chrome and looked like real users nowadays, although I guess with effort you can probably fingerprint?


The default build of chromedriver is pretty easy to detect client side because it injects js variables with known names. Most bots don't really care if they're detected and excluded from analytics.

Unfortunately a few sites unfairly throttled me to 1 request/hour despite the cost of my requests being fuck-all. So a couple of years ago I had to randomize those variable names, distribute requests over 64 IPs, and screw up all their analytics numbers in the process. I hope it was worth saving $0.01 for them.


Most bots are pretty stupid, so it's easy to detect them. Bot detection need not be 100% accurate. It's an arms race with the goal to make writing bots financially unattractive.


For one of the sites I work with. Headless chrome defenitly inflates the GA user numbers. We can spot this because they tend to come from big cloud data centre locations when doing ip to geo mapping.


> We can spot this because they tend to come from big cloud data centre locations when doing ip to geo mapping.

Of course, you can do that with server side logs, so in this case client side is less accurate:)


You can if you track all of the user's online activity and use machine learning to determine if the behavior seems genuine. That's how invisible captcha works, I'd be surprised if google analytics didn't do something similar.

Of course, that does mean you have to track all the user's online activity....


What about GDPR potential violation?


GDPR has exceptions for fraud detection.


A bot does not immediately imply fraud.

If you're processing personal data to detect bots to prevent let's say card fraud, fair enough. If you're processing personal data to detect bots for analytics purposes, this might not fly.

Of course, this assumes someone actually gives a shit about enforcing the GDPR, which is currently not the case.


I have literally worked on bot detection and the high powered company lawyers told me that it's okay for my code to collect and analyze PII for these purposes because GDPR has these exceptions. I'm not a lawyer myself though, so that might have been wrong.


> What percent of visits are bots

It doesn't matter.

> how do you detect them serverside?

Instead of trying to detect whether something isn't human you should be trying to detect whether something is malicious.



And then a user sends a support request with a screenshot showing an in browser error and demands you fix it but your server logs show nothing because this was a JS issue.

While people who use frontend logging/tracking just search the user ID and see what went wrong.


To me, one doesn't have to pre-emptively record every user session on the off chance that they run into a problem large enough to bother contacting support for. Having a 'copy error' button and/or logging the errors to a server seems sufficient here, no need to hire the biggest brother out there to track your users for that.


That has nothing to do with external visitor analytics, though. I've implemented log sending from frontend to backend multiple times. First party, same endpoints as other requests. Not getting blocked.


In Google Analytics?


We see around the same thing and our audience is totally NOT from HN or hacker like. Mixpanel and Cloudflare both report around 886k uniques (mixpanel being 5% less) for yesterday's Friday traffic... while GA reports 129k... that's a solid 80%+ too.


It's not just Google Analytics, but any other alternative be it plausible, Matomo, or any other JS analytics solution.

I would also say cloudflare doesn't count uniqu visitors accurately. Also recently there is big uptick in bots which server logs show, but are not real users.


Are bots not unique visitors?


No, of course not. On a small website, the bots are such a huge part of the logs that the whole exercise becomes an attempt at extracting understanding from random noise. It's why those visitor count widgets died out years ago.


True


Dealt with this recently. Something like half of Americans are using some sort of privacy blocker on desktop [1]. It's not as high on mobile but still sizable. My guess is it's higher with the HN crowd.

I dealt with something similar so I had my back-end fire server-side events to compare to Google Analytics and other client-side reported data. Sure enough, 50% loss on desktop and 30% on mobile.

1. https://www.forbes.com/sites/tjmccue/2019/03/19/47-percent-o...


Agreed. I used to support my hosting costs via google adsense. Over recent years the impressions and clicks dropped off a cliff.

In the end I realized I was just as guilty, after all I run adblock too - so I removed them all.


How do you support your hosting costs now?


Out of pocket expenses, mostly. Though I did shuffle things around so that they cost me less these days.


I think that GA samples the data anyway (unless you pay?), so it might not be a reliable "hit counter" these days.

I guess the value comes from seeing "trends" in your site content - I.e. how people move around within the site, which sites are sending traffic etc rtlather than seeing absolute 100% accurate counts

Log analysis will always be much more accurate counter if you just need a hit counter.


> <s>80% of my traffic is excluded from Google Analytics</s> Unique visitor count on Google Analytics is 80% lower than the one on Cloudflare.

FTFY.

Cloudflare’s unique visitor count always felt inflated to me, compared to both GA and non-GA analytics solutions.


If you want real numbers just parse your server log. I stopped using analytics years ago because in average it would only show half of my actual traffic. Goaccess works great.


How do you filter bot traffic from your server logs?

If all you are looking for is a 'hit counter', then GA is overkill. It's value is not in providing page traffic data, but in tracking things like click events, ecommerce funnels, marketing campaigns, A/B testing etc.


> How do you filter bot traffic from your server logs?

That's a option for goaccess --ignore-crawlers

Honestly I am not doing a lot of tracking on my users, the few campaigns I run I simply track within my application. That way I also don't miss any relevant Events.

Also a lot of my traffic is from tech people, so using anything like GA would also only show a small part of the audience I actually mostly care about.


Ironically this is how it was done before Google Analytics.


I even remember many refusing to use Google analytics for privacy reasons for a few years until everyone got lazy I guess


You can actually use cloudflare workers to mitigate this. You use a worker to act as a proxy so that most blockers won't block the tracker. More details:

https://blog.garble.org/using-cloudflare-to-increase-analyti...

I don't believe it would scale well to a big site though ($$)


You can do this but please don’t! Respect that users have signalled their intent to not be tracked in this way instead.


Plausible has docs on how to set up proxying: https://plausible.io/docs/proxy/introduction


It definitely depends on the audience you have and the adblocker adoption for each country. "mainstream" websites still show a 5-10% gap. That for marketing usages it's still ok.


What creeps me out even more is the ads and "automatic bing searches" that are cropping up in the Windows start menu and weather (location) tracker with ads in the taskbar. All this enabled by default after a Windows update and active while the user is not explicitly browsing something on the Internet. What metrics Microsoft is collecting?


Also worth noting that recent versions of Firefox block GA by default


It seems to allow it here. Fully up-to-date, default settings.


I should clarify it isn't blocked in the traditional sense.

It loads a benign script that pretends to be GA but doesn't touch Google servers at all, to avoid breaking sites that expect GA to be available.


Safari also blocks GA by default, also plausible have some versus articles about Cloudflare analytics (both server and client side)

https://plausible.io/vs-cloudflare-web-analytics https://plausible.io/blog/server-log-analysis


Firefox has built-in tracking protection since a while and would block Google Analytics. Similarly, Adblock extensions would also block GA (usually behind a strict setting).

Considering we're talking HN visitors, this wouldn't be representative of visitors in general.


Firefox doesn't block analytics in its default configuration; see https://news.ycombinator.com/item?id=27792881 above


I don't know the answer but with my smartphone I'd rather connect to a website through a VPN running on my own server (with pihole and dns sinkhole), than actually browsing any website directly.


So we might be optimising for the wrong things due to survivor bias?


Same, and the strongest reason why I stopped using GA altogether.

People don't talk much about that, I suppose, but to me it's clear GA is completely dead as a product.


This isn't true at all. If you get outside the niche tech field, you'll see that GA is blocked by 6% - 10% of unique visitors... less on mobile, which is growing.


I made a similar observation with Plausible and Cloudflare and wrote about it: https://rugpullindex.com/blog#HowPlausibleareOurWebsiteAnaly...


This link doesn’t seem to work correctly on mobile. It showed the right post for about a second, then scrolled to something from June 2nd and May 27th.


Good.


Hey guys, I’m the founder of Darwin (free analytics) https://www.darwin.so

There are a lot of ways to measure traffic depending if you care about all visits, https requests or just real users. For example, in our research we've found that around 56% of internet traffic is actually headless bots (bots pretending to be human).

If you not only count headless bots but also count http requests as users then you'll have 5/1 ratio as Eric mentioned here on twitter.

- Cloudflare uses cache hits and so on to measure traffic and therefore use http requests. Most of which are coming from data centers. This means their numbers are highly inflated.

- GA allows HTTP requests as well but this traffic can be easily filtered. They allows most/all bots and provide no tools to fight this.

- Darwin we exclude everything we can to try and ensure the best measure of "real" users. This we believe is more helpful to marketers.

tl;dr

your real traffic is 10% of what you see in cloudflare,

30% of what you see in unfiltered GA.

With Darwin around 85% of traffic is real.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: