I came across the same issue recently and found the answer in the Cloudflare help. Unique visitors on CloudFlare are different from unique visitors of Google Analytics. Google Analytics expects the client to execute JS (most bots don’t do that) and Google Analytics excludes known bots from the unique visitor count. The relevant Cloudflare article:
Not the person you replied to but try https://ethicalads.io from Read The Docs. It's very non-intrusive and I suspect most people will whitelist it if you ask. (I whitelist it anyway)
It also includes non-Chrome browsers which have tracking protection enabled by default, such as Firefox or Brave, and anyone who had installed an ad blocker or has common hosts blocked at the DNS level (which is a checkbox on a growing amount of consumer network hardware).
I develop on FF, and I have to set an IP address exclusion rule in GA so that my local development testing doesn't inflate GA numbers.
Google is a Mozilla donor and I don't think they'd like it if the browser blocked one of their major services. Ublock Origin will block GA and GTM (Google Tag Manager) however.
I've worked on several large websites and that doesn't sound terribly high. Even of just clients that execute JS often 20% of the traffic was also bots.
They don't block JavaScript, they block the loading of resources deemed to be for advertising. These are typically ads, but they can also be images, CSS, or other resources hosted by advertisers. Such blocking will also block any other kind of tracking done by the external domain.
On the other hand, JavaScript inline in the page is almost never blocked by users.
According to this random first result on Google 42.7% of internet users use ad blockers (which is what would usually block the js), and the number increases the techier the audience is.
Browser addons, rss readers, cli tools like youtube dl, light clients like chat thumbnailers, reader modes and so on don't either. This can add up depending of your site, and have a different significance than scrappers.
There are also users who bounce before js has the time to execute. Which could easily be 10-20% of users. (GA is usually loaded via GTAG manager, so not executed at page load)
Even among js analytics tools, there are differences. In a past job, the main question customers where asking, was "why there is not the same number of pageviews than GA?".
Everyone detect bots differently, blacklist different IP, js execution orders is different and some script are blocked by adblockers and some not
> Google Analytics expects the client to execute JS
This isn't applicable to internal analytics i.e. using Google Measurement Protocol, Which is a good way to avoid the cookie-tracking mess altogether and to gain more control over the analytics.
This is the right answer - HN is absolutely not representative of web users at large. If we were, I'm sure there would be much stricter anti-adblock/ad protection systems on large sites.
Twitch and other services have already started down the road of embedding ads in the video stream rather than loading separately, etc. I use an ad blocker and am torn sometimes because I know my favorite content creator isn’t getting my view and I’m contributing to the coming wave of even more invasive ad tech.
I think paying to get rid of ads is fair. I pay for YT Premium solely for that reason, it comes out to about $5/month/person if you use the family subscription.
It’s expensive but I find it’s worth it to not have my partner and our parents watching ads.
I watched the Euro 2020 games, and even the 5-second soundbite of “the games are sponsored by so-and-so” rubs me the wrong way. Why should I be told the name of some company when I’m watching football? I digress.
5he problem with pay-to-skip ads is that, as we learned with Cable and Airlines, companies don't like to keep money on the table. You'll pay, and still get ads. In fact, the fact the that you can pay to remove ads makes you even more lucrative target for ads. There is no winning against the targeted advertisement industry.
I get the hate for increasingly invasive ads (screw the little car that carries the ball), but I chuckled at the last sentence: we've had sponsors on football jersey for 50 years :)
We also had tobacco for many many decades as normal accepted part of life. It took two generations of concentrated efforts, but we're better off without it.
Just because something has been common for long time doesn't make it quite right.
My point wasn’t that it was invasive, but that even 5-second namedrops annoy me. The prominently placed logos on clothing and all over stadiums, even in stadium names lately, also annoy me. However, then I can just focus on the game and ignore the ads. With the “sponsored by” and actual ads, you just can’t ignore it. It is force fed. I think that’s an important difference, and also why forcing me to say some brand name because they renamed the stadium to “COCA COLA ARENA” or something similarly obnoxious is so grating.
It would be interesting to let consumers have access to adtech tools. Instead of a subscription, I could set bidding limits on the ads that would have been shown to me.
I appreciate there are commercial reasons why this won’t happen, but it would be interesting. I’d also personally never trust any of the ad brokers to be honest with individuals without an army of lawyers.
Google tried something along these lines with Contributor ( https://en.wikipedia.org/wiki/Google_Contributor), which in it's first iteration would bid on your own ad impressions. It wasn't popular, partly because it didn't cover all ads (not every ad goes through a public auction) and partly because it fundamentally cost you money. They later tried another version, which only worked with specific partners but did exclude all ads; it wasn't popular either.
(Disclosure: I work on ads at Google, speaking only for myself)
Are you talking about the second version? That one did exclude all ads on participating publishers, because to participate you had to set it up that way
But YT Premium only blocks YT ads, it doesn't block Sponsored Content put there by the content creator. Maybe that's okay, since by definition these ads aren't of the tracking variety?
So now you don't need AdBlock, but you still need SponsorBlock. You might be able to get a sponsorless version through their Patreon, but now YT Premium is pointless.
See that’s funny, because I’m on twitch everyday and I subscribe to half a dozen streamers and always get a little annoyed when I go to other streams where I don’t subscribe and have to sit through the pre-roll ad.
Yes, I suspect the people who can afford to pay for these services are exactly the people the advertisers are targeting. Without them the free part of the model would fall apart.
HN is in the vanguard, though. The crowd here is not representative of the state of the global online population at any given time, but one could say that global online sentiment tends to trail the HN sentiment by a number of years. But it's not just HN. Not by any means.
The audience for our website (gambling) tends to be both technologically sophisticated and very conscious of their privacy. Their behaviour has also been quite reliably an early indicator of a wider change. As it stands, more than 80% of page requests block analytics, trackers and other external crapware.
The players in other industries have to prepare for the coming wave, because it will hit them soon. About time, too.
I think for your site that's true but not for HN. Your users are advanced but regular users. HN, on the other hand, are not regular users but professionals.
After iOS added Adblock capability I fully expected them to either add one enabled by default, or ask the user a yes/no question of whether they wanted one. This would make Adblock on mobile easily 25% very quickly, and Android would be forced to follow too.
I don’t think they would add a default adblocker to Safari. You can’t use any other browser engines, and a system wide ad blocker would seem pretty anti-competitive.
By your same logic, a secure operating system would be anti-competitive against malware authors, gas and/or electric heating would be anti-competitive against chimney sweeps, and electric cards would be anti-competitive against oil companies.
The other three quarters are HN users with adblock so they don't register in GA but register in Cloudflare (and my logs) because that cannot be blocked.
I just double-checked my real-time analytics and looked back in the last 30 days to double-check. It's all coming through GA while I am in Firefox or other users are using Firefox.
The first vs. third-party cookies are confusing as hell.
I think increasingly more browsers (except Chrome and Edge) come with strong privacy centred defaults nowadays so it doesn’t require a tech savvy person to stay private or anonymous.
I wish Google Analytics had an option to run a small proxy in a Docker container so I could run it on my server and collect all traffic without being blocked.
You can exfiltrate the data via an http request or websocket and have your server submit the pageview hit directly to GA's servers. For me, this makes GA unique visitors numbers about 89% of Cloudflare's.
For this specific application, it's a game (digdig.io), so these 11% might likely be browsers that didn't fully connect to the game, or bots that don't fully support the apis I'm using.
> For this specific application, it's a game (digdig.io), so these 11% might likely be browsers that didn't fully connect to the game, or bots that don't fully support the apis I'm using.
...or people that simply refuse to correctly resolve google analytics' domains, or people who simply block google analytics' IP addresses at the firewall level...
And you know why people do that? Probably because there are developers out there that think it is a good idea to "exfiltrate the data via an http request or websocket and have your server submit the pageview hit directly to GA's servers".
I don't think you read properly. Simply blocking GA domains won't do anything. Open the website, there isn't a single request to GA. The data is collected when the game starts and sent to the game server which sends it to GA. The 11% is not caused by client side blocking.
Also, I don't care if you don't want some VERY basic info about your browser collected. If you're connecting to my servers, you already gave it to me through the headers, I'm giving it to GA too.
I think it's about developing better log analysis tools. GoAccess does a pretty good job but it's missing a couple of features that would make me entirely stop using GA. Namely: stats over time, and better source grouping.
Even events can be replaced by triggering calls to your own endpoints.
I had the same thing when my game https//termsandconditions.game got to #1 spot on HN.
I was using https://plausible.io which I presume has a higher success rate than GA, but I still saw actual usage of resources (requests to the CDN) about 4 times higher than reported.
I guess you couldn't pick more ad-block-heavy audience than HN though right?
The frustrating thing is that it's cheaper at that point to run a Linode or Do instance for $5/mo where you get full logs than to pay $9/mo for Netlify analytics.
It's not a perfect comparison but I am still on the lookout for the cheapest way to host a static site while retaining access to access logs.
Hosting a (static) website on GitHub pages means you put all your resources including html files into a repo, then the whole website will be served from GitHub's server. You don't even have a server to view your access log. Netlify is pretty similar.
As for CloudFlare, all the requests will go through CloudFlare's server for caching, so some request won't reach your server.
That really depends on your setup. The cache whatever you tell them to cache. HTML and dynamic content is also cacheable with good results. For example you can vary the cache by the user ID cookie. Or for dynamic content you can in most cases do caching for 5min or so.
Turning impression logs into people counts turns out to be a pretty difficult problem. Even with cookies it's still very complicated. Facebook has publicly acknowledged screwing it up several times.
We could go back to the days of talking about "hits" on a website, but for most things where you'd care about these types of metrics, it's a pretty crap metric.
Can you detect bots client side? I sorta assumed they used headless Chrome and looked like real users nowadays, although I guess with effort you can probably fingerprint?
The default build of chromedriver is pretty easy to detect client side because it injects js variables with known names. Most bots don't really care if they're detected and excluded from analytics.
Unfortunately a few sites unfairly throttled me to 1 request/hour despite the cost of my requests being fuck-all. So a couple of years ago I had to randomize those variable names, distribute requests over 64 IPs, and screw up all their analytics numbers in the process. I hope it was worth saving $0.01 for them.
Most bots are pretty stupid, so it's easy to detect them. Bot detection need not be 100% accurate. It's an arms race with the goal to make writing bots financially unattractive.
For one of the sites I work with. Headless chrome defenitly inflates the GA user numbers. We can spot this because they tend to come from big cloud data centre locations when doing ip to geo mapping.
You can if you track all of the user's online activity and use machine learning to determine if the behavior seems genuine. That's how invisible captcha works, I'd be surprised if google analytics didn't do something similar.
Of course, that does mean you have to track all the user's online activity....
If you're processing personal data to detect bots to prevent let's say card fraud, fair enough. If you're processing personal data to detect bots for analytics purposes, this might not fly.
Of course, this assumes someone actually gives a shit about enforcing the GDPR, which is currently not the case.
I have literally worked on bot detection and the high powered company lawyers told me that it's okay for my code to collect and analyze PII for these purposes because GDPR has these exceptions. I'm not a lawyer myself though, so that might have been wrong.
And then a user sends a support request with a screenshot showing an in browser error and demands you fix it but your server logs show nothing because this was a JS issue.
While people who use frontend logging/tracking just search the user ID and see what went wrong.
To me, one doesn't have to pre-emptively record every user session on the off chance that they run into a problem large enough to bother contacting support for. Having a 'copy error' button and/or logging the errors to a server seems sufficient here, no need to hire the biggest brother out there to track your users for that.
That has nothing to do with external visitor analytics, though. I've implemented log sending from frontend to backend multiple times. First party, same endpoints as other requests. Not getting blocked.
We see around the same thing and our audience is totally NOT from HN or hacker like. Mixpanel and Cloudflare both report around 886k uniques (mixpanel being 5% less) for yesterday's Friday traffic... while GA reports 129k... that's a solid 80%+ too.
It's not just Google Analytics, but any other alternative be it plausible, Matomo, or any other JS analytics solution.
I would also say cloudflare doesn't count uniqu visitors accurately. Also recently there is big uptick in bots which server logs show, but are not real users.
No, of course not. On a small website, the bots are such a huge part of the logs that the whole exercise becomes an attempt at extracting understanding from random noise. It's why those visitor count widgets died out years ago.
Dealt with this recently. Something like half of Americans are using some sort of privacy blocker on desktop [1]. It's not as high on mobile but still sizable. My guess is it's higher with the HN crowd.
I dealt with something similar so I had my back-end fire server-side events to compare to Google Analytics and other client-side reported data. Sure enough, 50% loss on desktop and 30% on mobile.
I think that GA samples the data anyway (unless you pay?), so it might not be a reliable "hit counter" these days.
I guess the value comes from seeing "trends" in your site content - I.e. how people move around within the site, which sites are sending traffic etc rtlather than seeing absolute 100% accurate counts
Log analysis will always be much more accurate counter if you just need a hit counter.
If you want real numbers just parse your server log. I stopped using analytics years ago because in average it would only show half of my actual traffic. Goaccess works great.
How do you filter bot traffic from your server logs?
If all you are looking for is a 'hit counter', then GA is overkill. It's value is not in providing page traffic data, but in tracking things like click events, ecommerce funnels, marketing campaigns, A/B testing etc.
> How do you filter bot traffic from your server logs?
That's a option for goaccess --ignore-crawlers
Honestly I am not doing a lot of tracking on my users, the few campaigns I run I simply track within my application. That way I also don't miss any relevant Events.
Also a lot of my traffic is from tech people, so using anything like GA would also only show a small part of the audience I actually mostly care about.
You can actually use cloudflare workers to mitigate this. You use a worker to act as a proxy so that most blockers won't block the tracker. More details:
It definitely depends on the audience you have and the adblocker adoption for each country.
"mainstream" websites still show a 5-10% gap. That for marketing usages it's still ok.
What creeps me out even more is the ads and "automatic bing searches" that are cropping up in the Windows start menu and weather (location) tracker with ads in the taskbar. All this enabled by default after a Windows update and active while the user is not explicitly browsing something on the Internet. What metrics Microsoft is collecting?
Firefox has built-in tracking protection since a while and would block Google Analytics. Similarly, Adblock extensions would also block GA (usually behind a strict setting).
Considering we're talking HN visitors, this wouldn't be representative of visitors in general.
I don't know the answer but with my smartphone I'd rather connect to a website through a VPN running on my own server (with pihole and dns sinkhole), than actually browsing any website directly.
This isn't true at all. If you get outside the niche tech field, you'll see that GA is blocked by 6% - 10% of unique visitors... less on mobile, which is growing.
This link doesn’t seem to work correctly on mobile. It showed the right post for about a second, then scrolled to something from June 2nd and May 27th.
There are a lot of ways to measure traffic depending if you care about all visits, https requests or just real users. For example, in our research we've found that around 56% of internet traffic is actually headless bots (bots pretending to be human).
If you not only count headless bots but also count http requests as users then you'll have 5/1 ratio as Eric mentioned here on twitter.
- Cloudflare uses cache hits and so on to measure traffic and therefore use http requests. Most of which are coming from data centers. This means their numbers are highly inflated.
- GA allows HTTP requests as well but this traffic can be easily filtered. They allows most/all bots and provide no tools to fight this.
- Darwin we exclude everything we can to try and ensure the best measure of "real" users. This we believe is more helpful to marketers.
your real traffic is 10% of what you see in cloudflare,