"80% of my traffic is bots"
1. Open devtools networking
2. Visit any site with analytics
3. See requests to https://www.google-analytics.com/j/collect?... (for GA), https://plausible.io/api/event (for Plausible) etc
EDIT: expanded this into a post https://jefftk.com/p/firefox-does-not-block-analytics-by-def...
Google is a Mozilla donor and I don't think they'd like it if the browser blocked one of their major services. Ublock Origin will block GA and GTM (Google Tag Manager) however.
Ad blockers also often cut GA off.
Even among js analytics tools, there are differences. In a past job, the main question customers where asking, was "why there is not the same number of pageviews than GA?".
Everyone detect bots differently, blacklist different IP, js execution orders is different and some script are blocked by adblockers and some not
As for relevancy - GA can still be useful as it can give you more detailed information which you can't get without JS.
Detailed information about users who allow the site to execute JS, that is.
I'm not sure how useful such skewed information is.
I think you are massively overestimating the number of people outside the HN crowd that have NoScript or an equivalent turned on.
This isn't applicable to internal analytics i.e. using Google Measurement Protocol, Which is a good way to avoid the cookie-tracking mess altogether and to gain more control over the analytics.
> Is GA still relevant today?
Of course it is. While a large part of HN audience might use adblock, very few regular users do and almost no mobile users.
Looking at my logs, however, I get a similar proportion. ~25k in GA, ~100k in logs.
It’s expensive but I find it’s worth it to not have my partner and our parents watching ads.
I watched the Euro 2020 games, and even the 5-second soundbite of “the games are sponsored by so-and-so” rubs me the wrong way. Why should I be told the name of some company when I’m watching football? I digress.
Just because something has been common for long time doesn't make it quite right.
I appreciate there are commercial reasons why this won’t happen, but it would be interesting. I’d also personally never trust any of the ad brokers to be honest with individuals without an army of lawyers.
(Disclosure: I work on ads at Google, speaking only for myself)
So now you don't need AdBlock, but you still need SponsorBlock. You might be able to get a sponsorless version through their Patreon, but now YT Premium is pointless.
I suggest a small machine with pi-hole :)
Also this way your favourite content creator gets paid as if you saw the ads
I had no idea twitch turbo existed.
Premium users are a perfect segment. They already shown they are willing to spend money.
The audience for our website (gambling) tends to be both technologically sophisticated and very conscious of their privacy. Their behaviour has also been quite reliably an early indicator of a wider change. As it stands, more than 80% of page requests block analytics, trackers and other external crapware.
The players in other industries have to prepare for the coming wave, because it will hit them soon. About time, too.
Car enthusiasts vs mechanics in a way.
Only to advertisement agencies. And they can go cry up a river.
They stopped tracking by default in safari which is similar.
Now if Google were to do something like add a default adblocker that blocked all non-Google ads... yeah, that'd be anti-competitive.
But if Apple isn't even competing, how can it be anti-competitive?
Did I miss some other product?
I just double-checked my real-time analytics and looked back in the last 30 days to double-check. It's all coming through GA while I am in Firefox or other users are using Firefox.
The first vs. third-party cookies are confusing as hell.
I wish Google Analytics had an option to run a small proxy in a Docker container so I could run it on my server and collect all traffic without being blocked.
For this specific application, it's a game (digdig.io), so these 11% might likely be browsers that didn't fully connect to the game, or bots that don't fully support the apis I'm using.
...or people that simply refuse to correctly resolve google analytics' domains, or people who simply block google analytics' IP addresses at the firewall level...
And you know why people do that? Probably because there are developers out there that think it is a good idea to "exfiltrate the data via an http request or websocket and have your server submit the pageview hit directly to GA's servers".
Also, I don't care if you don't want some VERY basic info about your browser collected. If you're connecting to my servers, you already gave it to me through the headers, I'm giving it to GA too.
Even events can be replaced by triggering calls to your own endpoints.
I was using https://plausible.io which I presume has a higher success rate than GA, but I still saw actual usage of resources (requests to the CDN) about 4 times higher than reported.
I guess you couldn't pick more ad-block-heavy audience than HN though right?
So I guess it's not too surprising that it's still the same ratio.
It's not a perfect comparison but I am still on the lookout for the cheapest way to host a static site while retaining access to access logs.
As for CloudFlare, all the requests will go through CloudFlare's server for caching, so some request won't reach your server.
Only for static files, they don't cache HTML and dynamic contents. So you can still make sense of unique visitors to your site.
We could go back to the days of talking about "hits" on a website, but for most things where you'd care about these types of metrics, it's a pretty crap metric.
Unfortunately a few sites unfairly throttled me to 1 request/hour despite the cost of my requests being fuck-all. So a couple of years ago I had to randomize those variable names, distribute requests over 64 IPs, and screw up all their analytics numbers in the process. I hope it was worth saving $0.01 for them.
Of course, you can do that with server side logs, so in this case client side is less accurate:)
Of course, that does mean you have to track all the user's online activity....
If you're processing personal data to detect bots to prevent let's say card fraud, fair enough. If you're processing personal data to detect bots for analytics purposes, this might not fly.
Of course, this assumes someone actually gives a shit about enforcing the GDPR, which is currently not the case.
It doesn't matter.
> how do you detect them serverside?
Instead of trying to detect whether something isn't human you should be trying to detect whether something is malicious.
While people who use frontend logging/tracking just search the user ID and see what went wrong.
I would also say cloudflare doesn't count uniqu visitors accurately. Also recently there is big uptick in bots which server logs show, but are not real users.
I dealt with something similar so I had my back-end fire server-side events to compare to Google Analytics and other client-side reported data. Sure enough, 50% loss on desktop and 30% on mobile.
In the end I realized I was just as guilty, after all I run adblock too - so I removed them all.
I guess the value comes from seeing "trends" in your site content - I.e. how people move around within the site, which sites are sending traffic etc rtlather than seeing absolute 100% accurate counts
Log analysis will always be much more accurate counter if you just need a hit counter.
Cloudflare’s unique visitor count always felt inflated to me, compared to both GA and non-GA analytics solutions.
If all you are looking for is a 'hit counter', then GA is overkill. It's value is not in providing page traffic data, but in tracking things like click events, ecommerce funnels, marketing campaigns, A/B testing etc.
That's a option for goaccess --ignore-crawlers
Honestly I am not doing a lot of tracking on my users, the few campaigns I run I simply track within my application. That way I also don't miss any relevant Events.
Also a lot of my traffic is from tech people, so using anything like GA would also only show a small part of the audience I actually mostly care about.
I don't believe it would scale well to a big site though ($$)
It loads a benign script that pretends to be GA but doesn't touch Google servers at all, to avoid breaking sites that expect GA to be available.
Considering we're talking HN visitors, this wouldn't be representative of visitors in general.
People don't talk much about that, I suppose, but to me it's clear GA is completely dead as a product.
There are a lot of ways to measure traffic depending if you care about all visits, https requests or just real users. For example, in our research we've found that around 56% of internet traffic is actually headless bots (bots pretending to be human).
If you not only count headless bots but also count http requests as users then you'll have 5/1 ratio as Eric mentioned here on twitter.
- Cloudflare uses cache hits and so on to measure traffic and therefore use http requests. Most of which are coming from data centers. This means their numbers are highly inflated.
- GA allows HTTP requests as well but this traffic can be easily filtered. They allows most/all bots and provide no tools to fight this.
- Darwin we exclude everything we can to try and ensure the best measure of "real" users. This we believe is more helpful to marketers.
your real traffic is 10% of what you see in cloudflare,
30% of what you see in unfiltered GA.
With Darwin around 85% of traffic is real.