
Show HN: I replaced Google Analytics with simple log-based analytics - benhoyt
https://benhoyt.com/writings/replacing-google-analytics/
======
AdriaanvRossum
Love what you are doing here. A few points:

1\. I would rename the pixel.png to something like image.png. Never call your
script anything like tracking, analytics, or pixel when you don't want to be
blocked by ad blockers. We use hello.js and hello.gif. [1]

> Log file parsing is an old-skool but effective way of measuring the traffic
> to your site.

2\. By using a pixel image you can bypass caching. When using server logs
without the image you only get the non cached requests. So an image like you
use is a better approach.

3\. Your image is being cached. So if somebody revisits your website your
image will not be loaded and you will not find anything in your logs. Just
disable your etag and add a expiry date of the past.

Awesome that more and more people are replacing Google Analytics with more
simple tools. GA is overly complicated and has not the best privacy mindset. I
built Simple Analytics [2] as a privacy friendly alternative.

[1]
[https://docs.simpleanalytics.com/script](https://docs.simpleanalytics.com/script)

[2] [https://simpleanalytics.com](https://simpleanalytics.com)

~~~
xiaq
> Never call your script anything like tracking, analytics, or pixel when you
> don't want to be blocked by ad blockers.

But this is a tracking script, no? If I were you I'd keep the name that way so
that if people don't want to be tracked, they don't.

~~~
harianus
If your business is to track people, yes. If it's to gather statistics without
tracking people (like page views), I think it perfectly fine to bypass ad
blockers. We even have a dedicated feature for bypassing ad blockers [1]
because we think page views are not privacy invasive. We drop IP addresses
from every request so there is no personal data in our database or logs.

If you really want to block you can enable the Do Not Track setting. Although
I think this should only be used when you are actually tracking people (we
don't). So this feature might be removed in the future. It's already removed
by Safari because it is another parameter to fingerprint a browser.

[1] [https://docs.simpleanalytics.com/bypass-ad-
blockers](https://docs.simpleanalytics.com/bypass-ad-blockers)

~~~
xiaq
The privacy game is about power, not about who is doing what right now. People
shun Google's data collection because of what Google _can_ do with the data,
not what it _has_ done or _is_ doing; it only takes a single case of data
misuse to reveal the power dynamics even if nothing has happened to them
personally.

You don't have to play the privacy game -- there is a lot of space between
really respecting user's privacy and breaking privacy laws. But if you do, you
should put the power back in the user's hand.

(Disclaimer: I work for Google.)

~~~
kerng
There have been countless cases for me personally where Googles tracking is
creepy. Like Google Map recommendations, YouTube videos I should watch,
misleading ads that send the user to malware bases on interest Google has
about them etc, as well as exposure of unauthenticated Google+ APIs that
allowed access to sensitive data to name a few.

I think saying nothing bad has happened is disingenuous. As soon as Google
gets similar exposure as FB right now, internal whistleblowers might come
forward also with more stories.

Also, couple of years ago Google was thoroughly compromised by at least one
foreign government, ever wonder how much data was stolen?

------
octref
> Obviously with log parsing you don’t get as much information as a
> JavaScript-heavy, Google Analytics-style system. There’s no screen sizes, no
> time-on-page metrics, etc. But that’s okay for me!

This also bypasses Ad blocker. In the case you have a large percentage of
technical audience (who presumably would have Ad blocker installed) this log
can be way more accurate than GA.

However this still requires setting up the image on a third-party server. I
would really love it if GitHub pages or Netlify can provide some simple
server-side tracking. It doesn't match GA but in some cases that's all I need.

~~~
highace
> This also bypasses Ad blocker

GA does a good job at extrapolating your data to account for users with ad
block. Obviously not perfect, but good enough for most cases.

~~~
hjnilsson
How is this presented in the UI? I have never noticed this extrapolation.

~~~
kingo55
That's because extrapolation does not exist. He may be confused with sampling.

------
acidburnNSA
For those of you hosting on your own server, this is a PSA that awstats is
still working. It reads server logs and makes little graphs via a cron job.
It's fun to have analytics going back to 2002 in the same format.

[https://awstats.sourceforge.io/](https://awstats.sourceforge.io/)

~~~
avian
Yes, I've also been using awstats for many years. It definitely works for a
small site, has a lot of useful features and occasional releases still keep up
with new user agents, etc. that pop up over time. Each time I want to switch I
fail to find a suitable replacement.

That said, behind the curtain, awstats has plenty of problems and shows its
age. Most of it is a single ~20 kline script with hundreds of global
variables, so it's very challenging to debug. There are no tests. Over time it
also had plenty of security issues [1]. I wouldn't recommend running it in any
other mode than for generating static HTML reports from an unprivileged
cronjob.

I've made my own test suite and I'm using a slightly patched version with ~20
commits on top of the latest release that fix problems I found and that
upstream didn't merge (still from the times of SourceForge - since they
switched to GitHub they do seem to be a bit better in accepting pull
requests). However it doesn't help with submitting patches that, for example,
concerns regarding GDPR compliance are met with responses like [2].

[1] [https://security-tracker.debian.org/tracker/source-
package/a...](https://security-tracker.debian.org/tracker/source-
package/awstats)

[2]
[https://github.com/eldy/awstats/issues/110](https://github.com/eldy/awstats/issues/110)

~~~
xfitm3
I read issue [2] and I think its reasonable to do this at the LogFormat level.
awstats doesn't need to do everything, that is probably how it got to be
20,000 lines.

I always use awstats to generate static pages, and that's what any security
conscientious operator should do.

~~~
avian
You see a reasonable technical response.

I see users with a valid issue (even quoting relevant laws) being called names
and told in a patronizing tone that widely accepted interpretations of said
laws are wrong.

------
telaelit
I also would suggest looking into Matomo
([https://matomo.org/](https://matomo.org/)) if your want an open source
analytic service to replace Google Analytics.

~~~
addandsubtract
FYI, Matomo used to be called Piwik, for anyone more familiar with their old
name.

[https://matomo.org/blog/2018/01/piwik-is-now-
matomo/](https://matomo.org/blog/2018/01/piwik-is-now-matomo/)

------
Neil44
I like GoAccess because it works well with multiple vhosts, if you have a lot
of sites and want to see relative busyness. Also you can see if any particular
site or individual resource is consuming too much bandwidth.

------
cromulent
Urchin (the software that became Google Analytics) was a log parsing
analytical tool.

[https://en.wikipedia.org/wiki/Urchin_(software)](https://en.wikipedia.org/wiki/Urchin_\(software\))

~~~
blakesterz
Oh how I miss Urchin! It was the best, by far, for this type of work.

------
abhin4v
Why not use a self-hosted analytics software like
[https://matomo.org/](https://matomo.org/) ?

~~~
Zolomon
I wish they had a simple Docker image available with everything setup and
configurable from a single file. The installation process is complex and under
documented, IMHO. But the software is pretty nice!

~~~
ifcho
Why do you need Docker for a simple PHP & MySQL script? To install it you just
need to setup a MySQL database and enter the credentials in the setup script.
Finally, add the tracking code to your site, but that's the same for every
analytics engine.

~~~
dewey
And you also need to set up a bunch of things in your web server, add the cron
jobs, run the update script and hope everything still works afterwards etc.

------
interfixus
If you take the trouble to host everyting on a stack you actually control, raw
server logs and GoAccess is a highly capable and for most cases sufficient
monitoring tool. I use othing but.

~~~
helij
Same here. Nginx log + GoAccess. I actually login to the server and generate
GoAccess html to check the traffic occasionally. I also anonymize ip on Nginx
- that way I don't have to deal with GDPR. No cookies, js or third-party
images, etc. Really see no need for anything else for a purely content based
website. Then again the website is purely non-commercial for now [1].

[1] [https://artlists.org/privacy-policy/](https://artlists.org/privacy-
policy/)

------
eljimmy
If you’re not a fan of big brother you probably should be logging to bare
metal instead of Amazon S3.

~~~
julsimon
[https://aws.amazon.com/compliance/data-privacy-
faq/](https://aws.amazon.com/compliance/data-privacy-faq/)

"We do not access or use your content for any purpose without your consent. We
never use your content or derive information from it for marketing or
advertising"

~~~
luckylion
With privacy, I personally prefer not to have to rely on trust. Encryption is
better than trusting other people not to invade your privacy.

------
redm
This looks like a solution in need of a problem. Previously I had tried some
similar things at scale, mainly because I wanted upsampled reports and GA
charges $100,000 for premium services. What I found is that raw logs are not
reliably accurate. The volume of traffic in certain countries was accurate but
not reliably “real” users even after accounting for known bots, search engines
etc. GA has a way of accounting for these and giving you a better overall
picture. The second thing is that GA has improved its service so you get
upsampled reports now, even at scale for tier 1 reports. At low volume, there
upsampled already.

I’m not sure why anyone would want to waste time with this.

~~~
RA_Fisher
I prefer the raw logs. One reason is that Google performs "statistical
improvement" like "sessonizing." This act destroys the value of point process
data. It wouldn't be found in a stats textbook because the operation destroys
valuable information. Also, the concept of Visitors and Users that GA uses
isn't transparent to me.

Another fun fact. Since the tracking happens on the client side, there's
potentially a ton of truncated data that GA simply misses. Backend server
instruments don't suffer the same way.

~~~
redm
I think those are fair points; mileage will depend on your end-goals. We want
to know how our traffic relates to real-world ad deliverability, real users in
our funnel, etc. I'd agree with "Also, the concept of Visitors and Users that
GA uses isn't transparent to me." but I'd add that whatever they do, is more
accurate than what we could get from raw logs, as it relates to relatable
business metrics.

------
eruci
I removed GA completely on [https://geocode.xyz/](https://geocode.xyz/)

Cloudflare provides all the basic analytics I need and I can parse the log
files in the command line if I need more.

~~~
Zenbit_UX
I'm also doing cloudflare only but they don't give you certain important
metrics like unique sessions.

Can you tell me where and how you are parsing cloudflare logs?

~~~
eruci
The logs are in my server. I use the cloudflare module to restore original
session data.

------
achairapart
Another very simple log viewer that runs on any LAMP/LEMP stack is Pimp My
Log[0].

[0]: [http://pimpmylog.com/](http://pimpmylog.com/)

------
nnx
Has anyone done research whether stopping using Google Analytics impacts
Google Search rankings?

~~~
spongeb00b
Taking out the tracking code snippet increases your Google PageSpeed score.
Which makes sense, but I was always amused that Google basically was taking
points says for using their analytics in your site.

------
pmlnr
I never moved from awstats. It still works.

------
TomK32
I'm using ahoy
[https://github.com/ankane/ahoy](https://github.com/ankane/ahoy) with my Rails
applications and I'm very happy with it. Geocoder using the IP is included and
the thing that matters for me is being able to set an additional conversion
flag on the visit.

------
stevenicr
I used to spend a lot of time looking at logs and trying to make pages better
and more content based upon the stats. Then google stopped giving keywords and
searched-phrases - I've since found stats of little use most of the time.

I wish google would make search-keyword-hiding opt-in for users, and perhaps
auto-opted-in if using incognito mode. I am sure most of my visitors would be
glad to provide the search phrases knowing that it helps us make more thing
and things better. But google does not let them opt in to sharing them, they
are all basically opted out.

~~~
tinus_hn
Even if Google wanted, browsers like Safari are going to strip referrer
headers of query terms as you navigate between sites.

~~~
stevenicr
I think it's a great ability - and a great option. I'd like this to be
optional with sites and browsers. Give users the ability to change these
settings.

Give web sites a way to say - thanks for visiting, we noticed you are using a
browser that.. or search portal that strips info form us.. would you please
click to enable sharing this small bit of info.. more about how we use and
what info here..

Something like this could help sites and users. I'd like to toggle it myself.
I like how startpage scrambles url queries, but I would turn it off for some
sites, whitelist them like some are with ublock etc. I also don't like how
p-hub and some others keep queries in the url, and would like an option to
scramble, with the site, via browser settings, proxies, whatever it takes.. to
give more options, more choice.

------
nathan-io
Very nice!

On several projects, we've had success with a custom tracker that records IP,
URL, referrer, display resolution, OS, and user agent to a local db.

To filter out bot traffic, we used Crawler-Detect [1].

The whole thing is just a few lines of PHP and JS, doesn't even require a
tracking pixel (we grab most of the data from the user session).

A cron job moves entries older than x from the production db to an archive db.

[1] [https://github.com/JayBizzle/Crawler-
Detect](https://github.com/JayBizzle/Crawler-Detect)

------
qrbLPHiKpiux
I feel old now. The 1x1 hidden pixel is so early 90's old school.

------
cartofu
Another alternative to more privacy in GA is to proxy all requests to GA via
your own simple proxy server analytics.yoursite.com and drop the last bytes
from the visitor IP when proxying.

------
mejakethomas
(data engineer here)

Nice post! It's always fun reading about people being creative and challenging
the analytics status quo (aka GA). Besides the joy of doing it yourself,
you've accomplished a couple other things worth mentioning:

1\. You'll never be sampled. GA samples historical data pretty heavily, and
you have to pay for 360 to retain unsampled event data (at a tune of $160k+
per year).

2\. You have full access to all generated data.

I'd highly recommend using Snowplow's javascript tracker
([https://github.com/snowplow/snowplow-javascript-
tracker](https://github.com/snowplow/snowplow-javascript-tracker)) in a very
similar manner to what you've outlined here. You'll get a ton of extra
functionality out of the box, which would add yet another level of insight.
With snowplow, you get the following for free:

1\. Sessionization, which is consistent with google analytics' definition -
effectively a 30 minute window of activity.

2\. User identification - the tracker drops a persistent cookie (just like
GA), so you can see returning visitors.

3\. Tools for splitting requests

4\. A variety of event types, out of the box:
[https://github.com/snowplow/snowplow/wiki/2-Specific-
event-t...](https://github.com/snowplow/snowplow/wiki/2-Specific-event-
tracking-with-the-Javascript-tracker)

5\. Ability to respect Do Not Track

6\. Time on page, browser width/height, etc

7\. Ability to make your event tracking 100% first-party

(Disclaimer: I don't work for them, but I've seen the system work very well a
number of times.)

I'm running a similar setup on my blog, and it costs well under $1 per month:
[https://bostata.com/client-side-instrumentation-for-under-
on...](https://bostata.com/client-side-instrumentation-for-under-one-dollar/).
I'm doing the same exact thing with Cloudfront log forwarding and have several
lambdas that process the files in S3. From there, I visualize traffic stats
with AWS Athena (but retain a ton of flexibility, since they are all
structured log files).

------
amanzi
If you're using JavaScript to add the pixel code, why not also include other
metrics you can get easily with JS like screen resolution?

~~~
benhoyt
Yeah, good question. I wanted to do that, but GoAccess is a web server log
parser and doesn't support custom fields (you don't get screen resolution via
web logs, so it kinda makes sense). See:
[https://goaccess.io/man](https://goaccess.io/man)

I could probably hack it and overload different HTTP status codes to mean
different screen sizes or something, but I didn't consider device size to be
important for me. GoAccess does break down the User-Agent into OS, so I can
see mobile usage via the "iOS" and "Android" OS usage. Breakdown for my site:
Windows 24%, iOS 22%, macOS 19%, Android 20%, Linux 11%, other 4%. So mobile
usage is probably about 45%.

~~~
jon-wood
If you’re using JavaScript you could make a request in the background with a
bunch of HTTP headers added like X-Screen-Resolution and then have your web
server log them.

------
finchisko
I was thinking about switching to log based user tracking some time ago. Not
because of big brother issues, but rather my intention was to remove the
cookie banner non sense required by EU. No cookies, no banner required, right?
I mean there are sure some downsides, but in current stage of our analytics,
logs should hold enough information for analytics we need.

~~~
mejakethomas
You'll still have to anonymize IP addresses, since those are classified as
personal data in the EU.

------
cphoover
I would use a json based logger like [https://github.com/trentm/node-
bunyan](https://github.com/trentm/node-bunyan) or
[https://github.com/pinojs/pino](https://github.com/pinojs/pino) and use the
elk stack which can parse JSON.

------
gorkemcetin
It is always possible to overcome ad blocker banning with js trackers if you
are self hosting and have the option to modify strings in js sdk - there are
several ways to do this. You can also achieve more data Countly, Matomo or
Fathom w/o using direct server logs.

------
pknerd
I could not find organic traffic details like keywords that bring visitors to
the site?

~~~
wwweston
Once upon a time you could get this information from the HTTP_REFERER header,
since search terms were essentially encoded in the query string portion of the
URL the search results showed up on.

Not sure how that holds up these days.

~~~
harianus
It's removed by Google, all search result clicks go through an intermediate
URL that removes the keywords. You can have some information from the Google
URL [1].

[1]
[https://webmasters.stackexchange.com/a/107179](https://webmasters.stackexchange.com/a/107179)

------
pawurb
You could go fancier with setting up ELK stack for visualizing those logs
[https://abot.app/blog/elk-nginx-logs-setup](https://abot.app/blog/elk-nginx-
logs-setup)

------
gscott
I use [https://statcounter.com/](https://statcounter.com/) which is
essentially like the old log based website stats.

Google analytics has too many layers of UI.

------
louismerlin
This is awesome ! It makes me want to build my own minimal analytics tool.

------
tdhz77
“Most tracking systems, including Google Analytics, don’t work at all without
JavaScript.“

I have found solutions with GA, matomo, fantom all to have image based
solutions that you can use.

~~~
mobjack
99.9% of those on the web have JavaScript enabled.

The audience who disables it is incredibly small and doesnt want to be tracked
anyways.

It shouldn't be a factor in your analytics solution unless you want to track
bots too.

------
qwerty456127
I have always been wondering why doesn't everybody do this and insist on using
Google Analytics and other 3rd-party trackers instead.

------
bryanrasmussen
making your own beacon based tracking system is pretty simple, and then you
have the screen sizes, time on page metrics etc.

~~~
speedplane
The real power of Google Analytics is not in the tracking code, it's with the
front-end interface that any non-technical product manager or marketer can
use. Tracking users is easy. Allowing non-technical folks with easy-to-use
analysis tools is much harder.

~~~
johnchristopher
The ever changing google analytics dashboard is not that easy to use for non
technical folks.

And then there's the whole "data analysis" thing.

~~~
justkez
The (lack of) speed and complexity of GA astounds me every time I go in (use
it over a portfolio of businesses). Using it on a 100Mb connection is still
like pulling teeth.

I've researched building out a desktop app that pulls GA data over the API in
the background so you can get key stats out much quicker, but it's quite an
investment of time to be beholden to Google's platform.

Now doing some dogfooding on a web analytics service I've been evolving that
tries to answer the "why" of change in traffic/behaviour over time ("traffic's
up today....not sure why?"). Google do this with their GA mobile app
("Insights") but what and when they show you don't seem to be too predictable.

------
arjunbanker
i’ve heard snowplow is good for this

~~~
mejakethomas
Totally agree! I'm running a minimalistic version of snowplow's collection/etl
infra for under $1 per month, and it works great:

[https://bostata.com/client-side-instrumentation-for-under-
on...](https://bostata.com/client-side-instrumentation-for-under-one-dollar/)

------
bilater
Nice. Another (more) simplistic effort:
[https://medium.com/datadriveninvestor/a-very-simple-way-
to-a...](https://medium.com/datadriveninvestor/a-very-simple-way-to-add-
analytics-to-your-website-b25916d281bd)

