Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using GoAccess for self-hosted web analytics (sablun.org)
110 points by bjoko on Dec 27, 2019 | hide | past | favorite | 28 comments


While I do like GoAccess and I like owning my own data, there's still a huge difference between what tools like GoAccess offer and what the elephant in the room, Google Analytics, offers.

It's not just the data available by default, it's also the very advanced user interface allowing even somewhat non-technical people to produce their own reports and dashboards.

Yes, you only co-own the data together with Google, but at least you have the data compared to using a home-grown solution which will either not provide your product managers with what they need or which will consume all of your life as you slowly re-implement your own Google Analytics which will never be as good and/or featureful as the original.

Of course, if it's just about your personal blog and if you're willing to spend some time on the tooling itself, then yes, tools like these don't have the privacy issues otherwise accompanying third-party analytics.


For me, the problem with GA (or Matomo or any JS analytics) is that it’s not accurate with the people visiting your website.

You lose all the ones with a blocker, all the ones with JS turned of.

Of course it’s not the biggest percentage, but still.

So you’re right, it depends of what the website is. It’s easier to follow individuals with a JS solution.


This is the sort of thing that the W3C committee and open-source web-companies should be researching/solving instead of "HTTPS" and "SSL pinning", "new HTML5 tags", "web notification", "location apis", and "DNS over HTTPS" type of stuff (although they are important on some level). We shouldn't have to rely on JS in order to do these sorts of website analytics. It could arguably be part of the browser stack and within the users' realm of control.

Instead we now have disable GA or all JS in order to not be tracked. GA is a huge privacy issue, and I have no idea why no one is looking at this more closely than some of the other things I see as trivial.

Google controls most of the web's search. It controls a huge chunk of the browser market. And a huge chunk of the websites out there use GA if not google-ads. Between Google and Facebook, we are constantly being tracked and monitored for maximum benefit/profit.

Case in point: Even without GA, I recently browsed a blog that had a Disqus comments section embedded on it. The embedded Disqus script did a post to "https://accounts.google.com/o/oauth2/iframe" that had a referral header telling google exactly the page I was at, and it included a few other potentially-tracking tokens. Bad? You decide. But this is not what we should be allowing the players on the web to do, and we shouldn't feel like conspiracy-theorists for being bothered by it.


GA does support using Google Tag Manager and a NOSCRIPT iframe for recording visits from people with JavaScript disabled.

You can also either proxy GA through your own domain, or use server side APIs to track those with blockers. Both support passing the browser IP as a parameter.

Also, fwiw, just noting both are technically possible. I'm not speaking to whether either is okay ethically.


I think a big problem with Google Analytics is accuracy, especially with the now so popular adblockers. Log analysis such as GoAccess should be able to track these down fine since it works at the server level.

I believe tracking visitors at the client level deflates the actual number of visitors. On the other hand, server-side tracking gives you a more accurate number at the cost of not knowing for sure if the client is a human behind a browser.


You can programmatically send hits to GA from your server, and get the benefit of the nice UI.


FYI: Paranoid adblockers / privacy protectors will block iframe to remote domains. For example: uMatrix by gorhill (same author as uBlock Origin) does this.


That's correct but there is also an alternative approach. You can track your user event data into your database using an open-source solution such as Rakam API (https://github.com/rakam-io/rakam) or Snowplow (https://github.com/snowplow/snowplow) and use a product analytics tool such as Rakam (https://rakam.io/product) that provides a set of prebuilt reports & ad-hoc reporting interface for the data on top of your database.

That way, you won't be sharing your data with a third-party company, have full control over your data, be able to access an interface similar to Google Analytics with prebuilt reports and also have the ability to model your data for your custom needs and do custom reporting.

Shameless plug: I'm the maintainer of the Rakam API and working for the company behind that open-source product.


Google Analytics keeps track of visitors using cookies, so if a browser has cookies or JavaScript disabled, then it won't keep track of it. This includes the now so popular adblockers and bots as well. Log analysis such as GoAccess should be able to track these down fine since it works at the server level.

I believe tracking visitors at the client level deflates the actual number of visitors as pointed above. On the other hand, server-side tracking gives you a more accurate number at the cost of not knowing for sure if the client is a human behind a browser.


Have you tried Matomo?


> produce their own reports and dashboards.

A good open source project would likely be to dump that all in a DB abd expose it as OData. Although PowerBI is non-free, it would cover what GA can do and then, not some, but a huge amount.


When did you last use Google Analytics? It became useless about two years ago when they refused to block referral spam.


goaccess does not deal with this either.

Also, Google Analytics (and other analytical solutions) give much more information about the visitors than can be inferred from access logs.

Not that I recommend using GA - I've dropped all analytical tools from my sites long time ago.


I did the same for a few years. Recently I installed a central matomo instance for all my sites to hit. It's pretty good!


If you're hosting a static site on shared infrastructure (e.g. github pages, S3) where access logs are hard to come by, you can achieve some level of statistics gathering using a 1x1 pixel GIF and putting a CDN in front of it, then putting the CDN logs through GoAccess.

EDIT: this was a blog I loosely followed https://benhoyt.com/writings/replacing-google-analytics/ - one detail I forgot to mention is you have to do some post processing on the log files to get them into a format amenable for use by GoAccess etc, but it's not too difficult


yeah, sharing logs for shared hosting is a small but interesting & unsolved problem

it's still weird to me that we rely on the frontend to manage access logging


I personally run Fathom[0] for my very simple needs and it's pretty good as well. Not quite as simple to deploy as GoAccess, but it's got a nice UI as well.

I used to use Matomo[1] (which is a lot closer to a full analytics suite) but stopped using it since it felt heavier than what I needed.

[0]: https://github.com/usefathom/fathom

[1]: https://matomo.org/


Like the other solutions I've seen, this assumes you have access to your web-server's logfiles. For many (most?) low-end web-hosting this is likely not true afaik. I suspect that this is precisely the window-of-opportunity that enabled the dominance of Google Analytics and similar solutions.


GoAccess is great and all, but static log analyzation =/= Google Analytics. You can only get so much data via log analysis, and if you have a SPA, or even just alot of client side stuff, static log analyzation just cannot provide you with the same type of data you get from client side.

If you really want to replace Google Analytics and have the same level of features and tracking you need a client side system - Matomo is the closest I've seen to that.

All that being said, I've used GoAccess, I like it, but I haven't quite mastered the log format to enable my more robust AWStats logs with GoAccess. I have a bunch of subdomains all as VHosts, and I loved the feature in GoAccess that rolls it all up into a VHost table/chart. However I haven't figured out the log parser settings to get both AWstats and GoAccess to like it with the %v_host field in the logs. Any thoughts/help there?


Google Analytics keeps track of visitors using cookies, so if a browser has cookies or JavaScript disabled, then it won't keep track of it. This includes the now so popular adblockers and bots as well. Log analysis such as GoAccess should be able to track these down fine since it works at the server level.

I believe tracking visitors at the client level deflates the actual number of visitors (due to reason listed on #2). On the other hand, server-side tracking gives you a more accurate number at the cost of not knowing for sure if the client is a human behind a browser.


> GoAccess is great and all, but static log analyzation =/= Google Analytics

True, but this has some advantages as well - GoAccess logs _all_ requests, even when you disable JavaScript an/or use ad blockers. For my blog, I don't care about extensive tracking capabilities. I just want to know what posts are most popular and roughly how much traffic I get. YMMV of course, especially for web applications.


Does anyone else work on a relatively large internal web project in a global corp where GA for obvious reasons is forbidden to use internally? How do you do statistics? Which statistics do you capture and why?


Yeah, good luck integrating this with Adwords, YouTube, Facebook, etc. I understand why this is good for personal projects, but if you are spending lots of money on advertising (which probably goes mostly toward Google anyway), Google Analytics is the way to go.


Sadly.

It would be nice if there was a way to do conversion attribution without the tracking cookies.

One idea I've been thinking about lately:

* in server-side access logs, include visitor's fingerprint as hash(client_ip + user_agent + maybe_salt)

* Look for access log entries with ad campaign specific Referrer values. Call these "Set A".

* Look for access log entries with the conversion event (user signs up, user upgrades to paid plan, user makes purchase, ...). Call these "Set B"

If any given visitor's fingerprint appears in both the Set A and the Set B, we can assume they came from an ad campaign and converted.

Of course, user's IP can change, and many users can have the same IP. So this would be imprecise. But could be better than nothing. And, IMO at least, better than having the tracking cookies, cookie warnings, consent screens, extra sections in privacy policies etc.


This looks awesome for a personal/portfolio site. I use matomo for mine and it’s sort of overkill. You have to set up your own lamp/lemp stack just to get analytics and I use maybe 1% of the features.


Matomo unfortunately is unnecessarily difficult to setup for reading web-server access-log files.


Isn’t the primary goal of Matomo to be a replacement for Google Analytics?


Yes I think so, which means comparing it with GoAccess is a bit apple&oranges.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: