I've found (after setting similar systems for 15+ companies of varying scale) that where a system like this breaks down is when you want to warehouse event data and tie it to other critical business metrics (stripe, salesforce, database tables that underpin the application, etc). Another point it starts to break down is when you need low-latency data access. At that point it makes more and more sense to run data into a stream (kinesis/kafka/etc) and have "low latency" (couple hundred ms or less) and "high latency" (minutes/hours/etc) points of centralization.
Using multi-az/replicated stream-based infrastructure (like snowplow's scala stuff) has been completely transformational to numerous companies I've set it up at. A single source of truth when it comes to both low-latency and med/high-latency client side event data is absolutely massive. Secondly, being able to tie many sources of data together (via warehousing into redshift or snowflake) is eye-opening every single time. I've recently been running ~300k+ requests/minute through snowplow's stream-based infrastructure and it's rock-solid.
Again, nice post! It's awesome to see people doing similar things. :)
As a frontend engineer having seen under the hood of data pipelines at scale, I wanted to reverse engineer the parts of it that I care about (product analytics via event logging), and package it up for my little side projects.
It's awesome that this is inspiring to people. If people get anything from what I wrote, it'd be this: While large companies all roll their own data pipelines, it's _not that difficult_ for a startup / smaller co / individual to do product analytics on a level that makes sense for them, without just automatically reaching for GA or whatever.
I find it has 80% of the features GA has, it's GPL v3+ and only takes a few minutes to setup.
> Avoid ad-blockers - My goal with analytics is to learn how people use my site so I can improve it and serve them better. I'm not using ad-tech so there's there's no point in getting blocked by 25% of visitors with an ad-blocker. That means doing 1st-party analytics, without using a 3rd-party tracking snippet—even self-hosted!*
> *Some ad-blockers already block self-hosted Matamo/Piwik tracking snippets.
Part of respecting user privacy is accepting the fact my tracking scripts will be blocked by most privacy extensions.
I feel Matomo does privacy correctly. By default it continues to use the well known piwik.js filename that extensions block and also respects the DNT (do not track) signal from browsers.
What users DO want to be tracked????
It's not that some of his users want to be tracked and some don't, it's more that some are aware of how to ask to not be tracked and some aren't aware.
Thankfully analytics can be blocked if someone so desires.
The beauty of Matomo is that you can configure it to read the data from the server logs. No need for me to bother with adblockers, tracking scripts, and how it all affects my website performance.
The usual trap. That's why we have managed DBs.
The full list of trackers that will be blocked by default is substantial. 
I have a very limited set of domains (manually) allowed to save cookie data in Firefox pas when a session ends. With everything else supposed to be auto deleted after the session ends ("Keep until: 'I close Firefox'").
Invariably though, after a week or so there are quite a few cookies saved for other domains anyway, which Firefox has decided for some unknown reason it's going to keep regardless. Without any explanation of why it's done so. Grrrr.
While manually clearing those out works, it doesn't seem like the current code base is working as intended.
I'm all for not using Google Analytics- I don't use it on my site. But why would you then use Google Sheets to hold the info? That's ridiculous. That's like telling people how to make an apple pie without sugar, and the last step before baking is dumping in a pound of sugar!
If you want an easy data store, use something like SQLite. There are plenty of options that are easy to self host, and a lot of libraries have been written to make it easier for you.
Using Google Sheets doesn't add 3rd party tracking back in. Even if you don't trust Google to not take a look at your spreadsheet, it's unlikely anybody or anything at Google would treat your spreadsheet like analytics data.
All that being said, it is a little funny seeing them funnel the data right back into Google. Personally, I would also prefer a different storage method.
It's not really a hard requirement for me to get off Google though, just to avoid using their ad-tech. For privacy concerns, using Google Sheets isn't all that different from using MySQL on Google Cloud, etc.
On that thought, you could potentially hook it up to Google Sheets or editors that integrate with remote git repos. Skipping 3rd party hosts, you could host your own git server for cheap too.
Awesome project nonetheless! I'm a hypocrite still using GA but want to move off of it this year, so reading your solution has me thinking about my own.
If you're looking for a first-party system, Snowplow is an amazing setup.
How hard is it to setup? Does it come with a UI to easily view, sort and filter these events into graphs like Mixpanel or Amplitude?
A pretty common move is to drop this data into redshift/snowflake and query it with Mode/Looker/Tableau/whatever. Athena is a viable option as well, until you get into higher data volumes and don't want to pay for each scan.
Context: I'm a tech lead (data engineering) @ a public company, have set this system up 15+ times @ numerous other companies, and could not live without it at this point. Current co's snowplow systems process 250M+ events per day peaking @ 300k+ reqs/min, on very cost-efficient infra.
Try Indicative https://www.indicative.com. (I’m the CEO) Our platform is a customer analytics platform, similar to the ones mentioned, but has a one-click integration for Snowplow based data warehouses. It is designed for product and marketing teams, to easily perform customer behavioral analysis without the direct need for data teams or coding skills.
If its a fun side project, do it! If you actually want it to work, don't do it. If you want to /really/ use it (checking more than sessions), don't do it.
I don't think rolling your own auth or encryption is an apples to apples comparison.
One of the new hotness in marketing is micro-funnels. You have to have customized metrics built into the product for that to work.
This could easily make the whole thing worthwhile. Ultimately, the number one ingredient for useful analytics is figuring out what you want to know, and how you are going to use that knowledge. It's usually a missing ingredient and GA encourages a much more passive approach.
Modern science often has the opposite issue. Economists (for example) have a lot of data. They can automate regression analysis to find correlations, then fit the theory (in economics, this is often an intricate model of an economic system) to the "result."
The problem is, 99% certainty doesn't mean anything if effectively test 1000 hypotheses. 10 correlations will occur randomly.
More data is strictly better.
But when you have more data, biases become more of a problem. p-hacking becomes easier, and the easier something is to do, the more likely it is to happen.
I would frame it this way: the signal/noise ratio of data decreases as the size increases.* The overall value increases, but at a slower rate over time.
* The caveat - once you get above a certain volume of data, new processing techniques become available that aren't available at smaller volumes.
You can do science on both directions, but even most scientists can't reliably navigate all the gotchas of the data -> hypothesis direction, while nearly everybody can do hypothesis -> data.
Just let us buy the software and run it. Like in the good old days.
Analytics in this context is a kind of self-awareness. I'm not spying on you the user, I'm listening to our conversation, so that I can improve where I make mistakes, and help serve you, the customer/user/visitor better.
Now if I stick a cookie in your pocket that keeps listening after you've left, or I let 3rd parties listen in to our conversation, then absolutely, that's spying.
As with physical venues that employ analytics, you can easily just not visit those sites that want to know a bit more about how people consume content than seeing "GET /page.html" in HTTP logs.
I'm writing this text to you as a huge free-software proponent so I'm not a "corporate shill" when I say analytics are really useful even to the most privacy-respecting pieces of software, it allows to spend resources much more effectively than making blind decisions about what users want - the vocal users opening GH issues about things are the 0.01% and those people shouldn't be the ones people building webpages, -services make decisions upon.
It is not accurate to say that it is like that the average person who goes to a web site is even in the slightest aware of all the different ways in which they are being used.
Developers creating software should see that 80% of people can't finish filling in a form or that 80% of people click on something that isn't supposed to be clickable. That way they can identify UX bugs and improve the product.
Sure, asking users to opt-in to telemetry for improving the product is a best practice.
Are developers trying to improve the product, or are you just collecting data to sell to 3rd parties? Big difference.
Problem is, there is no way to know or properly police this.
1. you can build the analyzers how you want, to relate different disparate events to each other and get a report of that easier than you can with a 3rd part solution (if the 3rd party solution can even do what you want)
2. You can do real time querying of the data relating to what the user is interacting with at the moment.
downsides, stuff like gender/age analysis will be outside of your grasp - at least at the beginning.
Instead of Google sheets, throw it to bigquery. Or reprocess hourly/daily into parquet then scan with athena.
For sure, Google Sheets is just a free, easy and temp stand-in for a real data store.
Since I posted it 2 hrs ago:
* logged 1200 sessions (this doesn't include folks who bounce before the page loads.)
* 8 "lock acquisition" errors from Google App script, which basically means it timed out trying to get a slot from too many concurrents
* 10 minutes of lambda runtime (Netlify gives 100hrs / month free)
* 590 ms average network latency
I'll update the post with full details on performance once the dust settles.
thanks for the tip
I sadly rely on reddit/hn to get updates/news now.
That being said, i do love seeing "roll your own" examples - which help foster creativity, remind us that the big silos are not the only ones that can actually produce helpful utilities and platforms, and to further de-centralize stuff on the web (indieweb baby!) Kudos @pcmaffey!
I decided to practice what I preach by rolling my own basic analytics for my site. I had a different set of requirements than this person but I am comfortable with the results.
This only flaw I have found is that it has revealed that most of my blog posts get pathetically few hits.
One issue though. I've reached the end and had no idea what to do from there (aka you don't have a scrollbar OMG). First time I've used Page Up to go back up on a site :)))
Still googly, but less scalability issues.
Roll-your-own is really nice when you have small data and when you have simple numbers to query (such as "get number of active players").
Or for now, track how many people moved their mouse to where they expected to see and use a scroll bar, noticed there was none and then closed the page.
I appreciate the joke, but don't give them ideas. Since one can't normally track interactions with the scrollbar, I wouldn't be surprised if they were to implement their own "scrollbar" from DIVs to be able to track if the user had dragged it.
Just put back the built-in scrollbar. Period.