Nice article! I did something very similar to this for my blog but used Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker), a cloudfront distribution with s3 log forwarding, a couple lambda functions (with s3 "put" triggers), S3 as the post-processed storage layer, and AWS athena as the query layer. The system costs under $1 per month, is very scalable, and is producing amazingly good/structured data with mid-level latency. I've written about it here:
By using the snowplow javascript tracker, you get a ton of functionality out of the box when it comes to respecting "do not track", structured event formatting, additional browser contexts, etc. If you want to see how the blog site is functionally instrumented, filter network requests by "stm" (sent time) and you'll see what's being collected.
I've found (after setting similar systems for 15+ companies of varying scale) that where a system like this breaks down is when you want to warehouse event data and tie it to other critical business metrics (stripe, salesforce, database tables that underpin the application, etc). Another point it starts to break down is when you need low-latency data access. At that point it makes more and more sense to run data into a stream (kinesis/kafka/etc) and have "low latency" (couple hundred ms or less) and "high latency" (minutes/hours/etc) points of centralization.
Using multi-az/replicated stream-based infrastructure (like snowplow's scala stuff) has been completely transformational to numerous companies I've set it up at. A single source of truth when it comes to both low-latency and med/high-latency client side event data is absolutely massive. Secondly, being able to tie many sources of data together (via warehousing into redshift or snowflake) is eye-opening every single time. I've recently been running ~300k+ requests/minute through snowplow's stream-based infrastructure and it's rock-solid.
Again, nice post! It's awesome to see people doing similar things. :)
Thanks! Not a data engineer, but I used to work at a data engineering company, and can attest to the complexity and rawness of the industry. Your setup looks solid!
As a frontend engineer having seen under the hood of data pipelines at scale, I wanted to reverse engineer the parts of it that I care about (product analytics via event logging), and package it up for my little side projects.
It's awesome that this is inspiring to people. If people get anything from what I wrote, it'd be this: While large companies all roll their own data pipelines, it's _not that difficult_ for a startup / smaller co / individual to do product analytics on a level that makes sense for them, without just automatically reaching for GA or whatever.
Very cool. For those that want to de-Google-ify their website analytics in a more practical sense, consider using a self-hosted Matomo (formerly Piwik) instance: https://matomo.org/
I find it has 80% of the features GA has, it's GPL v3+ and only takes a few minutes to setup.
Depending on what kind of analytics you want, Fathom is quite nice too. I recently set it up on my servers (about a month ago) and its been working well.
It’s open source as well, with a public Trello board that speaks to future dev goals.
https://github.com/usefathom/fathom
> Avoid ad-blockers - My goal with analytics is to learn how people use my site so I can improve it and serve them better. I'm not using ad-tech so there's there's no point in getting blocked by 25% of visitors with an ad-blocker. That means doing 1st-party analytics, without using a 3rd-party tracking snippet—even self-hosted!*
I'm okay with this. If a user agent is attempting to block my tracking code (piwik.js), it's likely that the user doesn't want to be tracked.
Part of respecting user privacy is accepting the fact my tracking scripts will be blocked by most privacy extensions.
I feel Matomo does privacy correctly. By default it continues to use the well known piwik.js filename that extensions block and also respects the DNT (do not track) signal from browsers.
Stop being so hysteric, if you want to avoid first-party analytics then don't use the service, if someone reads my blog I have the full right to find out how long my articles are read. Protesting against first-party tracking is like protesting against visitor counters in shops and security cameras, luddism.
I'm not being hysteric, I'm just pointing out that the parent poster was kind of offering a a false worldview.
It's not that some of his users want to be tracked and some don't, it's more that some are aware of how to ask to not be tracked and some aren't aware.
Calling someone who cares about his own privacy "hysteric" and rehashing the same tired old arguments (which inevitably boil down to "if you don't want me to do this, just make sure you don't touch my website even once!") doesn't do your position, whatever it is, a whole lot of justice.
Specifically, it's the EasyPrivacy list that blocks it. That list is one of the default ones in uBlock Origin.
The beauty of Matomo is that you can configure it to read the data from the server logs. No need for me to bother with adblockers, tracking scripts, and how it all affects my website performance.
It looks as if in the next several months Firefox will become all but invisible from most 3rd party trackers as Firefox ”will strip cookies and block storage access from third-party tracking content, based on lists of tracking domains by Disconnect.” [1]
The full list of trackers that will be blocked by default is substantial. [2]
I have a very limited set of domains (manually) allowed to save cookie data in Firefox pas when a session ends. With everything else supposed to be auto deleted after the session ends ("Keep until: 'I close Firefox'").
Invariably though, after a week or so there are quite a few cookies saved for other domains anyway, which Firefox has decided for some unknown reason it's going to keep regardless. Without any explanation of why it's done so. Grrrr.
While manually clearing those out works, it doesn't seem like the current code base is working as intended.
I'm all for not using Google Analytics- I don't use it on my site. But why would you then use Google Sheets to hold the info? That's ridiculous. That's like telling people how to make an apple pie without sugar, and the last step before baking is dumping in a pound of sugar!
If you want an easy data store, use something like SQLite. There are plenty of options that are easy to self host, and a lot of libraries have been written to make it easier for you.
The article isn't about removing all traces of Google from your site/stack. It's about removing 3rd party tracking from your site. Specifically, it focuses on removing Google Analytics.
Using Google Sheets doesn't add 3rd party tracking back in. Even if you don't trust Google to not take a look at your spreadsheet, it's unlikely anybody or anything at Google would treat your spreadsheet like analytics data.
All that being said, it is a little funny seeing them funnel the data right back into Google. Personally, I would also prefer a different storage method.
For sure, there are a bunch of reasons not to use Google Sheets. For this, it's just an easy, free, and temp way to get a new site up without needing to setup a backend.
It's not really a hard requirement for me to get off Google though, just to avoid using their ad-tech. For privacy concerns, using Google Sheets isn't all that different from using MySQL on Google Cloud, etc.
When looking at solutions, did you consider using a private git repo?
On that thought, you could potentially hook it up to Google Sheets or editors that integrate with remote git repos. Skipping 3rd party hosts, you could host your own git server for cheap too.
Awesome project nonetheless! I'm a hypocrite still using GA but want to move off of it this year, so reading your solution has me thinking about my own.
At the past 3 places I've worked, we've setup Snowplow Analytics (https://snowplowanalytics.com/) and would strongly recommend it over GA, Segment, and other third-party systems.
If you're looking for a first-party system, Snowplow is an amazing setup.
Showplow is great and cost effective. AWS Cloudfront, Lambda and Athena make it possible to create fully serverless setup on top of it. There's an open source example of such setup: https://statsbotco.github.io/cubejs/event-analytics/
A pretty common move is to drop this data into redshift/snowflake and query it with Mode/Looker/Tableau/whatever. Athena is a viable option as well, until you get into higher data volumes and don't want to pay for each scan.
Context: I'm a tech lead (data engineering) @ a public company, have set this system up 15+ times @ numerous other companies, and could not live without it at this point. Current co's snowplow systems process 250M+ events per day peaking @ 300k+ reqs/min, on very cost-efficient infra.
Try Indicative https://www.indicative.com. (I’m the CEO) Our platform is a customer analytics platform, similar to the ones mentioned, but has a one-click integration for Snowplow based data warehouses. It is designed for product and marketing teams, to easily perform customer behavioral analysis without the direct need for data teams or coding skills.
This is like a marketer suggesting an engineer roll his own security.
If its a fun side project, do it! If you actually want it to work, don't do it. If you want to /really/ use it (checking more than sessions), don't do it.
Props for diving into such a hot topic though. This is a really big deal for companies of many sizes (probably _all_ sizes), and bringing creativity to this conversation is +1
Awesome work! I had the same issues with GA, but decided to go with Matomo (Piwik). It's an Open Source, self-hosted alternative to Analytics. So far I couldn't be more impressed!
There's also GoAccess [1] which can monitor and parse your nginx or apache logs to show you realtime stats in a html report or even directly in your terminal. The numbers do not fully match google analytics numbers, but I love how it looks.
Working backwards, I start by defining what data I care about tracking. This is actually one of the biggest ancillary benefits of rolling your own analytics. I get to perfectly fit my analytics to my application.
This could easily make the whole thing worthwhile. Ultimately, the number one ingredient for useful analytics is figuring out what you want to know, and how you are going to use that knowledge. It's usually a missing ingredient and GA encourages a much more passive approach.
Its not very scientific though. You're starting from a point of assuming you already know what you want instead of collecting data and analyzing it to see what correlates to what.
I would argue it's more scientific (even though the goal generally isn't science). Hypothesize, design measurements, test, conclude.
Modern science often has the opposite issue. Economists (for example) have a lot of data. They can automate regression analysis to find correlations, then fit the theory (in economics, this is often an intricate model of an economic system) to the "result."
The problem is, 99% certainty doesn't mean anything if effectively test 1000 hypotheses. 10 correlations will occur randomly.
More data does not make you a worse data scientist. You can run a bad study at any amount of data. The only difference is a new hypothesis will have historical data to look at. More data not require you create some over fitted model.
I disagree with "More data is strictly better". if you said that everything else equal, more data is better, I would be closer to agreement.
But when you have more data, biases become more of a problem. p-hacking becomes easier, and the easier something is to do, the more likely it is to happen.
I would frame it this way: the signal/noise ratio of data decreases as the size increases.* The overall value increases, but at a slower rate over time.
* The caveat - once you get above a certain volume of data, new processing techniques become available that aren't available at smaller volumes.
Yeah, you start with an hypothesis, and then go test it.
You can do science on both directions, but even most scientists can't reliably navigate all the gotchas of the data -> hypothesis direction, while nearly everybody can do hypothesis -> data.
If you want to decommission Google all the way, you can have a look at Countly (https://github.com/countly/countly-server). It is also available on Digital Marketplace so it is quite straightforward to deploy on a cheap instance. It has an API so you can query your data, too.
We removed all tracking scripts from the front-end framework we use (vuepress) and are only checking nginx logs with geoip and org extensions enabled. This is more than enough for us as we're in the non-consumer software business with a relatively low volume of page views by human visitors. We see the org, the country, the city, and the page flow. Good enough.
This is an important sentiment that I want to respond to. Because on one level, I absolutely agree. But the nature of the web is different. HTTP is not software that you "buy and run". It's a back and forth communication, request / response.
Analytics in this context is a kind of self-awareness. I'm not spying on you the user, I'm listening to our conversation, so that I can improve where I make mistakes, and help serve you, the customer/user/visitor better.
Now if I stick a cookie in your pocket that keeps listening after you've left, or I let 3rd parties listen in to our conversation, then absolutely, that's spying.
Where is the distinction between spying on a user using a piece of traditional software and a web site. At the end of the day, you are spying on users to better yourself and sometimes at the cost of the user, who mostly is not aware of the manipulation that goes on. Sorry for the cynicism, but that is the reality.
First-party analytics is like blocking people in face masks entering a store, using security cameras or counting daily visitors. Your "cynicism" is more Luddide-ism.
As with physical venues that employ analytics, you can easily just not visit those sites that want to know a bit more about how people consume content than seeing "GET /page.html" in HTTP logs.
I'm writing this text to you as a huge free-software proponent so I'm not a "corporate shill" when I say analytics are really useful even to the most privacy-respecting pieces of software, it allows to spend resources much more effectively than making blind decisions about what users want - the vocal users opening GH issues about things are the 0.01% and those people shouldn't be the ones people building webpages, -services make decisions upon.
When any normal person visits a store they can see the security cameras watching, and they can likely image that they may be counted.
It is not accurate to say that it is like that the average person who goes to a web site is even in the slightest aware of all the different ways in which they are being used.
Just like when entering a store visiting a webpage implies already that the thing your visiting does not belong to you, that the content there is not yours, the door might be counting visitors and to dissuade thieves from stealing there most likely are CCTV cameras. If you do not agree with that, don't enter the store or visit the web page, it's that simple. It is not accurate to say that someone else's ignorance is a reason alone not to do something that in the end will benefit both you and maybe other customers.
We need a more nuanced conversation that "spying tools".
Developers creating software should see that 80% of people can't finish filling in a form or that 80% of people click on something that isn't supposed to be clickable. That way they can identify UX bugs and improve the product.
Sure, asking users to opt-in to telemetry for improving the product is a best practice.
Are developers trying to improve the product, or are you just collecting data to sell to 3rd parties? Big difference.
There are of course other benefits to rolling your own analytics:
1. you can build the analyzers how you want, to relate different disparate events to each other and get a report of that easier than you can with a 3rd part solution (if the 3rd party solution can even do what you want)
2. You can do real time querying of the data relating to what the user is interacting with at the moment.
downsides, stuff like gender/age analysis will be outside of your grasp - at least at the beginning.
Another option is to use a cloudfront hosted pixel, and have a cloudwatch job to schedule processing of that data every X minutes. This gives you ultra fast edge response time on your tracker, instead of slow lambda + api gateway, but you have to back-process the data. This should also cost less if you have a lot of traffic.
Instead of Google sheets, throw it to bigquery. Or reprocess hourly/daily into parquet then scan with athena.
OT, what do you use to follow RSS feeds with? I've tried a few different feed readers since google reader was shut down, but never settled on one for more than a couple days.
I sadly rely on reddit/hn to get updates/news now.
I've used The Old Reader since Google Reader shut down and haven't been disappointed. Does exactly what it replaced used to do and nothing more. Totally worth the few bucks a year.
Not the poster you were replying to, but I use Thunderbird's inbuilt rss reader. I get news and entertainment through it and have no complaints about the UI or experience (The interface is the same as the email interface, so it is easy to master).
This is quite timely! I'm in the middle of transitioning away from google analytics for my apps/sites (to de-google-ify as much as possible). However, my plan was to implement self-hosted matomo/piwik. In fact, i was considering only implementing the server log-reading feature (and not the javascript tracker script) primarily to lessen webpage bloat but also because my analytics needs are quite basic and minimal. (I'd like to say that i'm some awesome digital marketer, and "need all the data things", but honestly, i really can't justify all that for my small apps/sites. ;-)
That being said, i do love seeing "roll your own" examples - which help foster creativity, remind us that the big silos are not the only ones that can actually produce helpful utilities and platforms, and to further de-centralize stuff on the web (indieweb baby!) Kudos @pcmaffey!
I encourage efforts like this. Far too many sites pour their data into Google Analytics, giving Google far too much information about our browsing. In my opinion, even logging the IP addresses of visitors is risky with laws like the GDPR coming into force.
I decided to practice what I preach by rolling my own basic analytics for my site. I had a different set of requirements[0] than this person but I am comfortable with the results.
This only flaw I have found is that it has revealed that most of my blog posts get pathetically few hits[1].
This is great! I’m actually building my own analytics site for similar reasons[1] and to learn Rails. It's still in early development, but I invite anyone to join and test it out in the meantime.
Nice! Good job! A bit too long to read everything, but liked the intro.
One issue though. I've reached the end and had no idea what to do from there (aka you don't have a scrollbar OMG). First time I've used Page Up to go back up on a site :)))
I love the idea of rolling your own solutions, it's fun and challenging and you'll learn a lot especially when it comes to scaling. For actual deployment for a startup though, I struggle to see any benefit whatsoever.
This is awesome! I'm still an analytics noob with a tiny bit of GA experience. I have my own site I'm launching but it's totally DIY, for the DIY punk/hardcore scene. Some of the first party suggestions in this thread have been great!
I'd like to do the same for a mobile app I'm building, but I find Firebase has a lot of features that would be hard to build out myself. i.e User Session length, install attribution
I really like the in line links on this page. Not having to open a new tab for a one line explanation is great. I wonder if this should be a browser level implementation one day.
It seems like losing sessions because people closed their browser or unchecked exit events is potentially the weekest point of the setup. Still, very nice and user friendly!
Can anyone post a good guide as to what analytics I should be collecting? I want to implement a system of my own like this, but would like a quick primer on the subject.
Your site has no visible scroll bar. What was your design reason for hiding it? It's probably one of the most useful controls to use on a site. You should track its usage with your own analytics.
Or for now, track how many people moved their mouse to where they expected to see and use a scroll bar, noticed there was none and then closed the page.
>You should track its usage with your own analytics.
I appreciate the joke, but don't give them ideas. Since one can't normally track interactions with the scrollbar, I wouldn't be surprised if they were to implement their own "scrollbar" from DIVs to be able to track if the user had dragged it.
Nice article! I did something very similar to this for my blog but used Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker), a cloudfront distribution with s3 log forwarding, a couple lambda functions (with s3 "put" triggers), S3 as the post-processed storage layer, and AWS athena as the query layer. The system costs under $1 per month, is very scalable, and is producing amazingly good/structured data with mid-level latency. I've written about it here:
https://bostata.com/post/client-side-instrumentation-for-und...
By using the snowplow javascript tracker, you get a ton of functionality out of the box when it comes to respecting "do not track", structured event formatting, additional browser contexts, etc. If you want to see how the blog site is functionally instrumented, filter network requests by "stm" (sent time) and you'll see what's being collected.
I've found (after setting similar systems for 15+ companies of varying scale) that where a system like this breaks down is when you want to warehouse event data and tie it to other critical business metrics (stripe, salesforce, database tables that underpin the application, etc). Another point it starts to break down is when you need low-latency data access. At that point it makes more and more sense to run data into a stream (kinesis/kafka/etc) and have "low latency" (couple hundred ms or less) and "high latency" (minutes/hours/etc) points of centralization.
Using multi-az/replicated stream-based infrastructure (like snowplow's scala stuff) has been completely transformational to numerous companies I've set it up at. A single source of truth when it comes to both low-latency and med/high-latency client side event data is absolutely massive. Secondly, being able to tie many sources of data together (via warehousing into redshift or snowflake) is eye-opening every single time. I've recently been running ~300k+ requests/minute through snowplow's stream-based infrastructure and it's rock-solid.
Again, nice post! It's awesome to see people doing similar things. :)