Hacker News new | past | comments | ask | show | jobs | submit login
Roll Your Own Analytics (pcmaffey.com)
284 points by pcmaffey on March 14, 2019 | hide | past | favorite | 104 comments

(Data engineer here)

Nice article! I did something very similar to this for my blog but used Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker), a cloudfront distribution with s3 log forwarding, a couple lambda functions (with s3 "put" triggers), S3 as the post-processed storage layer, and AWS athena as the query layer. The system costs under $1 per month, is very scalable, and is producing amazingly good/structured data with mid-level latency. I've written about it here:


By using the snowplow javascript tracker, you get a ton of functionality out of the box when it comes to respecting "do not track", structured event formatting, additional browser contexts, etc. If you want to see how the blog site is functionally instrumented, filter network requests by "stm" (sent time) and you'll see what's being collected.

I've found (after setting similar systems for 15+ companies of varying scale) that where a system like this breaks down is when you want to warehouse event data and tie it to other critical business metrics (stripe, salesforce, database tables that underpin the application, etc). Another point it starts to break down is when you need low-latency data access. At that point it makes more and more sense to run data into a stream (kinesis/kafka/etc) and have "low latency" (couple hundred ms or less) and "high latency" (minutes/hours/etc) points of centralization.

Using multi-az/replicated stream-based infrastructure (like snowplow's scala stuff) has been completely transformational to numerous companies I've set it up at. A single source of truth when it comes to both low-latency and med/high-latency client side event data is absolutely massive. Secondly, being able to tie many sources of data together (via warehousing into redshift or snowflake) is eye-opening every single time. I've recently been running ~300k+ requests/minute through snowplow's stream-based infrastructure and it's rock-solid.

Again, nice post! It's awesome to see people doing similar things. :)

Thanks! Not a data engineer, but I used to work at a data engineering company, and can attest to the complexity and rawness of the industry. Your setup looks solid!

As a frontend engineer having seen under the hood of data pipelines at scale, I wanted to reverse engineer the parts of it that I care about (product analytics via event logging), and package it up for my little side projects.

It's awesome that this is inspiring to people. If people get anything from what I wrote, it'd be this: While large companies all roll their own data pipelines, it's _not that difficult_ for a startup / smaller co / individual to do product analytics on a level that makes sense for them, without just automatically reaching for GA or whatever.

I totally agree

And also, huge props for the creativity! You hit a nerve here, and it's a topic being discussed at pretty much any/all companies.

Your post is fantastic. Is there any chance you could share the lambda function code you use for annotating the log entries?

Could you share the lambda function? It'd be great to be able to see this end-to-end without getting into implementation.

This is a great idea using cloudfront logs as the "raw data store". Well done.

Is there a way to guild a comment here?

Very cool. For those that want to de-Google-ify their website analytics in a more practical sense, consider using a self-hosted Matomo (formerly Piwik) instance: https://matomo.org/

I find it has 80% of the features GA has, it's GPL v3+ and only takes a few minutes to setup.

Depending on what kind of analytics you want, Fathom is quite nice too. I recently set it up on my servers (about a month ago) and its been working well. It’s open source as well, with a public Trello board that speaks to future dev goals. https://github.com/usefathom/fathom

And here's my write up of how you can quickly host your own instance of Fathom using dokku: https://www.kn8.lt/blog/hosting-your-own-analytics-with-fath...

I can attest to the quality of Fathom. I switched over browserless.io from Google Analytics to fathom. Been very nice.

Fathom is quite nice. Very easy to deploy in Docker or Kubernetes too.

Also mentioned in the article:

> Avoid ad-blockers - My goal with analytics is to learn how people use my site so I can improve it and serve them better. I'm not using ad-tech so there's there's no point in getting blocked by 25% of visitors with an ad-blocker. That means doing 1st-party analytics, without using a 3rd-party tracking snippet—even self-hosted!*

> *Some ad-blockers already block self-hosted Matamo/Piwik tracking snippets.

I'm okay with this. If a user agent is attempting to block my tracking code (piwik.js), it's likely that the user doesn't want to be tracked.

Part of respecting user privacy is accepting the fact my tracking scripts will be blocked by most privacy extensions.

I feel Matomo does privacy correctly. By default it continues to use the well known piwik.js filename that extensions block and also respects the DNT (do not track) signal from browsers.

> I'm okay with this. If a user agent is attempting to block my tracking code (piwik.js), it's likely that the user doesn't want to be tracked.

What users DO want to be tracked????

Stop being so hysteric, if you want to avoid first-party analytics then don't use the service, if someone reads my blog I have the full right to find out how long my articles are read. Protesting against first-party tracking is like protesting against visitor counters in shops and security cameras, luddism.

I'm not being hysteric, I'm just pointing out that the parent poster was kind of offering a a false worldview.

It's not that some of his users want to be tracked and some don't, it's more that some are aware of how to ask to not be tracked and some aren't aware.

But you're leaving out those people who actually do not care and those who find it okay to a certain extent, that was my point.

Calling someone who cares about his own privacy "hysteric" and rehashing the same tired old arguments (which inevitably boil down to "if you don't want me to do this, just make sure you don't touch my website even once!") doesn't do your position, whatever it is, a whole lot of justice.

By that logic even Apache access logs are a privacy attack. Don't fall into fear mongering either, come on.

Actually there are a lot of things the law says about security cameras in my country...

Thankfully analytics can be blocked if someone so desires.

There is quite a bit that EU laws say about analytics as well... What's your point?

It's not that they want to be tracked, it's that they don't realize or don't care, sometimes both.

I imagine it depends a lot on how you ask the question. How many people opt in to "share my crash data with Apple" ?

Specifically, it's the EasyPrivacy list that blocks it. That list is one of the default ones in uBlock Origin.

The beauty of Matomo is that you can configure it to read the data from the server logs. No need for me to bother with adblockers, tracking scripts, and how it all affects my website performance.

>and only takes a few minutes to setup.

The usual trap. That's why we have managed DBs.

Matomo offers a hosted version of their product as well.

Wouldn't that defeat the purpose of keeping your analytics local to your site/service?

It looks as if in the next several months Firefox will become all but invisible from most 3rd party trackers as Firefox ”will strip cookies and block storage access from third-party tracking content, based on lists of tracking domains by Disconnect.” [1]

The full list of trackers that will be blocked by default is substantial. [2]

1 https://blog.mozilla.org/futurereleases/2019/02/20/enhanced-...

2 https://github.com/disconnectme/disconnect-tracking-protecti...

Hopefully this actually works.

I have a very limited set of domains (manually) allowed to save cookie data in Firefox pas when a session ends. With everything else supposed to be auto deleted after the session ends ("Keep until: 'I close Firefox'").

Invariably though, after a week or so there are quite a few cookies saved for other domains anyway, which Firefox has decided for some unknown reason it's going to keep regardless. Without any explanation of why it's done so. Grrrr.

While manually clearing those out works, it doesn't seem like the current code base is working as intended.

I like the idea.... but this doesn't make sense.

I'm all for not using Google Analytics- I don't use it on my site. But why would you then use Google Sheets to hold the info? That's ridiculous. That's like telling people how to make an apple pie without sugar, and the last step before baking is dumping in a pound of sugar!

If you want an easy data store, use something like SQLite. There are plenty of options that are easy to self host, and a lot of libraries have been written to make it easier for you.

The article isn't about removing all traces of Google from your site/stack. It's about removing 3rd party tracking from your site. Specifically, it focuses on removing Google Analytics.

Using Google Sheets doesn't add 3rd party tracking back in. Even if you don't trust Google to not take a look at your spreadsheet, it's unlikely anybody or anything at Google would treat your spreadsheet like analytics data.

All that being said, it is a little funny seeing them funnel the data right back into Google. Personally, I would also prefer a different storage method.

I actually liked it. It seems like a good place on the spectrum between ideologically obsessed and practical. Google sheets is convenient.

For sure, there are a bunch of reasons not to use Google Sheets. For this, it's just an easy, free, and temp way to get a new site up without needing to setup a backend.

It's not really a hard requirement for me to get off Google though, just to avoid using their ad-tech. For privacy concerns, using Google Sheets isn't all that different from using MySQL on Google Cloud, etc.

When looking at solutions, did you consider using a private git repo?

On that thought, you could potentially hook it up to Google Sheets or editors that integrate with remote git repos. Skipping 3rd party hosts, you could host your own git server for cheap too.

Awesome project nonetheless! I'm a hypocrite still using GA but want to move off of it this year, so reading your solution has me thinking about my own.

Lots of ways to skin this cat ;). Figuring out the best way that works for your setup is half the fun. Thanks and good luck!

I think it's a sound, practical decision.

At the past 3 places I've worked, we've setup Snowplow Analytics (https://snowplowanalytics.com/) and would strongly recommend it over GA, Segment, and other third-party systems.

If you're looking for a first-party system, Snowplow is an amazing setup.

Showplow is great and cost effective. AWS Cloudfront, Lambda and Athena make it possible to create fully serverless setup on top of it. There's an open source example of such setup: https://statsbotco.github.io/cubejs/event-analytics/

I looked into Snowplow but was intimidated by their Trackers/Collectors/Enrich/Storage/Data Modeling/Analytics flow.

How hard is it to setup? Does it come with a UI to easily view, sort and filter these events into graphs like Mixpanel or Amplitude?

Snowplow is awesome - it doesn't come with a UI but here's a sample of what data is included:


A pretty common move is to drop this data into redshift/snowflake and query it with Mode/Looker/Tableau/whatever. Athena is a viable option as well, until you get into higher data volumes and don't want to pay for each scan.

Context: I'm a tech lead (data engineering) @ a public company, have set this system up 15+ times @ numerous other companies, and could not live without it at this point. Current co's snowplow systems process 250M+ events per day peaking @ 300k+ reqs/min, on very cost-efficient infra.


Try Indicative https://www.indicative.com. (I’m the CEO) Our platform is a customer analytics platform, similar to the ones mentioned, but has a one-click integration for Snowplow based data warehouses. It is designed for product and marketing teams, to easily perform customer behavioral analysis without the direct need for data teams or coding skills.

This is like a marketer suggesting an engineer roll his own security.

If its a fun side project, do it! If you actually want it to work, don't do it. If you want to /really/ use it (checking more than sessions), don't do it.

Many places I have worked end up using N solutions plus atleast one from scratch BI platform.

I don't think rolling your own auth or encryption is an apples to apples comparison.

One of the new hotness in marketing is micro-funnels. You have to have customized metrics built into the product for that to work.

Props for diving into such a hot topic though. This is a really big deal for companies of many sizes (probably _all_ sizes), and bringing creativity to this conversation is +1

Awesome work! I had the same issues with GA, but decided to go with Matomo (Piwik). It's an Open Source, self-hosted alternative to Analytics. So far I couldn't be more impressed!

There's also GoAccess [1] which can monitor and parse your nginx or apache logs to show you realtime stats in a html report or even directly in your terminal. The numbers do not fully match google analytics numbers, but I love how it looks.

[1] https://goaccess.io/

Working backwards, I start by defining what data I care about tracking. This is actually one of the biggest ancillary benefits of rolling your own analytics. I get to perfectly fit my analytics to my application.

This could easily make the whole thing worthwhile. Ultimately, the number one ingredient for useful analytics is figuring out what you want to know, and how you are going to use that knowledge. It's usually a missing ingredient and GA encourages a much more passive approach.

Its not very scientific though. You're starting from a point of assuming you already know what you want instead of collecting data and analyzing it to see what correlates to what.

I would argue it's more scientific (even though the goal generally isn't science). Hypothesize, design measurements, test, conclude.

Modern science often has the opposite issue. Economists (for example) have a lot of data. They can automate regression analysis to find correlations, then fit the theory (in economics, this is often an intricate model of an economic system) to the "result."

The problem is, 99% certainty doesn't mean anything if effectively test 1000 hypotheses. 10 correlations will occur randomly.

More data does not make you a worse data scientist. You can run a bad study at any amount of data. The only difference is a new hypothesis will have historical data to look at. More data not require you create some over fitted model.

More data is strictly better.

I disagree with "More data is strictly better". if you said that everything else equal, more data is better, I would be closer to agreement.

But when you have more data, biases become more of a problem. p-hacking becomes easier, and the easier something is to do, the more likely it is to happen.

I would frame it this way: the signal/noise ratio of data decreases as the size increases.* The overall value increases, but at a slower rate over time.

* The caveat - once you get above a certain volume of data, new processing techniques become available that aren't available at smaller volumes.

Yeah, you start with an hypothesis, and then go test it.

You can do science on both directions, but even most scientists can't reliably navigate all the gotchas of the data -> hypothesis direction, while nearly everybody can do hypothesis -> data.

If you want to decommission Google all the way, you can have a look at Countly (https://github.com/countly/countly-server). It is also available on Digital Marketplace so it is quite straightforward to deploy on a cheap instance. It has an API so you can query your data, too.

We removed all tracking scripts from the front-end framework we use (vuepress) and are only checking nginx logs with geoip and org extensions enabled. This is more than enough for us as we're in the non-consumer software business with a relatively low volume of page views by human visitors. We see the org, the country, the city, and the page flow. Good enough.

I hate all these user spying tools, whether home grown or bought in. Please, if you write software, don't do it.

Just let us buy the software and run it. Like in the good old days.

This is an important sentiment that I want to respond to. Because on one level, I absolutely agree. But the nature of the web is different. HTTP is not software that you "buy and run". It's a back and forth communication, request / response.

Analytics in this context is a kind of self-awareness. I'm not spying on you the user, I'm listening to our conversation, so that I can improve where I make mistakes, and help serve you, the customer/user/visitor better.

Now if I stick a cookie in your pocket that keeps listening after you've left, or I let 3rd parties listen in to our conversation, then absolutely, that's spying.

I don't see anything wrong with keeping track of how users interact with your site in aggregate. It helps you improve it.

Where is the distinction between spying on a user using a piece of traditional software and a web site. At the end of the day, you are spying on users to better yourself and sometimes at the cost of the user, who mostly is not aware of the manipulation that goes on. Sorry for the cynicism, but that is the reality.

First-party analytics is like blocking people in face masks entering a store, using security cameras or counting daily visitors. Your "cynicism" is more Luddide-ism.

As with physical venues that employ analytics, you can easily just not visit those sites that want to know a bit more about how people consume content than seeing "GET /page.html" in HTTP logs.

I'm writing this text to you as a huge free-software proponent so I'm not a "corporate shill" when I say analytics are really useful even to the most privacy-respecting pieces of software, it allows to spend resources much more effectively than making blind decisions about what users want - the vocal users opening GH issues about things are the 0.01% and those people shouldn't be the ones people building webpages, -services make decisions upon.

When any normal person visits a store they can see the security cameras watching, and they can likely image that they may be counted.

It is not accurate to say that it is like that the average person who goes to a web site is even in the slightest aware of all the different ways in which they are being used.

Just like when entering a store visiting a webpage implies already that the thing your visiting does not belong to you, that the content there is not yours, the door might be counting visitors and to dissuade thieves from stealing there most likely are CCTV cameras. If you do not agree with that, don't enter the store or visit the web page, it's that simple. It is not accurate to say that someone else's ignorance is a reason alone not to do something that in the end will benefit both you and maybe other customers.

We need a more nuanced conversation that "spying tools".

Developers creating software should see that 80% of people can't finish filling in a form or that 80% of people click on something that isn't supposed to be clickable. That way they can identify UX bugs and improve the product.

Sure, asking users to opt-in to telemetry for improving the product is a best practice.

Are developers trying to improve the product, or are you just collecting data to sell to 3rd parties? Big difference.

"Are developers trying to improve the product, or are you just collecting data to sell to 3rd parties?"

Problem is, there is no way to know or properly police this.

There are of course other benefits to rolling your own analytics:

1. you can build the analyzers how you want, to relate different disparate events to each other and get a report of that easier than you can with a 3rd part solution (if the 3rd party solution can even do what you want)

2. You can do real time querying of the data relating to what the user is interacting with at the moment.

downsides, stuff like gender/age analysis will be outside of your grasp - at least at the beginning.

Another option is to use a cloudfront hosted pixel, and have a cloudwatch job to schedule processing of that data every X minutes. This gives you ultra fast edge response time on your tracker, instead of slow lambda + api gateway, but you have to back-process the data. This should also cost less if you have a lot of traffic.

Instead of Google sheets, throw it to bigquery. Or reprocess hourly/daily into parquet then scan with athena.

I can't POST with a pixel though, and I care about event logging.

For sure, Google Sheets is just a free, easy and temp stand-in for a real data store.

Hi HN, if anyone's wondering how this is performing with the 1st page crush... tldr; trial by fire, but it works :)

Since I posted it 2 hrs ago:

* logged 1200 sessions (this doesn't include folks who bounce before the page loads.)

* 8 "lock acquisition" errors from Google App script, which basically means it timed out trying to get a slot from too many concurrents

* 10 minutes of lambda runtime (Netlify gives 100hrs / month free)

* 590 ms average network latency

I'll update the post with full details on performance once the dust settles.

I think you're more likely to hit Netlify's limit of 125k requests per month vs the 100hr limit.

I wish pages like this one had RSS feed. I would like to follow future articles, but I do not want it to end up in my email inbox.

Ok here it is: https://www.pcmaffey.com/rss.xml

thanks for the tip

Cheers, that was quick!

OT, what do you use to follow RSS feeds with? I've tried a few different feed readers since google reader was shut down, but never settled on one for more than a couple days.

I sadly rely on reddit/hn to get updates/news now.

I have recently switched to feedbro after trying feedly and inoreader. Recommended.


I've used The Old Reader since Google Reader shut down and haven't been disappointed. Does exactly what it replaced used to do and nothing more. Totally worth the few bucks a year.


I'm a fan of [InoReader][1]. It's the best replacement for Google Reader I've found. Using it for more than ~5 years now and couldn't be happier.

[1]: https://inoreader.com

Not the poster you were replying to, but I use Thunderbird's inbuilt rss reader. I get news and entertainment through it and have no complaints about the UI or experience (The interface is the same as the email interface, so it is easy to master).

This is quite timely! I'm in the middle of transitioning away from google analytics for my apps/sites (to de-google-ify as much as possible). However, my plan was to implement self-hosted matomo/piwik. In fact, i was considering only implementing the server log-reading feature (and not the javascript tracker script) primarily to lessen webpage bloat but also because my analytics needs are quite basic and minimal. (I'd like to say that i'm some awesome digital marketer, and "need all the data things", but honestly, i really can't justify all that for my small apps/sites. ;-)

That being said, i do love seeing "roll your own" examples - which help foster creativity, remind us that the big silos are not the only ones that can actually produce helpful utilities and platforms, and to further de-centralize stuff on the web (indieweb baby!) Kudos @pcmaffey!


I encourage efforts like this. Far too many sites pour their data into Google Analytics, giving Google far too much information about our browsing. In my opinion, even logging the IP addresses of visitors is risky with laws like the GDPR coming into force.

I decided to practice what I preach by rolling my own basic analytics for my site. I had a different set of requirements[0] than this person but I am comfortable with the results.

This only flaw I have found is that it has revealed that most of my blog posts get pathetically few hits[1].

[0] https://sheep.horse/2019/1/the_world%27s_worst_web_analytics...

[1] https://sheep.horse/visitor_statistics.html

This is great! I’m actually building my own analytics site for similar reasons[1] and to learn Rails. It's still in early development, but I invite anyone to join and test it out in the meantime.

[1] https://analytics.zjm.me

Nice! Good job! A bit too long to read everything, but liked the intro.

One issue though. I've reached the end and had no idea what to do from there (aka you don't have a scrollbar OMG). First time I've used Page Up to go back up on a site :)))

I love the idea of rolling your own solutions, it's fun and challenging and you'll learn a lot especially when it comes to scaling. For actual deployment for a startup though, I struggle to see any benefit whatsoever.

Same DIY analytics with Google Storage + BigQuery: https://github.com/fedia/bigpapa

Still googly, but less scalability issues.

I did something similar a couple years ago for a game, where I simply wanted to track active player count over time: https://devlog.disco.zone/2016/09/02/google-sheets-analytics...

Roll-your-own is really nice when you have small data and when you have simple numbers to query (such as "get number of active players").

This is awesome! I'm still an analytics noob with a tiny bit of GA experience. I have my own site I'm launching but it's totally DIY, for the DIY punk/hardcore scene. Some of the first party suggestions in this thread have been great!

nice. heads up though; your link to netlify goes to neltify which doesnt exist.

Cool, thanks

I'd like to do the same for a mobile app I'm building, but I find Firebase has a lot of features that would be hard to build out myself. i.e User Session length, install attribution

I really like the in line links on this page. Not having to open a new tab for a one line explanation is great. I wonder if this should be a browser level implementation one day.

It seems like losing sessions because people closed their browser or unchecked exit events is potentially the weekest point of the setup. Still, very nice and user friendly!

Can anyone post a good guide as to what analytics I should be collecting? I want to implement a system of my own like this, but would like a quick primer on the subject.

All the effort to avoid Google Analytics, only to turn around and use Google Sheets?

When I read this title I imagined a person getting out a chest of D&D dice. Change my mind.

I’ve been checking the server logs for over 13 years now with awstats and that has always covered my needs. javascript based logging makes no sense.

Your site has no visible scroll bar. What was your design reason for hiding it? It's probably one of the most useful controls to use on a site. You should track its usage with your own analytics.

Or for now, track how many people moved their mouse to where they expected to see and use a scroll bar, noticed there was none and then closed the page.

>You should track its usage with your own analytics.

I appreciate the joke, but don't give them ideas. Since one can't normally track interactions with the scrollbar, I wouldn't be surprised if they were to implement their own "scrollbar" from DIVs to be able to track if the user had dragged it.

Just put back the built-in scrollbar. Period.

You don't track the scrollbar itself, just scroll depth by quartile


This can’t differentiate between scrolling via the scrollbar and other means (keyboard, trackpad, mouse wheel).

Likely also the same reason he disabled outlines on all input. Because people think they are ugly without any thought devoted to accessibility.

Woops, not intentional, scrollbar is back.

I see a scroll bar there, on Firefox.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact