
Roll Your Own Analytics - pcmaffey
https://www.pcmaffey.com/roll-your-own-analytics/
======
mejakethomas
(Data engineer here)

Nice article! I did something very similar to this for my blog but used
Snowplow's javascript tracker ([https://github.com/snowplow/snowplow-
javascript-tracker](https://github.com/snowplow/snowplow-javascript-tracker)),
a cloudfront distribution with s3 log forwarding, a couple lambda functions
(with s3 "put" triggers), S3 as the post-processed storage layer, and AWS
athena as the query layer. The system costs under $1 per month, is very
scalable, and is producing amazingly good/structured data with mid-level
latency. I've written about it here:

[https://bostata.com/post/client-side-instrumentation-for-
und...](https://bostata.com/post/client-side-instrumentation-for-under-one-
dollar/)

By using the snowplow javascript tracker, you get a ton of functionality out
of the box when it comes to respecting "do not track", structured event
formatting, additional browser contexts, etc. If you want to see how the blog
site is functionally instrumented, filter network requests by "stm" (sent
time) and you'll see what's being collected.

I've found (after setting similar systems for 15+ companies of varying scale)
that where a system like this breaks down is when you want to warehouse event
data and tie it to other critical business metrics (stripe, salesforce,
database tables that underpin the application, etc). Another point it starts
to break down is when you need low-latency data access. At that point it makes
more and more sense to run data into a stream (kinesis/kafka/etc) and have
"low latency" (couple hundred ms or less) and "high latency"
(minutes/hours/etc) points of centralization.

Using multi-az/replicated stream-based infrastructure (like snowplow's scala
stuff) has been completely transformational to numerous companies I've set it
up at. A single source of truth when it comes to both low-latency and
med/high-latency client side event data is absolutely massive. Secondly, being
able to tie many sources of data together (via warehousing into redshift or
snowflake) is eye-opening every single time. I've recently been running ~300k+
requests/minute through snowplow's stream-based infrastructure and it's rock-
solid.

Again, nice post! It's awesome to see people doing similar things. :)

~~~
pcmaffey
Thanks! Not a data engineer, but I used to work at a data engineering company,
and can attest to the complexity and rawness of the industry. Your setup looks
solid!

As a frontend engineer having seen under the hood of data pipelines at scale,
I wanted to reverse engineer the parts of it that I care about (product
analytics via event logging), and package it up for my little side projects.

It's awesome that this is inspiring to people. If people get anything from
what I wrote, it'd be this: While large companies all roll their own data
pipelines, it's _not that difficult_ for a startup / smaller co / individual
to do product analytics on a level that makes sense for them, without just
automatically reaching for GA or whatever.

~~~
mejakethomas
I totally agree

------
fbelzile
Very cool. For those that want to de-Google-ify their website analytics in a
more practical sense, consider using a self-hosted Matomo (formerly Piwik)
instance: [https://matomo.org/](https://matomo.org/)

I find it has 80% of the features GA has, it's GPL v3+ and only takes a few
minutes to setup.

~~~
amartya916
Depending on what kind of analytics you want, Fathom is quite nice too. I
recently set it up on my servers (about a month ago) and its been working
well. It’s open source as well, with a public Trello board that speaks to
future dev goals.
[https://github.com/usefathom/fathom](https://github.com/usefathom/fathom)

~~~
kn8
And here's my write up of how you can quickly host your own instance of Fathom
using dokku: [https://www.kn8.lt/blog/hosting-your-own-analytics-with-
fath...](https://www.kn8.lt/blog/hosting-your-own-analytics-with-fathom/)

------
elbac
It looks as if in the next several months Firefox will become all but
invisible from most 3rd party trackers as Firefox ”will strip cookies and
block storage access from third-party tracking content, based on lists of
tracking domains by Disconnect.” [1]

The full list of trackers that will be blocked by default is substantial. [2]

1
[https://blog.mozilla.org/futurereleases/2019/02/20/enhanced-...](https://blog.mozilla.org/futurereleases/2019/02/20/enhanced-
tracking-protection-testing-update/)

2 [https://github.com/disconnectme/disconnect-tracking-
protecti...](https://github.com/disconnectme/disconnect-tracking-protection)

~~~
justinclift
Hopefully this actually works.

I have a very limited set of domains (manually) allowed to save cookie data in
Firefox pas when a session ends. With everything else _supposed_ to be auto
deleted after the session ends ("Keep until: 'I close Firefox'").

Invariably though, after a week or so there are quite a few cookies saved for
other domains anyway, which Firefox has decided for some unknown reason it's
going to keep regardless. Without any explanation of why it's done so. Grrrr.

While manually clearing those out works, it doesn't seem like the current code
base is working as intended.

------
kotrunga
I like the idea.... but this doesn't make sense.

I'm all for not using Google Analytics- I don't use it on my site. But why
would you then use Google Sheets to hold the info? That's ridiculous. That's
like telling people how to make an apple pie without sugar, and the last step
before baking is dumping in a pound of sugar!

If you want an easy data store, use something like SQLite. There are plenty of
options that are easy to self host, and a lot of libraries have been written
to make it easier for you.

~~~
pcmaffey
For sure, there are a bunch of reasons not to use Google Sheets. For this,
it's just an easy, free, and temp way to get a new site up without needing to
setup a backend.

It's not really a hard requirement for me to get off Google though, just to
avoid using their ad-tech. For privacy concerns, using Google Sheets isn't all
that different from using MySQL on Google Cloud, etc.

~~~
troymcclure
When looking at solutions, did you consider using a private git repo?

On that thought, you could potentially hook it up to Google Sheets or editors
that integrate with remote git repos. Skipping 3rd party hosts, you could host
your own git server for cheap too.

Awesome project nonetheless! I'm a hypocrite still using GA but want to move
off of it this year, so reading your solution has me thinking about my own.

~~~
pcmaffey
Lots of ways to skin this cat ;). Figuring out the best way that works for
your setup is half the fun. Thanks and good luck!

------
lukethomas
At the past 3 places I've worked, we've setup Snowplow Analytics
([https://snowplowanalytics.com/](https://snowplowanalytics.com/)) and would
strongly recommend it over GA, Segment, and other third-party systems.

If you're looking for a first-party system, Snowplow is an amazing setup.

~~~
derekdahmer
I looked into Snowplow but was intimidated by their
Trackers/Collectors/Enrich/Storage/Data Modeling/Analytics flow.

How hard is it to setup? Does it come with a UI to easily view, sort and
filter these events into graphs like Mixpanel or Amplitude?

~~~
mejakethomas
Snowplow is awesome - it doesn't come with a UI but here's a sample of what
data is included:

[https://github.com/snowplow/snowplow/wiki/canonical-event-
mo...](https://github.com/snowplow/snowplow/wiki/canonical-event-model)

A pretty common move is to drop this data into redshift/snowflake and query it
with Mode/Looker/Tableau/whatever. Athena is a viable option as well, until
you get into higher data volumes and don't want to pay for each scan.

Context: I'm a tech lead (data engineering) @ a public company, have set this
system up 15+ times @ numerous other companies, and could not live without it
at this point. Current co's snowplow systems process 250M+ events per day
peaking @ 300k+ reqs/min, on very cost-efficient infra.

------
soared
This is like a marketer suggesting an engineer roll his own security.

If its a fun side project, do it! If you actually want it to work, don't do
it. If you want to /really/ use it (checking more than sessions), don't do it.

~~~
ozten
Many places I have worked end up using N solutions plus atleast one from
scratch BI platform.

I don't think rolling your own auth or encryption is an apples to apples
comparison.

One of the new hotness in marketing is micro-funnels. You have to have
customized metrics built into the product for that to work.

------
dirktheman
Awesome work! I had the same issues with GA, but decided to go with Matomo
(Piwik). It's an Open Source, self-hosted alternative to Analytics. So far I
couldn't be more impressed!

------
zubspace
There's also GoAccess [1] which can monitor and parse your nginx or apache
logs to show you realtime stats in a html report or even directly in your
terminal. The numbers do not fully match google analytics numbers, but I love
how it looks.

[1] [https://goaccess.io/](https://goaccess.io/)

------
dalbasal
_Working backwards, I start by defining what data I care about tracking. This
is actually one of the biggest ancillary benefits of rolling your own
analytics. I get to perfectly fit my analytics to my application._

This could easily make the whole thing worthwhile. Ultimately, the number one
ingredient for useful analytics is figuring out what you want to know, and how
you are going to use that knowledge. It's usually a missing ingredient and GA
encourages a much more passive approach.

~~~
jayd16
Its not very scientific though. You're starting from a point of assuming you
already know what you want instead of collecting data and analyzing it to see
what correlates to what.

~~~
dalbasal
I would argue it's more scientific (even though the goal generally isn't
science). Hypothesize, design measurements, test, conclude.

Modern science often has the opposite issue. Economists (for example) have a
lot of data. They can automate regression analysis to find correlations, then
fit the theory (in economics, this is often an intricate model of an economic
system) to the "result."

The problem is, 99% certainty doesn't mean anything if effectively test 1000
hypotheses. 10 correlations will occur randomly.

~~~
jayd16
More data does not make you a worse data scientist. You can run a bad study at
any amount of data. The only difference is a new hypothesis will have
historical data to look at. More data not require you create some over fitted
model.

More data is strictly better.

~~~
edmundsauto
I disagree with "More data is strictly better". if you said that everything
else equal, more data is better, I would be closer to agreement.

But when you have more data, biases become more of a problem. p-hacking
becomes easier, and the easier something is to do, the more likely it is to
happen.

I would frame it this way: the signal/noise ratio of data decreases as the
size increases.* The overall value increases, but at a slower rate over time.

* The caveat - once you get above a certain volume of data, new processing techniques become available that aren't available at smaller volumes.

------
gorkemcetin
If you want to decommission Google all the way, you can have a look at Countly
([https://github.com/countly/countly-
server](https://github.com/countly/countly-server)). It is also available on
Digital Marketplace so it is quite straightforward to deploy on a cheap
instance. It has an API so you can query your data, too.

------
rodionos
We removed all tracking scripts from the front-end framework we use (vuepress)
and are only checking nginx logs with geoip and org extensions enabled. This
is more than enough for us as we're in the non-consumer software business with
a relatively low volume of page views by human visitors. We see the org, the
country, the city, and the page flow. Good enough.

------
laythea
I hate all these user spying tools, whether home grown or bought in. Please,
if you write software, don't do it.

Just let us buy the software and run it. Like in the good old days.

~~~
jimmaswell
I don't see anything wrong with keeping track of how users interact with your
site in aggregate. It helps you improve it.

~~~
laythea
Where is the distinction between spying on a user using a piece of traditional
software and a web site. At the end of the day, you are spying on users to
better yourself and sometimes at the cost of the user, who mostly is not aware
of the manipulation that goes on. Sorry for the cynicism, but that is the
reality.

~~~
dcbadacd
First-party analytics is like blocking people in face masks entering a store,
using security cameras or counting daily visitors. Your "cynicism" is more
Luddide-ism.

As with physical venues that employ analytics, you can easily just __not
__visit those sites that want to know a bit more about how people consume
content than seeing "GET /page.html" in HTTP logs.

I'm writing this text to you as a huge free-software proponent so I'm not a
"corporate shill" when I say analytics are really useful even to the most
privacy-respecting pieces of software, it allows to spend resources much more
effectively than making blind decisions about what users want - the vocal
users opening GH issues about things are the 0.01% and those people shouldn't
be the ones people building webpages, -services make decisions upon.

~~~
laythea
When any normal person visits a store they can see the security cameras
watching, and they can likely image that they may be counted.

It is not accurate to say that it is like that the average person who goes to
a web site is even in the slightest aware of all the different ways in which
they are being used.

~~~
dcbadacd
Just like when entering a store visiting a webpage implies already that the
thing your visiting does not belong to you, that the content there is not
yours, the door might be counting visitors and to dissuade thieves from
stealing there most likely are CCTV cameras. If you do not agree with that,
don't enter the store or visit the web page, it's that simple. It is not
accurate to say that someone else's ignorance is a reason alone not to do
something that in the end will benefit both you and maybe other customers.

------
bryanrasmussen
There are of course other benefits to rolling your own analytics:

1\. you can build the analyzers how you want, to relate different disparate
events to each other and get a report of that easier than you can with a 3rd
part solution (if the 3rd party solution can even do what you want)

2\. You can do real time querying of the data relating to what the user is
interacting with at the moment.

downsides, stuff like gender/age analysis will be outside of your grasp - at
least at the beginning.

------
jimmy_ruska
Another option is to use a cloudfront hosted pixel, and have a cloudwatch job
to schedule processing of that data every X minutes. This gives you ultra fast
edge response time on your tracker, instead of slow lambda + api gateway, but
you have to back-process the data. This should also cost less if you have a
lot of traffic.

Instead of Google sheets, throw it to bigquery. Or reprocess hourly/daily into
parquet then scan with athena.

~~~
pcmaffey
I can't POST with a pixel though, and I care about event logging.

For sure, Google Sheets is just a free, easy and temp stand-in for a real data
store.

------
pcmaffey
Hi HN, if anyone's wondering how this is performing with the 1st page crush...
tldr; trial by fire, but it works :)

Since I posted it 2 hrs ago:

* logged 1200 sessions (this doesn't include folks who bounce before the page loads.)

* 8 "lock acquisition" errors from Google App script, which basically means it timed out trying to get a slot from too many concurrents

* 10 minutes of lambda runtime (Netlify gives 100hrs / month free)

* 590 ms average network latency

I'll update the post with full details on performance once the dust settles.

~~~
twalling
I think you're more likely to hit Netlify's limit of 125k requests per month
vs the 100hr limit.

------
cyrusmg
I wish pages like this one had RSS feed. I would like to follow future
articles, but I do not want it to end up in my email inbox.

~~~
johntash
OT, what do you use to follow RSS feeds with? I've tried a few different feed
readers since google reader was shut down, but never settled on one for more
than a couple days.

I sadly rely on reddit/hn to get updates/news now.

~~~
dodgyb
I have recently switched to feedbro after trying feedly and inoreader.
Recommended.

[https://nodetics.com/feedbro/](https://nodetics.com/feedbro/)

------
mxuribe
This is quite timely! I'm in the middle of transitioning away from google
analytics for my apps/sites (to de-google-ify as much as possible). However,
my plan was to implement self-hosted matomo/piwik. In fact, i was considering
only implementing the server log-reading feature (and not the javascript
tracker script) primarily to lessen webpage bloat but also because my
analytics needs are quite basic and minimal. (I'd like to say that i'm some
awesome digital marketer, and "need all the data things", but honestly, i
really can't justify all that for my small apps/sites. ;-)

That being said, i do love seeing "roll your own" examples - which help foster
creativity, remind us that the big silos are not the only ones that can
actually produce helpful utilities and platforms, and to further de-centralize
stuff on the web (indieweb baby!) Kudos @pcmaffey!

~~~
mejakethomas
Agreed!

------
AndrewStephens
I encourage efforts like this. Far too many sites pour their data into Google
Analytics, giving Google far too much information about our browsing. In my
opinion, even logging the IP addresses of visitors is risky with laws like the
GDPR coming into force.

I decided to practice what I preach by rolling my own basic analytics for my
site. I had a different set of requirements[0] than this person but I am
comfortable with the results.

This only flaw I have found is that it has revealed that most of my blog posts
get pathetically few hits[1].

[0]
[https://sheep.horse/2019/1/the_world%27s_worst_web_analytics...](https://sheep.horse/2019/1/the_world%27s_worst_web_analytics.html)

[1]
[https://sheep.horse/visitor_statistics.html](https://sheep.horse/visitor_statistics.html)

------
news_to_me
This is great! I’m actually building my own analytics site for similar
reasons[1] and to learn Rails. It's still in early development, but I invite
anyone to join and test it out in the meantime.

[1] [https://analytics.zjm.me](https://analytics.zjm.me)

------
Eek
Nice! Good job! A bit too long to read everything, but liked the intro.

One issue though. I've reached the end and had no idea what to do from there
(aka you don't have a scrollbar OMG). First time I've used Page Up to go back
up on a site :)))

------
Scirra_Tom
I love the idea of rolling your own solutions, it's fun and challenging and
you'll learn a lot especially when it comes to scaling. For actual deployment
for a startup though, I struggle to see any benefit whatsoever.

------
sheepy
Same DIY analytics with Google Storage + BigQuery:
[https://github.com/fedia/bigpapa](https://github.com/fedia/bigpapa)

Still googly, but less scalability issues.

------
avolcano
I did something similar a couple years ago for a game, where I simply wanted
to track active player count over time:
[https://devlog.disco.zone/2016/09/02/google-sheets-
analytics...](https://devlog.disco.zone/2016/09/02/google-sheets-analytics/)

Roll-your-own is really nice when you have small data and when you have simple
numbers to query (such as "get number of active players").

------
droobles
This is awesome! I'm still an analytics noob with a tiny bit of GA experience.
I have my own site I'm launching but it's totally DIY, for the DIY
punk/hardcore scene. Some of the first party suggestions in this thread have
been great!

------
xgess
nice. heads up though; your link to netlify goes to neltify which doesnt
exist.

~~~
pcmaffey
Cool, thanks

------
fuddle
I'd like to do the same for a mobile app I'm building, but I find Firebase has
a lot of features that would be hard to build out myself. i.e User Session
length, install attribution

------
albertgoeswoof
I really like the in line links on this page. Not having to open a new tab for
a one line explanation is great. I wonder if this should be a browser level
implementation one day.

------
taurath
It seems like losing sessions because people closed their browser or unchecked
exit events is potentially the weekest point of the setup. Still, very nice
and user friendly!

------
fuball63
Can anyone post a good guide as to what analytics I should be collecting? I
want to implement a system of my own like this, but would like a quick primer
on the subject.

~~~
mejakethomas
I've found the documentation here to be very comprehensive if you want to
start learning why/what/how:

[https://github.com/snowplow/snowplow/wiki/javascript-
tracker](https://github.com/snowplow/snowplow/wiki/javascript-tracker)
[https://github.com/snowplow/snowplow/wiki/canonical-event-
mo...](https://github.com/snowplow/snowplow/wiki/canonical-event-model)
[https://developer.matomo.org/api-reference/tracking-
api](https://developer.matomo.org/api-reference/tracking-api)

------
rdiddly
All the effort to avoid Google Analytics, only to turn around and use Google
Sheets?

------
ngvrnd
When I read this title I imagined a person getting out a chest of D&D dice.
Change my mind.

------
gjs278
I’ve been checking the server logs for over 13 years now with awstats and that
has always covered my needs. javascript based logging makes no sense.

------
nickjj
Your site has no visible scroll bar. What was your design reason for hiding
it? It's probably one of the most useful controls to use on a site. You should
track its usage with your own analytics.

Or for now, track how many people moved their mouse to where they expected to
see and use a scroll bar, noticed there was none and then closed the page.

~~~
aaaaaaaaaaab
>You should track its usage with your own analytics.

I appreciate the joke, but don't give them ideas. Since one can't normally
track interactions with the scrollbar, I wouldn't be surprised if they were to
implement their own "scrollbar" from DIVs to be able to track if the user had
dragged it.

Just put back the built-in scrollbar. Period.

~~~
soared
You don't track the scrollbar itself, just scroll depth by quartile

[https://www.simoahava.com/analytics/scroll-depth-trigger-
goo...](https://www.simoahava.com/analytics/scroll-depth-trigger-google-tag-
manager/)

~~~
aaaaaaaaaaab
This can’t differentiate between scrolling via the scrollbar and other means
(keyboard, trackpad, mouse wheel).

