If you are thinking of migrating away from GA, I highly recommend you move to a data warehouse based solution, where you store each event permanently in a data warehouse like Snowflake or BigQuery. There are two client side pixels that you can self-host: snowplow.js and segment. It’s hard to find instructions for self hosting segment but I’ve made an example at https://github.com/fivetran/self-hosted-analytics-js
The advantage of doing it this way is you preserve the event level data forever and you can write arbitrary SQL against it. You will need a BI tool to visualize; there are several excellent ones with free or cheap tiers for small companies.
That's the path many data-driven companies take lately. Most companies start with plug-and-play solutions such as GA or Mixpanel but as they start to dig into their customer data, they either hire data engineers or use ETL solutions to collect their raw event data into a data warehouse and BI tools in order to track their own metrics.
That way, you will have more control and be able to ask any question you want. We have been working on a tool that basically connects to your Segment or Snowplow data and run ad-hoc analysis similar to Mixpanel and GA so that you don't need to adopt generic BI tools and write SQL every time you create a new report. I was going to create a Show HN post but since your comment is quite relevant to the topic, I wanted to share the product analytics tools that we have been working on, https://rakam.io. The announcement blog post is also here: https://blog.rakam.io/update-we-built-rakam-ui-from-scratch-...
P.S: I also genuinely appreciate the work the people have done at Countly, it's often not easy to use ETL tools to set up your own data pipeline and create your own metrics yourself so they're a great alternative if you don't want to get stuck with GA or third-party SaaS alternatives.
Thanks for the suggestion Thomas. We're rebuilding the landing page at the moment and we will be updating it next week. In the meantime, here is a getting started video: https://www.youtube.com/watch?v=G8Cm5jzYUPw&t=1s
Elastic is great when it comes to storing the log data but since it's not ANSI SQL compliant, it's hard to run complex queries that involve JOINs, WINDOW operations, and subqueries.
Depending on the data volume, you can use one the SQL based data warehouse solutions such as Redshift, Presto, BigQuery or Snowflake (even PG works if you have less than 100M events per month) and run ad-hoc queries on your raw customer event data easily. It's a bit tricky to run behavioral analytics queries such as funnel and retention but we provide non-technical friendly user interfaces that don't require you to write any SQL.
I would love to talk about your use-case, feel free to shoot an email to emre {at} rakam.io
I do both: use ELK for deep dive on analytics and have a traditional data warehouse for SQL.
Elastic is really good for data exploration. Kibana's discovery page makes it trivial to start diving into your events and gain insights. This is most valuable for large, complex applications where you don't necessarily control the analytics tagging (thus devs can create new tags willy-nilly). It's also easier for non-technical users to get into, since there don't need any sort of SQL experience to build reports.
Elastic is really bad at performing aggregations. You can do it, but they are incredibly memory-intensive, and using them puts the cluster at risk of failure due to out of memory exceptions. It doesn't take many concurrent users for this to be an issue.
Traditional data warehouses are still the best place to get counts though. Assuming your users can write SQL, they can get everything they would want and it will be reliable. Newer databases support JSON structures, so you can even host arbitrary-sized dynamic event data in them like you can with Elastic.
This also depends on why you are choosing this technology - is it part of your existing stack or do you have in-house experts in this tech?
While Lucene synax is very powerful, it is not SQL as pointed out by the OP. If you have a lot of spare developer time and people skilled in this, it will likely work well for a while (potentially a really long while).
Going with something like BigQuery or Redshift enables you to utilize other tech to supplement or accelerate skill sets, such as paying a SaaS for visualization/analysis tools (Looker, Mode, etc).
In particular, separation of storage and processing helps avoid costs when you are not using the data. With elastic, I believe you need the cluster whether you are using it or not. BigQuery will only charge you for compute you use and pennies for storage. Same thing for Athena/Redshift spectrum. Snowflake is similar, but I believe for enterprise contracts it a bit more complex with minimums and such.
Structuring data is another really important consideration - will you need to normalize and will you need to update labels/dimensions? That’s just the start.
> is it part of your existing stack or do you have in-house experts in this tech?
Nope and nope. That's definitely the scary / unknown part!
> If you have a lot of spare developer time and people skilled in this, it will likely work well for a while (potentially a really long while).
We don't have a lot of spare developer time and people. However, the Elastic docs make our fairly straightforward use case - dump lots of events in and filter them by date and about 4 different properties - seem not too daunting.
The way it's sold on the Elastic site these days seems like a very "batteries included" kind of approach - am I interpreting their marketing a little too positively? Are we kidding ourselves? Would love to hear more about your thoughts!
> Structuring data is another really important consideration - will you need to normalize and will you need to update labels/dimensions? That’s just the start.
We don't expect much.
Thanks for mentioning the other technologies, too! The way we're structuring it allows us to either use Elastic, SQL, or something else entirely, so thankfully we're fairly agnostic on the data store. If Elastic turns out to be a time sink, we'll have no hesitation trying something else out.
Remember that if you’re storing each event you need to anonymise user identifying data out of it for GDPR (assuming you have users who don’t consent), and you need to be able to remove (or at least anonymise) the data of consent is withdrawn.
With Google the TOS states that you’re not allowed to store personal information in the first place at least.
The best way to solve this is to have a normalized schema. You assign each user an ID, and then you sync all the identifying information to its own table. So you get a schema that looks like:
If a user requests to be deleted from your database, or if you have a policy of deleting users when the cancel their subscriptions, you can accomplish that by deleting one row.
This isn't going to make you GDPR compliant by itself but it's a start.
As long as events don't have name/email/ip/... there should be no problem as far as i understand it. The regulation targets only personal identifiable information.
PII is a concept is US law. The GDPR uses personal data, defined as:
"any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person"
Pseudonymization (separating unique identifiers from the rest of the datapoints) can and should be used as a safeguard, but doesn't remove the need to protect the data, particularly if you keep a link between the two.
If you were a bookshop and your events log someone anonymous as buying a book on niche topics A, B and C, I reckon that could easily globally uniquely identify me as there aren't many people in the world interested in all three. Then you could also see me buying a book on embarrassing topic D and blackmail me with it. No personal identifying information, but it's not anonymous.
Finger-printing is not PII. And the people blackmailing you because you're interested in both finger painting and the orficial applications of pine cones would not be entirely sure it was you. If they had an email address, name, or IP address with a timestamp, then they could be quite certain it that it was you.
I didn’t say it was. The original conversation wasn’t about PII it was about anonymising. Fingerprints defeat anonymisation.
I reckon in practice with for example an Amazon sales log with names removed and my public Twitter feed you could de-anonymise me with a high degree of certainty.
No argument from me, there are things like IPs that are very hard to find every copy of and eliminate. Putting the users in their own table is just a simple best-practice that has a lot of benefits, so it’s a great first step.
You will not and will not assist or permit any third party to, pass information to Google that Google could use or recognize as personally identifiable informatiom
100% recommend the BigQuery and Segment combo. It’s so easy to setup and querying is just SQL. Few annoying issues though: Metabase is brilliant, but doesn’t support big query datasets properly so isn’t really able to be used. We’re in the process of switching to Tableau for our BQ queries, which allows powerful joins across all sorts of different data sets
I was thinking of using elastic search and Kibana, mostly because I know the stack already. Do you think i should rather look into BigQuery and Segment?
I'm not an expert, but I looked at both BQ and ELK and went with BQ just for ease of ops and the fact we can use SQL, which makes it more accessible to other team members who know SQL.
We're a small startup though, so last thing I wanted was even the slightest ops load from a new system.
We've actually been trying to figure out the best path for this. We have most of our data in Periscope (we're on the Redshift version of it) and we also use GA and it would be great to be able to have everything in one place.
Seems like using Segment here is probably the least effort solution for future events, although it doesn't address ETLing past GA data into the warehouse. Curious if anybody else has had to deal with a similar situation.
We made that exact transition, and the limitation you describe is real - you can’t get anything out of GA, particularly at the granularity you’ll have with an alternative.
So it’s best to start early, because it’s a slow process.
It’s even slower if you don’t instrument the right events in the beginning, because you’ll wait a while to gather enough event data to perform an analysis, find you want to track some new event, and then need to wait even longer to have enough data for analysis. It can be frustrating, and a surprising amount of work.
If you want to build out your BI infrastructure in the mean time, you could always ETL your e.g. server logs into the data warehouse as a stopgap. It’s definitely not the same as the customer-centric setup that you get with Segment, though, unless you have some advanced logging already set up.
Are you a paid GA user? If that's the case, you can ingest your data into BigQuery. Otherwise; you can't export the raw data, the only option is to export the summary data (number of unique users per hour, day, week etc.) from GA to your Redshift cluster via an ETL tool such as Segment.
We are not a paid user. I believe that's around 200k a year? That's a good point, I haven't looked into paid vs unpaid functionality when it comes to data export yet.
I run data science in a company that publishes email newsletters. We store every delivery and event record as a row in our database. This gives us so much flexibility.
I've used Redash [1] and Saiku [2] with good success. Redash is a swiss army knife approach to dashboarding and can run/schedule queries against a plethora of data sources and visualise in different ways
Saiku uses Mondrian (OLAP) tech to give a slice/dice drill down view of data. Unfortunately the docs for Saiku seem very disjointed these days - there's a docker image for the CE (Community Edition) that I recently setup okay. Quite a learning curve on setting up Mondrian schemas here, though, but worth it when it's done.
The advantage of doing it this way is you preserve the event level data forever and you can write arbitrary SQL against it. You will need a BI tool to visualize; there are several excellent ones with free or cheap tiers for small companies.