Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Good tech talks on how analytics systems are implemented?
193 points by psankar on June 12, 2019 | hide | past | favorite | 81 comments
I am doing a new sub-system, for analytics, which I can design / implement from scratch. I get a bunch of unique users (say a few thousands). Now I need to track each of these users and do some analytics (Which places are they logging in from ? How long do their sessions last on average ? What are the usual links/endpoints that they visit ? etc.) in my API server. I have a few thousand active users and about two dozen parameters on which I want to track and analyze these parameters.

I have never implemented such an analytics type system. I want to learn about people who have already implemented similar systems. Are there any good tech talks, engineering blog posts, video courses, etc. that highlight the design/technology/architecture choices and their benefits.

Speaking as an analytics architect ...

You'll be a lot better off spending your mental energy thinking about the outcomes you want to achieve (user engagement, upselling, growth, etc) and the types of analysis you'll need to understand what changes you need to make to produce those outcomes. Protip: this is actually really hard, and people underestimate it by orders of magnitude. A blog post by Roger Peng (with indirect commentary from John Tukey) ... https://simplystatistics.org/2019/04/17/tukey-design-thinkin...

One other immediate tip is to start thinking about correlating your telemetry with user surveys - again, strongly focusing on outcomes and the controllable aspects of those outcomes.

Don't let the data lead the discuisson; decide on the question you're asking, and the implications of all of the possible answers to that question (clearly yes, clearly no, mixed, etc) before you ask it.

Then engineer the lightest weight system possible to ingest, process, store, analyze, and visualize that data.

For me, that would just be:

1. Log data in whatever logging tool you like. Persist the raw stuff forever in a cheap data lake. 2. Batch at some fixed interval into a staging area of a relational DB. 3. Transform it with stored procedures for now (while you figure out what the right transforms are) into a flat fact table. 4. Visualize in Superset or PowerBI or even plain old Excel.

Once you've got the patterns of analysis at least fundamentally right you can consider stream processing (Flink or Kafka Streams are fine) to replace 2 and 3.

One small point about "persist the raw stuff forever": ...maybe don't? I wouldn't go so far as to call user data a toxic asset, but I'd definitely refer to it in the same breath as "technical debt". Sometimes you have to take on some to get anything done, but you need to recognize that it's a potential liability, as well.

Persist it as long as you need it to obtain value from it, but after that, you're just holding on to something you can lose some day in a data breach, or build up enough of that people start to get mistrustful (Facebook, Google, etc.) Data expiration policies should definitely be something you think about and have justification for, even if you do decide that your expiration is "Never".

Yes! Getting and transforming your data these days is easy and cheap to store. Use existing tooling for that (e.g. Snowplow, etc.) The hard part is producing actionable analysis.

Seriously, just put everything in Postgres. You have so little data, you shouldn't even be thinking about an "analytics system".

I have seen so many developers over-engineer this exact problem. Forget about Kafka, Kinesis, Redshift, Airflow, Storm, Spark, Cassandra etc. You don't need them, not even close. Unless you want to add a bunch of expensive distributed systems and operational overhead for fun/resume building, they're going to waste your time and hurt your stability.

wouldn't writing events from analytics sub-system into a SQL database put load on the DB even when it is not warranted ? My point is that if you are using a SQL db and most of the database is for business transactions, wouldn't logging events which are not mission critical , unnecessarily consume db resources ?

Also, assuming that SQL is being used for storing analytics events, would you not cache events in a queue and then flush them to DB in a batch?

Typically you have a separate database just for analytics, exactly for the reason you described. You don't want to hold up actual business logic.

And yes typically you also have a job queue for the same reason. So e.g. API calls don't take longer due to database event writes.

If your needs are similar to the original questioner (still starting out, few thousand users), I'd say even the separate database isn't necessary, just the job queue. Heck even the job queue is probably optional at that scale, but most hosting platforms make job queues easy enough to integrate so I say why not.

We put unwarranted load on computers all the time, and nobody notices. They're good at that. Now, if you think the load we are talking about will actually be so much that humans will be affected by it, then it makes sense to reconsider. But at 1000 DAU, you could probably run the database (analytics logging included) on a Linux server from the 1990s and its CPU and disk would not be the transaction bottleneck. (With proper indexing ofc.)

On top of that I would say that many specialized analytics systems are SQL-based, so even if you start out with a normal RDMBS you can upgrade to something that is super-efficient for your workload.

I would be suspicious of most tech talks on this. If someone is giving a tech talk on their analytics systems, they are either working at enormous scale (Facebook, Google), selling something (Splunk), or over engineering their system (many startups).

I second advice elsewhere in this thread. Log it into PostgreSQL. If you start overloading that, look into sampling your data before you look into a fancier system. Make sure you have identifiers in each row for each entity a row is part of: user, session, web request. If you're not building replicated PostgreSQL (which you probably won't need for this), log to files first, and build another little process that tails the files and loads the rows into PostgreSQL. That log then load advice is hard learned experience from working at Splunk.

That's actually what we do at Rakam. Postgresql fits in many analytics workloads with partitioned tables, parallel queries, and BRIN indexes. The only limitation is that since it's not horizontally scalable, your data must fit in one server. `it just works` up to ~10M events per month.

The SDKs provide ways to send the event data with the following format:

rakam.logEvent('pageview', {url: 'https://test.com'})

The event types and attributes are all dynamic. The API server automatically infers the attribute types from the JSON blob and creates the tables which correspond to event types and the columns which correspond to event attributes and inserts the data into that table. It also enriches the events with visitor information such as user agent, location, referrer, etc. The users just run the following SQL query:

SELECT url, count(*) from pageview where _city = 'New York' group by 1

All the project is open-source: https://github.com/rakam-io/rakam Would love to get some contribution!

To scale a PG db horizontally, you may want to look at https://www.citusdata.com/ (they were recently bought by Microsoft but I don't expect any change on the Open Source part).

This also makes a lot of things really easy. Want to join against your products table to see what product categories are most popular for certain customer segments? It’s a single query or Tableau drag-and-drop away. You don’t know what you’ll need to access fast to answer business questions, so use a system designed for flexibility until you can’t.

Dumping out to a text file quickly and have some aysnronous queue insert that into your database Is one solution but you have to watch uniqifiers, reconciliation and handling failed inserts so you can fix and reinject any failed records.

Can you elaborate on the log then load advice? Specifically the problems it solves or issues it prevents?

If you mess up the database you can just reread the log files. It also helps manage backpressure during activity spikes where your db can't keep up.

I was working for a startup implementing analytics tools. In my opinion, our setup was over-engineered, but I wasn't there at the beginning, so I might be wrong. Also, requirements changed a couple of times, so this could also explain why something that looked necessary for scaling and speed, ended up being this over-engineered mess. This is how it worked: After javascript tracker fired, we got log files, passed them through Kafka, then parsed the log files and performed calculations through Storm (Java). For storage, we used Cassandra. The system also had other parts, but I don't remember why they were there, tbh.

My thought process for solving your problem would be the following. First, you need to understand what's good for you and for your company might not be the same. You want the challenge, you want to implement something that could scale and you want to use exotic tools for achieving this. It's interesting and looks good in your CV. Your company might just want the results. You need to decide which is more important.

If we prioritize your companies needs over keeping you entertained, I'd follow this thought process:

Can't you just use Google Analytics? You can also connect it to BigQuery and do lots of customizations. Maybe time would be better spent learning GA. It's powerful, but most of us cannot use it well.

Second question: if for some reason, you don't want to use Google Analytics, can you use another, possibly open-source and/or self-hosted analytics solution? Only because you can design it from scratch, it doesn't mean you should.

Third: Alright, you want to implement something from scratch. For this scale, you can probably just log and store events in an SQL database, write the queries, and display it in a dashboard.

Then, if you really want to go further, there are many tools that are designed to scale well and perform analytics, "big data". By looking for talks about these tools, you will get a better understanding of how things work. There are various open-source projects you should read more about: Cassandra, Scylla, Spark, Storm, Flink, Hadoop, Kafka, Hadoop, Parquet, just to name a few.

The analytics are not just limited to web clients. There would be API clients too. The deployment will be in a private enteprise vpn and so talking to external services may not be an option.

I am aware of these tools like cassandra/flink/spark/kafka etc. But I am more curios about the best tools and architectural patterns that work well with each other.

>>I'm more curios about the best tools and architectural patterns that work well with each other.

Well you can go with:

Fancy: Hdfs(distributed file system) as storage - oozie as workflow scheduler for your load(python/hive/scala/spark) - Tableau for visualization (your business ussers will love it.

Mid range: SQL Server as storage - Informatica for your workload - power BI /SSRS for visuals

Open/low budget: PostgreSQL / Cassandra for storage - make your own scheduler/ there is a post for ETL open score yesterday that might help - for visuals you can make it from scratch but hire a good designer!

This is based on my experience on industries like Gambling, Banking and Telecom

Tableau has a terrible UI and looks like it hasn't been updated since 1995. Go for Looker instead.

Absolutely what? It's the best in the industry. In that case, could you suggest a better tool, and also maybe separately which tool that looks better?

There has to be something somewhat out of the box for this, no? It seems insane to have to build this for any given project.

I'd imagine you could use Segment (or source(s) that plugs into it) to accomplish much of this.

I worked in a 15m/year revenue product for 3 years. Our Analytic system was screw by cookie messages and now GDPR. Marketeers wanted to serve Analytics through Google Tag Manager, which helped customers to block our analytics launcher, meaning 0 data for most of the visits.

Relying solely on client-side logs gets less and less reliable each year.

We never relied only on client side.

That's not my experience using GA VIA GTM for a Major brand

We got blocked by a majority of users using uBlock, AdBlock, etc.

My go-to v0 solution is JSON (simple, with no nested objects or lists) written to s3 (partitioned by date, see Hive date partitioning) and AWS Athena (serverless Presto) to do SQL queries on those JSONs. You can build the system in less than an hour, you don't have to manage any VMs and it's relatively easy to extend to a more serious solution (e.g. if you need major scale or Spark-like analytic jobs).

Relational databases like some are suggesting are fine, but you have to manage them unlike s3 + Athena and it tends to make you design around relational database concepts, which can make it difficult to migrate to a full-blown analytics solution that often abandons relational guarantees.

This solution also lets you be flexible in your raw data schema unlike relational databases where you have to have well defined schema or hacks like saving the raw JSON as a string.

When you need to evolve your data schema (you will as you learn what things you want to measure), a relational database requires you to be thoughtful about how to do this (e.g. you can't have your data producer writing a new schema before the table has been changed). Often this requires you to add some sort of queue between data producers and database so that you can make changes to the table without stopping the data producers. With s3 + Athena, you can just upgrade your data producer, it will start saving the new format to s3 and then you upgrade your Athena table definition whenever you want to start querying the new data (because in relational databases, the schema defines how data is stored, but in s3+Athena world, the schema just tells the SQL engine how to read whatever data exists on s3).

This is an interesting modern solution. I agree with the JSON flat files on S3 for event storage to start. Athena is cool but I've also always felt that the pricing model is weird being based of amount of data scanned (not data returned by the query). That said I don't have much experience partitioning or bucketing for Athena to optimize this such as the ideas mentioned in [1]. And the whole per query thing, the more you query your data, the more you pay.

It would be fair to argue that the pricing for small amounts of data is low enough similar to Lambda pricing that it's not worth worrying about or that one can optimize costs by using a columnar format like Apache Parquet over JSON. But it's just a concern to be cognizant of vs e.g., Redshift, where you're paying a flat rate instead of per query. Overall, I do think you're right with the "serverless analytics" approach even if not many companies have adopted a "stack" like this yet.

I would be curious to hear how you would evolve your v0 solution into a v1 solution as it grows.

[1]: https://aws.amazon.com/blogs/big-data/top-10-performance-tun...

I'm not the OP, but I can report on how we evolved our BI analytics at Appcues.

We started out writing events into a Postgres table, but BI queries were slow and Postgres was an expensive place to put the events.

So then we started writing the events into S3 in batches of 10,000. That number was chosen semi-arbitrarily, intending that any Lambda function would be able to process an entire batch within the 5 minute execution limit. We started also keeping aggregate stats on this data by updating counters and HyperLogLog estimators in a Redis store, updated as each batch hits S3. Athena became our BI tool.

About a year after that, we'd evolved the system so that we were splitting events by customer (instead of being an arbitrary time-slice of a day's traffic). This became a suitable backend for a customer CSV export system, as well as making BI cheaper by allowing us to zero in on the customer(s) we were curious about.

And in the last year, we've begun to use the batched event data in S3 to feed a Snowflake DB (via Snowpipe), which we use for both offline BI and online analytics as part of our product. Snowflake is not free and requires some sophistication, but it supports the leading analytics tools and visualizers, and it's part of a direct evolution from keeping JSON files on S3.

My experience with Athena was that cost was negligible compared to alternative options, but YMMV. As in all things cloud pricing related, you should experiment and measure once price becomes worth thinking about.

Partitioning by date gets you really far because the vast majority of queries are interested in a specific timeframe. If your data is growing, but the ingestion rate is not, v0 might work forever.

v0.5 is simply to add another level of partitioning. This can get you an order of magnitude performance gain in many situations. I've often not needed to go past this.

v1 is a different beast. The nice thing about the simplicity of v0 is that it gives you time to learn the requirements for v1 (including whether you even need a v1) and the data is JSON on S3 so it's never hard to migrate it somewhere else. v1 is usually driven by very specific, known requirements, so there's no generic answer. For many requirements, it's useful to put a queue like Kafka (powerful) or Kinesis (convenient/cheap) in front of S3 and have a Spark job performing actions on that queue.

If speed of a single query is the issue, too many small files could be the root problem. Then, a good option is to read batches from the queue and write them to s3 as a single Parquet or ORC file. This could also be a good option if the issue is the cost of s3 API calls (when you're looking at 10k JSONs per second, PUT calls get very pricy), but at large enough scale, you might be better off using Redshift.

With Postgres or Redshift, you can run into contested resource problems when the number of users querying the data scales. I haven't found that to be a problem with Athena due to separated compute and storage, but if it is, you could maybe replicate data to another AWS account. Or, depending on the query patterns, you can build a higher-level table (e.g. given a series of events, tell me the most current state for every entity) from the raw data, either using ETL or by writing a new Spark job that reads from the Kafka queue, figures out the current state and writes the higher-level data a new s3+Athena table.

If the issue is time-to-answer (e.g. if there is a problem, how quickly will that be apparent in your analytics since batched writing introduces delay), streaming analytics could be the solution. You can also use the lambda architecture, but that's always looked too hard to implement and maintain to me.

If the issue is auth/data visibility in an enterprise context, I would lean towards using a fully featured enterprise solution, preferably one that uses a well-documented format on s3 as the underlying datastore with separated compute and storage. Alternatively, you could build or buy a query layer that sits on top of Athena that handles the permissions (e.g. Looker).

At a certain point, you should consider just adopting Snowflake as they have solved many of these problems already and is building an analytics solution really your core competency?

I work on an infrastructure that's 4+ years old, with a full featured analytics DB in place (Snowflake). I still use JSON + S3 + Athena for some BI queries. It's not what you want for an online query source, but for doing occasional offline BI work on medium-large data (let's say under a petabyte), it doesn't get simpler.

This is the setup I normally use, though I would suggest storing the data in Parquet or ORC format if possible. Kinesis Firehose will perform this conversion for you, and the resulting data files are much smaller. If you need a flexible schema, though, stick with JSON.

1) Design the reports you want. Pay special attention to interactive elements like filters and drilldowns. List all dimensions and metrics you need. Think about privacy.

2) Find your visualisation tool of choice. This is more important than any architecture choice for the tracking because this makes your data useable. [1]

3) Select your main data storage that is compatible with your visualisation tool, data size, budget, servers, security, ... SQL is always better because it has a schema the vis tools can work with. For a low amount of data you might just want to use your existing database (if you have one) and not build up new infrastructure that has to be maintained.

4) If you need higher availability on the data ingress than your db can provide use a high availability streaming ingress [2] to buffer the data.

5) Design a schema to connect your db to the visualisation tool. Also think about how you will evolve this schema in the future. (Simplest thing in sql is: Add colunms.)

I hope this helps. If you have selected some tools it is fairly easy to search for blog posts and tech talks. But don't think to big (data). "A few thousands users" and "two dozen parameters" may be handled with postgres and metabase. Also in most enterprise enviroments there already exists a data analytics / data science stack that is covered by SLAs and accepted by privacy officers. Ask around.

[1] https://github.com/onurakpolat/awesome-bigdata#business-inte... [2] https://github.com/onurakpolat/awesome-bigdata#data-ingestio...

Couple bits (good overall): "1) Design the reports you want. Pay special attention to interactive elements like filters and drilldowns. List all dimensions and metrics you need. Think about privacy." I think what you're getting at here is figure out what information you want to get and then work backwards to figure out if you have the data. A couple minor changes I'd make: A) don't just figure out a a report, figure out what actions you'd want to see. If something gets above or below a threshold, who should be doing what? (Reports for the sake of reports is generally bad) B) Are you trying to build things that will push for operational, tactical, or strategic change? The manifestations of those are often very different. Operational bits are often dashboards / KPIs, whereas with strategic changes we often would want to present something more akin to a story. C) Privacy - Think of GDPR / PII _now_. Look at each metric / dimension and understand the data classification of it.

2 and 3 are tied to each other. You could have visualizatoin drive storage or visa versa. Just understand the tradeoffs.

I'd suggest for who's done it before / talks, etc. there are a ton out there. There's those chats from the FAANGs and various groups in the valley (Lyft, etc.). Tons of blog posts there. Vendors have (largely predisposed towards them) builds. Finally the talks/slides at datacouncil and strata often contain lots of more .. "pointed" information. The high level bible that lots of folks would say look at is Kleppmann's "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems".

Good additions. For report design I found the book "Information Dashboard Design" by Stephen Few valuable. It talks about actionable data and has many examples.

I hadn't heard of that book before. Thank you.

I have designed one for 500 dashboard users and various other requirements. My advise would be to get a cheap SQL compliant database that does not require a lot of maintenance (if you can afford buy a cloud one). Then for the analysis part the quickest thing to do is use Jupyter + SQLAlchemy. You can also use a dashboarding tool, there are many, to connect to the database, but I think with Jupyter you can ask more interesting questions that require more blending or transformations. That's it, you'll grow from here in the coming months and years, but if you over engineer analytics at the beginning you'll most likely get tired of it and stop doing it at some point.

IMO, your requirements are too basic to need a serious system. Either log interaction to a file or a database, parse the output and query it with SQL to produce your basic metrics, or just write to Google Analytics.

When this starts creaking at the seams it'll mean that you either have bigger analysis and/or scalability requirements and it'll much clearer what you need to look for.

GA may not be possible as no thirdparty service could be accessed in the environment where this would be deployed.

Piwik then perhaps

Thanks. I will check.

I made one for our company using AWS Kinesis Firehose which I thought was really good having used GA, Mixpanel and Segment before. Shame we haven't been able to put it into more wider use. Extremely simple and very robust, to deploy it you just have to run the CloudFormation stacks with Sceptre in a single command and then add the client library with some event listeners for clicks, pageviews et cetera. I'd love to be able to open-source it but I don't know, should think through the benefits and disadvantages of both with my CEO. Probably couldn't get customers to pay for an expensive custom analytics platform if it was open-source.

Having spent some time on this I'll just say that don't overthink it. Over-engineering such system is way too easy while the actual benefits might not be that great. Sure if you're receiving a lot of data there might be some pitfalls to be aware of eg using proper bucket partitioning with Athena for queries.

Go with off the shelf. You'll get something far better that you can build yourself, and if you need something custom, you'll have a much clearer idea what your analysis is missing.

Writing to Google Analytics, Amplitude, Mixpanel (all of which have free tiers) or equivalent all should handle your case well.

Talking to third party services may not be possible. The deployment scenario would be a private enterprise vpn kind of setup.

Note: I build and maintain such systems for a living.

There's a lot of context that's missing from your post, some questions that can help us guide you in the right direction:

1) Can your website call out to external services, or are you limited to operating behind a company network?

2) Is this more of an ad-hoc analysis or do you want to invest in a framework to be able to track such metrics systematically over time?

3) How important is data accuracy? Adblock can easily mess with client-side metrics.

4) How real-time do metrics need to be? The big trade-off here is speed vs accuracy.

5) How long do you intend to keep this data? This is a pretty big concern with regards to privacy and storage costs.

If you'd rather not share some of these answers on a public forum, feel free to shoot me an email.

1) The website cannot call to external services, which is the primary reason why we thought about implementing it from scratch.

2) We want to invest in building a good framework to track such metrics systematically over time

3) We have some non-web API clients too. Adblock is not a problem.

4) Accuracy is better. Speed is not that critical and could even be a few minutes delayed.

5) The data will be kept for a few months at least, if not years. The storage costs are not a big problem. This is for enterprise based solution, where the deployment may happen in a private network.

I would say what you need to consider more than anything is making sure you have the right data, and that the data can be combined.

This is the hard part of analytics for an app that is more back office oriented, understanding what will be needed to get truly accurate and useful information to support the reports people will want in 2 years, 5 years, whatever time frame is long enough for things to really change in your environment.

Try to think in an adversarial way, what question could someone come up with that I can't answer. The user who has seen the most errors? Usage trends by department?

One place this might lead is wanting to put a way to link request/error logs back to an application level user account (in a way that respects security and privacy), this can become great debugging info too.

The standard setup nowadays is something like this: http://bit.ly/2MFAAt9.

You can use different technologies based on your use case, but you probably need all the pieces outlined above. As someone else has mentioned, if you're looking for trade-offs between different technologies, I'd recommend "Designing Data-Intensive Applications" by Kleppmann.

How does Snowplow compare to this?

Start by adopting https://github.com/snowplow/snowplow then grow as and where you feel restricted.

Seconded! We're implementing it at GitLab and while the documentation could use some work, it's been pretty great. Coupled with some off the shelf modeling packages[0] you can get good analytics pretty quick. We even started on CloudSQL (Postgres flavor) before migrating to Snowflake.

[0] https://github.com/fishtown-analytics/snowplow/

Seconded on snowplow. Piwik/Matomo are showing their age, you don't really control your data with Adobe/GA/MixP/etc, and building from scratch seems easy... til it's not. Either use snowplow as is, or at least learn from it to see how they solved issues so you can avoid some basic mistakes (their data QA on input from the trackers, for example, is very well handled).

I built a web analytics system from scratch for my current employer. I give the same advice to anyone that asks.

You can just crunch your data with SQL/service layer code in a background worker and store it in redis. Then you can use the objects from redis to render charts, build dashboards, etc...

Structure your code so you crunch your historical data once, store in redis, and then new data gets shoved in the redis cache as your time dimensions on your metrics progress based on business logic.

Until your data is at enterprise volume, you really don't need an OLAP system.

I have an open-source project that collects the customer events via SDKs and stores it in a data warehouse.

It's a distributed system, the mobile and web SDKs batch the user events on their devices and push it to our API in JSON format. The API servers enrich & sanitize the data, validate the schema, convert it to a serialized AVRO binary, and push it to a commit-log system such as Kinesis or Kafka (It's pluggable).

We have another project that fetches data from Kafka & Kinesis in small batches, converts the data into columnar format and stores it in an S3 bucket / Google Cloud Storage. Then, we integrate their preferred data-warehouse into their distributed filesystem. That way they have all their raw data in their infrastructure for other systems such as fraud detection, recommendation, etc. but they have SQL access to their data as well.

That being said, this architecture is for >100M events per month. If your data is not that much, you can actually ingest your data into an RDBMS and it just works fine. We support Postgresql at Rakam and you need is the API server and a Postgresql instance in that case. Our open-source version supports Postgresql so you can look into the source code from here: https://github.com/rakam-io/rakam Would love to get some contribution. :)

For the analysis part, all these metrics can be created using just SQL, the modern data-warehouse solutions (BigQuery and Snowflake) also support javascript and it's relatively easy to build funnel & retention queries that way. It requires more work but now you have more control & flexibility over your data.

This sounds like a classic case of build vs. buy. If analytics are not your core product, inventing a new solution is going to cost you more than buying an existing analytics solution. There are dozens, a few of which have even been in the news the last few days due to acquisitions.

I'm not going to endorse any of them over the others, but I will say you'll be better off using a 3rd party than coding this yourself.

The quickest and easiest thing to do would be to hook up Segment or a similar system (heap analytics, google analytics, etc). I would stay away from GA given my own choice though. It’s free but google won’t give your own data back to you without an enterprise agreement which runs 6 figures minimum. For open source there’s snowplow, which I haven’t used but many in the data community do.

If your analytics are merely a 'nice to have' but losing a day or two of results would be acceptable in a crisis, I'd log everything to Redis and then run a daily report to drag aggregated values into another database system. I would clogging your main database system up with analytics related queries on a day to day basis, for sure.

Definitely not a "good talk/blog post/video course" type of thing, however if you are interested on how we built Countly from ground up, together with the technology stack behind, you can check our source code here:


While we have used MongoDB, Nodejs, Linux as underlying platforms, there are several options out there you may check.

Note that some (if not most) of the effort would go into SDK development, and to tell you the truth, SDK development is no easy task as it requires knowledge of how different platforms work.

The point is (and take it as a warning): you will never be satisfied with what you have - after you are done with vital data, then there is custom events, raw data, user profiles, online users, and then you will start eating your own dog food as it becomes your part-time job.

Erlang Factory 2014 -- Step By Step Guide to Building an Application Analytics System



For such a small scale, you can use a simple event tracking schema from your client-side and server-side code and have a simple stream processor to join these events and then save it to a simple event table in a SQL database. The DB tech you choose should be something suitable for OLAP workloads. For your scale, PostgreSQL or MySQL would just work fine. When your data grows you can look at more distributed systems like Vertica or Memsql or Clickhouse etc.

In this architecture, most of your brain cycles will go into designing the queries for generating aggregates at regular intervals from the raw events table and storing in various aggregate tables. You must be familiar with facts and dimensions tables as understood in data warehouse context.

I’ve worked with a GA implementation before which I don’t recommend if you want to own your data or if you want unsampled, detailed logs. I’ve also seen a full end to end implementation that uses server log shipping to s3, log parsing and complicated ETL processes which I also don’t recommend due to the sheer effort it would take to build.

I’d say go with something like Matomo (formerly Piwik) https://matomo.org. If you wanted to build your own, I’d suggest keeping it simple. Look at Matomo’s architecture and replicate https://github.com/matomo-org/matomo.

Designing data intensive systems, M. Kleppmann

Thanks. But it is a bit theoretical. I was looking more on the lines of people's practical experience, with architectural choices, tools that they used, etc.

I found this article quite interesting, it's about implementing an analytics system at stackoverflow: https://jasonpunyon.com/blog/2015/02/12/providence-failure-i...

If it's a only few thousand enterprise users, I'd actually say log to a bog standard relational database from server side like MySQL or Postgres. Think through table schemas for everything you're going to log and make sure primary keys and nomenclature for everything talk to each other. Virtually any analytics platform or software talks to standard databases. Record as much as you can because analytics use cases typically get generated following first data collection and analysis and are iterative - so you also want to build flexibility.

Suggestion: send the events to an elastic queue (like SQS) first, and have a pair of ingestion processes do the actual INSERTs when the db is available.

Then you can take the db offline for maintenance and upgrades and not lose data.

Alternately, log to disk and have another process ingest and write to the database. Appending to files is an amazing persistent queue.

How does your writer process deal with failures / tracking state of what it has written / prune the file when it doesn't need old data anymore? You don't have to use SQS but this problem has tons of available options, I wouldn't resort to rolling your own (if anything just pick up something that works off a leveldb/sqlite/etc file and has already implemented all this boring stuff for you)

Pruning the file is done the same way as logrotate, or even with logrotate. You don't track state. When the loader starts up, you spool through the whole file and use ON DUPLICATE KEY IGNORE in your INSERT statement.

The nice thing about the queue is that you can do fanout (eg multiple SQS subscribers to the SNS topic or similar) and have multiple parallel ingestors or change out the backend consumer/db injector without the producers knowing anything about it.

You also get HA for free; your solution depends on local node disk persistence.

> your solution depends on local node disk persistence.

We have reached a point where people think that local node disk persistence is an exotic property.

The Chaos Monkey is real. I don’t like losing data unnecessarily. Queues are cheap and are usually run by Other People. It’s basically free reliability. It’s also nice not having to specifically test the mode where the db is offline; this gives free maintenance windows with literally no alternate procedure on producers (and no additional testing).

If we’re gonna sign our lives over to Amazon and Google anyway, we might as well get some reliability out of the exchange.

I have a slightly adjusted question on any good talks / online training programs that touch on digital measurement across channels (media pixel, web analytics, 3rd party data etc). Any pointers?

Some questions for you:

- Who will be viewing these reports when they are done? Who do you want to have a view of the data eventually?

- How fresh do you need the data to be? Is 24 hours, 4 hours, or 4 seconds okay to wait?

- Do you need to be alerted of anomalies in the data?

- How long do you intend to store the raw data? Aggregated data?

- Does your data need to contain anything that could personally identify a user in order to make a useful analysis? Do you serve customers in the EU?

I'll check back later today and see if I can provide any insights based on your response.

Ahoy https://github.com/ankane/ahoy is an interesting tool that we use to replace Google Analytics in most projects now.

It covers all the basic needs, and even if you're not using Rails, I think you can draw inspiration from it!

Since no one is answering the question, this talk by Sonos engineers at AWS re:Invent 2015 is really good:


Here you go:

AWS re:Invent 2017: Building Serverless ETL Pipelines with AWS Glue (ABD315)


AWS ...SFTP-> S3 -> GLUE -> REDSHIFT ... PowerBI, Tableau

I helped build an analytics platform that served more than millions of events. Nothing about it is really difficult until you scale heavily.

The tl;dr - It's not worth your time / energy to build from scratch at this scale. Leveraging an existing standard like the analytics.js spec [1] makes web analytics very easy and quick to get started. With this approach you just have to add one JS snippet to your website and never need to update it in your code. If you are interested in the internals, you might enjoy digging deeper into the spec to understand why it was designed this way.

Two services that implement this spec are Segment [2] and MetaRouter [3] [full disclosure: I helped build the product that became MetaRouter at a previous job]. They have different target audiences and pricing models but both are worth a look.

You can think of these types of services as a meta analytics service that routes your events to destination analytics services and data stores of your choice. The great thing about using the standard is you can benefit from all of the many integrations that have already been created with various analytics services, databases, data warehouses, etc [4]. These destination catalogs can also help you decide what services to explore and try next as you need more advanced features.

To get started with a meta analytics service, in the management dashboard, just add your API keys and config values for each service. For a simple service like Google Analytics this is literally just one simple key to copy and paste.

As far as adding custom even monitoring to your site, within the analytics.js spec, first, you mainly want to be concerned with the Track call [5] which is a way to say for an arbitrary event e.g., ProductAddedToCart, I would like to attach this JSON object of properties e.g., a product name and price.

And finally, user info like name, email, IP, etc are handled by Identify [6]. You can add custom fields too (traits on an identify are ~= properties to a track event but less transient).

Going with an existing standard and SaaS-based approach will save a ton of time and engineering effort.

[1]: https://segment.com/docs/spec/

[2]: https://segment.com/

[3]: https://www.metarouter.io/

[4]: https://segment.com/docs/destinations/

[5]: https://segment.com/docs/spec/track/#example

[6]: https://segment.com/docs/spec/identify/#example

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact