
Ask HN: Good tech talks on how analytics systems are implemented? - psankar
I am doing a new sub-system, for analytics, which I can design &#x2F; implement from scratch. I get a bunch of unique users (say a few thousands). Now I need to track each of these users and do some analytics (Which places are they logging in from ? How long do their sessions last on average ? What are the usual links&#x2F;endpoints that they visit ? etc.) in my API server. I have a few thousand active users and about two dozen parameters on which I want to track and analyze these parameters.<p>I have never implemented such an analytics type system. I want to learn about people who have already implemented similar systems. Are there any good tech talks, engineering blog posts, video courses, etc. that highlight the design&#x2F;technology&#x2F;architecture choices and their benefits.
======
kthejoker2
Speaking as an analytics architect ...

You'll be a lot better off spending your mental energy thinking about the
outcomes you want to achieve (user engagement, upselling, growth, etc) and the
types of analysis you'll need to understand what changes you need to make to
produce those outcomes. Protip: this is actually really hard, and people
underestimate it by orders of magnitude. A blog post by Roger Peng (with
indirect commentary from John Tukey) ...
[https://simplystatistics.org/2019/04/17/tukey-design-
thinkin...](https://simplystatistics.org/2019/04/17/tukey-design-thinking-and-
better-questions/)

One other immediate tip is to start thinking about correlating your telemetry
with user surveys - again, strongly focusing on outcomes and the controllable
aspects of those outcomes.

Don't let the data lead the discuisson; decide on the question you're asking,
and the implications of all of the possible answers to that question (clearly
yes, clearly no, mixed, etc) before you ask it.

Then engineer the lightest weight system possible to ingest, process, store,
analyze, and visualize that data.

For me, that would just be:

1\. Log data in whatever logging tool you like. Persist the raw stuff forever
in a cheap data lake. 2\. Batch at some fixed interval into a staging area of
a relational DB. 3\. Transform it with stored procedures for now (while you
figure out what the right transforms are) into a flat fact table. 4\.
Visualize in Superset or PowerBI or even plain old Excel.

Once you've got the patterns of analysis at least fundamentally right you can
consider stream processing (Flink or Kafka Streams are fine) to replace 2 and
3.

~~~
DanHulton
One small point about "persist the raw stuff forever": ...maybe don't? I
wouldn't go so far as to call user data a toxic asset, but I'd definitely
refer to it in the same breath as "technical debt". Sometimes you have to take
on some to get anything done, but you need to recognize that it's a potential
liability, as well.

Persist it as long as you need it to obtain value from it, but after that,
you're just holding on to something you can lose some day in a data breach, or
build up enough of that people start to get mistrustful (Facebook, Google,
etc.) Data expiration policies should definitely be something you think about
and have justification for, even if you do decide that your expiration is
"Never".

------
sfkdjf9j3j
Seriously, just put everything in Postgres. You have so little data, you
shouldn't even be _thinking_ about an "analytics system".

I have seen so many developers over-engineer this exact problem. Forget about
Kafka, Kinesis, Redshift, Airflow, Storm, Spark, Cassandra etc. You don't need
them, not even close. Unless you want to add a bunch of expensive distributed
systems and operational overhead for fun/resume building, they're going to
waste your time and hurt your stability.

~~~
manishsharan
wouldn't writing events from analytics sub-system into a SQL database put load
on the DB even when it is not warranted ? My point is that if you are using a
SQL db and most of the database is for business transactions, wouldn't logging
events which are not mission critical , unnecessarily consume db resources ?

Also, assuming that SQL is being used for storing analytics events, would you
not cache events in a queue and then flush them to DB in a batch?

~~~
dougk16
Typically you have a separate database just for analytics, exactly for the
reason you described. You don't want to hold up actual business logic.

And yes typically you also have a job queue for the same reason. So e.g. API
calls don't take longer due to database event writes.

If your needs are similar to the original questioner (still starting out, few
thousand users), I'd say even the separate database isn't necessary, just the
job queue. Heck even the job queue is probably optional at that scale, but
most hosting platforms make job queues easy enough to integrate so I say why
not.

------
madhadron
I would be suspicious of most tech talks on this. If someone is giving a tech
talk on their analytics systems, they are either working at enormous scale
(Facebook, Google), selling something (Splunk), or over engineering their
system (many startups).

I second advice elsewhere in this thread. Log it into PostgreSQL. If you start
overloading that, look into sampling your data before you look into a fancier
system. Make sure you have identifiers in each row for each entity a row is
part of: user, session, web request. If you're not building replicated
PostgreSQL (which you probably won't need for this), log to files first, and
build another little process that tails the files and loads the rows into
PostgreSQL. That log then load advice is hard learned experience from working
at Splunk.

~~~
buremba
That's actually what we do at Rakam. Postgresql fits in many analytics
workloads with partitioned tables, parallel queries, and BRIN indexes. The
only limitation is that since it's not horizontally scalable, your data must
fit in one server. `it just works` up to ~10M events per month.

The SDKs provide ways to send the event data with the following format:

rakam.logEvent('pageview', {url: '[https://test.com'}](https://test.com'}))

The event types and attributes are all dynamic. The API server automatically
infers the attribute types from the JSON blob and creates the tables which
correspond to event types and the columns which correspond to event attributes
and inserts the data into that table. It also enriches the events with visitor
information such as user agent, location, referrer, etc. The users just run
the following SQL query:

SELECT url, count(*) from pageview where _city = 'New York' group by 1

All the project is open-source: [https://github.com/rakam-
io/rakam](https://github.com/rakam-io/rakam) Would love to get some
contribution!

~~~
mrchlhblng
To scale a PG db horizontally, you may want to look at
[https://www.citusdata.com/](https://www.citusdata.com/) (they were recently
bought by Microsoft but I don't expect any change on the Open Source part).

------
serial_dev
I was working for a startup implementing analytics tools. In my opinion, our
setup was over-engineered, but I wasn't there at the beginning, so I might be
wrong. Also, requirements changed a couple of times, so this could also
explain why something that looked necessary for scaling and speed, ended up
being this over-engineered mess. This is how it worked: After javascript
tracker fired, we got log files, passed them through Kafka, then parsed the
log files and performed calculations through Storm (Java). For storage, we
used Cassandra. The system also had other parts, but I don't remember why they
were there, tbh.

My thought process for solving your problem would be the following. First, you
need to understand what's good for you and for your company might not be the
same. You want the challenge, you want to implement something that could scale
and you want to use exotic tools for achieving this. It's interesting and
looks good in your CV. Your company might just want the results. You need to
decide which is more important.

If we prioritize your companies needs over keeping you entertained, I'd follow
this thought process:

Can't you just use Google Analytics? You can also connect it to BigQuery and
do lots of customizations. Maybe time would be better spent learning GA. It's
powerful, but most of us cannot use it well.

Second question: if for some reason, you don't want to use Google Analytics,
can you use another, possibly open-source and/or self-hosted analytics
solution? Only because you _can_ design it from scratch, it doesn't mean you
should.

Third: Alright, you want to implement something from scratch. For this scale,
you can probably just log and store events in an SQL database, write the
queries, and display it in a dashboard.

Then, if you really want to go further, there are many tools that are designed
to scale well and perform analytics, "big data". By looking for talks about
these tools, you will get a better understanding of how things work. There are
various open-source projects you should read more about: Cassandra, Scylla,
Spark, Storm, Flink, Hadoop, Kafka, Hadoop, Parquet, just to name a few.

~~~
psankar
The analytics are not just limited to web clients. There would be API clients
too. The deployment will be in a private enteprise vpn and so talking to
external services may not be an option.

I am aware of these tools like cassandra/flink/spark/kafka etc. But I am more
curios about the best tools and architectural patterns that work well with
each other.

~~~
lelima
>>I'm more curios about the best tools and architectural patterns that work
well with each other.

Well you can go with:

Fancy: Hdfs(distributed file system) as storage - oozie as workflow scheduler
for your load(python/hive/scala/spark) - Tableau for visualization (your
business ussers will love it.

Mid range: SQL Server as storage - Informatica for your workload - power BI
/SSRS for visuals

Open/low budget: PostgreSQL / Cassandra for storage - make your own scheduler/
there is a post for ETL open score yesterday that might help - for visuals you
can make it from scratch but hire a good designer!

This is based on my experience on industries like Gambling, Banking and
Telecom

~~~
dclausen
Tableau has a terrible UI and looks like it hasn't been updated since 1995. Go
for Looker instead.

~~~
unixhero
Absolutely what? It's the best in the industry. In that case, could you
suggest a better tool, and also maybe separately which tool that looks better?

------
solidasparagus
My go-to v0 solution is JSON (simple, with no nested objects or lists) written
to s3 (partitioned by date, see Hive date partitioning) and AWS Athena
(serverless Presto) to do SQL queries on those JSONs. You can build the system
in less than an hour, you don't have to manage any VMs and it's relatively
easy to extend to a more serious solution (e.g. if you need major scale or
Spark-like analytic jobs).

Relational databases like some are suggesting are fine, but you have to manage
them unlike s3 + Athena and it tends to make you design around relational
database concepts, which can make it difficult to migrate to a full-blown
analytics solution that often abandons relational guarantees.

This solution also lets you be flexible in your raw data schema unlike
relational databases where you have to have well defined schema or hacks like
saving the raw JSON as a string.

When you need to evolve your data schema (you will as you learn what things
you want to measure), a relational database requires you to be thoughtful
about how to do this (e.g. you can't have your data producer writing a new
schema before the table has been changed). Often this requires you to add some
sort of queue between data producers and database so that you can make changes
to the table without stopping the data producers. With s3 + Athena, you can
just upgrade your data producer, it will start saving the new format to s3 and
then you upgrade your Athena table definition whenever you want to start
querying the new data (because in relational databases, the schema defines how
data is stored, but in s3+Athena world, the schema just tells the SQL engine
how to read whatever data exists on s3).

~~~
tedmiston
This is an interesting modern solution. I agree with the JSON flat files on S3
for event storage to start. Athena is cool but I've also always felt that the
pricing model is weird being based of amount of data _scanned_ (not data
returned by the query). That said I don't have much experience partitioning or
bucketing for Athena to optimize this such as the ideas mentioned in [1]. And
the whole per query thing, the more you query your data, the more you pay.

It would be fair to argue that the pricing for small amounts of data is low
enough similar to Lambda pricing that it's not worth worrying about or that
one can optimize costs by using a columnar format like Apache Parquet over
JSON. But it's just a concern to be cognizant of vs e.g., Redshift, where
you're paying a flat rate instead of per query. Overall, I do think you're
right with the "serverless analytics" approach even if not many companies have
adopted a "stack" like this yet.

I would be curious to hear how you would evolve your v0 solution into a v1
solution as it grows.

[1]: [https://aws.amazon.com/blogs/big-data/top-10-performance-
tun...](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-
for-amazon-athena/)

~~~
gamache
I'm not the OP, but I can report on how we evolved our BI analytics at
Appcues.

We started out writing events into a Postgres table, but BI queries were slow
and Postgres was an expensive place to put the events.

So then we started writing the events into S3 in batches of 10,000. That
number was chosen semi-arbitrarily, intending that any Lambda function would
be able to process an entire batch within the 5 minute execution limit. We
started also keeping aggregate stats on this data by updating counters and
HyperLogLog estimators in a Redis store, updated as each batch hits S3. Athena
became our BI tool.

About a year after that, we'd evolved the system so that we were splitting
events by customer (instead of being an arbitrary time-slice of a day's
traffic). This became a suitable backend for a customer CSV export system, as
well as making BI cheaper by allowing us to zero in on the customer(s) we were
curious about.

And in the last year, we've begun to use the batched event data in S3 to feed
a Snowflake DB (via Snowpipe), which we use for both offline BI and online
analytics as part of our product. Snowflake is not free and requires some
sophistication, but it supports the leading analytics tools and visualizers,
and it's part of a direct evolution from keeping JSON files on S3.

------
srrr
1) Design the reports you want. Pay special attention to interactive elements
like filters and drilldowns. List all dimensions and metrics you need. Think
about privacy.

2) Find your visualisation tool of choice. This is more important than any
architecture choice for the tracking because this makes your data useable. [1]

3) Select your main data storage that is compatible with your visualisation
tool, data size, budget, servers, security, ... SQL is always better because
it has a schema the vis tools can work with. For a low amount of data you
might just want to use your existing database (if you have one) and not build
up new infrastructure that has to be maintained.

4) If you need higher availability on the data ingress than your db can
provide use a high availability streaming ingress [2] to buffer the data.

5) Design a schema to connect your db to the visualisation tool. Also think
about how you will evolve this schema in the future. (Simplest thing in sql
is: Add colunms.)

I hope this helps. If you have selected some tools it is fairly easy to search
for blog posts and tech talks. But don't think to big (data). "A few thousands
users" and "two dozen parameters" may be handled with postgres and metabase.
Also in most enterprise enviroments there already exists a data analytics /
data science stack that is covered by SLAs and accepted by privacy officers.
Ask around.

[1] [https://github.com/onurakpolat/awesome-bigdata#business-
inte...](https://github.com/onurakpolat/awesome-bigdata#business-intelligence)
[2] [https://github.com/onurakpolat/awesome-bigdata#data-
ingestio...](https://github.com/onurakpolat/awesome-bigdata#data-ingestion)

~~~
ironchef
Couple bits (good overall): "1) Design the reports you want. Pay special
attention to interactive elements like filters and drilldowns. List all
dimensions and metrics you need. Think about privacy." I think what you're
getting at here is figure out what information you want to get and then work
backwards to figure out if you have the data. A couple minor changes I'd make:
A) don't just figure out a a report, figure out what actions you'd want to
see. If something gets above or below a threshold, who should be doing what?
(Reports for the sake of reports is generally bad) B) Are you trying to build
things that will push for operational, tactical, or strategic change? The
manifestations of those are often very different. Operational bits are often
dashboards / KPIs, whereas with strategic changes we often would want to
present something more akin to a story. C) Privacy - Think of GDPR / PII
_now_. Look at each metric / dimension and understand the data classification
of it.

2 and 3 are tied to each other. You could have visualizatoin drive storage or
visa versa. Just understand the tradeoffs.

I'd suggest for who's done it before / talks, etc. there are a ton out there.
There's those chats from the FAANGs and various groups in the valley (Lyft,
etc.). Tons of blog posts there. Vendors have (largely predisposed towards
them) builds. Finally the talks/slides at datacouncil and strata often contain
lots of more .. "pointed" information. The high level bible that lots of folks
would say look at is Kleppmann's "Designing Data-Intensive Applications: The
Big Ideas Behind Reliable, Scalable, and Maintainable Systems".

~~~
srrr
Good additions. For report design I found the book "Information Dashboard
Design" by Stephen Few valuable. It talks about actionable data and has many
examples.

~~~
ironchef
I hadn't heard of that book before. Thank you.

------
kfk
I have designed one for 500 dashboard users and various other requirements. My
advise would be to get a cheap SQL compliant database that does not require a
lot of maintenance (if you can afford buy a cloud one). Then for the analysis
part the quickest thing to do is use Jupyter + SQLAlchemy. You can also use a
dashboarding tool, there are many, to connect to the database, but I think
with Jupyter you can ask more interesting questions that require more blending
or transformations. That's it, you'll grow from here in the coming months and
years, but if you over engineer analytics at the beginning you'll most likely
get tired of it and stop doing it at some point.

------
usgroup
IMO, your requirements are too basic to need a serious system. Either log
interaction to a file or a database, parse the output and query it with SQL to
produce your basic metrics, or just write to Google Analytics.

When this starts creaking at the seams it'll mean that you either have bigger
analysis and/or scalability requirements and it'll much clearer what you need
to look for.

~~~
quickthrower2
Piwik then perhaps

~~~
psankar
Thanks. I will check.

------
tekkk
I made one for our company using AWS Kinesis Firehose which I thought was
really good having used GA, Mixpanel and Segment before. Shame we haven't been
able to put it into more wider use. Extremely simple and very robust, to
deploy it you just have to run the CloudFormation stacks with Sceptre in a
single command and then add the client library with some event listeners for
clicks, pageviews et cetera. I'd love to be able to open-source it but I don't
know, should think through the benefits and disadvantages of both with my CEO.
Probably couldn't get customers to pay for an expensive custom analytics
platform if it was open-source.

Having spent some time on this I'll just say that don't overthink it. Over-
engineering such system is way too easy while the actual benefits might not be
that great. Sure if you're receiving a lot of data there might be some
pitfalls to be aware of eg using proper bucket partitioning with Athena for
queries.

------
acidbaseextract
Go with off the shelf. You'll get something far better that you can build
yourself, and if you need something custom, you'll have a much clearer idea
what your analysis is missing.

Writing to Google Analytics, Amplitude, Mixpanel (all of which have free
tiers) or equivalent all should handle your case well.

~~~
psankar
Talking to third party services may not be possible. The deployment scenario
would be a private enterprise vpn kind of setup.

------
karanke
Note: I build and maintain such systems for a living.

There's a lot of context that's missing from your post, some questions that
can help us guide you in the right direction:

1) Can your website call out to external services, or are you limited to
operating behind a company network?

2) Is this more of an ad-hoc analysis or do you want to invest in a framework
to be able to track such metrics systematically over time?

3) How important is data accuracy? Adblock can easily mess with client-side
metrics.

4) How real-time do metrics need to be? The big trade-off here is speed vs
accuracy.

5) How long do you intend to keep this data? This is a pretty big concern with
regards to privacy and storage costs.

If you'd rather not share some of these answers on a public forum, feel free
to shoot me an email.

~~~
psankar
1) The website cannot call to external services, which is the primary reason
why we thought about implementing it from scratch.

2) We want to invest in building a good framework to track such metrics
systematically over time

3) We have some non-web API clients too. Adblock is not a problem.

4) Accuracy is better. Speed is not that critical and could even be a few
minutes delayed.

5) The data will be kept for a few months at least, if not years. The storage
costs are not a big problem. This is for enterprise based solution, where the
deployment may happen in a private network.

~~~
karanke
The standard setup nowadays is something like this:
[http://bit.ly/2MFAAt9](http://bit.ly/2MFAAt9).

You can use different technologies based on your use case, but you probably
need all the pieces outlined above. As someone else has mentioned, if you're
looking for trade-offs between different technologies, I'd recommend
"Designing Data-Intensive Applications" by Kleppmann.

~~~
namanyayg
How does Snowplow compare to this?

------
princeverma
Start by adopting
[https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow)
then grow as and where you feel restricted.

~~~
veritas3241
Seconded! We're implementing it at GitLab and while the documentation could
use some work, it's been pretty great. Coupled with some off the shelf
modeling packages[0] you can get good analytics pretty quick. We even started
on CloudSQL (Postgres flavor) before migrating to Snowflake.

[0] [https://github.com/fishtown-
analytics/snowplow/](https://github.com/fishtown-analytics/snowplow/)

------
gt565k
You can just crunch your data with SQL/service layer code in a background
worker and store it in redis. Then you can use the objects from redis to
render charts, build dashboards, etc...

Structure your code so you crunch your historical data once, store in redis,
and then new data gets shoved in the redis cache as your time dimensions on
your metrics progress based on business logic.

Until your data is at enterprise volume, you really don't need an OLAP system.

------
buremba
I have an open-source project that collects the customer events via SDKs and
stores it in a data warehouse.

It's a distributed system, the mobile and web SDKs batch the user events on
their devices and push it to our API in JSON format. The API servers enrich &
sanitize the data, validate the schema, convert it to a serialized AVRO
binary, and push it to a commit-log system such as Kinesis or Kafka (It's
pluggable).

We have another project that fetches data from Kafka & Kinesis in small
batches, converts the data into columnar format and stores it in an S3 bucket
/ Google Cloud Storage. Then, we integrate their preferred data-warehouse into
their distributed filesystem. That way they have all their raw data in their
infrastructure for other systems such as fraud detection, recommendation, etc.
but they have SQL access to their data as well.

That being said, this architecture is for >100M events per month. If your data
is not that much, you can actually ingest your data into an RDBMS and it just
works fine. We support Postgresql at Rakam and you need is the API server and
a Postgresql instance in that case. Our open-source version supports
Postgresql so you can look into the source code from here:
[https://github.com/rakam-io/rakam](https://github.com/rakam-io/rakam) Would
love to get some contribution. :)

For the analysis part, all these metrics can be created using just SQL, the
modern data-warehouse solutions (BigQuery and Snowflake) also support
javascript and it's relatively easy to build funnel & retention queries that
way. It requires more work but now you have more control & flexibility over
your data.

------
codingdave
This sounds like a classic case of build vs. buy. If analytics are not your
core product, inventing a new solution is going to cost you more than buying
an existing analytics solution. There are dozens, a few of which have even
been in the news the last few days due to acquisitions.

I'm not going to endorse any of them over the others, but I will say you'll be
better off using a 3rd party than coding this yourself.

------
drunkpotato
The quickest and easiest thing to do would be to hook up Segment or a similar
system (heap analytics, google analytics, etc). I would stay away from GA
given my own choice though. It’s free but google won’t give your own data back
to you without an enterprise agreement which runs 6 figures minimum. For open
source there’s snowplow, which I haven’t used but many in the data community
do.

------
petercooper
If your analytics are merely a 'nice to have' but losing a day or two of
results would be acceptable in a crisis, I'd log everything to Redis and then
run a daily report to drag aggregated values into another database system. I
would clogging your main database system up with analytics related queries on
a day to day basis, for sure.

------
gorkemcetin
Definitely not a "good talk/blog post/video course" type of thing, however if
you are interested on how we built Countly from ground up, together with the
technology stack behind, you can check our source code here:

[https://github.com/countly/countly-
server](https://github.com/countly/countly-server)

While we have used MongoDB, Nodejs, Linux as underlying platforms, there are
several options out there you may check.

Note that some (if not most) of the effort would go into SDK development, and
to tell you the truth, SDK development is no easy task as it requires
knowledge of how different platforms work.

The point is (and take it as a warning): you will never be satisfied with what
you have - after you are done with vital data, then there is custom events,
raw data, user profiles, online users, and then you will start eating your own
dog food as it becomes your part-time job.

------
erlanganalytics
Erlang Factory 2014 -- Step By Step Guide to Building an Application Analytics
System

    
    
       https://www.youtube.com/watch?v=XBuQg1ZElao
    
       http://www.erlang-factory.com/static/upload/media/1395920416434704ef2014_anton_lavrik.pdf

------
vinay_ys
For such a small scale, you can use a simple event tracking schema from your
client-side and server-side code and have a simple stream processor to join
these events and then save it to a simple event table in a SQL database. The
DB tech you choose should be something suitable for OLAP workloads. For your
scale, PostgreSQL or MySQL would just work fine. When your data grows you can
look at more distributed systems like Vertica or Memsql or Clickhouse etc.

In this architecture, most of your brain cycles will go into designing the
queries for generating aggregates at regular intervals from the raw events
table and storing in various aggregate tables. You must be familiar with facts
and dimensions tables as understood in data warehouse context.

------
ergest
I’ve worked with a GA implementation before which I don’t recommend if you
want to own your data or if you want unsampled, detailed logs. I’ve also seen
a full end to end implementation that uses server log shipping to s3, log
parsing and complicated ETL processes which I also don’t recommend due to the
sheer effort it would take to build.

I’d say go with something like Matomo (formerly Piwik)
[https://matomo.org](https://matomo.org). If you wanted to build your own, I’d
suggest keeping it simple. Look at Matomo’s architecture and replicate
[https://github.com/matomo-org/matomo](https://github.com/matomo-org/matomo).

------
anonymousDan
Designing data intensive systems, M. Kleppmann

~~~
psankar
Thanks. But it is a bit theoretical. I was looking more on the lines of
people's practical experience, with architectural choices, tools that they
used, etc.

~~~
pcsanwald
I found this article quite interesting, it's about implementing an analytics
system at stackoverflow: [https://jasonpunyon.com/blog/2015/02/12/providence-
failure-i...](https://jasonpunyon.com/blog/2015/02/12/providence-failure-is-
always-an-option/)

------
sfifs
If it's a only few thousand enterprise users, I'd actually say log to a bog
standard relational database from server side like MySQL or Postgres. Think
through table schemas for everything you're going to log and make sure primary
keys and nomenclature for everything talk to each other. Virtually any
analytics platform or software talks to standard databases. Record as much as
you can because analytics use cases typically get generated following first
data collection and analysis and are iterative - so you also want to build
flexibility.

~~~
sneak
Suggestion: send the events to an elastic queue (like SQS) first, and have a
pair of ingestion processes do the actual INSERTs when the db is available.

Then you can take the db offline for maintenance and upgrades and not lose
data.

~~~
madhadron
Alternately, log to disk and have another process ingest and write to the
database. Appending to files is an amazing persistent queue.

~~~
awinder
How does your writer process deal with failures / tracking state of what it
has written / prune the file when it doesn't need old data anymore? You don't
have to use SQS but this problem has tons of available options, I wouldn't
resort to rolling your own (if anything just pick up something that works off
a leveldb/sqlite/etc file and has already implemented all this boring stuff
for you)

~~~
madhadron
Pruning the file is done the same way as logrotate, or even with logrotate.
You don't track state. When the loader starts up, you spool through the whole
file and use ON DUPLICATE KEY IGNORE in your INSERT statement.

~~~
sneak
The nice thing about the queue is that you can do fanout (eg multiple SQS
subscribers to the SNS topic or similar) and have multiple parallel ingestors
or change out the backend consumer/db injector without the producers knowing
anything about it.

You also get HA for free; your solution depends on local node disk
persistence.

~~~
madhadron
> your solution depends on local node disk persistence.

We have reached a point where people think that local node disk persistence is
an exotic property.

~~~
sneak
The Chaos Monkey is real. I don’t like losing data unnecessarily. Queues are
cheap and are usually run by Other People. It’s basically free reliability.
It’s also nice not having to specifically test the mode where the db is
offline; this gives free maintenance windows with literally no alternate
procedure on producers (and no additional testing).

If we’re gonna sign our lives over to Amazon and Google anyway, we might as
well get some reliability out of the exchange.

------
norejisace
I have a slightly adjusted question on any good talks / online training
programs that touch on digital measurement across channels (media pixel, web
analytics, 3rd party data etc). Any pointers?

------
reilly3000
Some questions for you:

\- Who will be viewing these reports when they are done? Who do you want to
have a view of the data eventually?

\- How fresh do you need the data to be? Is 24 hours, 4 hours, or 4 seconds
okay to wait?

\- Do you need to be alerted of anomalies in the data?

\- How long do you intend to store the raw data? Aggregated data?

\- Does your data need to contain anything that could personally identify a
user in order to make a useful analysis? Do you serve customers in the EU?

I'll check back later today and see if I can provide any insights based on
your response.

------
Spone
Ahoy [https://github.com/ankane/ahoy](https://github.com/ankane/ahoy) is an
interesting tool that we use to replace Google Analytics in most projects now.

It covers all the basic needs, and even if you're not using Rails, I think you
can draw inspiration from it!

------
nwsm
Since no one is answering the question, this talk by Sonos engineers at AWS
re:Invent 2015 is really good:

[https://www.youtube.com/watch?v=-70wNNrxf6Q](https://www.youtube.com/watch?v=-70wNNrxf6Q)

------
unixhero
Here you go:

AWS re:Invent 2017: Building Serverless ETL Pipelines with AWS Glue (ABD315)

[https://www.youtube.com/watch?v=eQBHIINW8VY&t=2692s](https://www.youtube.com/watch?v=eQBHIINW8VY&t=2692s)

------
ssvss
[https://druid.apache.org/](https://druid.apache.org/)

------
unixhero
AWS ...SFTP-> S3 -> GLUE -> REDSHIFT ... PowerBI, Tableau

------
tedmiston
I helped build an analytics platform that served more than millions of events.
Nothing about it is really difficult _until you scale heavily_.

The tl;dr - It's not worth your time / energy to build from scratch at this
scale. Leveraging an existing standard like the analytics.js spec [1] makes
web analytics very easy and quick to get started. With this approach you just
have to add one JS snippet to your website and never need to update it in your
code. If you are interested in the internals, you might enjoy digging deeper
into the spec to understand why it was designed this way.

Two services that implement this spec are Segment [2] and MetaRouter [3] [full
disclosure: I helped build the product that became MetaRouter at a previous
job]. They have different target audiences and pricing models but both are
worth a look.

You can think of these types of services as a meta analytics service that
routes your events to destination analytics services and data stores of your
choice. The great thing about using the standard is you can benefit from all
of the many integrations that have already been created with various analytics
services, databases, data warehouses, etc [4]. These destination catalogs can
also help you decide what services to explore and try next as you need more
advanced features.

To get started with a meta analytics service, in the management dashboard,
just add your API keys and config values for each service. For a simple
service like Google Analytics this is literally just one simple key to copy
and paste.

As far as adding custom even monitoring to your site, within the analytics.js
spec, first, you mainly want to be concerned with the Track call [5] which is
a way to say for an arbitrary event e.g., ProductAddedToCart, I would like to
attach this JSON object of properties e.g., a product name and price.

And finally, user info like name, email, IP, etc are handled by Identify [6].
You can add custom fields too (traits on an identify are ~= properties to a
track event but less transient).

Going with an existing standard and SaaS-based approach will save a ton of
time and engineering effort.

[1]: [https://segment.com/docs/spec/](https://segment.com/docs/spec/)

[2]: [https://segment.com/](https://segment.com/)

[3]: [https://www.metarouter.io/](https://www.metarouter.io/)

[4]:
[https://segment.com/docs/destinations/](https://segment.com/docs/destinations/)

[5]:
[https://segment.com/docs/spec/track/#example](https://segment.com/docs/spec/track/#example)

[6]:
[https://segment.com/docs/spec/identify/#example](https://segment.com/docs/spec/identify/#example)

