
Analytics data integrity is tough - pragmacoders
https://pragmacoders.com/analytics-data-integrity/
======
teej
I'm surprised that there isn't a mention of Google Dataflow aka Apache Beam.
The Beam programming model is specifically designed to solve nearly all of the
problems this post is addressing.

> It is likely that one day, I'll need to shard the data and distribute the
> processing amongst multiple servers. But no company I've used this with
> currently has enough data flowing through its analytics system, or intense
> amounts of real-time processing, to warrant such a complexity.

This solution is so over-engineered for low data volume. You could capture all
of the business value for 10% of the engineering effort by just dumping this
data into a database meant for analytics. And then you'd at least have an
answer for things like fixing broken data, making full-history business logic
changes, merging events, etc.

If you're in AWS, sending your events from snowplow into S3 and then into
Redshift/Athena/Presto/PostgreSQL is the way to go.

~~~
pragmacoders
I am aggregating events on a user-level and on a session-level so as to gain a
higher level understanding of user behavior.

Like: \- "What is the median time that users spend on a given page?"

\- "How many events do users generally fire in their first 3 sessions?"

-"Who visited Page A, then Page B, THEN Page C?"

-"How many users who used this feature also used this feature within the same day?"

This is difficult to do with purely event-level metrics. It also takes a
really long time and isn't suited to exploration.

We're already using a data warehouse (AWS Redshift) to analyze event-level
metrics. But more complex queries are much easier, and much faster to do, when
the data is stored in ways that are more suited to such calculations.

So, that's what this article is hopefully helping people to do!

For a related topic, "Entity-Centric Indexing" is the ElasticSearch equivalent
of what I'm working on!

------
sturgill
I’d recommend the event producer send a UUID that is then the primary key on
the events table. The producer should also send the timestamp the event
occurred.

I could be missing something, but that seems to solve both the duplicate event
firing (an upsert command based on the UUID makes duplicate event writing a
non-issue) and the timing issues.

Though I’m still incredibly skeptical of “real-time analytics.” The number of
business cases that require actual real-time analysis are pretty limited. High
frequency trading and...?

~~~
pragmacoders
Let's say we'd like to know the median time users spend on a given page.

If we leave these at an event level, in a database, we'd need to:

\- Loop through all events

\- Group them by user

\- Figure out which ones represent the beginning of a page view

\- Figure out which ones represent the end of a page view

\- Subtract these two events

\- Aggregate these for all users in the database

\- Find the median

This can take a prohibitive amount of time for a pretty simple question. It'd
be much faster if the data was already stored in a manner that is useful for
such an exploration.

As questions become more complex - so do queries. They become hard do think
about and the time it takes to process them explodes.

One related concepts is "Entity-Centric Indexing", which talks about this
problem in terms of ElasticSearch.

Anywho - that's a long-winded reason why I'm taking this approach!

~~~
teej
I answer these types of questions literally every day on Redshift. My data set
is >1B events and more than doubling every year. You are underestimating the
ability of a database to do its job.

Here's 18 lines of SQL to save you a week of writing a data pipeline in Go.

    
    
       WITH page_timings AS (
         SELECT page_type
              , event_timestamp_utc
              , LEAD(event_timestamp_utc, 1)
                  OVER(PARTITION BY session_id
                           ORDER BY event_timestamp_utc)
                AS next_pageview
           FROM events
          WHERE event_type = 'page_view'
        )
         SELECT page_type
              , MEDIAN(
                  DATEDIFF(sec,
                           event_timestamp_utc,
                           next_pageview)
                ) AS median_sec_on_page
           FROM page_timings
          GROUP BY page_type
    ;

~~~
meow1032
It's been a while since I've done database stuff, so maybe this is a naive
question, but how do you find the reusability of using primarily SQL/redshift
to answer these sorts of questions.

I remember getting these sorts of questions and having to write the sort of
queries that you just wrote, but then having to write like 10 other ones to
solve related questions with subtle differences(such as the event_type would
change or something).

~~~
aceregen
Hello, you may like to try out what we've been working on at Holistics
(www.holistics.io)! We make it easy for you to retrieve and share your SQL
data.

We help in 2 use-cases to address the "duplicative" overhead of writing the
same SQL query.

1\. Where you just need to adjust specific parameters/values in your query
With Holistics, You can insert custom variables in your SQL query to make them
dynamic to address the subtle differences without repeating yourself. See an
example here: [https://docs.holistics.io/guides/adding-
filters/](https://docs.holistics.io/guides/adding-filters/)

2\. Where you have chunks of text that are reused across multiple questions
This can be long chunks of `CASE-IFs` or CTE/sub-queries that you use across
multiple reports. You can insert a template snippet (chunks of SQL syntax for
example) that can be reused across multiple questions. You can find more
information here: [https://docs.holistics.io/query-
templates/](https://docs.holistics.io/query-templates/)

~~~
meow1032
Thanks for the response, I'm more on the research side of things now, so I
don't have a current use case, but thinking about moving back to industry, and
it's something that I remember having trouble with when I was last in
industry.

------
slap_shot
I have a lot issues with this project and I’m on my phone so I can’t outline
them all. At quick glance, a few comments have already addressed some of these
concerns.

But my biggest question is WHY did this person feel it necessary to do this
project? From first glance, there is no way Crystal is producing the traffic
required to roll this solution. There are dozens of companies that can solve
this problem for a few hundred dollars a month and have handled all the
problems discussed in this article at serious scale for their customers.

The most irritating part is that the developer states in the beginning why he
did this: because it’s fun.

Disclosure: I’m a founder of company whose core product is a real time
analytics platform for web and mobile. Of course I’m going to recommend “Buy”
for a small company like this in a Build vs Buy analysis. But when
“professional engineers” say they’re building projects “because they are fun”,
a lot of people suffer.

~~~
pragmacoders
I am confused at how your engineers learned to manage such a system if your
take on this is that engineers should forego a learning opportunity and
instead take an opportunity to give you their money.

I'm confused why you're angry at me exploring a subject and giving my
learnings away for free to anyone who is interested.

I am also confused why you assume knowledge of how I use the things created in
this article or the technical needs of projects I work on, when I have not
described those needs nor described the data I work with.

Ah - You own a company that sells the thing I'm trying to help people do for
free and with open-source tools. Never mind. Not confused.

In all seriousness - If I'm leading people in the wrong direction, I
definitely would like to know a better direction to lead them. But that
direction won't be to deposit money into your wallet. It will be to help them
learn (hopefully) useful techniques to deal with their problems.

