Hacker News new | past | comments | ask | show | jobs | submit login
Everything is a funnel, but SQL doesn't get it (motifanalytics.medium.com)
34 points by mikpanko on June 23, 2022 | hide | past | favorite | 12 comments



Not to play Code Golf here but the premise that retention takes 50+ lines of SQL is plain wrong.

Here's an implementation (T-SQL) in less than 10 lines. Table setup script is here: https://pastebin.com/9DajyzLp . There are probably even shorter ways of doing it. Anyone who objects to hardcoded dates, just swap out with a relative DATEADD.

;WITH lastweek AS ( SELECT USERID, TS, [ACTION] FROM Users WHERE TS BETWEEN '2022-06-13' AND '2022-06-19' )

SELECT ROUND(CAST(f1.[c] AS FLOAT) / CAST(f2.[c] AS FLOAT),2) FROM

( SELECT COUNT(*) [c] FROM lastweek w INNER JOIN Users u ON w.USERID = u.USERID AND w.[ACTION] = 'A' AND u.TS BETWEEN '2022-06-20' AND '2022-06-26' ) f1,

( SELECT COUNT(*) [c] FROM Users u WHERE u.[ACTION] = 'A' ) f2


Yes, the title could be "author does not get SQL"...

The cumbersome SQL example also produces a much different report format than the simple ratio of the second query sketch. If we are charitable and assume a DBA could define what a "login" is just like the second query form assumes, we might have a view already and not need a clunky CTE to confuse things.

SELECT

(SELECT COUNT(DISTINCT user_id) FROM logins w1 JOIN logins w2 ON (w1.user_id = w2.user_id AND (EXTRACT(WEEK, w2.timestamp) - EXTRACT(WEEK, w1.timestamp))::int = 1)

/

(SELECT COUNT(DISTINCT user_id) FROM logins);


Moreover, their sequence example for the relational database is just plain the wrong way you would go about that.

In the relational example they give[1], that is how you'd capture "have they done X, Y, and Z" whereas the sequence is explicitly capturing the order those things were done in. If you care about that information, you would design your system for this.

The point could have been that this is actually simpler with sequences, not that it's impossible in a relational db. We already have loads of examples of how to capture this data and put it in order. But I suppose a table with a User, Action, and Timestamp column would really invalidate the point the author was trying to make. Then it's just a select and order by to get the order things happened in.

[1]: https://miro.medium.com/max/1026/0*NbWYw7bbEflgG-5m


Not to play Code Golf here but the premise that retention takes 50+ lines of SQL is plain wrong.

Yeah, where I stopped reading when when you drill down into that 50-line SQL query and find all these lines about unpacking JSON structs and so forth. Obviously a gratuitously bloated "see-this-is-the-alternative" example so author can pad whatever argument they were trying to make.

But even sillier is the article's premise:

Everything Is a Funnel,

No -- "Some things are funnels, more often then you might think." But that's not clickbait-y enough ("conversion-friendly" as they say in the industry), the author must have thought. And so they went with the current title that brought us all here.

Don't get me wrong -- I really like finding out about use cases where SQL doesn't do so well -- but the intellectual sloppiness (and sheer guile) of this piece (obviously an ad) is a major turn-off.


As a DE by trade, working in marketing and customer analytics, I agreed with with the premise that analyzing data as they said was what happens most often; however, like you, I disagree that it’s as complicated as this post claims.

As a selling tactic, I think they’re trying to market to those who think the data engineering side of things is a big drain on time and resources and another expensive tool will help.

This is the same snake oil that’s been delivered in many different ways over the entirely of my career.


Yeah I'm somewhat confused as well. I even somewhat agree that using only vanilla SQL might make it tricky to answer some complex questions (not limited to questions about sequences), but why would that be your goal?

The idea of a DSL to analyse sequences of events is interesting, but I'm not convinced by their examples (I mean if you want the order of events, why store them as columns of booleans saying X happened?). They also claim without proof that tables of events are somehow 'arcane' and hard to deal with, but the examples they give aren't exactly complicated even using just SQL (the best I can give them is that they may not be solvable without writing a query, but a new query language doesn't solve that problem).


Process Mining is a relatively new field of data analytics, focusing on exactly these types of questions using workflow or activity sequence data. Celonis, one of the biggest players in the field have even developed a query languages called PQL for querying these kind of information from “event logs”


The second example (with the sequences v. the table of boolean columns) seems like it'd be immediately resolved by replacing those booleans with timestamps.


Of course. It's a joke of an example. More likely you'd have 3 separate tables corresponding to each of the actions, that are joined together (keyed on the customer id) including the timestamps



Check out MATCH_RECOGNIZE. It is a new SQL feature that does just what the post describes.


The Arthur has no clue about SQL. He does not seem to have heard of a basic SQL capability - CTEs (Common Table Expressions) These make it very simple to sequentially process/query data.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: