A really interesting proposal to unify stream processing and relational querying and to extend SQL with a semantics over time-varying relations.
In particular, the authors (from the Apache Beam/Flink/Calcite projects) stress on the need to:
* smooth the frontier between streams and tables by using a single concept of time-varying relations;
* make explicit the difference between the event and processing times, using monotonic watermarks to tie the two;
* control when these continuously varying relations are materialized, avoiding unnecessary work and producing only timely outcomes.
These ideas are not new per se, but there are here pushed further and nicely combined.
However, I'm wondering if SQL is the right tool for the task. For instance, the listing 2 seems complex when compared to the query expressed in plain English. In particular, I disagree that "anyone who understands enough SQL to solve a problem in a non-streaming context still has the knowledge required to solve the problem in a streaming context as well.
A must read paper by who is keen on stream processing, Spark, Flink, Beam, Kafka Stream ...
> However, I'm wondering if SQL is the right tool for the task. For instance, the listing 2 seems complex when compared to the query expressed in plain English
For me Listing 2 is more complex and less intuitive than Listing 1 (CQL). Yet, it is a general problem when we try to adopt relational (and SQL) concepts for solving such kind of tasks (windowing, grouping, aggregation etc.) One solution is to switch to some kind of column algebra rather than relational algebra as described in [1] and [2] (it has also been applied to stream processing).
I agree that listing 2 is a little verbose, but once you understand the syntax (not much different from vanilla SQL), it's pretty powerful. I think writing queries this way will ultimately allow for more expressiveness than something more concise. That isn't to say that this particular SQL is necessarily the best, but I think it works well for the problem at hand.
Sure, SQL is powerful and I agree that any alternative would also bring its own complexity. However, my concerns are more about encapsulation rather than expressiveness and conciseness. Even if a watermark is used the same way in different queries, I will have to use the same query clause again and again; as I already have to with the join conditions. Not a fundamental issue, but something which makes me more difficult to focus on the kernel of a query.
> However, I'm wondering if SQL is the right tool for the task.
In my anecdotal experience, yep, it's a data analysis language everyone in the industry knows - BAs, product managers, developers, and data scientists.
Having a standard SQL extension definitely eases adoption and ability to migrate from one tool to another. But rather than adding and adding features to SQL; we maybe need a simpler way to extend/compose features.
On that point, I see a contradiction in the future work section of the paper where it said that "Experience has also shown that pre-built solutions are never sufficient for all use cases; ultimately, users should be able to utilize the power of SQL to describe their own custom-windowing TVFs". But, how can a SQL user add its own custom operator?
It seems that the main proposal is to make SQL watermark-aware. This additional "knowledge" will allow for producing groups and windows taking into account some late and out-of-order records.
In particular, the authors (from the Apache Beam/Flink/Calcite projects) stress on the need to:
* smooth the frontier between streams and tables by using a single concept of time-varying relations;
* make explicit the difference between the event and processing times, using monotonic watermarks to tie the two;
* control when these continuously varying relations are materialized, avoiding unnecessary work and producing only timely outcomes.
These ideas are not new per se, but there are here pushed further and nicely combined.
However, I'm wondering if SQL is the right tool for the task. For instance, the listing 2 seems complex when compared to the query expressed in plain English. In particular, I disagree that "anyone who understands enough SQL to solve a problem in a non-streaming context still has the knowledge required to solve the problem in a streaming context as well.
A must read paper by who is keen on stream processing, Spark, Flink, Beam, Kafka Stream ...