
One SQL to Rule Them All: Management of Streams and Tables - dwenzek
https://arxiv.org/abs/1905.12133
======
truth_seeker
Great write-up. SQL is the most underrated language.

I am doing most of the stuff mentioned in that paper right now in my project
with PipelineDB extension.

PostgreSQL 11 +

CitusDB extension (horizontal scaling and sharding) +

cstore_fdw extnesion (columnar data storage) +

file_fdw (to treat file as table) +

PipelineDB extension (Streaming computations and MV) +

TimescaleDB extension (timeseries data storage)

Also, PG 12 release will have many optimizations for table partitioning, query
planning and its parallel execution

~~~
rkwasny
One thing that's missing is GPU acceleration, like MAPD

Being able to group by a column and do a count(*) in 100ms is mind blowing. I
would pay for a MAPD-style backend for postgresql.

~~~
sitkack
While a GPU has excellent memory bandwidth and lots of parallel processors,
its connection to the rest of the system makes it not ideal for database work.
If you can have bit set indexes on the GPU, maybe. But data transfer and
kernel launch will dominate in the timing.

But, dual socket EPYC systems have ~300GB/s of memory bandwidth to main
memory. 1-4TB machines are available right now.

~~~
scott_s
I wrote a paper on pretty much that problem: "Evaluation of Streaming
Aggregation on Parallel Hardware Architectures", [https://www.scott-
a-s.com/files/debs2010.pdf](https://www.scott-a-s.com/files/debs2010.pdf)

Conclusion in brief: in order to "win" when trying to use an off-board
accelerator, you need to look at each byte transferred more than once. If you
only look at a byte once to compute your result, then it's going to be faster
to just do the computation on the CPU.

The reason: the up-front cost of transferring that byte is high. The win comes
from reusing that byte many times in the massively parallel architecture on
the accelerator.

------
dwenzek
A really interesting proposal to unify stream processing and relational
querying and to extend SQL with a semantics over time-varying relations.

In particular, the authors (from the Apache Beam/Flink/Calcite projects)
stress on the need to:

* smooth the frontier between streams and tables by using a single concept of time-varying relations;

* make explicit the difference between the event and processing times, using monotonic watermarks to tie the two;

* control when these continuously varying relations are materialized, avoiding unnecessary work and producing only timely outcomes.

These ideas are not new per se, but there are here pushed further and nicely
combined.

However, I'm wondering if SQL is the right tool for the task. For instance,
the listing 2 seems complex when compared to the query expressed in plain
English. In particular, I disagree that "anyone who understands enough SQL to
solve a problem in a non-streaming context still has the knowledge required to
solve the problem in a streaming context as well.

A must read paper by who is keen on stream processing, Spark, Flink, Beam,
Kafka Stream ...

~~~
asavinov
> However, I'm wondering if SQL is the right tool for the task. For instance,
> the listing 2 seems complex when compared to the query expressed in plain
> English

For me Listing 2 is more complex and less intuitive than Listing 1 (CQL). Yet,
it is a general problem when we try to adopt relational (and SQL) concepts for
solving such kind of tasks (windowing, grouping, aggregation etc.) One
solution is to switch to some kind of column algebra rather than relational
algebra as described in [1] and [2] (it has also been applied to stream
processing).

[1] Joins vs. Links or Relational Join Considered Harmful:
[https://www.researchgate.net/publication/301764816_Joins_vs_...](https://www.researchgate.net/publication/301764816_Joins_vs_Links_or_Relational_Join_Considered_Harmful)

[2] From Group-By to Accumulation: Data Aggregation Revisited:
[https://www.researchgate.net/publication/316551218_From_Grou...](https://www.researchgate.net/publication/316551218_From_Group-
by_to_Accumulation_Data_Aggregation_Revisited)

~~~
dwenzek
Thanks for the two links. I will have a look.

------
d__k
_> A time-varying relation is just what it says: a relation that changes over
time ... Critically, the full suite of existing SQL operators remain valid on
time-varying relations_

Any relation by definition is supposed to change in time. So why do we need to
introduce a special kind of relations? The authors essentially confirm that
there is no difference between classical relations and "time-varying"
relations.

Also, it is not clear how does it relate to EPL (in CEP) which is also based
on SQL.

~~~
drunkpotato
According to relational theory, relations are immutable sets of tuples. The
way you “change” a relation is by assigning a new relation to the name, e.g.:

R1 <\- R1 union R2

Is the relational algebra way of “inserting” R2 into R1. Of course, most
RDBMS’s (and the SQL family of languages) are only reasonably faithful
implementations of relational theory under some practical constraints.

But those differences re important, and I think the authors are proposing a
new relational model (though I don’t think this is a totally new idea; I’m not
a database researcher). One that incorporates time and history of a relation
as first-class citizens of the theory, that allows sql to operate natively on
both traditional relational-theoretic relations and streams of data. Pretty
cool! Can anyone who knows more than me comment on how this relates to Kafka
tables, which seems to me a similar concept?

------
iblaine
Is the point of this paper that SQL should be used to in streaming data
pipelines? Aren’t people doing this already?

------
Nano2rad
Not yet time to combine sql and nosql.

~~~
chrisjc
What does this article/paper have to do with "NoSQL"?

------
SmokeGS
Commenting to read again.

~~~
grzm
You can use the _favorite_ link under the submission title to effectively
bookmark it. Your favorites are available in your profile.

~~~
scott_s
Or an upvote.

