
KSQL: Open Source Streaming SQL for Apache Kafka - uptown
https://www.confluent.io/blog/ksql-open-source-streaming-sql-for-apache-kafka/
======
daddykotex
I've had the question for a while so I'll ask it here, maybe someone can help
me.

Suppose you modeled your domain with events and your stack is build on top of
it. As stuff happens in your application, events are generated and appended to
the stream. The stream is consumed by any number of consumers and awesome
stuff is produced with it. The stream is persisted and you have all events
starting from day 1.

Over time, things have changed and you have evolved your events to include
some fields and deprecate others. You could do this without any downtime
whatsoever by changing your events in a way that is backward compatible way.

What is the good approach to what I'd call a `replay`?

When you want to replay all events, the version of your apps that will consume
the events may not know about the fields that were in the event for day one.

~~~
stingraycharles
As always in these types of the scenarios, the answer is: it depends. It
depends on the amount of data you have. It depends upon how big the diversion
from the original schema is. Etcetera.

My personal philosophy is to always leave event data at rest alone: data is
immutable, you don't convert it, and you treat it like a historical artifact.
You version each event, but never convert it into a new version in the actual
event store. Any version upgrades that should be applied are done when the
event is read; this requires automated procedures to convery any event version
N to another version N + 1, but having these kind of procedures in place is
good practice anyway. Some might argue that doing this every time an event is
read is a waste of CPU cycles, but in my experience this far outweights
possible downsides of losing the actual event stored at that time in the past,
and this type of data is accessed far less frequently than new event data.

~~~
dkersten
I suppose you can always trade those CPU cycles off against storage and cache
the N+1 version (in a separate Kafka topic or elsewhere), so now reading the
latest-version data is fast, yet you still retain the original data intact, at
the expense of more storage. This does complicate the storage though, as you
now have multiple days a sources, but nothing that can't be solved.

~~~
raulk
Bingo. If you equate this technique to a DB migration, you could have "up" and
"down" directions for the translations from version N <=> N+1.

Then if you have 90% confidence you'll only ever need to replay the upgraded
stream, you can upgrade it and destroy the previous version.

If at some point (the remaining 10%) you need to rescue the old stream, you
can run the "down" direction and rehydrate the old version of the stream.

~~~
tedmiston
It sounds good in theory. In practice, I haven't heard much around running
backwards migrations on a data warehouse / massive collection of events but
I'm sure some out there already do it.

~~~
dkersten
I suppose one needs to take care that migrations are never lossy so that the
full information for upgrading or downgrading a version is available.

~~~
tedmiston
Yeah, that's the challenge. For instance, how do you handle when a column was
one data type but then down the road was changed to another type when the two
aren't cross compatible or could potentially break?

~~~
raulk
You could retain this info in a meta field of flexible type. For a DB, it
could a JSON type. For messages, it could be an extra _meta field on the
message that the systems themselves ignore.

------
CSDude
Apache Flink is also a good alternative, and works very well. We have used it
in production for a while for generating live reports. I made simple example
[1] and have a look at the docs if you are more interested [2]. Gonna
definetely try Kafka's version, its version of stream processing [3] also
interesting as well.

[1] [https://medium.com/@mustafaakin/flink-streaming-sql-
example-...](https://medium.com/@mustafaakin/flink-streaming-sql-
example-6076c1bc91c1)

[2] [https://ci.apache.org/projects/flink/flink-docs-
release-1.3/...](https://ci.apache.org/projects/flink/flink-docs-
release-1.3/dev/table/sql.html)

[3]
[http://docs.confluent.io/current/streams/index.html](http://docs.confluent.io/current/streams/index.html)

~~~
nehanarkhede
I'm one of the authors of Kafka. I've outlined some differences between
Flink's support for streaming SQL and KSQL in this Twitter thread -
[https://twitter.com/juliusvolz/status/902283513382051840](https://twitter.com/juliusvolz/status/902283513382051840)

Here's a summary: \- KSQL has a completely Interactive SQL interface, so you
don't have to switch between DSL code and SQL.

\- KSQL upports local, distributed and embedded modes. Is tightly integrated
with Kafka's Streams API and Kafka itself; doesn't reinvent the wheel. So is
simple to use and deploy.

\- KSQL doesn't have external dependencies, for orchestration, deployment etc.

\- KSQL has native support for Kafka's exactly once processing semantics,
supports and stream-table joins.

~~~
sandGorgon
@neha - where do you think kafka is going to evolve in the world of data
processing.

I'm very bullish on kafka. Today we have Spark for batch data computation and
have already switched some of our streaming stuff to Kafka.

Do you see yourselves entering into the batch processing space anytime ?
Google has officially said that Flink is "compelling" because of its
compatibility with the Beam model.

If I can step on thin ice... is it easier for Flink to commandeer Kafka or for
Kafka to win over batch processing ?

~~~
blain_the_train
What do you mean by "batch processing" Personally i find that term to be
confusing.

I believe if Kafka can do streaming then it effectively can do batch as batch
is a subset of streaming.

------
filereaper
I really wish frameworks offering SQL would upfront say what level of SQL
compatibility they have, is it SQL 2011, is it Postgres etc...

Anyways, if anyone's wondering, here's the Github page. [1]

Also from FAQ [2]:

Is KSQL fully compliant to ANSI SQL?

KSQL is a dialect inspired by ANSI SQL. It has some differences because it is
geared at processing streaming data. For example, ANSI SQL has no notion of
“windowing” for use cases such as performing aggregations on data grouped into
5-minute windows, which is a commonly required functionality in the streaming
world.

[1]
[https://github.com/confluentinc/ksql/blob/0.1.x/docs/syntax-...](https://github.com/confluentinc/ksql/blob/0.1.x/docs/syntax-
reference.md#syntax-reference)

[2]
[https://github.com/confluentinc/ksql/blob/0.1.x/docs/faq.md#...](https://github.com/confluentinc/ksql/blob/0.1.x/docs/faq.md#frequently-
asked-questions)

~~~
joaodlf
I'm assuming that most "SQL" implementations in distributed systems don't
actually follow the SQL standard. They just offer a SQL like syntax to solve
common problems. I could be wrong, but that's the impression I get working
with tools like Cassandra/Spark.

------
lima
Combine this with Debezium[1] and you get real-time SQL queries on your
MySQL/PostgreSQL/MongoDB database!

RethinkDB, as far as I understand, does the same thing - their changefeed
mechanism would consume the DB log and run the queries against the log.

[1]: [http://debezium.io](http://debezium.io)

~~~
nehanarkhede
Yes, exactly. We plan to add support for various Kafka connectors to KSQL so
you can ingest/export streams from external systems, like database, and then
do stream processing all using KSQL queries.

The purpose is to bring Streaming ETL and Stream Processing together to make
the user's life easy.

------
guidoism
That name is already taken by the San Carlos airport near Oracle:
[https://en.m.wikipedia.org/wiki/San_Carlos_Airport_(Californ...](https://en.m.wikipedia.org/wiki/San_Carlos_Airport_\(California\))

~~~
jmccarthy
Isn't it great that the name is entirely coincidental? Love that anecdote.

------
andmarios
At Landoop we submitted our proposal to present our KSQL at the Kafka summit
but we were rejected. Ah, if we only knew... :)

~~~
nehanarkhede
An independent program committee makes these calls and had to pick from 166
submissions. The program committee picks talks that are insightful and
technically challenging.

~~~
andrewstevenson
I think a bit of clarification is required. KCQL, kafka connect query language
was jointly developed with Landoop when Confluent commissioned DataMountaineer
to write the JDBC sink. The reason we added it was to simplify the
configuration but it also enabled us to filter and support various options of
the many data sources/sinks we have connectors for. Confluent removed the kcql
from the sink and reverted to a flume style configuration we were trying to
avoid. It's good to see Confluent plans to support their Ksql in Connect,
following DataMountaineers lead. We hope this is optional. We can look at
migrating if it supports all the options we need that KCQL provides.

Regarding Landoops proposal. This was not about KCQL but SQL on Streams, also
named KSQL which is integrated with their new product Kafka Lenses. We'll look
at Confluents SQL and see which one to go forward with, maybe both but we have
happy customers using our version.

But congratulations on your KSQL very nice! We (Landoop and DM staff) have
proved many ex colleagues in the Investment bank world wrong about Kafka and
this cements our decisions to use it. Thanks.

------
buremba
I love the continuous queries over append-only datasets and it's great that
Kafka added support for it.

It looks like there is one main contributor
([https://github.com/confluentinc/ksql/graphs/contributors](https://github.com/confluentinc/ksql/graphs/contributors))
though, it seems that the other contributors either wrote the documentation or
helped for the packaging. Not a great sign considering how big this project is
(there are competitors which only do this such Pipelinedb), hopefully you can
create a team just for KSQL.

~~~
miguno
To add to what nehanarkhede said: I'd rather see it as a testament to the
powerful building blocks of Kafka and its Streams API that you can actually
implement a project such as KSQL with very few engineers.

Also, the commits in KSQL reflect only parts of the work -- it doesn't include
design discussions, code reviews, etc.

Lastly, keep in mind that the git repository was cleaned (think: squashed
commits) prior to publication, see the very first commit in the repository's
timeline. So you don't see the prior work/commits of other Confluent engineers
that went into KSQL.

------
BMarkmann
I understand the underlying architecture is fundamentally different, but the
end result seems kind of akin to some of the "complex event processing" (CEP)
tools out there like Esper (ie. feeding incoming data into established queries
instead of executing queries against at-rest data). Would this fit similar
sorts of use cases / fit into the CEP market?

~~~
nehanarkhede
Yes KSQL is ideal for use cases similar to those that CEP was initially
targeted for and more - from real-time anomaly detection, monitoring,
analytics to application development and Streaming ETL. As you alluded to, the
big difference is that KSQL is designed as a distributed Streaming SQL engine
that can run at Kafka scale.

~~~
edan0
If you are interested in performing ultra low latency (<1ms) CEP via SQL, take
a close look at SQLstream Blaze
([http://www.sqlstream.com](http://www.sqlstream.com)) and it's full
implementation of Allen's Interval Algebra
([https://en.wikipedia.org/wiki/Allen%27s_interval_algebra](https://en.wikipedia.org/wiki/Allen%27s_interval_algebra))
via SQL temporal predicates:
[http://sqlstream.com/docs/sqlrf_planned_feature_temporal_pre...](http://sqlstream.com/docs/sqlrf_planned_feature_temporal_predi.html?zoom_highlightsub=leads).
This stuff runs at 1,000,000 events per second per core and can scale out in
conjunction with Kafka across any number of servers.

------
cbsmith
This is the Kafka answer to Pipelinedb. I love it.

~~~
cmollis
PipelineDB is great..and the guys are super helpful..

influxdb has something similar, and now so does AWS Kinesis (with their
Kinesis Analytics product).

~~~
Fergi
PipelineDB will become a standard, open-source PostgreSQL extension this year,
enabling anybody using PostgreSQL to leverage continuous SQL queries in
combination with all other PostgreSQL functionality.

[https://www.pipelinedb.com/blog/pipelinedb-0-9-8-postgresql-...](https://www.pipelinedb.com/blog/pipelinedb-0-9-8-postgresql-
extension-refactor-nearing-completion)

~~~
_pmf_
> PipelineDB will become a standard, open-source PostgreSQL extension

What does this imply?

------
coolio222
Does anyone knows details about the KSQL-engine that computes the queries?
According to their git, there can be multiple KSQL-engines in a Client-Server
configuration. Is the workload for one query distributed? Is the SQL
translated into a program using the Kafka streaming API?

~~~
hojjat
Yes, queries are translated into Kafka Streams API. In the client-
server(cluster) mode each query will run on every instance of engine the same
way kafka streams apps run on multiple instances.

------
largote
Fun fact: KSQL is the code of the airport closest to Oracle HQ (San Carlos
airport).

~~~
iaabtpbtpnn
Came here to post this... do you know if the code is meant to pay homage to
Oracle, or were they just trying to make it sound like San QarLos?

Edit: Oh, the wiki page explains that it predates Oracle. So probably the
latter.

------
stuartaxelowen
The many streaming SQL options are cool - but how do you test streaming SQL as
it comes to represent a larger part of your application?

------
beagle3
ksql was the sql dialect understood by Kx systems k2 and k3 versions. It
includes nice things like foreign key chasing.

------
chandanrai
this link is broken in the article [step through the demo
yourself]([https://github.com/confluentinc/ksql/blob/0.1.x/docs/demo.md...](https://github.com/confluentinc/ksql/blob/0.1.x/docs/demo.md#demo))

------
tycho01
Now I'm wondering if something like Mondrian could help stream MDX queries
over this...

------
jnordwick
This seems like a painfully slow and complex way to do event processing.

~~~
twic
What would be faster and less complex ways? Can you think of any advantages
that KSQL has over them?

------
jasonrhaas
Seems like a somewhat similar idea to AWS Aurora.

~~~
chicagobuss
Hmm. I don't see that at all. Aurora is a way to query data in S3. This is a
way to query Kafka streams in real time.

~~~
elvinyung
I think you mean Athena :P

This seems to be basically similar to Spark, which lets you perform full SQL
queries on Kafka streams.

~~~
nehanarkhede
Spark SQL is different from KSQL in the following ways:

\- Spark SQL is not an interactive Streaming SQL interface. To do stream
processing, you have to switch between writing code using Java/Scala/Python
and SQL statements. KSQL, on the other hand, is a completely interactive
Streaming SQL engine. You can do sophisticated stream processing operations
interactively using SQL statements alone.

\- KSQL is a true event-at-a-time Streaming SQL engine. Spark SQL is micro-
batch.

~~~
rxin
What you said about Spark was not true. One can totally interactively query a
Kafka stream using Spark, in SQL.

