Suppose you modeled your domain with events and your stack is build on top of it. As stuff happens in your application, events are generated and appended to the stream. The stream is consumed by any number of consumers and awesome stuff is produced with it. The stream is persisted and you have all events starting from day 1.
Over time, things have changed and you have evolved your events to include some fields and deprecate others. You could do this without any downtime whatsoever by changing your events in a way that is backward compatible way.
What is the good approach to what I'd call a `replay`?
When you want to replay all events, the version of your apps that will consume the events may not know about the fields that were in the event for day one.
My personal philosophy is to always leave event data at rest alone: data is immutable, you don't convert it, and you treat it like a historical artifact. You version each event, but never convert it into a new version in the actual event store. Any version upgrades that should be applied are done when the event is read; this requires automated procedures to convery any event version N to another version N + 1, but having these kind of procedures in place is good practice anyway. Some might argue that doing this every time an event is read is a waste of CPU cycles, but in my experience this far outweights possible downsides of losing the actual event stored at that time in the past, and this type of data is accessed far less frequently than new event data.
Then if you have 90% confidence you'll only ever need to replay the upgraded stream, you can upgrade it and destroy the previous version.
If at some point (the remaining 10%) you need to rescue the old stream, you can run the "down" direction and rehydrate the old version of the stream.
Note that this "good practice" already has a name, it is usually called "migrations".
Migrations may be as simple as SQL Update / ALTER TABLE statements. But they may also be transformation of JSON/XML/... structures, or any other complex calculation.
It may not be the best term, as "migrations" usually imply that the result is written back to the data store. But apart from that, I don't see any problem using this term here.
It matches what some others have been saying as well. Thanks
The text is written by Greg Young - the lead on the EventStore  project.
In data warehousing, particularly the Kimball methodology, if descriptive attributes are missing from dimensions, for example, it is common to represent them using some standard value, like, "Not Applicable" or "Unknown" for string values. For integers, one might use -1 or similar. For dates it might be a specially chosen token date which means "Unknown Date" or "Missing Date".
It doesn't solve the problem of truly unknown missing information, but it at least gives a standard practice for labeling it consistently. Think of trying to do analytics on something where the value is unknown?? Not too easy, but at least it is all in one bucket.
Certainly, if past values can be derived, even if they were not created at the time the data was originally created, that is one way of "deriving" the past when it was missing. But, otherwise, I don't think there is any other way to make up for the lack of data/information.
Build a view on against these schema change events as a table of schema version by timestamp to allow for parsing any arbitrary event.
Building apps using event sourcing, CQRS and microservices can easily become hell if the data models are not thought through.
If you create a STREAM or TABLE using KSQL, it makes sure that they are kept updated with every single event that arrives on the source Kafka topics.
That's what you'd expect in a true event-at-a-time Streaming SQL engine, which is what KSQL is.
Here's a summary:
- KSQL has a completely Interactive SQL interface, so you don't have to switch between DSL code and SQL.
- KSQL upports local, distributed and embedded modes. Is tightly integrated with Kafka's Streams API and Kafka itself; doesn't reinvent the wheel. So is simple to use and deploy.
- KSQL doesn't have external dependencies, for orchestration, deployment etc.
- KSQL has native support for Kafka's exactly once processing semantics, supports and stream-table joins.
While Flink has in fact no direct SQL entry point right now (and many users simply wrap the API entry points themselves to form a SQL entry point), the other statements are actually not quite right.
- Flink as a whole (and SQL sits just on the DataStream API) works local, distributed and embedded as well.
- Flink does not have any external dependencies, not even Kafka/ZooKeeper; it is self-contained. One can even just receive a data stream via a socket if that works for the use case.
- Flink itself has always had exactly-once semantics, and works also exactly-once with Kafka.
I'm very bullish on kafka. Today we have Spark for batch data computation and have already switched some of our streaming stuff to Kafka.
Do you see yourselves entering into the batch processing space anytime ? Google has officially said that Flink is "compelling" because of its compatibility with the Beam model.
If I can step on thin ice... is it easier for Flink to commandeer Kafka or for Kafka to win over batch processing ?
I believe if Kafka can do streaming then it effectively can do batch as batch is a subset of streaming.
Anyways, if anyone's wondering, here's the Github page. 
Also from FAQ :
Is KSQL fully compliant to ANSI SQL?
KSQL is a dialect inspired by ANSI SQL. It has some differences because it is geared at processing streaming data. For example, ANSI SQL has no notion of “windowing” for use cases such as performing aggregations on data grouped into 5-minute windows, which is a commonly required functionality in the streaming world.
There is a way of interpreting streams and tables to make ANSI SQL meaningful in the presence of streams, which we follow .
The big advantage is (besides not having to learn another syntax and retaining compatibility with SQL tools and dashboards) that this seamlessly handles the batch (bounded/static input) and streaming (unbounded/continuous) use cases with literally the same SQL statement.
There is a good explanation of ANSI SQL standard windowing at http://sqlstream.com/docs/conc_applicationdesign.html?zoom_h.... The document describes both tumbling and rolling windows as well as how they are achieved with standard syntax.
SQLstream could not agree with you more that an up-front explanation of SQL compatibility level is necessary to evaluate a product. SQLstream Blaze is SQL:2011 compliant.
RethinkDB, as far as I understand, does the same thing - their changefeed mechanism would consume the DB log and run the queries against the log.
The purpose is to bring Streaming ETL and Stream Processing together to make the user's life easy.
Regarding Landoops proposal. This was not about KCQL but SQL on Streams, also named KSQL which is integrated with their new product Kafka Lenses. We'll look at Confluents SQL and see which one to go forward with, maybe both but we have happy customers using our version.
But congratulations on your KSQL very nice! We (Landoop and DM staff) have proved many ex colleagues in the Investment bank world wrong about Kafka and this cements our decisions to use it. Thanks.
1. We don't support just stream. You can throw a SQL at a kafka topic as easy as SELECT * FROM `topic` [WHERE ]
2. We support selecting or filter on the record metadata: offset/timestamp/partition (Haven't seen something similar in Confluent KSQL)
3. We integrate with Schema Registry for Avro. We hope to support Hortonworks schema registry soon as well as protobuf.
4. We allow for injecting fields in the Kafka Key part. For example: SELECT _offset as `_key.offset`, field2 * field3 - abs(field4.field5) as total FROM `magic-topic`
5. Just quickly looking at the Confluent KSQL "abs" function i see it accepts Double only. It doublt that everything is converted to Double before it hits the method and then converted back. (too short of a time to understand the whole implementation).
6. Filters: is related to point 2. We allow filter on message metadata. For example: SELECT * FROM topicA WHERE (a.d.e + b) /c = 100 and _offset > 2000 and partition in
Also not sure if they have customers yet using it. We do.
- KSQL is a full-fledged Streaming SQL engine for all kinds of stream processing operations from windowed aggregations, stream-table joins, sessionization and much more. So it does more powerful stream processing on Kafka than what Landoop's product supports which is simple projections and filters.
- KSQL can do that because it supports streams and tables as first-class constructs and tightly integrates with Kafka's Streams API and the Kafka log itself. We are not aware of any other products that do that today, including Landoop's tool.
- We will add support for Kafka connectors so you can stream data from different systems into Kafka through KSQL. This will cover what Landoop intended with KCQL (Kafka Connect Query Language.
- Confluent works with several very large enterprises and many of the companies that have adopted Kafka. We worked with those customers to learn what would solve real business problems and used that feedback to build KSQL. So it . models on real-world customer feedback.
- We'd love to hear feedback. Here's the repository https://github.com/confluentinc/ksql and here's the Slack Channel slackpass.io/confluentcommunity - #ksql
Hope that helps!
It looks like there is one main contributor (https://github.com/confluentinc/ksql/graphs/contributors) though, it seems that the other contributors either wrote the documentation or helped for the packaging. Not a great sign considering how big this project is (there are competitors which only do this such Pipelinedb), hopefully you can create a team just for KSQL.
Also, the commits in KSQL reflect only parts of the work -- it doesn't include design discussions, code reviews, etc.
Lastly, keep in mind that the git repository was cleaned (think: squashed commits) prior to publication, see the very first commit in the repository's timeline. So you don't see the prior work/commits of other Confluent engineers that went into KSQL.
influxdb has something similar, and now so does AWS Kinesis (with their Kinesis Analytics product).
What does this imply?
PipelineDB also offers a clustering extension for large workloads (see: http://enterprise.pipelinedb.com/docs/)
But in terms of ad hoc, exploratory analytics workloads, yes - the scaling limitations would be the same, since for ad hoc, exploratory analytics PipelineDB and PostgreSQL are the same. But with that said, the processed, aggregated data that gets stored is generally much smaller than large volumes of granular data, so there is much less data to comb through with PipelineDB.
Edit: Oh, the wiki page explains that it predates Oracle. So probably the latter.
This seems to be basically similar to Spark, which lets you perform full SQL queries on Kafka streams.
- Spark SQL is not an interactive Streaming SQL interface. To do stream processing, you have to switch between writing code using Java/Scala/Python and SQL statements. KSQL, on the other hand, is a completely interactive Streaming SQL engine. You can do sophisticated stream processing operations interactively using SQL statements alone.
- KSQL is a true event-at-a-time Streaming SQL engine. Spark SQL is micro-batch.