
Announcing Kafka Streams: Stream Processing Made Simple - nehanarkhede
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
======
ameyamk
This looks great. Some open questions -

1\. How does it scale horizontally? if I am consuming from say topic with 100
partitions - do I have to deploy this on library on multiple nodes? or
expectation is we need to embed this inside spark streaming/ storm/ samza/ any
other real time processing framework?

2\. Is RocksDB instance local to your consumer/ thread/ node? or can state be
stored across nodes? What happens when joins are needed across the partitions?

I have been doing real time processing for a long time now - and this does
capture most of the typical things you end up doing with kafka with spark/
storm etc- so this does look a significant step forward.

I am just trying to understand where it exactly fits.

~~~
miguno
Disclaimer: I work for Confluent (www.confluent.io). That said, let me clarify
that Kafka Streams is a component of the Apache Kafka open source project.
There's nothing proprietary about it. The reason the links below point to
docs.confluent.io is because the Apache Kafka docs for Kafka Streams are not
ready yet; but they will when Kafka 0.10 is released, which will be the first
official Kafka release that includes Kafka Streams.

> 1\. How does it scale horizontally?

See
[http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h...](http://docs.confluent.io/2.1.0-alpha1/streams/architecture.html#parallelism-
model).

> Do I have to deploy this on library on multiple nodes?

Kafka Streams really is a normal Java library. You include it in "your" Java
application just like you'd include a library such as Apache Commons, Google
Guava, or JUnit. There's no "deploying" of this library -- i.e. no need to
pre-ship it to machines; also, you don't install a cluster or anything like
that for Kafka Streams. It's a purely client-side library.

Interestingly, many folks I have talked to seem to be confused initially about
the "library" aspect because I suppose, in the big data world, one has simply
become used to (and pigeonholed by?) frameworks like Spark, Storm, etc. that
you must deploy and operate separately (and which then dictate how they want
you to write your own apps against them).

Does that make any sense?

> 2\. Is RocksDB instance local to your consumer/ thread/ node?

Yes, it is local. But local state stores (RocksDB is but one choice for
implementing these) are also replicated for fault tolerance. Details at
[http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h...](http://docs.confluent.io/2.1.0-alpha1/streams/architecture.html#state)
and
[http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h...](http://docs.confluent.io/2.1.0-alpha1/streams/architecture.html#fault-
tolerance).

Regarding joins: We have documented some information at
[http://docs.confluent.io/2.1.0-alpha1/streams/developer-
guid...](http://docs.confluent.io/2.1.0-alpha1/streams/developer-
guide.html#joining-streams).

I hope this helps! If these pointers above aren't sufficient, please drop us a
note to dev@kafka.apache.org
([http://kafka.apache.org/contact.html](http://kafka.apache.org/contact.html)).
We really appreciate any such feedback (code, docs, whatever) to ensure people
have an easy and fun time to get started with Kafka as well as its upcoming
Kafka Streams library.

------
rollulus
Great job, Confluent! I'm really impressed by the helpful documentation and
blog posts!

