

Infrastructure for Data Streams - tadasv
http://vilkeliskis.com/blog/2014/11/10/infrastructure_for_data_streams.html

======
eikenberry
From everything I've read Kafka is a really bad fit for AWS. It is not
tolerant of partitioning. They stated this in their own design document where
they present it as a CA system. In his Jepsen post on Kafka, Kyle backed this
up with more data.

Given this, why do people deploy it to AWS? It seems like an invitation to
disaster.

~~~
noelwelsh
I think you are vastly overstating the issue. The issue in the Jepsen post
([http://aphyr.com/posts/293-call-me-maybe-
kafka](http://aphyr.com/posts/293-call-me-maybe-kafka)) requires a complex
failure case to cause data loss. For most people using Kafka, a little data
loss very infrequently, while still accepting writes, is acceptable. The
alternative is to refuse all writes till the system is back. This in
unacceptable for typical use cases. (See
[http://blog.empathybox.com/post/62279088548/a-few-notes-
on-k...](http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-
jepsen))

~~~
justinsb
It's a trade-off: C vs A. If you can't lose data, Kafka's current
configuration is unacceptable. There's an open issue, but it isn't fixed yet,
and this means Kafka is (currently) unsuitable e.g. for use as a transaction
log. Kafka is great if your data is lower value (e.g. web analytics).

It's a pity, because it would be great to have an open source component for
the Kafka architecture, but where data-loss is unacceptable. Hopefully this
will be fixed in Kafka itself; I've also been working on something that is
based around Raft.

------
nostrademons
Curious whether Cap'n Proto or another zero-copy serialization format might've
been a better choice than protobufs? Protobufs still need to parse the
message, it's just that the code to do so is automatically generated for you.
With Cap'n Proto you can just read them directly off the wire and save them,
or mmap a file full and access them.

Most of the downsides of Cap'n Proto also don't apply here. Compressing with
Snappy will elide all the zero-valued padding bytes. The format of an HTTP
message is relatively stable, so you don't get a lot of churn in the message
layout. HTTP doesn't have a lot of optional fields, so that's another
potential source of Cap'n Proto bloat that doesn't apply to your use case.

------
felipesabino
My lazy self always wonder how nice it would be if some of these
infrastructure designs were always accompanied with a docker/fig configuration
example to be used as a start point/proof of concept for people looking for
similar solutions.

It obviously happens some times [1] [2], but it should be more common...

[1] [http://alvinhenrick.com/2014/08/18/apache-storm-and-kafka-
cl...](http://alvinhenrick.com/2014/08/18/apache-storm-and-kafka-cluster-with-
docker/)

[2]
[https://registry.hub.docker.com/u/ches/kafka/](https://registry.hub.docker.com/u/ches/kafka/)

~~~
erichmond
As you pointed out, this is already pretty common. As someone who's tried this
before, it's hard to find the right balance. To setup a dead simple kafka
config is about 10 lines of shell, so simple, it almost doesn't make sense to
do.

A complex kafka setup is pretty involved (relatively speaking) and becomes
domain specific pretty quickly, at which point, it probably becomes less
usable / understandable to someone who is trying to learn / get interested.

On the whole, I agree with you. Some of the open source software that exists
now is truly amazing and I think lots of people are defaulting to less-the-
optimal solutions because they just don't have exposure to the latest and
greatest.

------
zerop
We use netty for transport in similar scenario. Though we have not hard-tested
it with the limits mentioned but wouldn't a write-behind cache can write large
volume of data..ofcourse there will be a delay but it is not hard to
implement.

------
eva1984
Just curious how does Kafka handle data rentention though? Can it be easily
configured? Or you need to build something from scratch?

~~~
erichmond
Answer: Very well.

Kafka supports replication and fault-tolerance, runs on cheap, commodity
hardware, and is glad to store many TBs of data per machine. So, retaining
large amounts of data is a perfectly natural and economical thing to do and
won’t hurt performance. LinkedIn keeps more than a petabyte of Kafka storage
online, and a number of applications make good use of this long retention
pattern for exactly this purpose.

From [http://radar.oreilly.com/2014/07/questioning-the-lambda-
arch...](http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html)

------
hbz
I was hoping he'd post the http-to-kafka adapter but I'm guessing that's
ChartBeat IP.

~~~
tadasv
I cannot post that. It is IP and very domain specific code. But it is not that
difficult to write an http server in libevent and add librdkafka to it.

~~~
hbz
No worries, thanks for taking the time to write all this up!

------
suchitpuri
One thing which is not clear about kafka or kinesis is when you have multiple
consumers for the same topic how will they get the data and in what order ,
and what happens when consumers die down. How do you handle consumers in your
data pipeline ?

~~~
easytiger
On the contrary, Ordering is something that is VERY clear in kafka,

> By having a notion of parallelism—the partition—within the topics, Kafka is
> able to provide both ordering guarantees and load balancing over a pool of
> consumer processes. This is achieved by assigning the partitions in the
> topic to the consumers in the consumer group so that each partition is
> consumed by exactly one consumer in the group. By doing this we ensure that
> the consumer is the only reader of that partition and consumes the data in
> order. Since there are many partitions this still balances the load over
> many consumer instances. Note however that there cannot be more consumer
> instances than partitions.

[http://kafka.apache.org/documentation.html](http://kafka.apache.org/documentation.html)

~~~
ewencp
In particular,
[http://kafka.apache.org/documentation.html#intro_consumers](http://kafka.apache.org/documentation.html#intro_consumers)
addresses the concept of consumer groups and what ordering is guaranteed. One
thing that might be worth noting for the grandparent is that Kafka consumers
have an offset commit API that gives some control over how failures are
handled. If a consumer dies before committing an offset but after reading data
from the broker, a fresh consumer that joins the consumer group can see the
same data once the system determines the original has died; that ensures all
data will be processed, even in the event of consumer failures.

Kinesis provides the same ordering guarantees. They use different terminology
(Kafka topics == Kinesis streams; Kafka partitions == Kinesis shards) but have
the same system interface. The details of the APIs used for consumption
differ, but they provide the same basic functionality of Kafka's "consumer
groups".

