
Apache Kafka – Publish-subscribe messaging rethought as a distributed commit log - fintler
https://kafka.apache.org/
======
helper
@Aphyr's "Call me maybe: Kafka"[1] blog post is a great reference for what
Kafka is and what distributed guarantees it tries to provide.

[1]: [http://aphyr.com/posts/293-call-me-maybe-
kafka](http://aphyr.com/posts/293-call-me-maybe-kafka)

~~~
rdtsc
Good point.

TL;DR: "Kafka’s replication claimed to be CA, but in the presence of a
partition, threw away an arbitrarily large volume of committed writes. It
claimed tolerance to F-1 failures, but a single node could cause catastrophe."

~~~
jessaustin
I think I see what he's trying to say, but there's a letter that isn't
included in the adjective "CA", and that letter is "P". How could it be
otherwise? Indeed, from TFA: _" Kafka can do this because LinkedIn’s brokers
run in a datacenter, where partitions are rare."_

~~~
Nacraile
You're correct, the system does not advertise partition tolerance, and @Aphyr
does indeed confirm that the system is not partition tolerant.

Broadly, I think sacrificing P is pretty deeply misguided. If you run a
distributed system at any scale for any period of time, you're going to
experience network partitions, even within a single DC. (I speak from
experience: I work on a distributed system which runs a whole bunch of
machines in a whole bunch of datacentres worldwide. We see a reasonable number
of non-trivial within-DC network partitions every year.)

A much more detailed argument of the above: [http://codahale.com/you-cant-
sacrifice-partition-tolerance/](http://codahale.com/you-cant-sacrifice-
partition-tolerance/)

You really do need to think about what happens to your system when it gets
partitioned. This paper has much of interest to say on the topic: [http://cs-
www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf](http://cs-
www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf)

~~~
derefr
Isn't "a partition-intolerant system" really just another way to say "a non-
distributed system"?

~~~
aaronblohowiak
No, it is just a system that can lose data in the face of a partition.

------
ot
Kafka's author wrote a very well motivated and interesting introduction to
Kafka-based architectures a few months ago on LinkedIn's blog:

[http://engineering.linkedin.com/distributed-systems/log-
what...](http://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying)

Previous HN discussion:
[https://news.ycombinator.com/item?id=6916557](https://news.ycombinator.com/item?id=6916557)

------
burke
We're using Kafka at Shopify and we're pretty satisfied with it. We wrote a
client library for Go:
[https://github.com/Shopify/sarama](https://github.com/Shopify/sarama)

------
nofinator
I like the hattip to Franz Kafka, although I would be wary of submitting An
Imperial Message [1] to the queue. It may never make its way out.

[1] [http://www.kafka-online.info/an-imperial-message.html](http://www.kafka-
online.info/an-imperial-message.html)

------
crawdog
If you are interested in a sample project I have one available here for
tailing a file into a queue:
[https://github.com/rickcrawford/kafka_file_tailer](https://github.com/rickcrawford/kafka_file_tailer)

~~~
res0nat0r
Can Kafka be thought of as a non-hosted version of Kinesis? I thought Kinesis
would be a good solution to dump logfile data to for processing and ingestion.
Could you explain some technical reasons to use Kafka vs. Kinesis? Thanks.

~~~
ihsw
One main advantage is that Kinesis is elastic -- it scales automatically based
on load. Managing a Kafka cluster is an unnecessary task with Kinesis
available, which alleviates quite a bit of headache.

~~~
nullspace
Ehh - this is just my two cents with working on Kafka. When they say it's high
performance, they really really mean. I have gotten very high throughputs on
just 2 medium machines.

If you process that much data, Kafka is one of the last things which you'll
need to scale out.

------
MagicWishMonkey
Sorry for the dumb question, but can anyone explain how this compares to
something like RabbitMQ?

------
geoffroy
How does it compare to beanstalkd ?

------
shmerl
Using it with C++ is not really on par with using it with Java. Implementing
the system itself in Java also looks very questionable, since supposedly
performance should be very critical there.

~~~
t0mas88
Depending on the task Java is actually faster than C++ at times. Garbage
Collection, when done right, can for example be faster than manual memory
management. And a JIT compiler can do optimizations that a normal C++ compiler
could only do with help from the developer.

There is a reason many of these kind of systems are written in JVM based
languages. Examples: Hadoop and all of its siblings, Cassandra, Storm, Kafka.
So either all of those people in successful projects make "questionable
decisions" or your knowledge of Java/JVM performance has been outdated by the
new developments of the past few years...

~~~
hatred
I don't particularly agree with the reasons mentioned by you. Most of the
existing systems are JVM based due to the excellent tooling/supporting
libraries around Java and its ease of use. This enables you to focus more on
building the system rather then focus on micro-optimizations around C++.

