
Kafka in a Nutshell - soofaloofa
http://sookocheff.com/post/kafka/kafka-in-a-nutshell/
======
vectorpush
Are there any alternatives to kafka that are also modeled after the "message
queue as a log" concept? In particular, I'd like to be able to reconsume
arbitrary ranges of 'already processed' log/event/message data as well as
trust the log as the ultimate deterministic source of truth for the state of
the system.

The reason I don't just use kafka is because the "quorum" style scaling is
overkill for our needs since the app will be for internal use only and will
likely never exceed more than 1000 simultaneous users. Also, I've heard that
zookeeper (required to use kafka) has its own technical overhead that I'd like
to avoid dealing with if possible.

~~~
jganetsk
Google Cloud Pub/Sub perhaps
[https://cloud.google.com/pubsub/](https://cloud.google.com/pubsub/)

------
uptown
Conceptually, how do these types of pub/sub messaging systems work at scale?
How does the number of subscribers impact the efficiency of updates being
distributed to the subscribers? Is the server pushing these messages to them
all simultaneously, or is there some logic that might result in one subscriber
receiving an update faster than another? Is the publishing server opening up a
ton of ports to handle the communication, or from a networking/ports
perspective how is this handled?

~~~
soofaloofa
With Kafka specifically the consumers of messages use standard TCP to receive
messages by pulling them from Kafka. Kafka is not a push system. Consumers can
come and go in and ad-hoc fashion.

~~~
uptown
So I guess to address user volume, you just scale horizontally increasing the
number of server instances in your cluster?

~~~
Joeri
Linkedin blogged about how kafka is their data backbone.

[https://engineering.linkedin.com/kafka/running-kafka-
scale](https://engineering.linkedin.com/kafka/running-kafka-scale)

When combined, the Kafka ecosystem at LinkedIn is sent over 800 billion
messages per day which amounts to over 175 terabytes of data. Over 650
terabytes of messages are then consumed daily, which is why the ability of
Kafka to handle multiple producers and multiple consumers for each topic is
important. At the busiest times of day, we are receiving over 13 million
messages per second, or 2.75 gigabytes of data per second. To handle all these
messages, LinkedIn runs over 1100 Kafka brokers organized into more than 60
clusters.

------
wcdolphin
Great post, but one thing to note is that the number of partitions DOES NOT
need to be equivalent to the number of consumers. A consumer can reasonably
consume multiple partitions, as most do. On the other hand, it does provide an
upper bound to the number of consumers in one group, as a consumer can only
reasonably consume as few as one of the partitions.

~~~
justifier
so do you drop the idea of partitions completely because you have shown its
structure proves a soft, reasonable, upperbound?

implying overworking

what do you replace it with?

what problem was 'use the same number of partitions as consumers' trying to
fix?

~~~
justifier
or if you keep it, how do you measure the optimal partition count?

------
hakann
Is AWS SNS/SQS equivalent to Kafka? And what are the differences,
advantages/disadvantages of either one?

~~~
Cieplak
Kinesis is the AWS alternative to Kafka, minus some features.

~~~
power
Kinesis is for streaming computations. It's closer to Apache Storm (which
typically uses Kafka) than Kafka.

~~~
fizx
You might be thinking of Simple Workflow Service, which is closer to Storm.
Grandparent had it right.

