Kafka is not more mature just more hyped. I just wish Aphyr jepsen test would al...

EdwardDiego · on May 17, 2020

It's way more mature. I just spent a week evaluating Pulsar vs Kafka for a client and the fact Kafka has been open sourced for 10 years vs. Pulsar's 1.5 really shows in documentation, community support etc.

> what happen to your data if x+1 server permanently fail in the cluster with a replication factor of X.

It depends on how many in sync replica sets existed entirely within those X+1 servers. Their partitions will go offline, and other ISRs will have underreplicated partitions, and the alerting you've set up as a good engineer will have told you this was happening.

> what happen in multi-tenancy scenario to other user throughput and latency when one user try to use all the capacity of the cluster

Nothing because you're using ACL and have configured quotas appropriately.

Bad things otherwise.

PS, also been running Kafka since 0.8.

skyde · on May 17, 2020

if replication factor is 3 and 3 server go down in the span of 1 or 2 hours no alert will save you

EdwardDiego · on May 17, 2020

Yes, but this is true of any system offering N - 1 safety, e.g., HDFS, Vertica, Pulsar. It's not specific to Kafka.

And you can switch to your warm replicated cluster in this scenario, if you have one, Mirror Maker 2 supports replicated offsets so consumers can switch without losing state.

But what you're describing is going to shaft any replicated system.

skyde · on May 17, 2020

not true for HDFS, Cassandra ,pulsar and most distributed file system.

As soon as a segment is under-replicated it”s replication factor is restored under less than 2 minutes by selecting new machine as replica.

Kafka try to do it with “kafka cruise control” but adding a replica to the in sync replica list take several hours if partition are 300GB and servers are already busy handling regular live traffic

EdwardDiego · on May 17, 2020

> adding a replica to the in sync replica list take several hours if partition are 300GB

I'd be curious to hear more about this, because I run several topics with similar partition sizes, and haven't encountered several hours for one replica, and I've routinely shifted 350GB partition replicas as part of routine maintenance.

I have encountered 2 hours to restore a broker that was shut down improperly, but yeah, assuming your replica fetchers aren't throttled to shit, or your brokers aren't overloaded (what's the request handler avg idle? 20% or lower is time to add another broker, 10% is time to add another broker right now), that's really extreme.