Side note: https://pulsar.apache.org also seems to be gaining traction and has a much better story around performance, pub/sub, multi-tenancy, and cross-dc replication. Will be interesting to see the evolution of both going forward.
Biggest difference is that Pulsar splits compute from storage. Pulsar brokers are stateless services that manage topics and handle clients (publishers and consumers) while reading/writing data from a lower tier of bookies, which is the Apache BookKeeper project for low-latency real-time storage workloads.
Better performance and easier scaling compared to Kafka's partition rebalancing. It also natively supports data center replication and multiple clusters working together globally. Pulsar addressing is property/cluster/namespace/topic so multi-tenant isolation is built-in.
Anywhere you use Kafka can use Pulsar and also consolidate pub/sub only systems that need lower latency than Kafka.
- rebalancing: Bookkeeper (which is the storage backend of Pulsar) manages rebalancing way better than kafka.
- Read scaling: Bookkeeper uses consensus to store log records, which means you can read from whatever shard you want. Whereas in kafka, reader has to connect to partition master.
The specific concern here is the possibility that Kafka's ISR strategy can potentially result in a corrupt leader partition and truncate messages to recover from a broker machine failure. The unclean leader election configuration setting for Kafka brokers is relevant here.
Also "better" is subjective depending on your configuration, requirements, and storage backends.
It's also been around for years but recently open-sourced so community is smaller than Kafka, however it's growing quickly along with the usual ecosystem of drivers, extensions, and services. Unless you're already running Kafka, I would look at Pulsar first for new projects.
A great overview of Pulsar history with plenty of links is here: https://streaml.io/blog/messaging-storage-or-both/
And pulsar does not provide exactly once neither, that is something very important.
“Kafka has exactly-once delivery but messaging system x/y/z doesn’t provide it” is also confusing and misleading. Exactly-once is technically effectively-once: “at-least-once” and make the processing of the messages idempotent or “de-duplicated”. This has already been done in the industry for many decades in many mature stream processing engines like heron, flink. It isn’t a really new thing. And many messaging systems like pulsar already provides those primitives (e.g. idempotent producing, at-least-once delivery) for processing jobs to achieve effectively-once very easily. Streamlio folks did a great job about explaining exactly-once and effectively-once. It is worth checking this blog post out -- https://streaml.io/blog/exactly-once/
I think Pulsar itself as a distributed messaging system does provides all the three delivery semantics: at-most-once, at-least-once and effectively-once. It is very easy for people to use and integrate. I don’t think it is difficult to make kafka streams run with pulsar technically. The question is more is there a value to do that, do kafka folks wanna to do that, can the collaboration happen in the ASF?
that's just my two cents.
Pulsar doesn't have any client overhead since it's all tracked on the broker so there's no real need for a separate library. Read, process, and push messages using any code you want, 1 at a time or in batches.
There's no realistic "exactly once" either, it's idempotency or some local cache of processed events used to dedup, you can read more here: https://streaml.io/blog/exactly-once/
- pulsar provides an failover subscription mode, which seems to be the equivalent of partition rebalancing of consumer group in kafka.
- it has partitioned topics as well.
- it supports idempotent producing and have effectively-once delivery semantic.
It seems to have all the kind of primitives for kafka streams. to use.
Is this architecture common? Well, I suspect it is overkill for most smaller organizations due to increased infrastructure complexity. I wouldn't do this just for the sake of doing it -- you may find yourself saddled with an increased maintenance workload just keeping the infrastructure running.
But if you truly have this use case, this is a well-known method for syncing data across various types of datastores (ie. so-called polyglot persistence).
If we need to replay something we just move the group of files in and out of the folder on s3 (bucket/logs -> bucket/temp -> bucket/logs) and it triggers a new objectCreated and fires the lambda.
We execute the 2 move operations with the AWS cli tool.
The one other thing we do is we have a config file in bucket/config.json that’s says event type X should go to redshift, pg, es, where ever. So we can tweak that before a move to send data to additional data stores.
We follow ELT instead of ETL for streaming into data stores.
That’s just one system. Our Nginx access logs alone are about 250GB per day (they use that system).
I'd imagine that 99% or more of that 400GB/day gets thrown away before ending up in Redshift.
We usually only replay when we want to restore data for querying stuff outside the pruning range.
We keep all that data forever.
We use replay capability to fix bugs, add new features to existing data, and to load qa environments.
If you run Kafka yourself, you can set an infinite retention and keep all the messages, and also use the compaction feature to keep the latest value for every key to reduce some of the bloat if that applies to your data, but if you don't need it instantly accessible then it's probably worth archiving to cheaper storage.
- If your Kafka retention period is long enough, then the data will still be sitting in Kafka. You don't need to replay, just have the new consumer start from the beginning.
- If the retention period is not long enough, then you'll have to replay data. There's not any provision for that in Kafka itself - it's really just a somewhat-smart logfile. But whatever you're using to write into Kafka may have some way to do it. This is similar to most message queuing or eventing systems. eg, RabbitMQ.
Kinesis has a real auth story too, plus you can trigger Lambda functions off streams.
It's multi-tenant, but interaction is nearly identical to interacting with a dedicated kafka cluster -- i.e. you can use any regular kafka client library.
Check out docs and launch blog post. Happy to answer any questions here or through email (contact info in profile).
I read this quite often, but we run a relatively small kafka cluster on GCP and it's pretty much hassle-free. We also run some ad-hoc clusters in kubernetes from time to time which also works well.
What exactly have you found complex about running Kafka?
I run small 2-node kafka cluster that processes to 10 million messages/hr - not large at all - it's very stable, for almost a year now. However what was complex was:
* Setup. We wanted it managed it by mesos/marathon, and having to figure out BROKER_IDs took a couple hours of trial and error.
* Operations. Adding queues and checking on consumers isn't amazing to do from the command line.
* Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm. Even still there are more metrics I would like to have about our cluster that I don't care to put the time in to retrieve.
In other words, I could probably get everything up and running (especially with the various Kafka-in-Docker projects I found), but what happens if (when) something goes wrong?
Would you be willing to write a bit (or point to a post with) more about this? What do you find useful?
Secondly, due to the difference in speed in the consumer and producer, we typically have an offset lag of around 10MM, and its important to monitor this lag for us because if it gets too high, then it means we are falling behind (our consumers scale up and down through the day to mitigate this).
Next, we use Go, which is not an official language supported by the project but has a library written by Shopify called Sarama. Sarama's consumer support had been in beta mode in a while, and in the past had caused some issues were every partition of a topic wasn't being consumed.
Lastly, at the time we thought creating new topics would be a semi-regular event, and that we might have dozens of them (this didn't pan out), but having a simple overview of the health of all of our topics and consumers was thought to be good too.
We found Yahoo's Kafka Manager, which has ended up being really useful for us in standing up and managing the cluster without resorting to the command line. It's been great, but it wasn't exactly super obvious for me to find at the time.
Currently the only metrics I don't have are things plottable things like processed/incoming msg/sec (per topic), lag over time and disk usage. I'm sure these are now easily ingested into grafana, I just haven't had the time to do it.
All of this information is great to have, but requires some setup, tuning, and elbow grease that is probably batteries included in a managed service. At the same time however, this is something you get almost out of the box with RabbitMQ's management plugin.
That does not match my experience at all. Of all the distributed message queues I've tried, Kafka has been - by far - the easiest to operate.
It works well out-of-the box and even setting it up with ZooKeeper is relatively simple.
If that isn't you and you are still intent on using it, I'd second the opinion to go with "let someone else (e.g. 'cloud provider') administer it". It's probably not really worth the effort to get configuration settings and installation details right unless you genuinely have a serious interest in it.
We had a mix of realtime (continuous) consumers and batch consumers that consumed directly from Kafka, but also archived that data to S3 for future batch processing.
Operations were quite simple for us, but we were also used to running Zookeeper before, so YMMV. If you can get away with Kinesis, it's simpler operationally, but Kafka isn't difficult compared to most systems.
If it was a clean failure, no big deal. If the disk started having issues and taking a long time for IO to come back, all partitions with replicas or leaders on that node were slow. That's kind of the state of the world with most distributed things though: a bad node is worse than a dead node.
"no leader for partition" is something I see frequently.
In Kafka's defence a better managed cluster might do better. In nsq's defence: nobody has to manage it, it just runs.
Kafka is a distributed, partitioned log. Messages published to a partition in a given order will be stored in that order and replicated in that order, and consumed in that same order. Kafka writes all of your data to disk, and waits for replicas to acknowledge it (depending on your producer configuration). Kafka does considerably more work than nsq.
I like nsq, it works well, but they aren't at all designed to solve the same problems.
I mentioned the issues above because they'd happened, but in years of running Kafka it was never Kafka's fault.
> "no leader for partition" is something I see frequently.
Something is wrong with your cluster if it's constantly switching leadership. Leadership should be stable -- and furthermore, partitions have preferred leaders, so they should stay with specific nodes until that node goes down. Kafka has a process which shifts leadership back to preferred leaders automatically (unless it's off). Check your logs.
I can't remember if it's an issue for change capture since in theory you have some insert timestamp column anyway.
The best way to get the power, throughput, latency of Kafka without the operations is to use a hosted service. The one created and supported by the Kafka team is Confluent Cloud https://www.confluent.io/confluent-cloud/
Any time it seems like I have to establish some sort of relationship and get an enterprise tailored solution, I no longer feel like I'm part of a public and elastic market.
From what they said multi-tenant is about a month a way and self service cloud early next year.
Definitely not for a startup whose monthly Kinesis bill is a few hundred dollars, which is a shame because Kakfa is better in several ways.
Interesting to know about if I want to introduce Kafka in a bigger project, though.
I'd look at a hosted Kafka solution before going with Kinesis.
I'd still run it myself over Kinesis, even if I was a single-person startup.
Yes. Queuing semantics are usually about ordered delivery, acknowledgements, consumer groups, retries, priorities, etc. RabbitMQ does that, NATS only has queue groups at most.
Imagine how you would build a persistent logging app on top of NATS, and that's what NATS Streaming is.
- kafka for firehose data
- nats for realtime, routed data
It does not mean what they use it for, and it does generate a lot of confusion.
> Exactly-once Semantics are Possible: Here’s How Kafka Does it
> Now, I know what some of you are thinking. Exactly-once delivery is impossible, it comes at too high a price to put it to practical use, or that I’m getting all this entirely wrong! You’re not alone in thinking that.
blah blah blah ... of course it's at-least-once, with the "idempotent producer" so only one resulting message is published back to another kafka stream. Big surprise.
Now many people think "kafka has exactly-once delivery, that's what I want, I don't want to have to deal with this at-least-once stuff" when really it's the same thing, and others have been doing idempotent operations of various kinds for years, and the user still has to figure out how to do their desired thing (which might not be sending one more kafka message) in a mostly idempotent way.
> Is this Magical Pixie Dust I can sprinkle on my application?
> No, not quite. Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.
I think that is a pretty clear statement that end-to-end exactly once semantics doesn't come for free. It states that there needs to be additional application logic to achieve this and also specifies what must be done.
What you could do though, is a keep a "set" data structure in a replicated redis or some other database, and each member of the set is the id of the message which incremented that count. Duplicates are naturally ignored when adding to the set. The count is the size of the set.
The message might also have a "time-generated" embedded in it. You could group by hour, and after some number of hours assume that no more messages will be re-delivered from that long ago, and consolidate that set down to a single number.
Or maybe that's too much overhead because there are millions of messages per hour. Maybe in that case you can afford the occasional double-count and don't need all this. Trade-offs will be unique for each situation, and I don't think kafka will completely take care of this kind of thing for you.
Relying on another distributed system to ensure the first distributed system isn't duplicating messages sounds like a headache I wouldn't wish on anyone. (I am a Confluent employee).
But you still need to understand how this works to do "kafka transactions" and you still need some other scheme to get effects/actions outside of kafka. (And you'll probably get people doing dumb stuff saying "I was told I get exactly-once delivery")
Exactly once seems to refer to exactly once write to kafka and processing in kafka streams that write back to kafka and don't do stuff like make external http requests, in which case you'll need your logic of at most or at least once to delivery results to wherever you're sending the http requests.
Confluence blogs do seem pretty misleading in terms of hiding these issues.
I just think that they are opaque to what they really mean by exactly-once and definitely hide the fact that they do not offer exactly-once delivery.
Check the sequence number map to determine if the PID is already present.
If it is not, check sequence number.
If it is 0, insert the PID, epoch, and sequence number into the PID mapping table and proceed to append the entries.
If the sequence is not zero, return the InvalidSequenceNumber error code.
If the PID is present in the mapping, check the epoch.
If the epoch is older than the current one, return InvalidProducerEpoch error code.
If the epoch matches the current epoch, check the sequence number:
If it matches the current sequence number, allow the append.
If the sequence number is older than the current sequence number, return DuplicateSequenceNumber.
If it is newer, return InvalidSequenceNumber.
If the epoch is newer, check the sequence number.
If it is 0, update the mapping and allow the append.
If it is not 0, return InvalidSequenceNumber.
Also low-volume on modern servers can easily get into 10k+ transactions per second so you have plenty of performance potential.
Don't bother with Kafka and other systems unless you really have the scale of 100k+ events per second or otherwise need to start splitting up your app into several layers. That recent NYTimes article about Kafka for storing their entire dataset of 100gb is exactly the wrong way to use it.
If you want to use Kafka and need disaster recovery capabilities we typically recommend using Kafka Connect or other similar tools to replicate the data to another cluster or persistent storage system such as S3.
For example, for credit card receipts, simply by the nature of the type of transaction, you're unlikely to be processing enough of these to put pressure on a SQL database. One $1/transaction per second means you're grossing north of $30m, which is easy to handle in even an unoptimized schema. Citus reckon you can get thousands to tens of thousands of writes per second on Postgres, which would be grossing tens or hundreds of billions of dollars; this tech stack is suitable even when "low volume" becomes quite significant.
Of course, Kafka is designed for situations where you need to process millions of writes per second, which is into "GDP of the whole world" territory if those writes are credit card receipts, so I'd contend you're unlikely to ever need something like Kafka scale for your credit card payment handling components.
1. RabbitMQ saves on some of the operational complexity, especially at low volume. I've used it for similarly durability-critical applications with a combination of:
- Publisher Confirms.
- Durable queue and message metadata.
- Multiple nodes with HA (synchronous replication) in pause-minority mode (sacrifice availability when there's a partition).
- Fanout-duplicating messages into two queues, hooking up main consumers to one, and a periodic backup that drains the other to a backup location (a separate, non-replicated RabbitMQ via federation, or a text/db file or whatnot). This deals with RMQ's achilles heel, which is its synchronous replication system that can fail in interesting ways. Restoring a backup taken like this isn't automatic, but I've found that even adding a second decoupled RMQ instance into the mix is sufficient to significantly mitigate data loss due to disasters or RMQ failures/bugs.
All of those things will slow RMQ down a bit from its "crazy high throughput" use case, but your volume requirements are low, and the slowness will not be significant (at worst network time + 10s of ms rather than 1s of ms).
The configuration of each of those can be done mostly via idempotent messaging operations from your clients, with a bit of interaction (programmatic or manual) via the very user-friendly management HTTP API/site.
For even more delivery/execution confirmation, you can use more advanced messaging flows with AMQP transactions to get better assurance of whether or not a particular operation was completed or needs to be restored from backup.
2. Use a relational database. The reliability/transaction guarantees of something like Postgres are incredibly nice to have, simple to use, and incredibly widely supported even compared to things as popular as Kafka/JMS/NSQ/RMQ. At low volume, using a properly-tuned, replicated, dedicated database as a message bus or queue (via polling, triggers, or other Postgres features) tends to be almost as easy to set up as a real queue that prioritizes user-friendliness (like RabbitMQ), much easier to use and reason about in client code, and much more reliable.
Edit: syntax, missed words.
the “it just works” experience of using Kafka.
Blog post says how it just works, then shows a crazy diagram stringing together 8 different pieces of technology (which themselves are insanely deep). Is there any easy way to understand WTF this crap does without knowing about RDS, hadoop, monitoring, real time BS, streaming etc? Does it "just work" once you've spent weeks and months understanding wtf it is?
After that, you can choose whichever implementation you want, of which Kafka is just one option. And yes it does require some work to setup and run, as most software does, but it's no more or less than any other distributed system at this point.
I came across some projects which claim to be full or partial alternative to Kafka with better scalability and less operational overhead.
Has anyone tried it in their project ?
SMF is a prototype RPC system and libraries based on the C++ Seastar framework, not a messaging system in itself. MapR-ES is part of the MapR hadoop/data platform and also not standalone.
If you want alternatives then look at NATS + NATS Streaming  or Apache Pulsar .
Yes. MapR-ES comes with its own File system and tool chain but its more closer alternative to Kafka than NATS as it follows Kafka 0.9 API and can act as drop in replacement. Although it does not support exactly once semantics yet.
MapR is an entire data platform and suite of services, one of which is the ES (event streams) offering. It's not drop-in for Kafka unless you're already running the MapR platform which is much more involved than just running Kafka.
Edit: crate I was using, and enjoy (which brings in librdkafka): https://github.com/fede1024/rust-rdkafka