Hacker News new | past | comments | ask | show | jobs | submit login
Apache Kafka Goes 1.0 (confluent.io)
392 points by pradeepchhetri on Nov 1, 2017 | hide | past | web | favorite | 134 comments

Congrats to the kafka/confluent team.

Side note: https://pulsar.apache.org also seems to be gaining traction and has a much better story around performance, pub/sub, multi-tenancy, and cross-dc replication. Will be interesting to see the evolution of both going forward.

Has somebody compared pulsar with pravega(http://pravega.io)?

another side note, pulsar supports kafka api since 1.20.0 - https://github.com/apache/incubator-pulsar/releases/tag/v1.2...

Interesting. Can someone give a quick/high level list what makes it better than Kafka and in what use cases?

I recommend reading the Pulsar architecture page: https://pulsar.apache.org/docs/latest/getting-started/Concep...

Biggest difference is that Pulsar splits compute from storage. Pulsar brokers are stateless services that manage topics and handle clients (publishers and consumers) while reading/writing data from a lower tier of bookies, which is the Apache BookKeeper project for low-latency real-time storage workloads.

Better performance and easier scaling compared to Kafka's partition rebalancing. It also natively supports data center replication and multiple clusters working together globally. Pulsar addressing is property/cluster/namespace/topic so multi-tenant isolation is built-in.

Anywhere you use Kafka can use Pulsar and also consolidate pub/sub only systems that need lower latency than Kafka.

I think it is beyond partition rebalancing. There is a fact that people didn't realize of making message broker `stateless`. It is actually much better on reacting to failures or shifting traffic, which is critical when running a messaging bus for online services. because it doesn't have to wait for copying the data of a whole partition when error occurs.

Agreed, the architecture page and other blog posts do a more thorough job of explaining the details. Having a stateless broker layer on top of a focused data layer makes all the operations much easier. I expect BookKeeper to further integrate with the various cloud storage APIs so it can also start to become stateless cache.

The biggest differences:

- rebalancing: Bookkeeper (which is the storage backend of Pulsar) manages rebalancing way better than kafka.

- Read scaling: Bookkeeper uses consensus to store log records, which means you can read from whatever shard you want. Whereas in kafka, reader has to connect to partition master.

To clarify the use of the term "rebalancing" here for people who are familiar with Kafka's terminology: this is (I believe) referring to partition _storage_ rebalancing (redistribution of topics replicated across brokers). It is _not_ referring to Kafka's notion of rebalancing across a consumer group subscribed to a set of topic partitions.

The specific concern here is the possibility that Kafka's ISR strategy can potentially result in a corrupt leader partition and truncate messages to recover from a broker machine failure. The unclean leader election configuration setting for Kafka brokers is relevant here.

Also "better" is subjective depending on your configuration, requirements, and storage backends.

another point added to 'rebalancing' -- when kafka rebalances the partitions, it has to copy all the data for the partitions that are moved around. it might not be a big problem when retention is small. however it is pretty worse when retention period is longer, rebalancing is going to exhaust all the bandwidth (both network and I/O) in the cluster. people don't realize the fact until they want to grow the cluster (adding more brokers) to support increased traffic.

Would you say that pulsar is strictly better than kafka, or that there are cases where kafka is preferred?

Yes, right now Pulsar does everything Kafka does but easier, faster, and with more features. It's especially great in combining message queuing with low-latency pub/sub.

It's also been around for years but recently open-sourced so community is smaller than Kafka, however it's growing quickly along with the usual ecosystem of drivers, extensions, and services. Unless you're already running Kafka, I would look at Pulsar first for new projects.

A great overview of Pulsar history with plenty of links is here: https://streaml.io/blog/messaging-storage-or-both/

Unfortunately pulsar does not have a system such kafka stream. In pulsar we need to use spark to do the same (https://pulsar.incubator.apache.org/docs/latest/adaptors/Pul...).

And pulsar does not provide exactly once neither, that is something very important.

Kafka streams is great. But it is a processing library. It isn’t an apple-to-apple comparison to a messaging system like Pulsar. First there are already many mature stream processing engines and libraries (spark, flink, heron, storm). They have been in production for years and it made no sense to write a new one without a good reason. Second, kafka stream could have done a better job, not just tight with kafka. Technically kafka streams isn’t really a very specific implementation or design to kafka. If confluent has done a better job on abstraction, I would expect it is very easy to plugin different messaging systems or log storages to run "kafka streams". Although I am not sure Confluent want to see that happen.

“Kafka has exactly-once delivery but messaging system x/y/z doesn’t provide it” is also confusing and misleading. Exactly-once is technically effectively-once: “at-least-once” and make the processing of the messages idempotent or “de-duplicated”. This has already been done in the industry for many decades in many mature stream processing engines like heron, flink. It isn’t a really new thing. And many messaging systems like pulsar already provides those primitives (e.g. idempotent producing, at-least-once delivery) for processing jobs to achieve effectively-once very easily. Streamlio folks did a great job about explaining exactly-once and effectively-once. It is worth checking this blog post out -- https://streaml.io/blog/exactly-once/

I think Pulsar itself as a distributed messaging system does provides all the three delivery semantics: at-most-once, at-least-once and effectively-once. It is very easy for people to use and integrate. I don’t think it is difficult to make kafka streams run with pulsar technically. The question is more is there a value to do that, do kafka folks wanna to do that, can the collaboration happen in the ASF?

that's just my two cents.

Kafka Streams is just a small wrapper library to make it easy to work with Kafka partitions, along with some utilities for reading, transforming, and writing to another stream in a single step.

Pulsar doesn't have any client overhead since it's all tracked on the broker so there's no real need for a separate library. Read, process, and push messages using any code you want, 1 at a time or in batches.

There's no realistic "exactly once" either, it's idempotency or some local cache of processed events used to dedup, you can read more here: https://streaml.io/blog/exactly-once/

Is it possible to use Kafka Streams wtihout Kafka by listening and processing events from another source?

I would say no, because kafka stream use the internal kafka method of partitions to be able to scale.

I do think it is technically possible base on my understanding of pulsar

- pulsar provides an failover subscription mode, which seems to be the equivalent of partition rebalancing of consumer group in kafka. https://pulsar.incubator.apache.org/docs/latest/getting-star...

- it has partitioned topics as well.

- it supports idempotent producing and have effectively-once delivery semantic.

It seems to have all the kind of primitives for kafka streams. to use.

For the folks using kafka or kinesis or other products with streaming event architectures - do you have a replay capability? For example, I am using kinesis and I have a lambda that processes events and writes to a postgres database. so say 3 months from now I want to also create a lambda that firehoses to redshift. How do I get the first 3 months of my data into redshift? Right now I have a lambda that writes all events to S3, so I can replay them for other stream consumers (or even for debugging existing consumers). This seems like a reasonable (if naive) solution, but I don't see a lot of talk out there in the first place about ensuring this capability exists, and definitely haven't seen any more robust or thoughtful options discussed. Are people just not doing this generally?

This replayability is discussed in Designing Data Intensive Applications (DDIA), a book by Martin Kleppmann. Essentially you can use the Change Data Capture (CDC) information in your primary Postgres database, and pipe it through Kafka and replay it on any other data store.[0] This is also the basis of traditional database replication technology, where the change logs are replayed on other databases.

Is this architecture common? Well, I suspect it is overkill for most smaller organizations due to increased infrastructure complexity. I wouldn't do this just for the sake of doing it -- you may find yourself saddled with an increased maintenance workload just keeping the infrastructure running.

But if you truly have this use case, this is a well-known method for syncing data across various types of datastores (ie. so-called polyglot persistence).

[0] https://www.confluent.io/blog/bottled-water-real-time-integr...

We have something overly dumb and simple. We generate about 400GB of logs a day through kinesis to lambda to s3. And a second lambda that reads of s3 and forwards the stuff to wherever using an s3:objectCreated event ...

If we need to replay something we just move the group of files in and out of the folder on s3 (bucket/logs -> bucket/temp -> bucket/logs) and it triggers a new objectCreated and fires the lambda.

We execute the 2 move operations with the AWS cli tool.

The one other thing we do is we have a config file in bucket/config.json that’s says event type X should go to redshift, pg, es, where ever. So we can tweak that before a move to send data to additional data stores.

We follow ELT instead of ETL for streaming into data stores.

What are you doing that makes a half-terabyte of logs per day? That seems kinda nuts.

Ads ads ads!

That’s just one system. Our Nginx access logs alone are about 250GB per day (they use that system).

Do you find that you have to replay data often?

I'd imagine that 99% or more of that 400GB/day gets thrown away before ending up in Redshift.

Rarely. We visualize a lot of it in Kibana and our kibana indexes are pruned after N days (configured for each event type).

We usually only replay when we want to restore data for querying stuff outside the pruning range.

We keep all that data forever.

I journal everything to s3 after batching and compressing the batches. I have minute-resolution batches per machine (so you read the prefix in S3 for ROLE/YYYY/MM/DD/HH/MM/* to get all of the machines who wrote for that time period). I have replay capability as well as a utility library built up for my company's use cases to find specific events. Fundamentally I use an architecture similar to s3-journal[0]. There are some stream systems that can handle writing to s3 and reading from s3 natively such as Onyx[1][2] as well as checkpointing stateful operations against s3 [3].

We use replay capability to fix bugs, add new features to existing data, and to load qa environments.

[0] https://github.com/Factual/s3-journal

[1] http://www.onyxplatform.org/

[2] https://github.com/onyx-platform/onyx-amazon-s3

[3] http://www.onyxplatform.org/docs/api/0.10.x/onyx.storage.s3....

I have almost exactly your use-case and I solved the problem very similarly. I use a lambda to write the events to S3. I have another lambda that can read the files in S3 and produce them to replay. It's designed so I can replay many years of events if necessary, which is way beyond what I would want Kafka's retention period to be.

There's nothing special about replay when it comes to a log. The only issue is retention and many managed services have limited retention based on data size or time frame. S3 archiving is common to keep the operational dataset small so reading it back and refilling Kafka for another consumer is actually pretty normal.

If you run Kafka yourself, you can set an infinite retention and keep all the messages, and also use the compaction feature to keep the latest value for every key to reduce some of the bloat if that applies to your data, but if you don't need it instantly accessible then it's probably worth archiving to cheaper storage.

IDK about Kinesis, but for Kafka there are two major cases here:

- If your Kafka retention period is long enough, then the data will still be sitting in Kafka. You don't need to replay, just have the new consumer start from the beginning.

- If the retention period is not long enough, then you'll have to replay data. There's not any provision for that in Kafka itself - it's really just a somewhat-smart logfile. But whatever you're using to write into Kafka may have some way to do it. This is similar to most message queuing or eventing systems. eg, RabbitMQ.

Do you go kinesis-->lambda-->S3? How do you write multiple events at once into S3? Lambdas are stateless. Where would I save the last saved event?

For all folks that forward kinesis messages to s3 via lambda. Kinesis stream can be forwarded to a kinesis firehose that will persist messages to s3

Google cloud solved this by offering a pubsub to GCS and GCS to pubsub pipeline. Letting you save and replay events.


This is like the whole point of kakfa. You make all your workers idempotent and retain N days/weeks/months/years of messages. The design of kafka follows, clients track their offset, messages aren't "delivered". The docs are rly simple and if you spent 5 minutes reading the overview it would answer all your questions.

I'm currently comparing using Kinesis vs running a small scale Kafka cluster on AWS. The ecosystem around Kafka is great, especially Kafka connect's stuff like Debezium. But I don't know if it's worth the trouble to deal with the extra operational complexity. Any opinions on administrating Kafka at small scale?

At small scale just go with Kinesis. The base semantics are pretty much the same between the two, and Kafka is terribly complex to run. The hosted Kafka solutions are too expensive for small scale.

Kinesis has a real auth story too, plus you can trigger Lambda functions off streams.

Disclosure: I work for Heroku. Heroku launched a cheaper managed Kafka 1.5 months ago. It starts at $100/month (pro-rated to second). That's ~$3.33/day. Great if you want to play, learn, or test out a proof-of-concept.

It's multi-tenant, but interaction is nearly identical to interacting with a dedicated kafka cluster -- i.e. you can use any regular kafka client library.

Check out docs[1] and launch blog post[2]. Happy to answer any questions here or through email (contact info in profile).

[1]: https://devcenter.heroku.com/articles/multi-tenant-kafka-on-...

[2]: https://blog.heroku.com/kafka-on-heroku-new-plans

> Kafka is terribly complex to run

I read this quite often, but we run a relatively small kafka cluster on GCP and it's pretty much hassle-free. We also run some ad-hoc clusters in kubernetes from time to time which also works well.

What exactly have you found complex about running Kafka?

>What exactly have you found complex about running Kafka?

I run small 2-node kafka cluster that processes to 10 million messages/hr - not large at all - it's very stable, for almost a year now. However what was complex was:

* Setup. We wanted it managed it by mesos/marathon, and having to figure out BROKER_IDs took a couple hours of trial and error.

* Operations. Adding queues and checking on consumers isn't amazing to do from the command line.

* Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm. Even still there are more metrics I would like to have about our cluster that I don't care to put the time in to retrieve.

Another thing I found "complex" was the Java/Scala knowledge requirement. I wanted Kafka-like functionality for a Node.js project, but my limited Java and Scala knowledge made me concerned about my ability to deal with any problems I might run into.

In other words, I could probably get everything up and running (especially with the various Kafka-in-Docker projects I found), but what happens if (when) something goes wrong?

What do you mean by "Java/Scala knowledge requirement"? I don't know much c/c++ but I use postgres just fine. There is a bunch of stuff in software ecosystem in a bunch of languages that if I had to know it all I wouldn't progress much.

I've never had to dive into any Java or Scala to maintain our Kafka cluster

> Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm.

Would you be willing to write a bit (or point to a post with) more about this? What do you find useful?

Like I mentioned our Kafka setup is relatively small - we moved from RabbitMQ to Kafka because of the sheer size (as in byte size) of the messages we needed to process (~10 million/hr), where each message could be 512-1024kb which caused RabbitMQ to blowup unpredictably.

Secondly, due to the difference in speed in the consumer and producer, we typically have an offset lag of around 10MM, and its important to monitor this lag for us because if it gets too high, then it means we are falling behind (our consumers scale up and down through the day to mitigate this).

Next, we use Go, which is not an official language supported by the project but has a library written by Shopify called Sarama. Sarama's consumer support had been in beta mode in a while, and in the past had caused some issues were every partition of a topic wasn't being consumed.

Lastly, at the time we thought creating new topics would be a semi-regular event, and that we might have dozens of them (this didn't pan out), but having a simple overview of the health of all of our topics and consumers was thought to be good too.

We found Yahoo's Kafka Manager[1], which has ended up being really useful for us in standing up and managing the cluster without resorting to the command line. It's been great, but it wasn't exactly super obvious for me to find at the time.

Currently the only metrics I don't have are things plottable things like processed/incoming msg/sec (per topic), lag over time and disk usage. I'm sure these are now easily ingested into grafana, I just haven't had the time to do it.

All of this information is great to have, but requires some setup, tuning, and elbow grease that is probably batteries included in a managed service. At the same time however, this is something you get almost out of the box with RabbitMQ's management plugin.

[1] https://github.com/yahoo/kafka-manager

Yes, I do agree with these (except mesos is not a requirement for us). Is any of this significantly better for hosted Kafka or kinesis though? I have no experience with either

Yes, not having to worry about any of that is primary reason for managed services.

What is the point of 2 node cluster?

Topic sharding. The messages were pretty large and at the time we set this up we were on DO-like platform where the only way to get more disk space was to buy a larger instance. We didn't need the extra cpu power, but needed extra disk space, and it was cheaper to opt to two nodes instead of upgrading to n+2.

Running Kafka is just fine, the issues arise when a node fails, when you need to add data and re-partition a topic. However, it is not that hard once you know what to do, but Kinesis is simpler but it is expensive as shit.

At small scale Kinesis is far less expensive. There is definitely a point where Kinesis becomes more expensive, especially if you consider the operational and human costs involved.

I get shit for free daily!

I don't understand how Kafka is a complex project to run. It's dead simple to install alongside kafka manager and we have dedicated no time to it since installation - just runs and does it's job.

> Kafka is terribly complex to run

That does not match my experience at all. Of all the distributed message queues I've tried, Kafka has been - by far - the easiest to operate.

It works well out-of-the box and even setting it up with ZooKeeper is relatively simple.

Kafka can do client side certificates, no? That would be a real auth story.

It can, and yes, it works well.

Define "small scale". Keep in mind that Kafka was developed by LinkedIn to handle trillions of bytes of message ingest/transfer per day in real time (that's at least tens of MB per second across all boundaries, and probably much more during "bursty" periods). Is that you? If not, unless you just want to use Kafka as a learning experience/resume builder it's (probably) not necessary to even use it.

If that isn't you and you are still intent on using it, I'd second the opinion to go with "let someone else (e.g. 'cloud provider') administer it". It's probably not really worth the effort to get configuration settings and installation details right unless you genuinely have a serious interest in it.

I'm looking for a small scale solution (several orders smaller than linkedin) for change data capture into an ordered queue that Flink or Spark can consume continuously. The solution needs to be reliable and highly available, latency needs to be reasonable and relatively consistent to keep aggregates relatively fresh. Do you have a recommendation?

We "only" had 20K messages/s at my last job, and Kafka worked incredibly well for us. We ran 3 brokers and always had headroom for jobs which read the last N days of the log.

We had a mix of realtime (continuous) consumers and batch consumers that consumed directly from Kafka, but also archived that data to S3 for future batch processing.

Operations were quite simple for us, but we were also used to running Zookeeper before, so YMMV. If you can get away with Kinesis, it's simpler operationally, but Kafka isn't difficult compared to most systems.

Thanks for sharing, what was your experience with node failure at that scale?

Short unavailability (10-20s) of the partitions whose leadership were assigned to that broker, followed by otherwise uninterrupted service.

If it was a clean failure, no big deal. If the disk started having issues and taking a long time for IO to come back, all partitions with replicas or leaders on that node were slow. That's kind of the state of the world with most distributed things though: a bad node is worse than a dead node.

Another experience report. I work work a team that handles gigabytes per second and one stage of the pipeline involves sending data from a nsq to Kafka. The nsq nodes take exponentially less resources and empirically have far greater reliability. Odd because in terms of functionality I care about, both systems do the same thing.

"no leader for partition" is something I see frequently.

In Kafka's defence a better managed cluster might do better. In nsq's defence: nobody has to manage it, it just runs.

I mean, they're kind of apples and oranges. nsq is a primarily in-memory (minus the fallback to disk if too many messages are in flight) pubsub service fundamentally designed for spreading load and availability. There's no delivery or ordering guarantees, no replication.

Kafka is a distributed, partitioned log. Messages published to a partition in a given order will be stored in that order and replicated in that order, and consumed in that same order. Kafka writes all of your data to disk, and waits for replicas to acknowledge it (depending on your producer configuration). Kafka does considerably more work than nsq.

I like nsq, it works well, but they aren't at all designed to solve the same problems.

I mentioned the issues above because they'd happened, but in years of running Kafka it was never Kafka's fault.

> "no leader for partition" is something I see frequently.

Something is wrong with your cluster if it's constantly switching leadership. Leadership should be stable -- and furthermore, partitions have preferred leaders, so they should stay with specific nodes until that node goes down. Kafka has a process which shifts leadership back to preferred leaders automatically (unless it's off). Check your logs.

How are nsq & Kafka remotely comparible? One is a durable event store with strong consistency constraints the other is a standard message broker system without order or durability promises.

It's ok to compare two different things, after all that's kind of the point of a comparison. In fact I believe you managed to do just that in your comment.

Remember that both Kafka and Kinesis have "at least once" delivery and from experience it's a common scenario. You will receive multiple messages. Two options: (1) store hashes of each row and then deduplicate, (2) use a database.

I can't remember if it's an issue for change capture since in theory you have some insert timestamp column anyway.

Kafka has "exactly once" delivery guarantees since 0.11, I believe this will be the default in Kafka 1.0 (I work for Confluent).

I've enjoyed using Aiven's hosted Kafka. Can't comment on the other services that offer hosted solutions. I was originally going to go with Kinesis, but there were certain latency issues that didn't make sense for our application. I don't remember the specifics, but I seem to recall something about monster latency occasionally cropping up in Kinesis.

Thanks for mentioning us! We're currently running the Kafka 1.0.0 final release packages through our QA systems and hope to make it available for all Aiven (https://aiven.io) users next week.

Disclaimer: I'm one of the creators of Kafka and founders of Confluent

The best way to get the power, throughput, latency of Kafka without the operations is to use a hosted service. The one created and supported by the Kafka team is Confluent Cloud https://www.confluent.io/confluent-cloud/

I love companies developing a product also offering a managed implementation as a service, but I cannot find the pricing.

Any time it seems like I have to establish some sort of relationship and get an enterprise tailored solution, I no longer feel like I'm part of a public and elastic market.

Talked with one of the sales reps, starts at $1500/month.

From what they said multi-tenant is about a month a way and self service cloud early next year.

I beleive the phrase is "If you have to ask you can't afford it”.

Hehe, that only applies to luxury goods, which are decooupled from supply/demand rules.

I'm familiar, I emailed over a week ago with my usecase (change data capture with debezium and postgres at small scale) and asked for some basic pricing information in comparison to Kinesis. I never received a response so I assumed you were looking for bigger enterprise customers right now.

I talked to one of Confluent's sales reps, who said this service starts at around $1,500/month on the low end.

Definitely not for a startup whose monthly Kinesis bill is a few hundred dollars, which is a shame because Kakfa is better in several ways.

When you don't show any information about pricing at all on your page, you're obviously not meant for small-scale use.

Interesting to know about if I want to introduce Kafka in a bigger project, though.

I guess no pricing means I cannot afford it. :(

Is this ready now? We talked to the team during early access but never got any followup.

I recommend Kinesis. I was in this same position a nearly couple years ago and Kafka is seriously complicated to manage and run. Kinesis is super simple, 100% managed, and has about the same semantics as Kafka. Also the AWS ecosystem around Kinesis (lambda triggers, firehose, etc) is awesome.

The number of consumers is also important. Kinesis is optimized (and priced) for the many producers / few consumers scenario. If you have many consumers, Kinesis gets expensive quickly.

I am not sure if you had seen comparisons like this [1] or not but it has some important points when comparing these two. But I would say it also depends on how you are gonna use the streams. For example, if like us, you are going to digest the streams using Spark I would say Spark's API is more mature with Kafka (unless you have switched to Spark 2.2 and higher).

[1] http://go.datapipe.com/whitepaper-kafka-vs-kinesis-download

If you're in AWS land you can use Kinesis firehose to S3, which is perfect for Spark. Way more straightforward than any Kafka solution.

Kafka needs additional infrastructure (ZK cluster) if you're going to handle it yourselves. Kinesis wins that category, but it has atrocious write performance compared to Kafka, and limited data retention time, if that's important for your project.

I'd look at a hosted Kafka solution before going with Kinesis.

Go with a hosted solution: CloudKarafka, Aiven, Heroku Kafka, Eventador are all available. First one has a free plan too. Disclaimer: I work with CloudKarafka.

It's really not that big a deal to administer Kafka at small scale. If you have to, you can run it on 3 machines, with ZK on the same nodes.

I'd still run it myself over Kinesis, even if I was a single-person startup.

For workloads that aren't as important durability wise, has anyone considered nats? https://nats.io/

I use NATS in production. Kafka and NATS are almost complete opposites. One is a durable, partitioned queue; the other is a no-persistence message queue.

Neither are message queues. NATS is a pub/sub system and NATS Streaming is a client that connects to NATS and stores all messages on certain channels for replay.

Indeed, NATS is a pub/sub system. I should have said "message broker", but defaulted to "queue". Is a zero-depth queue a queue?

> Is a zero-depth queue a queue?

Yes. Queuing semantics are usually about ordered delivery, acknowledgements, consumer groups, retries, priorities, etc. RabbitMQ does that, NATS only has queue groups at most.

Aren't they fixing that with NATS Streaming Server?

There's nothing to "fix" - NATS is a distributed and fast pub/sub system. NATS Streaming is actually a client, it uses NATS to communicate and basically just subscribes to some topics, saves messages it receives, and will then replay them back to you if asked.

Imagine how you would build a persistent logging app on top of NATS, and that's what NATS Streaming is.

We built a metrics relay using it (nats) and are pushing > 500k metrics per second in bursts just fine. I've been pretty blown away with the perf of it actually. That said, it isn't durable, so normal here be dragons warning apply.

They did. Partially. I use nats-streaming in Production. It still does not have real “clustering” solution, but for my small scale I’m able to manage ~7-10k messages per second. Which is fine for many companies.

We use both:

- kafka for firehose data

- nats for realtime, routed data

I really hope they would stop with that exactly once thing.

It does not mean what they use it for, and it does generate a lot of confusion.

Ya I wrote a blog about why that label is misleading:


How so?

"exactly once" is not possible, but you can do "at least once" and make the processing of the message idempotent or "de-duplicated". This is true of many messaging systems.


> Exactly-once Semantics are Possible: Here’s How Kafka Does it

> Now, I know what some of you are thinking. Exactly-once delivery is impossible, it comes at too high a price to put it to practical use, or that I’m getting all this entirely wrong! You’re not alone in thinking that.

blah blah blah ... of course it's at-least-once, with the "idempotent producer" so only one resulting message is published back to another kafka stream. Big surprise.

Now many people think "kafka has exactly-once delivery, that's what I want, I don't want to have to deal with this at-least-once stuff" when really it's the same thing, and others have been doing idempotent operations of various kinds for years, and the user still has to figure out how to do their desired thing (which might not be sending one more kafka message) in a mostly idempotent way.

From the very same blog post:

> Is this Magical Pixie Dust I can sprinkle on my application?

> No, not quite. Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.

I think that is a pretty clear statement that end-to-end exactly once semantics doesn't come for free. It states that there needs to be additional application logic to achieve this and also specifies what must be done.

Right. But that's at the end of a separate article, while the post which this HN discussion is about throws around the words "exactly once" a lot more casually. The argument is over the use of the words "exactly once". They should just refer to the feature as "transactions" or "idempotent producer".

How do you handle idempotence for at least once delivery? Say I have an event counter of some sort. How would you track whether or not an event was handled in a way where you wouldn't increment the counter multiple times if the event was delivered multiple times?

kafka doesn't give you a counter like that either, I don't think.

What you could do though, is a keep a "set" data structure in a replicated redis or some other database, and each member of the set is the id of the message which incremented that count. Duplicates are naturally ignored when adding to the set. The count is the size of the set.

The message might also have a "time-generated" embedded in it. You could group by hour, and after some number of hours assume that no more messages will be re-delivered from that long ago, and consolidate that set down to a single number.

Or maybe that's too much overhead because there are millions of messages per hour. Maybe in that case you can afford the occasional double-count and don't need all this. Trade-offs will be unique for each situation, and I don't think kafka will completely take care of this kind of thing for you.

Kafka does give you a counter, it's in Neha's blog post. Kafka's idempotent producer registers itself with the brokers with a unique producer id and includes a sequence number with each message it sends. The brokers simply keep track of producer id + the highest sequence number they've seen, and return an ack, if the ack is lost the producer retries and the broker knows to deduplicate the message, and send the ack. In the event counter example this would ensure that each event would be incremented once, even in the face of multiple failures.

Relying on another distributed system to ensure the first distributed system isn't duplicating messages sounds like a headache I wouldn't wish on anyone. (I am a Confluent employee).

OK, so you're right, it seems "brokers" are little databases (maybe just for counters?). In this case the broker acts as the separate de-duplication system I described. I'm much more familiar with a system that does not provide order guarantees (and as a result doesn't need "partitioning" or "re-partitioning" for multiple consumers), but with Kafka where order is guaranteed, a simpler mechanism is possible - keep the count and the message sequence number together, sometimes update both the count and the seq, sometimes just the seq, only update if the new seq is prev+1. And this is built into the broker.

But you still need to understand how this works to do "kafka transactions" and you still need some other scheme to get effects/actions outside of kafka. (And you'll probably get people doing dumb stuff saying "I was told I get exactly-once delivery")

If you have a consumer on some other node that does nothing more that a println statement, you will see that there will be more than once delivery.

Exactly once seems to refer to exactly once write to kafka and processing in kafka streams that write back to kafka and don't do stuff like make external http requests, in which case you'll need your logic of at most or at least once to delivery results to wherever you're sending the http requests.

Confluence blogs do seem pretty misleading in terms of hiding these issues.

I think the term they meant to use was “idempotent”, but that doesn’t get attention the same way that claiming to violate the FLP result does.

Not sure that I agree with that.

I just think that they are opaque to what they really mean by exactly-once and definitely hide the fact that they do not offer exactly-once delivery.

We can look at the ProduceRequest logic in their design doc - (https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7C...)

  Check the sequence number map to determine if the PID is already present. 
    If it is not, check sequence number. 
      If it is 0, insert the PID, epoch, and sequence number into the PID mapping table and proceed to append the entries.
      If the sequence is not zero, return the InvalidSequenceNumber error code.
    If the PID is present in the mapping, check the epoch. 
      If the epoch is older than the current one, return InvalidProducerEpoch error code.
    If the epoch matches the current epoch, check the sequence number:
      If it matches the current sequence number, allow the append. 
      If the sequence number is older than the current sequence number, return DuplicateSequenceNumber. 
      If it is newer, return InvalidSequenceNumber.
    If the epoch is newer, check the sequence number.
      If it is 0, update the mapping and allow the append.
      If it is not 0, return InvalidSequenceNumber.
That's a pretty straightforward approach of enforcing idempotency; checking producer and sequence IDs and throwing them out if they aren't correctly ordered. The other big addition to the code base was cross-partition transactions, which, while pretty cool, is a far cry from "Exactly once delivery".

Do people use kafka for situations where data loss is not tolerable(eg: accept credit card receipts).

Yes, but the default configuration is not suitable to that situation. You will need to make some adjustments if you cannot tolerate data loss.

What are the alternatives people are using instead of kafka in these situations. Low volume but high reliability.

Low-volume = relational database. They remain the gold standard in usability and tooling. Backups and high-availability and consistency are well solved.

Also low-volume on modern servers can easily get into 10k+ transactions per second so you have plenty of performance potential.

Don't bother with Kafka and other systems unless you really have the scale of 100k+ events per second or otherwise need to start splitting up your app into several layers. That recent NYTimes article about Kafka for storing their entire dataset of 100gb is exactly the wrong way to use it.

For low volume my recommendation would be to just use a traditional relational database configured for high availability.

If you want to use Kafka and need disaster recovery capabilities we typically recommend using Kafka Connect or other similar tools to replicate the data to another cluster or persistent storage system such as S3.

+1. For use-cases which impose strict data durability requirements (either for business or regulatory reasons), I think it's unwise to use anything fancy like Kafka unless you've maxed out performance of your SQL database.

For example, for credit card receipts, simply by the nature of the type of transaction, you're unlikely to be processing enough of these to put pressure on a SQL database. One $1/transaction per second means you're grossing north of $30m, which is easy to handle in even an unoptimized schema. Citus reckon you can get thousands to tens of thousands of writes per second[1] on Postgres, which would be grossing tens or hundreds of billions of dollars; this tech stack is suitable even when "low volume" becomes quite significant.

Of course, Kafka is designed for situations where you need to process millions of writes per second, which is into "GDP of the whole world" territory if those writes are credit card receipts, so I'd contend you're unlikely to ever need something like Kafka scale for your credit card payment handling components.

[1]: https://www.citusdata.com/blog/2017/09/29/what-performance-c...

Two suggestions:

1. RabbitMQ saves on some of the operational complexity, especially at low volume. I've used it for similarly durability-critical applications with a combination of:

- Publisher Confirms.

- Durable queue and message metadata.

- Multiple nodes with HA (synchronous replication) in pause-minority mode (sacrifice availability when there's a partition).

- Fanout-duplicating messages into two queues, hooking up main consumers to one, and a periodic backup that drains the other to a backup location (a separate, non-replicated RabbitMQ via federation, or a text/db file or whatnot). This deals with RMQ's achilles heel, which is its synchronous replication system that can fail in interesting ways. Restoring a backup taken like this isn't automatic, but I've found that even adding a second decoupled RMQ instance into the mix is sufficient to significantly mitigate data loss due to disasters or RMQ failures/bugs.

All of those things will slow RMQ down a bit from its "crazy high throughput" use case, but your volume requirements are low, and the slowness will not be significant (at worst network time + 10s of ms rather than 1s of ms).

The configuration of each of those can be done mostly via idempotent messaging operations from your clients, with a bit of interaction (programmatic or manual) via the very user-friendly management HTTP API/site.

For even more delivery/execution confirmation, you can use more advanced messaging flows with AMQP transactions to get better assurance of whether or not a particular operation was completed or needs to be restored from backup.

2. Use a relational database. The reliability/transaction guarantees of something like Postgres are incredibly nice to have, simple to use, and incredibly widely supported even compared to things as popular as Kafka/JMS/NSQ/RMQ. At low volume, using a properly-tuned, replicated, dedicated database as a message bus or queue (via polling, triggers, or other Postgres features) tends to be almost as easy to set up as a real queue that prioritizes user-friendliness (like RabbitMQ), much easier to use and reason about in client code, and much more reliable.

Edit: syntax, missed words.

you might want to checkout https://pulsar.apache.org/ a durable low latency pub/sub system. it also has a kafka api client.

Kafka works fine for no-data-loss scenarios, you just need to configure it properly.

Write to a replicated database?

Absolutely, we used store-and-forward semantics when I used Kafka at a previous company to guarantee zero message loss (for data getting into kafka). Now that kafka streams provides exactly once semantics, you're in an even better spot to achieve this.

You probably already know this, but obligatory reminder for others: Kafka's exactly-once delivery is not the same as exactly-once execution of an arbitrary workload. E1 mode is incredibly convenient, but it doesn't solve the tricky parts of, for example, hitting an external ACH to process a credit card payment. What if the external request times out? Fails silently? Fails loudly? Exactly-once delivery can help develop solutions in this area, but it doesn't eliminate this problem domain.

Of course, the key to any successful implementation of end-to-end exactly-once processing lies in the fact that you USE the kstreams APIs - including all semantics around "consumption is completed up to offset X"

ugh... i dispise this crap..

" the “it just works” experience of using Kafka. "

Blog post says how it just works, then shows a crazy diagram stringing together 8 different pieces of technology (which themselves are insanely deep). Is there any easy way to understand WTF this crap does without knowing about RDS, hadoop, monitoring, real time BS, streaming etc? Does it "just work" once you've spent weeks and months understanding wtf it is?

Nothing in that diagram has anything to do with Kafka, it's just showing that you can connect different systems together using a single message bus. If you are new to that concept then I recommend reading the seminal blog post about using logs as the data backbone: https://engineering.linkedin.com/distributed-systems/log-wha...

After that, you can choose whichever implementation you want, of which Kafka is just one option. And yes it does require some work to setup and run, as most software does, but it's no more or less than any other distributed system at this point.

Kafka has come a long way. Kudos to confluent team.

I came across some projects which claim to be full or partial alternative to Kafka with better scalability and less operational overhead.

Jocko - https://github.com/travisjeffery/jocko

SMF - https://senior7515.github.io/smf/

MapR-ES - https://mapr.com/blog/kafka-vs-mapr-streams-why-mapr/

Has anyone tried it in their project ?

None of those are alternatives. Jocko's creator now works for Confluent and it's not production ready or supported, more of a hobby project to recreate kafka in go without zookeeper.

SMF is a prototype RPC system and libraries based on the C++ Seastar framework, not a messaging system in itself. MapR-ES is part of the MapR hadoop/data platform and also not standalone.

If you want alternatives then look at NATS + NATS Streaming [1] or Apache Pulsar [2].

1. http://nats.io

2. https://pulsar.apache.org/

Thanks for sharing the Jocko's creators story.

Yes. MapR-ES comes with its own File system and tool chain but its more closer alternative to Kafka than NATS as it follows Kafka 0.9 API and can act as drop in replacement. Although it does not support exactly once semantics yet.

Usually alternatives are about providing similar functionality, not the exact interface. If you just want a Kafka API then there are multiple adapters available for most other messaging systems, including pulsar.

MapR is an entire data platform and suite of services, one of which is the ES (event streams) offering. It's not drop-in for Kafka unless you're already running the MapR platform which is much more involved than just running Kafka.

actually, pulsar just supports kafka 0.10 api in the recent apache release. it might be worth checking it out as a drop in replacement. https://github.com/apache/incubator-pulsar/releases/tag/v1.2...

maybe another alternative here https://rocketmq.apache.org/

So can it be used comfortably now with anything else than Java?

I've been using it in Rust, but librdkafka had (has?) a bug that's been fixed in tree that cut the read throughput to about 10MB/s using loopback (edit: cuts throughput to 10MB/s in many settings, which hurts especially in loopback). With that fixed, it seemed pretty pleasant for my use cases (framed bytestreams at line rates, nothing fancy or web-scale).

Edit: crate I was using, and enjoy (which brings in librdkafka): https://github.com/fede1024/rust-rdkafka

Is Kafka good for processing hundreads of GB size messages per second? Can also a Kafka Stream (output) be input to another Kafka stream dynamically?

Kafka expects messages to be under 1MiB. If you need larger messages, they recommend you store the payload somewhere else and use Kafka to pass around pointers to the payload. So e.g., pass around an S3 URL.

kafka transactional producer not working on windows 7, this bug not yet fixed.https://issues.apache.org/jira/browse/KAFKA-6052

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact