Hacker News new | past | comments | ask | show | jobs | submit login
Issues we've encountered while building a Kafka based data processing pipeline (sixfold.medium.com)
158 points by poolik 43 days ago | hide | past | favorite | 93 comments



I ran a few dozen kafka clusters at MegaCorp in a previous life.

My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.


This is exactly how I feel about it.

A while back I was on team building a non-critical, low volume application. It just involves people sending a message basically. (there is more too it).

The consultants said he had to use Kafka because the messages could come in really fast.

I said we should stick with Postgres.

No, they said, we really need Kakfa to be able to handle this.

Then I went and spun up Postgres on my work laptop (nothing special), and got a loaner to act as a client. I simulated about 300% more traffic than we had any chance of getting. It worked fine. (did tax my poor work laptop).

No, we could not risk it, when we use Kafka we are safe.

Took it to management, Kafka won since Buzzword.

Now of course we have to write a process to feed the data into Postgres. After all its what everything else depends on


People tend to not realize how big "at scale" problems really are. Instead anything at a scale at the edge of their experience is "at scale" and they reach for the tools they've read one is supposed to use in those situations. It makes sense, people don't know what they don't know.

And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.


The GHz aspect of Moore's law died over a decade ago, and I suppose it's fair to say most other stuff has also slowed down, but if you've got a job that is embarrassingly parallel, which a lot of these "big data" jobs are, people badly underestimate how much progress there has been in the server space even so in the last 10 years if they're not paying attention. What was "big data" in 2011 can easily be "spin up a single 32-core instance with 1TB RAM for a couple of hours" in 2021. Even beyond the "big data" that my laptop comfortably handles.

I'm slowly wandering into a data science role lately, and I've been dealing with teams who are all kinds of concerned about whether or not we can handle the sheer, overwhelming volume of their (summary) data. "Well, let's see, how much data are we talking about?" "Oh, gosh, we could generate 200 or 300 megabytes a day." (Of uncompressed JSON.) Well, you know, if I have to bust out a second Raspberry Pi I'll be sure to charge it to your team.

The funny thing is that some of these teams have the experience that they ought to know better. They are legitimately running cloud services with dozens of large nodes continually running at high utilization and chewing through gigabytes of whatever per second. In their own worlds they would absolutely know that a couple hundred megabytes is nothing. They'll often have known places in their stack where they burn through a few hundred megabytes in internal API calls or something unnecessarily, and it will barely rise to the level of a P3 bug, quite legitimately so. But when they start thinking in terms of (someone else's) databases it's like they haven't updated their sense of size since 2005.


You are not thinking enterprisey enough. Everything is Big Scale if you add enough layers, because all these overheads add up, to which the solution is of course more layers.


There are various Kafka to Postgres adaptors. Of course, now you're running 3 bits of software, with 3 bottlenecks, instead of just 1.


Were you advocating an endpoint + DB or different apps directly writing into a shared DB? The latter is not really a good idea for numerous reasons. Kafka a - potentially overkill - replacement for REST or whatever, not for your DB.


But can you stream data from lets say MS SQL directly into Postgres? Easiest way I found is Kafka, I would love some simple python script instead


Stream? Unless you have really hight traffic, put that in a python script while loop that regularly check for new rows, and it will be fine.

If you want to get fancy, db now have pub/sub.

There are use case for stream replication, but you need way more data than 99% of biz have.


You don’t want stream replication for scale, you want it for availability and durability. The number of times replication has saved my ass is too damn high.

And you want Kafka (or something like it) the moment you need two processes handling updates which again you probably want for availability reasons so a crash doesn’t stop the stream.

You also don’t catch deletes or updates with this setup. But there’s a million db to Kafka connectors that read the binlog or pretend to be a replica to handle this.


Sure, if you need high availability + replication cross db, go for Kafka, but that's the thread point: it's not a common use case.

Most people don't need high availability, and most replication use the same db which comes with tooling for that.


>My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

Sorry thats just a clickbait-y statement. I love Postgres, try handling 100-500k rps of data coming in from various sources reading and writing to it. You are going to get bottlenecked on how many connections you can handle, you will end up throwing pgBouncers on top of it.

Eventually you will run out of disk, start throwing more in.

Then end up in VACCUUM hell all while having a single point of failure.

While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.


I think anotherhue would agree that half a million write requests per second counts as a valid answer to "you can't do what you need with a beefy Postgres," but that is also a minority of situations.


It's just hard to know what people mean when they say "most people don't need to do this." I was sitting wondering about a similar scale (200-1000 rps), where I've had issues with scaling rabbitmq, and have been thinking about whether kafka might help.

Without context provided, you might think: "oh, here's somebody with kafka and postgres experience, saying that postgres has some other super powers I hadn't learned about yet. Maybe I need to go learn me some more postgres and see how it's possible."

It would be helpful for folks to provide generalized measures of scale. "Right tool for the job," sure, but in the case of postgres, it often feels like there are a lot of incredible capabilities lurking.

I don't know what's normal for day-to-day software engineers anymore. Was the parent comment describing 100-500 rps really "a minority of situations?" I'm sure it is for most businesses. But is it "the minority of situations" that software engineers are actively trying to solve in 2021? I have no clue.


Note superyesh was talking about 100 to 500 thousand requests per second. Your overall question stands, but the scale superyesh was talking about is very different and I am quite confident superyesh's scale is definitely in the minority.


Oops, yes, was omitting the intended "k", totally skewing the scale of infrastructure my comment was intending to describe. Very funny, ironic. Unfortunately I can no longer edit that comment.


I’m not sure if you’re omitting the k in your numbers, or missed it in the other comment? Do you mean 100-500 and 200-1000, or 100 000-500 000 and 200 000-1 000 000?


Yes, quite ironically, I accidentally forgot the "k" in my numbers. Oops.


that seems like an awfully low number to be running into issues with RabbitMQ ?


at least for me, every system has its place.

i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.

"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.

with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.


I think having ad hoc query capabilities on your job queue/log (not possible with rabbit and only possible by running extra software like KQL with Kafka, and even then at a full-table-scan cost equivalent) is a benefit to using postgres that should not be overlooked. For monitoring and debugging message broker behavior SQL is a very powerful tool.

I say this as someone squarely in the "bigger than can work with an RDBMS" message processing space, too: until you are at that level, you need less tooling (e.g. read replicas, backups, very powerful why-is-it-on-fire diagnostics informed by decades of experience), and get generally higher reliability, with postgres as a broker.


Yeah, once you do have to scale a relational database you're in for a world of pain. Band-aid after band-aid... I very much prefer to just start with Kafka already. At the very least you'll have a buffer to help you gain some time when the database struggles.


Indeed! That sounds absolutely like it requires a real pub/sub. 'Actually needing it' is the exception in my experience though.


I have done both patterns.

Kafka has its use case. Databases have theirs. You can make a DB do what kafka does. But you also add in the programming overhead of getting the DB semantics correct to make an event system. When I see people saying 'lets put the DB into kafka' I make the exact same argument. You will spend more time making kafka act like a database and getting the semantics right. Kafka is more of a data/event transportation system. A DB is an at rest data store that lets you manipulate the data. Use them to their strengths or get crushed by weird edge cases.


I'd argue that having the event system semantics layered on top of a sql database is a big benefit when you have an immature product, since you have an incredibly powerful escape hatch to jump in and fire off a few queries to fix problems. Kafka's visibility for debugging is pretty poor in my experience.


My issues typically with layering an event system on top of a db is replication and ownership of that event. Kafka makes some very nice guarantees about giving best attempt to make sure only one process works on something at a time inside of a consumer group. You have to build a system in the db using locks and different things that are poor substitutes.

If you are having trouble debugging kafka you could use a connector to put the data into the database/file to also debug, or a db streamer. You can also use the built in cli tools to scroll along. I have had very good luck with using both of those to find out what is going wrong. Also kafka will basically by default keep all the messages for the past 7 days so you can play it back if you need to by moving the consumer offsets. IF you are trying to use kafka like a db and change messages on the fly you will have a bad time of it. Kafka is meant to be a here is something that happened and some data. Changing that data after the fact would in the kafka world be another event. In some types of systems that is a very desirable property (such as audit heavy cultures, banks, medical, etc). Now also kafka can be a real pain if you are debugging and messup a schema or produce a message that does not fit into the consumer. Getting rid of that takes some poor cli trickery when in a db it is a delete call.

Also kafka is meant for a distributed system event based worker systems (typically some sort of microservice style system). If you are early on you more than likely not building that yet. Just dumping something into a list in a table and polling on that list is a very effective way for something that is early on or maybe even forever. But once you add in replication and/or multiple clients looking at that same task list table you will start to see the issues quickly.

Using an event system like a db system and yes it will feel broken. Also vice versa. You can do it. But like you are finding out those edge cases are a pain and make you feel like 'bah its easier to do it this way'. In some cases yes. In your case you have a bad state in your event data. You are cleaning it up with some db calls. But what if instead you had an event that did that for you?


Well, if you want events, you have lighter alternative. Redis pub/sub, crossbar... Even rabbitMQ is lighter.

If you need queuing, you have libs like celery.

You don't need to go full kafka


I do not disagree. Kafka is kind of a monster to fully configure correctly. Once they get rid of zookeeper it may be nicer to spin up and 'set it and forget it' sort of thing like the others.


Redis - it's probably in your stack already and fills all the use-cases.


Agreed. Redis pub/sub is underrated.


Just because you can do it with Postgres doesn't mean it is the best tool for the job.

Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.

While you can replicate almost all functionality of Kafka with Postgres (except for performance, but hardly anybody needs as much of it), we all know what we end up with when we set up Postgres and use it to integrate applications with each other.

If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.

But that is not how it works in reality.


We work with a client who has requested a Kafka cluster. They can’t really say why, or what they need it for, but the now have a very large cluster which doesn’t do much. I know why they want it, same reason why they “need” Kubernetes. So far they use it as a sort of message bus.

It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.

My problem is when people ignore how capable modern servers are, and when developers don’t see the risks in building these highly complex systems, if something much simpler would solve the same problem, only cheaper and safer.


I've also had much more success with "queues in postgres" than "queues in kafka". I'm sure the use cases exist where kafka works better, but you have to architect it so carefully, and there are so many ways to trip up. Whereas most of your team probably already understands a RDMS and you're probably already running one reliably.


What velocity have you achieved with postgres, in terms of # messages/sec where messages could range between 1KB-100KB in size?


Yup. That's what Kafka excels at and where it scales to throughput way beyond Postgres.

But there are many many projects where Kafka is used for low value event sourcing stuff where a SQL DB could be easier.


I had a great time with Kafka for prototyping. Being able to push data from a number of places, have mulitple consumers able to connect, go back and forth though time, add and remove independent consumer groups. Ran in pre-production very reliably too, for years.

But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.


> Show me that you can't do what you need with a beefy Postgres.

I've found this question very useful when pitched any esoteric database.


I see posts like this a lot, and it makes me wonder what the heck you were using Kafka for that Postgres could handle, yet you had dozens of clusters? I question if you actually ever used Kafka or just operated it? Sure anyone can follow the "build a queue on a database pattern" but it falls over at the throughputs that justify Kafka. If you have a bunch of trivial 10tps workloads, of course a distributed system is overkill.


They didn’t say that the Kafka clusters they personally ran could have been handled with Postgres instead.

They first gave their credentials by mentioning their experience.

Then they basically said ”given what I know about Kafka, with my experience, I require other people who ask for it to show me that they really need it before I accommodate them - often a beefy Postgres is enough”.


A major ecommerce site. We had hundreds of thousands of messages/s but for most use cases YAGNI.


Why Postgres? Why not redis, rabbitMQ or even Kafka itself?


Not those other tools because you can't achieve Postgres functionality with those other tools generally.

Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.

Postgres has the best database query language man has invented so far (AFAIK), well reasoned persistence and semantics, and as of recently partitioning features to boot and lots of addons to support different usecases. Despite all this postgres is mostly Boring Technology (tm) and easily available as well as very actively developed in the open, with a enterprise base that does consulting first and usually upstreams improvements after some time (2nd Quadrant, EDB, Citus, TimescaleDB).

The other tools win on simplicity for some (I'd take managing a PostgreSQL cluster over RMQ or Kafka any day), but for other things especially feature wise Postgres (and it's amalgamation of mostly-good-enough to great features) wins IMO.


Your comment on Boring Technology (tm) made me smile... At Timescale as you'll see in this blog we are happy to celebrate the boring, there's an awful lot to be said for it :) https://blog.timescale.com/blog/when-boring-is-awesome-build...

For transparency: I'm Timescale's Community Manager


I agree so much I wrote an article about that https://link.medium.com/eQqVhuvRtkb

"You really can replace Kafka with a database."


> My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

...

You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.

If the right tool is a relational database, then use Postgres/$OTHER-DATABASE

If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG

Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.


Something I find peculiar about this truth is that Postgres itself is based on the write ahead log structure for sequencing statements, replication, audit logging.

It feels like two different lenses onto the same reality.


Can you kinda high level the setup/processes for making Postgres a replacement for Kafka? I've not attempted such a thing before, and wonder about things like expiration/autodeletion, etc. Does it need to be vacuumed often, and is that a problem?


Honest question, how do you expire content in Postgres? Every time I start to use it for ephemeral data I start to wonder if I should have used TimescaleDB or if I should be using something else...


You likely already know that TimescaleDB is an extension to PostgreSQL, so you get everything you'd get with PostgreSQL plus the added goodies of TimescaleDB. All that said, you can drop (or detach) data partitions in PostgreSQL (however you decide to partition...) Does that not do the trick for your use case, though? https://www.postgresql.org/docs/10/ddl-partitioning.html

Transparency: I work for Timescale


Maybe something like https://github.com/citusdata/pg_cron or if it's not a rolling time window, just trigger functions that get called whenever some expression in SQL is true.


Record an event log and reliably connect it to a variety of 3rd party sinks using off-the-shelf services


using a sql db for push/pop semantic feels like using a hammer to squash a bug.. How would you model queues & partitions with ordering guarantees with pg ?


With transactions, and stored procedures if that helps ;). Redis also seems well suited to the use cases I've seen for Kafka. Kafka must have capabilities beyond those use cases, and I've sometimes wondered what they are.


sql has many conveniences for doing so, it wouldn't be much work.

> using a hammer to squash a bug..

Agreed - but Kafka is a much much bigger hammer. SES/Az Queues are also good choices.


Read 500m messages in less than an hour?


It depends. Are your reads also writes (e.g. simulating a job queue with a "claimed" state update or a "select for update ... skip locked")? If not, 500m should be no sweat, even on crappy hardware with minimal tuning. If so, things get a little trickier, and you may have to introduce other tools or a clustering solution to get that throughput with postgres.


If you're in the Go ecosystem, Gazette [0] offers transactional integrations [1] with remote DB's for stateful processing pipelines, as well as local stores for embedded in-process state management.

It also natively stores data as files in cloud storage. Brokers are ephemeral, you don't need to migrate data between them, and you're not constrained by their disk size. Gazette defaults to exactly-once semantics, and has stronger replication guarantees (your R factor is your R factor, period -- no "in sync replicas").

Estuary Flow [2] is building on Gazette as an implementation detail to offer end-to-end integrations with external SaaS & DB's for building real-time dataflows, as a managed service.

[0]: https://github.com/gazette/core [1]: https://gazette.readthedocs.io/en/latest/consumers-concepts.... [2]: https://github.com/estuary/flow


Small suggestion: If Gazette is ready for a wider adoption, it may be useful to bump it up to 1.0 as a signal of confidence.


The issue of running a transaction that spans multiple heterogeneous systems is usually solved with a 2 phase commit. The "jobs" abstraction from the article looks similar to the "coordinator" in 2PC. The article does not talk about how they achieve fault tolerance in case the "job" crashes inbetween the two transactions. Postgres supports the XA standard, which might help with this. Kafka does not support it.


I can't speak to their solution, but when solving an equivalent problem within Gazette, where you desire a distributed transaction that includes both a) published downstream messages, and b) state mutations in a DB, the solution is to 1) write downstream messages marked as pending a future ACK, and 2) encode the ACK you _intend_ to write into the checkpoint itself.

Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.

Of course, you can fail immediately after commit but before you get around to publishing all of those ACKS. So, on recovery, the first thing a task assignment does is publish (or re-publish) the ACKs encoded in the recovered checkpoint. This will either 1) provide a first notification that a commit occurred, or 2) be an effective no-op because the ACK was already observed, or 3) roll-back pending messages of a partial, failed transaction.

More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...


IIRC ~2 decades ago we were dequeueing from JMS, updating RDBMS, and then enqueuing all under the cover of JTA (Java Transaction API) for atomic ops.

https://docs.oracle.com/en/middleware/fusion-middleware/12.2...

Using a very broad definition of ‘noSQL’ approach that would include solutions like Kafka, the issue becomes clear: A 2PC or ‘distributed transaction manager’ approach ala JTA comes with a performance/scalability cost — arguably a non-issue for most companies who don’t operate at LinkedIn scale (where Kafka was created).


And MySQL didn’t yet support transactions!


Actually, Innodb, which supports transactions, has been bundled with MySQL since 2001, but existed before then. It became the default storage engine in 2010.


> IIRC ~2 decades ago


The solution I've seen is to write the message you want to send to the DB along with the transaction, and have some separate thread that tries to send the messages to Kafka.

Although, from various code bases I've seen, a lot of people just don't seem to worry about the possibility of data loss.


Solving both these problems is the best hidden feature of Vitess - Messaging. You can ack a message, do data work, and add to a destination queue all in a single transaction. You can do selective requeuing, and infinite parallelization, since it isn't sequential processing. It's easy to introspect, debug, and have metrics for since it's just MySQL. Vitess lets you shard horizontally, so it can handle any QPS, and has at YouTube. It also supports native message priority. All of that, plus your infrastructure is simplified because you don't have to maintain a separate data store from your main RDBMS. Highly recommended. https://vitess.io/docs/reference/features/messaging/


What you want is called Apache Pulsar. Log when you need it, work queue when you need that instead.

As for ensuring transactional consistency that isn't so bad, you can use an table to track offset inserts making sure you verify from that before you update consumer offsets (or Pulsar subscription if you go that route).


(1) seems best solved by having a 'on heavy task, publish to a secondary topic'. This is good if you have flaky messages that need to be retried in the background, without blocking all of your 'good' messages.

(2) this problem should be avoided in general by just having idempotent services. Just as a hard restriction, forever, build services to be idempotent. It should be the exception to have a non-idempotent service, and it should be carefully understood.

That said, if you have (1) as a consistent issue, like if every message is flaky, kafka isn't the right solution. Postgres-based queues are perfect for this because you can examine the table as a whole, making more informed decisions about what you want to process (or not process).


This seems like a much simpler solution. The "heavy task queue" would process tasks that could simply be retried until they're done.

Maybe I'm misunderstanding the article, but having "Job tasks" both insert another Job to run as well as updating DB state, and then having the executor pick up the previously inserted Job (whos only purpose is to send a kafka message) seems overly complex. I'm having trouble seeing why this is needed.


I didn't quite follow their explanation for why producing to Kafka first didn't/wouldn't work for them (db state potentially being out of sync requiring continuous messaging until fixed).


it's a chicken and egg problem

you can either send a kafka message but potentially not commit the db transaction (i.e. an event is published for which the action did not actually occur) or commit the db transaction and potentially not send the kafka message

it sounds like they implemented something like the Transactional Outbox pattern https://microservices.io/patterns/data/transactional-outbox....

i.e. you use the db transaction to also commit a record of your intent to send a kafka message - you can then move the actual event sending to a separate process and implement at-least-once semantics

This is the job queueing system they described in the article


Their solution seems like a "produce to Kafka first" but with extra steps.

Regarding:

When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved

I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.

Regarding:

The Job might still fail during execution, in which case it’s retried with exponential backoff, but at least no updates are lost. While the issue persists, further state change messages will be queued up also as Jobs (with same group value). Once the (transient) issue resolves, and we can again produce messages to Kafka, the updates would go out in logical order for the rest of the system and eventually everyone would be in sync.

This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.


When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved

the article does not explain things very clearly, but I think this is describing the problem rather than their solution

Our high level idea was:

- Insert “work” into a table that acts like a queue

- “Executor” takes “work” from DB and runs it

...

A Job is an abstraction for a scheduled DB backed async activity

...

How did we solve the #2 state problem?

By recording Jobs in the service database we can do the state update within the same transaction as inserting a new Job. Combining this with a Job that produces the actual Kafka message, allows us to make the whole operation transactional. If either of the parts fails, updating the data or scheduling the job, both get rolled back and neither happens.

I think this is describing basically a Transactional Outbox

i.e. "jobs" are recorded in the postgres db as part of the same db transaction as the business logic actions

the difference from Kafka-first is that if the app decides to rollback the business logic then the message hasn't already sent


Very interested to hear how people here overcome the limits of kafka for ordered events delivery in real world, and what those were.


I feel as if you're using Kafka and expect guaranteed ordering, then you're using the wrong tool. At best you have guaranteed ordering per partition but then you've tied your ordering/keying strategy to the amount of partitions you've enabled ... which may not ideal.

But, that's speaking from my light experience with it. I'm also curious if there's a better way :-)


Not for Kafka, but we are building Flow [1] to offer deterministic ordering, even across multiple logical and physical partitions, and regardless of whether you're back-filling over history or processing in real-time.

This ends up being required for one of our architectural goals, which is fully repeatable transformations: You must be able to model a transactional decision as a Flow derivation (like "does account X have funds to transfer Y to Z ?", and if you create a _copy_ of that derivation months later, get the exact same result.

Under the hood (and simplifying a bit) Flow always does a streaming shuffled read to map events from partitions to task shards, and each shard maintains a min-heap to process events in their ~wall-time order.

This also avoids the common "Tyranny of Partitioning", where your upstream partitioning parallelism N also locks you into that same task shard parallelism -- a big problem if tasks manage a lot of state. With a read-time shuffle, you can scale them independently.

[1]: https://github.com/estuary/flow


It depends on what the events are, how they are structured.

You get guaranteed ordering at the partition level.

Items are partitioned by key so you also get guaranteed ordering for a key.

If you have guaranteed ordering for a key you can’t get total ordering across all keys but you can get eventual consistency across the keys.

Ultimately if you want ordering you have to design around being eventually consistent.

I don’t read a lot of papers but Leslie Lamports Time, Clocks, and the Ordering of Events in a Distributed System gave me a lot of insight in to the constraints. https://lamport.azurewebsites.net/pubs/time-clocks.pdf


For kafka the default is round robin in each partition. A hash key can let you direct the work to particular partitions. Each partition is guaranteed ordering. Also only one consumer in a consumer group can remove an item from a partition at a time. No two consumers in a consumer group will get the same message.


It's round robin if no key specified otherwise it uses murmur2 hash of the key so the partition for a key is always deterministic.

Just checking the docs it appears the round robin is no longer true after Confluent Platform 5.4. After 5.4 it looks like if no key specified the partition is assigned based on the batch being processed.

> If the key is provided, the partitioner will hash the key with murmur2 algorithm and divide it by the number of partitions. The result is that the same key is always assigned to the same partition. If a key is not provided, behavior is Confluent Platform version-dependent:...

https://docs.confluent.io/platform/current/clients/producer....


That is new! Guess I missed that. Thank you for the heads up. Wonder why the do not have a flag to set it to the old way, maybe they do I will have to dig through the docs. I could see some processes that could have that built in as a dependency and now that would change.


Yep, IIRC around Kafka 2.6? the default partitioner changed from RoundRobin to preferring existing open record batches.


At lower data volumes (<10,000 events per minute) it’s perfectly feasible to just use single partition topics and then ordered event delivery is no problem at all. If a consuming service has processing times that means horizontal scaling is necessary then the topic can be repartitioned into a new topic with multiple partitions and the processing application can handle sorting the outputted data to some SLA.


put a timestamp in the message. use a conflict free replicated data type


If you need a guaranteed ordering, timestamps and distributed systems are not friends. See logical / vector clocks.


temporal.io provides much higher level abstraction for building asynchronous microservices. It allows one to model async invocations as synchronous blocking calls of any duration (months for example). And the state updates and queueing are transactional out of the box.

Here is an example using Typescript SDK:

   async function main(userId, intervals){ 
     // Send reminder emails, e.g. after 1, 7, and 30 days
   
     for (const interval of intervals) {
       await sleep(interval * DAYS);
       // can take hours if the downstream service is down
       await activities.sendEmail(interval, userId); 
     } 
     // Easily cancelled when user unsubscribes
   } 

Disclaimer: I'm one of the creators of the project.


Couldn't the "state" issue be solved simply by enclosing the database save and kafka message send in the same database transaction block and only doing the kafka send if it reaches that part of the code?


Should you not consider every article as a divine insight. Sixfold is nowhere close from being a classy tech comp.


The only time I used Kafka, it was involuntarily (included for the sake of fashion in some complicated IBM product, where it hid among WebSphere, DB2 and other bigger elephants) and it ran my server out of disk space because due to a bug ridiculously massive temporary files weren't erased. Needless to say, I wasn't impressed: just one more hazard to worry about.


due to a bug

Data retention time is Kafka config 101. Are you sure it was a bug?


Considering how half-assed Kafka is in general, that it needs all clients code changes when Kafka servers are upgraded. It is very likely that user hit Kafka bug.


> it needs all clients code changes when Kafka servers are upgraded

It absolutely doesn't. Message formats predating Kafka 0.11 are only just being deprecated as of Kafka 3.0, and won't be dropped until Kafka 4.0.

Now, if you want to use new shiny features (like cooperative sticky assignors to minimise consumer group stop the world rebalance pauses), then yes, you might need to upgrade clients.

But otherwise, you can still happily use 0.8 clients with your upgraded brokers.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A...


Citation needed.

New server versions are protocol backwards compatible so I'm not sure what you're referring to.

Ofc, if you downgraded a server without changing the client, that may cause problems, but tbh that's hardly Kafka's fault.


What was the bug, out of curiosity?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: