Issues we've encountered while building a Kafka based data processing pipeline

anotherhue · on Oct 18, 2021

I ran a few dozen kafka clusters at MegaCorp in a previous life.

My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

ThinkBeat · on Oct 18, 2021

This is exactly how I feel about it.

A while back I was on team building a non-critical, low volume application. It just involves people sending a message basically. (there is more too it).

The consultants said he had to use Kafka because the messages could come in really fast.

I said we should stick with Postgres.

No, they said, we really need Kakfa to be able to handle this.

Then I went and spun up Postgres on my work laptop (nothing special), and got a loaner to act as a client. I simulated about 300% more traffic than we had any chance of getting. It worked fine. (did tax my poor work laptop).

No, we could not risk it, when we use Kafka we are safe.

Took it to management, Kafka won since Buzzword.

Now of course we have to write a process to feed the data into Postgres. After all its what everything else depends on

jghn · on Oct 18, 2021

People tend to not realize how big "at scale" problems really are. Instead anything at a scale at the edge of their experience is "at scale" and they reach for the tools they've read one is supposed to use in those situations. It makes sense, people don't know what they don't know.

And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.

jerf · on Oct 18, 2021

The GHz aspect of Moore's law died over a decade ago, and I suppose it's fair to say most other stuff has also slowed down, but if you've got a job that is embarrassingly parallel, which a lot of these "big data" jobs are, people badly underestimate how much progress there has been in the server space even so in the last 10 years if they're not paying attention. What was "big data" in 2011 can easily be "spin up a single 32-core instance with 1TB RAM for a couple of hours" in 2021. Even beyond the "big data" that my laptop comfortably handles.

I'm slowly wandering into a data science role lately, and I've been dealing with teams who are all kinds of concerned about whether or not we can handle the sheer, overwhelming volume of their (summary) data. "Well, let's see, how much data are we talking about?" "Oh, gosh, we could generate 200 or 300 megabytes a day." (Of uncompressed JSON.) Well, you know, if I have to bust out a second Raspberry Pi I'll be sure to charge it to your team.

The funny thing is that some of these teams have the experience that they ought to know better. They are legitimately running cloud services with dozens of large nodes continually running at high utilization and chewing through gigabytes of whatever per second. In their own worlds they would absolutely know that a couple hundred megabytes is nothing. They'll often have known places in their stack where they burn through a few hundred megabytes in internal API calls or something unnecessarily, and it will barely rise to the level of a P3 bug, quite legitimately so. But when they start thinking in terms of (someone else's) databases it's like they haven't updated their sense of size since 2005.

yongjik · on Oct 18, 2021

You are not thinking enterprisey enough. Everything is Big Scale if you add enough layers, because all these overheads add up, to which the solution is of course more layers.

sparsely · on Oct 18, 2021

There are various Kafka to Postgres adaptors. Of course, now you're running 3 bits of software, with 3 bottlenecks, instead of just 1.

dtech · on Oct 18, 2021

Were you advocating an endpoint + DB or different apps directly writing into a shared DB? The latter is not really a good idea for numerous reasons. Kafka a - potentially overkill - replacement for REST or whatever, not for your DB.

i_like_waiting · on Oct 18, 2021

But can you stream data from lets say MS SQL directly into Postgres? Easiest way I found is Kafka, I would love some simple python script instead

BiteCode_dev · on Oct 18, 2021

Stream? Unless you have really hight traffic, put that in a python script while loop that regularly check for new rows, and it will be fine.

If you want to get fancy, db now have pub/sub.

There are use case for stream replication, but you need way more data than 99% of biz have.

Spivak · on Oct 19, 2021

You don’t want stream replication for scale, you want it for availability and durability. The number of times replication has saved my ass is too damn high.

And you want Kafka (or something like it) the moment you need two processes handling updates which again you probably want for availability reasons so a crash doesn’t stop the stream.

You also don’t catch deletes or updates with this setup. But there’s a million db to Kafka connectors that read the binlog or pretend to be a replica to handle this.

BiteCode_dev · on Oct 19, 2021

Sure, if you need high availability + replication cross db, go for Kafka, but that's the thread point: it's not a common use case.

Most people don't need high availability, and most replication use the same db which comes with tooling for that.

superyesh · on Oct 18, 2021

>My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

Sorry thats just a clickbait-y statement. I love Postgres, try handling 100-500k rps of data coming in from various sources reading and writing to it. You are going to get bottlenecked on how many connections you can handle, you will end up throwing pgBouncers on top of it.

Eventually you will run out of disk, start throwing more in.

Then end up in VACCUUM hell all while having a single point of failure.

While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.

dwohnitmok · on Oct 18, 2021

I think anotherhue would agree that half a million write requests per second counts as a valid answer to "you can't do what you need with a beefy Postgres," but that is also a minority of situations.

IggleSniggle · on Oct 18, 2021

It's just hard to know what people mean when they say "most people don't need to do this." I was sitting wondering about a similar scale (200-1000 rps), where I've had issues with scaling rabbitmq, and have been thinking about whether kafka might help.

Without context provided, you might think: "oh, here's somebody with kafka and postgres experience, saying that postgres has some other super powers I hadn't learned about yet. Maybe I need to go learn me some more postgres and see how it's possible."

It would be helpful for folks to provide generalized measures of scale. "Right tool for the job," sure, but in the case of postgres, it often feels like there are a lot of incredible capabilities lurking.

I don't know what's normal for day-to-day software engineers anymore. Was the parent comment describing 100-500 rps really "a minority of situations?" I'm sure it is for most businesses. But is it "the minority of situations" that software engineers are actively trying to solve in 2021? I have no clue.

dwohnitmok · on Oct 18, 2021

Note superyesh was talking about 100 to 500 thousand requests per second. Your overall question stands, but the scale superyesh was talking about is very different and I am quite confident superyesh's scale is definitely in the minority.

IggleSniggle · on Oct 19, 2021

Oops, yes, was omitting the intended "k", totally skewing the scale of infrastructure my comment was intending to describe. Very funny, ironic. Unfortunately I can no longer edit that comment.

Cederfjard · on Oct 18, 2021

I’m not sure if you’re omitting the k in your numbers, or missed it in the other comment? Do you mean 100-500 and 200-1000, or 100 000-500 000 and 200 000-1 000 000?

IggleSniggle · on Oct 19, 2021

Yes, quite ironically, I accidentally forgot the "k" in my numbers. Oops.

evntdrvn · on Oct 18, 2021

that seems like an awfully low number to be running into issues with RabbitMQ ?

fernandotakai · on Oct 18, 2021

at least for me, every system has its place.

i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.

"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.

with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.

zbentley · on Oct 20, 2021

I think having ad hoc query capabilities on your job queue/log (not possible with rabbit and only possible by running extra software like KQL with Kafka, and even then at a full-table-scan cost equivalent) is a benefit to using postgres that should not be overlooked. For monitoring and debugging message broker behavior SQL is a very powerful tool.

I say this as someone squarely in the "bigger than can work with an RDBMS" message processing space, too: until you are at that level, you need less tooling (e.g. read replicas, backups, very powerful why-is-it-on-fire diagnostics informed by decades of experience), and get generally higher reliability, with postgres as a broker.

doliveira · on Oct 18, 2021

Yeah, once you do have to scale a relational database you're in for a world of pain. Band-aid after band-aid... I very much prefer to just start with Kafka already. At the very least you'll have a buffer to help you gain some time when the database struggles.

anotherhue · on Oct 18, 2021

Indeed! That sounds absolutely like it requires a real pub/sub. 'Actually needing it' is the exception in my experience though.

sumtechguy · on Oct 18, 2021

I have done both patterns.

Kafka has its use case. Databases have theirs. You can make a DB do what kafka does. But you also add in the programming overhead of getting the DB semantics correct to make an event system. When I see people saying 'lets put the DB into kafka' I make the exact same argument. You will spend more time making kafka act like a database and getting the semantics right. Kafka is more of a data/event transportation system. A DB is an at rest data store that lets you manipulate the data. Use them to their strengths or get crushed by weird edge cases.

tengbretson · on Oct 18, 2021

I'd argue that having the event system semantics layered on top of a sql database is a big benefit when you have an immature product, since you have an incredibly powerful escape hatch to jump in and fire off a few queries to fix problems. Kafka's visibility for debugging is pretty poor in my experience.

sumtechguy · on Oct 18, 2021

My issues typically with layering an event system on top of a db is replication and ownership of that event. Kafka makes some very nice guarantees about giving best attempt to make sure only one process works on something at a time inside of a consumer group. You have to build a system in the db using locks and different things that are poor substitutes.

If you are having trouble debugging kafka you could use a connector to put the data into the database/file to also debug, or a db streamer. You can also use the built in cli tools to scroll along. I have had very good luck with using both of those to find out what is going wrong. Also kafka will basically by default keep all the messages for the past 7 days so you can play it back if you need to by moving the consumer offsets. IF you are trying to use kafka like a db and change messages on the fly you will have a bad time of it. Kafka is meant to be a here is something that happened and some data. Changing that data after the fact would in the kafka world be another event. In some types of systems that is a very desirable property (such as audit heavy cultures, banks, medical, etc). Now also kafka can be a real pain if you are debugging and messup a schema or produce a message that does not fit into the consumer. Getting rid of that takes some poor cli trickery when in a db it is a delete call.

Also kafka is meant for a distributed system event based worker systems (typically some sort of microservice style system). If you are early on you more than likely not building that yet. Just dumping something into a list in a table and polling on that list is a very effective way for something that is early on or maybe even forever. But once you add in replication and/or multiple clients looking at that same task list table you will start to see the issues quickly.

Using an event system like a db system and yes it will feel broken. Also vice versa. You can do it. But like you are finding out those edge cases are a pain and make you feel like 'bah its easier to do it this way'. In some cases yes. In your case you have a bad state in your event data. You are cleaning it up with some db calls. But what if instead you had an event that did that for you?

BiteCode_dev · on Oct 18, 2021

Well, if you want events, you have lighter alternative. Redis pub/sub, crossbar... Even rabbitMQ is lighter.

If you need queuing, you have libs like celery.

You don't need to go full kafka

sumtechguy · on Oct 19, 2021

I do not disagree. Kafka is kind of a monster to fully configure correctly. Once they get rid of zookeeper it may be nicer to spin up and 'set it and forget it' sort of thing like the others.

emerongi · on Oct 18, 2021

Redis - it's probably in your stack already and fills all the use-cases.

anotherhue · on Oct 18, 2021

Agreed. Redis pub/sub is underrated.

lmilcin · on Oct 18, 2021

Just because you can do it with Postgres doesn't mean it is the best tool for the job.

Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.

While you can replicate almost all functionality of Kafka with Postgres (except for performance, but hardly anybody needs as much of it), we all know what we end up with when we set up Postgres and use it to integrate applications with each other.

If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.

But that is not how it works in reality.

mrweasel · on Oct 18, 2021

We work with a client who has requested a Kafka cluster. They can’t really say why, or what they need it for, but the now have a very large cluster which doesn’t do much. I know why they want it, same reason why they “need” Kubernetes. So far they use it as a sort of message bus.

It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.

My problem is when people ignore how capable modern servers are, and when developers don’t see the risks in building these highly complex systems, if something much simpler would solve the same problem, only cheaper and safer.

sparsely · on Oct 18, 2021

I've also had much more success with "queues in postgres" than "queues in kafka". I'm sure the use cases exist where kafka works better, but you have to architect it so carefully, and there are so many ways to trip up. Whereas most of your team probably already understands a RDMS and you're probably already running one reliably.

gilbetron · on Oct 18, 2021

What velocity have you achieved with postgres, in terms of # messages/sec where messages could range between 1KB-100KB in size?

lima · on Oct 18, 2021

Yup. That's what Kafka excels at and where it scales to throughput way beyond Postgres.

But there are many many projects where Kafka is used for low value event sourcing stuff where a SQL DB could be easier.

afandian · on Oct 18, 2021

I had a great time with Kafka for prototyping. Being able to push data from a number of places, have mulitple consumers able to connect, go back and forth though time, add and remove independent consumer groups. Ran in pre-production very reliably too, for years.

But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.

aqme28 · on Oct 18, 2021

> Show me that you can't do what you need with a beefy Postgres.

I've found this question very useful when pitched any esoteric database.

merarischroeder · on Oct 19, 2021

I agree so much I wrote an article about that https://link.medium.com/eQqVhuvRtkb

"You really can replace Kafka with a database."

ceencee · on Oct 18, 2021

I see posts like this a lot, and it makes me wonder what the heck you were using Kafka for that Postgres could handle, yet you had dozens of clusters? I question if you actually ever used Kafka or just operated it? Sure anyone can follow the "build a queue on a database pattern" but it falls over at the throughputs that justify Kafka. If you have a bunch of trivial 10tps workloads, of course a distributed system is overkill.

Cederfjard · on Oct 18, 2021

They didn’t say that the Kafka clusters they personally ran could have been handled with Postgres instead.

They first gave their credentials by mentioning their experience.

Then they basically said ”given what I know about Kafka, with my experience, I require other people who ask for it to show me that they really need it before I accommodate them - often a beefy Postgres is enough”.

anotherhue · on Oct 18, 2021

A major ecommerce site. We had hundreds of thousands of messages/s but for most use cases YAGNI.

alephu5 · on Oct 18, 2021

Why Postgres? Why not redis, rabbitMQ or even Kafka itself?

hardwaresofton · on Oct 18, 2021

Not those other tools because you can't achieve Postgres functionality with those other tools generally.

Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.

Postgres has the best database query language man has invented so far (AFAIK), well reasoned persistence and semantics, and as of recently partitioning features to boot and lots of addons to support different usecases. Despite all this postgres is mostly Boring Technology (tm) and easily available as well as very actively developed in the open, with a enterprise base that does consulting first and usually upstreams improvements after some time (2nd Quadrant, EDB, Citus, TimescaleDB).

The other tools win on simplicity for some (I'd take managing a PostgreSQL cluster over RMQ or Kafka any day), but for other things especially feature wise Postgres (and it's amalgamation of mostly-good-enough to great features) wins IMO.

LoriP · on Oct 18, 2021

Your comment on Boring Technology (tm) made me smile... At Timescale as you'll see in this blog we are happy to celebrate the boring, there's an awful lot to be said for it :) https://blog.timescale.com/blog/when-boring-is-awesome-build...

For transparency: I'm Timescale's Community Manager

capableweb · on Oct 18, 2021

> My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

...

You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.

If the right tool is a relational database, then use Postgres/$OTHER-DATABASE

If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG

Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.

chartpath · on Oct 21, 2021

Something I find peculiar about this truth is that Postgres itself is based on the write ahead log structure for sequencing statements, replication, audit logging.

It feels like two different lenses onto the same reality.

silisili · on Oct 18, 2021

Can you kinda high level the setup/processes for making Postgres a replacement for Kafka? I've not attempted such a thing before, and wonder about things like expiration/autodeletion, etc. Does it need to be vacuumed often, and is that a problem?

dionian · on Oct 18, 2021

Honest question, how do you expire content in Postgres? Every time I start to use it for ephemeral data I start to wonder if I should have used TimescaleDB or if I should be using something else...

LoriP · on Oct 18, 2021

You likely already know that TimescaleDB is an extension to PostgreSQL, so you get everything you'd get with PostgreSQL plus the added goodies of TimescaleDB. All that said, you can drop (or detach) data partitions in PostgreSQL (however you decide to partition...) Does that not do the trick for your use case, though? https://www.postgresql.org/docs/10/ddl-partitioning.html

Transparency: I work for Timescale

chartpath · on Oct 21, 2021

Maybe something like https://github.com/citusdata/pg_cron or if it's not a rolling time window, just trigger functions that get called whenever some expression in SQL is true.

mvc · on Oct 18, 2021

Record an event log and reliably connect it to a variety of 3rd party sinks using off-the-shelf services

bsaul · on Oct 18, 2021

using a sql db for push/pop semantic feels like using a hammer to squash a bug.. How would you model queues & partitions with ordering guarantees with pg ?

throwaway81523 · on Oct 18, 2021

With transactions, and stored procedures if that helps ;). Redis also seems well suited to the use cases I've seen for Kafka. Kafka must have capabilities beyond those use cases, and I've sometimes wondered what they are.

anotherhue · on Oct 18, 2021

sql has many conveniences for doing so, it wouldn't be much work.

> using a hammer to squash a bug..

Agreed - but Kafka is a much much bigger hammer. SES/Az Queues are also good choices.

SergeAx · on Oct 19, 2021

Read 500m messages in less than an hour?

zbentley · on Oct 20, 2021

It depends. Are your reads also writes (e.g. simulating a job queue with a "claimed" state update or a "select for update ... skip locked")? If not, 500m should be no sweat, even on crappy hardware with minimal tuning. If so, things get a little trickier, and you may have to introduce other tools or a clustering solution to get that throughput with postgres.

jgraettinger1 · on Oct 18, 2021

If you're in the Go ecosystem, Gazette [0] offers transactional integrations [1] with remote DB's for stateful processing pipelines, as well as local stores for embedded in-process state management.

It also natively stores data as files in cloud storage. Brokers are ephemeral, you don't need to migrate data between them, and you're not constrained by their disk size. Gazette defaults to exactly-once semantics, and has stronger replication guarantees (your R factor is your R factor, period -- no "in sync replicas").

Estuary Flow [2] is building on Gazette as an implementation detail to offer end-to-end integrations with external SaaS & DB's for building real-time dataflows, as a managed service.

[0]: https://github.com/gazette/core [1]: https://gazette.readthedocs.io/en/latest/consumers-concepts.... [2]: https://github.com/estuary/flow

ivanr · on Oct 18, 2021

Small suggestion: If Gazette is ready for a wider adoption, it may be useful to bump it up to 1.0 as a signal of confidence.

fafle · on Oct 18, 2021

The issue of running a transaction that spans multiple heterogeneous systems is usually solved with a 2 phase commit. The "jobs" abstraction from the article looks similar to the "coordinator" in 2PC. The article does not talk about how they achieve fault tolerance in case the "job" crashes inbetween the two transactions. Postgres supports the XA standard, which might help with this. Kafka does not support it.

jgraettinger1 · on Oct 18, 2021

I can't speak to their solution, but when solving an equivalent problem within Gazette, where you desire a distributed transaction that includes both a) published downstream messages, and b) state mutations in a DB, the solution is to 1) write downstream messages marked as pending a future ACK, and 2) encode the ACK you _intend_ to write into the checkpoint itself.

Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.

Of course, you can fail immediately after commit but before you get around to publishing all of those ACKS. So, on recovery, the first thing a task assignment does is publish (or re-publish) the ACKs encoded in the recovered checkpoint. This will either 1) provide a first notification that a commit occurred, or 2) be an effective no-op because the ACK was already observed, or 3) roll-back pending messages of a partial, failed transaction.

More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...

eternalban · on Oct 18, 2021

IIRC ~2 decades ago we were dequeueing from JMS, updating RDBMS, and then enqueuing all under the cover of JTA (Java Transaction API) for atomic ops.

https://docs.oracle.com/en/middleware/fusion-middleware/12.2...

Using a very broad definition of ‘noSQL’ approach that would include solutions like Kafka, the issue becomes clear: A 2PC or ‘distributed transaction manager’ approach ala JTA comes with a performance/scalability cost — arguably a non-issue for most companies who don’t operate at LinkedIn scale (where Kafka was created).

zeckalpha · on Oct 18, 2021

And MySQL didn’t yet support transactions!

redis_mlc · on Oct 19, 2021

Actually, Innodb, which supports transactions, has been bundled with MySQL since 2001, but existed before then. It became the default storage engine in 2010.

zeckalpha · on Oct 19, 2021

> IIRC ~2 decades ago

nitwit005 · on Oct 18, 2021

The solution I've seen is to write the message you want to send to the DB along with the transaction, and have some separate thread that tries to send the messages to Kafka.

Although, from various code bases I've seen, a lot of people just don't seem to worry about the possibility of data loss.

derekperkins · on Oct 19, 2021

Solving both these problems is the best hidden feature of Vitess - Messaging. You can ack a message, do data work, and add to a destination queue all in a single transaction. You can do selective requeuing, and infinite parallelization, since it isn't sequential processing. It's easy to introspect, debug, and have metrics for since it's just MySQL. Vitess lets you shard horizontally, so it can handle any QPS, and has at YouTube. It also supports native message priority. All of that, plus your infrastructure is simplified because you don't have to maintain a separate data store from your main RDBMS. Highly recommended. https://vitess.io/docs/reference/features/messaging/

jpgvm · on Oct 18, 2021

What you want is called Apache Pulsar. Log when you need it, work queue when you need that instead.

As for ensuring transactional consistency that isn't so bad, you can use an table to track offset inserts making sure you verify from that before you update consumer offsets (or Pulsar subscription if you go that route).

staticassertion · on Oct 18, 2021

(1) seems best solved by having a 'on heavy task, publish to a secondary topic'. This is good if you have flaky messages that need to be retried in the background, without blocking all of your 'good' messages.

(2) this problem should be avoided in general by just having idempotent services. Just as a hard restriction, forever, build services to be idempotent. It should be the exception to have a non-idempotent service, and it should be carefully understood.

That said, if you have (1) as a consistent issue, like if every message is flaky, kafka isn't the right solution. Postgres-based queues are perfect for this because you can examine the table as a whole, making more informed decisions about what you want to process (or not process).

rajin444 · on Oct 19, 2021

This seems like a much simpler solution. The "heavy task queue" would process tasks that could simply be retried until they're done.

Maybe I'm misunderstanding the article, but having "Job tasks" both insert another Job to run as well as updating DB state, and then having the executor pick up the previously inserted Job (whos only purpose is to send a kafka message) seems overly complex. I'm having trouble seeing why this is needed.

LgWoodenBadger · on Oct 18, 2021

I didn't quite follow their explanation for why producing to Kafka first didn't/wouldn't work for them (db state potentially being out of sync requiring continuous messaging until fixed).

anentropic · on Oct 18, 2021

it's a chicken and egg problem

you can either send a kafka message but potentially not commit the db transaction (i.e. an event is published for which the action did not actually occur) or commit the db transaction and potentially not send the kafka message

it sounds like they implemented something like the Transactional Outbox pattern https://microservices.io/patterns/data/transactional-outbox....

i.e. you use the db transaction to also commit a record of your intent to send a kafka message - you can then move the actual event sending to a separate process and implement at-least-once semantics

This is the job queueing system they described in the article

LgWoodenBadger · on Oct 18, 2021

Their solution seems like a "produce to Kafka first" but with extra steps.

Regarding:

When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved

I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.

Regarding:

The Job might still fail during execution, in which case it’s retried with exponential backoff, but at least no updates are lost. While the issue persists, further state change messages will be queued up also as Jobs (with same group value). Once the (transient) issue resolves, and we can again produce messages to Kafka, the updates would go out in logical order for the rest of the system and eventually everyone would be in sync.

This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.

anentropic · on Oct 19, 2021

When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved

the article does not explain things very clearly, but I think this is describing the problem rather than their solution

Our high level idea was:

- Insert “work” into a table that acts like a queue

- “Executor” takes “work” from DB and runs it

...

A Job is an abstraction for a scheduled DB backed async activity

...

How did we solve the #2 state problem?

By recording Jobs in the service database we can do the state update within the same transaction as inserting a new Job. Combining this with a Job that produces the actual Kafka message, allows us to make the whole operation transactional. If either of the parts fails, updating the data or scheduling the job, both get rolled back and neither happens.

I think this is describing basically a Transactional Outbox

i.e. "jobs" are recorded in the postgres db as part of the same db transaction as the business logic actions

the difference from Kafka-first is that if the app decides to rollback the business logic then the message hasn't already sent

bsaul · on Oct 18, 2021

Very interested to hear how people here overcome the limits of kafka for ordered events delivery in real world, and what those were.

luxurytent · on Oct 18, 2021

I feel as if you're using Kafka and expect guaranteed ordering, then you're using the wrong tool. At best you have guaranteed ordering per partition but then you've tied your ordering/keying strategy to the amount of partitions you've enabled ... which may not ideal.

But, that's speaking from my light experience with it. I'm also curious if there's a better way :-)

jgraettinger1 · on Oct 18, 2021

Not for Kafka, but we are building Flow [1] to offer deterministic ordering, even across multiple logical and physical partitions, and regardless of whether you're back-filling over history or processing in real-time.

This ends up being required for one of our architectural goals, which is fully repeatable transformations: You must be able to model a transactional decision as a Flow derivation (like "does account X have funds to transfer Y to Z ?", and if you create a _copy_ of that derivation months later, get the exact same result.

Under the hood (and simplifying a bit) Flow always does a streaming shuffled read to map events from partitions to task shards, and each shard maintains a min-heap to process events in their ~wall-time order.

This also avoids the common "Tyranny of Partitioning", where your upstream partitioning parallelism N also locks you into that same task shard parallelism -- a big problem if tasks manage a lot of state. With a read-time shuffle, you can scale them independently.

[1]: https://github.com/estuary/flow

BFLpL0QNek · on Oct 18, 2021

It depends on what the events are, how they are structured.

You get guaranteed ordering at the partition level.

Items are partitioned by key so you also get guaranteed ordering for a key.

If you have guaranteed ordering for a key you can’t get total ordering across all keys but you can get eventual consistency across the keys.

Ultimately if you want ordering you have to design around being eventually consistent.

I don’t read a lot of papers but Leslie Lamports Time, Clocks, and the Ordering of Events in a Distributed System gave me a lot of insight in to the constraints. https://lamport.azurewebsites.net/pubs/time-clocks.pdf

sumtechguy · on Oct 18, 2021

For kafka the default is round robin in each partition. A hash key can let you direct the work to particular partitions. Each partition is guaranteed ordering. Also only one consumer in a consumer group can remove an item from a partition at a time. No two consumers in a consumer group will get the same message.

tiew9Vii · on Oct 18, 2021

It's round robin if no key specified otherwise it uses murmur2 hash of the key so the partition for a key is always deterministic.

Just checking the docs it appears the round robin is no longer true after Confluent Platform 5.4. After 5.4 it looks like if no key specified the partition is assigned based on the batch being processed.

> If the key is provided, the partitioner will hash the key with murmur2 algorithm and divide it by the number of partitions. The result is that the same key is always assigned to the same partition. If a key is not provided, behavior is Confluent Platform version-dependent:...

https://docs.confluent.io/platform/current/clients/producer....

sumtechguy · on Oct 19, 2021

That is new! Guess I missed that. Thank you for the heads up. Wonder why the do not have a flag to set it to the old way, maybe they do I will have to dig through the docs. I could see some processes that could have that built in as a dependency and now that would change.

EdwardDiego · on Oct 19, 2021

Yep, IIRC around Kafka 2.6? the default partitioner changed from RoundRobin to preferring existing open record batches.

orobinson · on Oct 18, 2021

At lower data volumes (<10,000 events per minute) it’s perfectly feasible to just use single partition topics and then ordered event delivery is no problem at all. If a consuming service has processing times that means horizontal scaling is necessary then the topic can be repartitioned into a new topic with multiple partitions and the processing application can handle sorting the outputted data to some SLA.

csours · on Oct 18, 2021

put a timestamp in the message. use a conflict free replicated data type

sethammons · on Oct 18, 2021

If you need a guaranteed ordering, timestamps and distributed systems are not friends. See logical / vector clocks.

mfateev · on Oct 18, 2021

temporal.io provides much higher level abstraction for building asynchronous microservices. It allows one to model async invocations as synchronous blocking calls of any duration (months for example). And the state updates and queueing are transactional out of the box.

Here is an example using Typescript SDK:

   async function main(userId, intervals){ 
     // Send reminder emails, e.g. after 1, 7, and 30 days
   
     for (const interval of intervals) {
       await sleep(interval * DAYS);
       // can take hours if the downstream service is down
       await activities.sendEmail(interval, userId); 
     } 
     // Easily cancelled when user unsubscribes
   }

Disclaimer: I'm one of the creators of the project.

sam0x17 · on Oct 18, 2021

Couldn't the "state" issue be solved simply by enclosing the database save and kafka message send in the same database transaction block and only doing the kafka send if it reaches that part of the code?

at0mic22 · on Oct 18, 2021

Should you not consider every article as a divine insight. Sixfold is nowhere close from being a classy tech comp.

HelloNurse · on Oct 18, 2021

The only time I used Kafka, it was involuntarily (included for the sake of fashion in some complicated IBM product, where it hid among WebSphere, DB2 and other bigger elephants) and it ran my server out of disk space because due to a bug ridiculously massive temporary files weren't erased. Needless to say, I wasn't impressed: just one more hazard to worry about.

kitd · on Oct 18, 2021

due to a bug

Data retention time is Kafka config 101. Are you sure it was a bug?

geodel · on Oct 18, 2021

Considering how half-assed Kafka is in general, that it needs all clients code changes when Kafka servers are upgraded. It is very likely that user hit Kafka bug.

EdwardDiego · on Oct 19, 2021

> it needs all clients code changes when Kafka servers are upgraded

It absolutely doesn't. Message formats predating Kafka 0.11 are only just being deprecated as of Kafka 3.0, and won't be dropped until Kafka 4.0.

Now, if you want to use new shiny features (like cooperative sticky assignors to minimise consumer group stop the world rebalance pauses), then yes, you might need to upgrade clients.

But otherwise, you can still happily use 0.8 clients with your upgraded brokers.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-724%3A...

kitd · on Oct 18, 2021

Citation needed.

New server versions are protocol backwards compatible so I'm not sure what you're referring to.

Ofc, if you downgraded a server without changing the client, that may cause problems, but tbh that's hardly Kafka's fault.

EdwardDiego · on Oct 19, 2021

What was the bug, out of curiosity?