Hacker News new | past | comments | ask | show | jobs | submit login
Kafka as an Antipattern (joshaustin.tech)
98 points by joshaustintech 8 months ago | hide | past | favorite | 98 comments



8000 messages a day, tops? That’s 5 a minute. Does that warrant “infrastructure”? I think a gameboy’s Z80 could handle that load.

I don’t want to be dismissive, but I often see these big numbers being posted, like “14M messages” or “thousands of messages” and then adding “per year” or something, which brings it down to toy level load.

Even the first “serious” example is about “thousands of messages” per minute. Say 5K a minute. That’s 83 per second, say 100. That seems .. not that interesting?

Am I being too dismissive? I think I am. I am not seeing something right. Can anybody say something to widen my perspective?


FTA's conclusion: "If you are handling thousands of messages a day, a simple database-driven queue might be better than Kafka."

They're not trying to say "thousands of messages a day" is a lot, but rather not. Or at the very least, they're saying that at that scale, it is not significant enough to merit the complexity they were dealing with.


I think you're probably being slightly dismissive. It's not necessarily about load but various other concerns like durability, delivery latency and how failures are handled. There's a big difference between messaging and reliable messaging. I have messaging systems that take 10-20 messages a day but must deliver those messages and do it on a deadline. For that you do need infrastructure (and no that isn't a queue inside a SQL database).


you don't need "systems" for 10-20 messages a day. it all could be replaced with S3 buckets and aws-cli with even better durability and delivery latency and error handling than anything you would be able to engineer yourself


I am completely dumbfounded by this reply.

You're suggesting we engineer something on S3 and aws-cli, while complaining about engineering something ourselves when AWS offers a perfectly good queue service that requires no engineering?

Uff. I'm going to buy a hut in the woods and live in it.


I used s3 just as an example of a service with very good track record of availability for a very low cost - and perfectly reliable available service can be created with bash scripts, aws-cli and free-tier AWS account.

perfectly fine with using SQS, just it will have worse availability guarantees than S3 - people should understand tradeoffs


I don't know but I think they might be suggesting that the answer to "don't need a huge complex system" is not "use someone else's huge complex system".


my understanding is that you should not roll your own expensive complex huge system, just use AWS S3 which provides eleven nines of availability at $22/tb/mo.

It is really hard to beat the cost/benefit ratio of S3.

a lot of mediocre engineers cannot swallow a pill that all their expensive work with hundreds of hours of overengineering could be replaced by a couple of AWS managed serverless services stitched together with a few mouse clicks or a single .yml file


Why does AWS even enter the conversation?


same reason any vendor enters the conversation. the argument that you can get more value out of the vendor than you would have out of a full time employee, and often for less money.

sometimes it really does make sense just to pay someone else to solve the problem. not always, but not never.


Then you have dependency on a specific proprietary API / technology available from a single company. Doesn't look like a good trade off.


S3 API has become lingua franca and is supported by open source (minio), storage vendors (QNAP), as well as plugins that translate S3 API calls to APIs for competing cloud providers (s3proxy).

all this is done because S3 provides unmatched durability and reliability at a dirt cheap cost of $22/terabyte/month of storage (with the first 50Tb/mo free!).

Try to beat that reliability guarantees with whatever you handrolled, and I bet you will never be able to beat the cost of S3, even match the durability, reliability, availability guarantees at any reasonable cost at all

from https://aws.amazon.com/s3/storage-classes/:

  Key Features:
    Low latency and high throughput performance
    Designed for durability of 99.999999999% of objects across multiple Availability Zones
have you ever built anything with 11 nines? (as in eleven nines)


S3 is fine until you want your data to leave AWS.

Then it costs $92 / TB to get it out again.

Also S3 has durability guarantees but it's very difficult to do a durable transactional write to S3. Try it a few million times and see. The API is a defacto shitty standard.

These two facts are rather interesting when it comes to doing a restore from your supposed backup or wonder why consistency guarantees between external metadata services (DB) and what is in S3 don't always line up.


and why would you ever take raw data out of AWS ?

if it is for migration: it is one time cost that anyone can swallow easily if they decided to leave AWS for something else.

If your data is worth < $94/tb - it is really not worth pulling it out of AWS. Just let it sit there.

or just use cloudfront to download your data ($8.5/Tb)?


On top of that, if it's a big enough deal, most salespeople at other cloud providers (GCP, Oracle etc.) will gladly pay you to migrate. They'd probably even throw one of their Solutions Engineers at the problem and do it for you for free.


Selling your soul to a different crack dealer. Hmm.


crackheads will move mountains for a score.

this checks out.


> S3 API has become lingua franca

S3 API support sounds great until your costumer builds a system with an "S3 compatible object storage" product. Soon you discover that many "S3 compatible" solutions aren't actually that compatible when pushed.


get-object and put-object is really all you need. everything else is nice to have


Your point is good, but that stack wouldn't win any latency awards. Many of the people I know using kafka need latencies in the millisecond range.


but kafka isn't fast. Most things are backed by real files, so when you hit limits or something ejected from cache, it gets slow real fast.

Kafka isn't the right choice for most things.

SQS, MQTT, NATS, rabbit if you're wanting a lot of admin are all better (plus the crap that azure and google make)


kafka is not for latency, it is used for high throughput. By design kafka shines in high throughput workload due to consumer and producer concurrency (consumer group), broker concurrency (multiple nodes and partitions).

for latency sensitive you will probably need redis pub/sub or something in-memory


In all the times I've been forced to use Kafka, I have never seen single digit millisecond latency.

If you need fast response Kafka is a bad choice.

If you are okay going to multiple digits of milliseconds then there are simpler solutions.

The only reason to use Kafka is the ability to guarantee order. For everything else it's second place at best.


what if you want to have at least once processing and durability?


Like AWS SQS? Others provide at least once processing as well.

Kafka has been the slowest out of them that I've used and definitely more complex to use.


SQS has durability for up to 14 days and you can only have one consumer group.

It’s also proprietary.


Then use Rabbit.

It's still harder to get wrong than Kafka.


> 8000 messages a day, tops? That’s 5 a minute. Does that warrant “infrastructure”? I think a gameboy’s Z80 could handle that load.

The blog post is quite clear in stating that their pain points had nothing to do with scaling or throughput. The author explicitly mentions idempotency, custom headers, and authentication.

I think you're ranting about a strawman you put up.


As someone who helps administer several kafka clusters which carry millions of messages per second, and works with both DBs and DBAs, the shallowness of the author's complaints seemed to belie a lack of experience and understanding of either technology. If they gave even a small amount of detail, it might have been useful:

"We couldn’t create a Kafka source connector from the database holding the messages" – Why couldn't they?

"we had extreme difficulty troubleshooting when their system refused our best understanding of said custom headers" – What made this so extremely difficult for them?

"Authentication with Kafka, and authentication to encrypt confidential data in Kafka, and getting the Avro schema from the schema registry were all separate and painful steps" – These are 3 totally unrelated features, so of course they each have settings. What made the steps so painful for them?

"We would never tolerate this level of complexity with a database" the author says of configuring kafka features, apparently missing that a database setup with all those features (external authentication, encryption, custom serialization) also requires "separate" configuration for each one.


It seems to me that Kafka was the strawman that the article put up to take down as an antipattern. There was no part of the premise that made sense.

> While my client was in a situation where we had no choice but to use this monstrosity of Kafka, Avro, and custom message headers, I would never recommend this usage of Kafka if I had the option.

The antipattern isn't Kafka, it's the client situation.


The author's talked about their pain points, but they didn't really outline the downside of going to this approach in a comprehensive way. What I've found with event based systems is that they are much worse if your operational maturity/excellence is low. To make it worse, operational excellence can get worse over the years so you can't simply base the decision on how well you do it today.


I think you might be right. I’ll take the hit.

Still, I think talking about Kafka without load feels like Christmas without tree. I thought that was the main point of it. Handling massive loads (by distributing them).


I think that the author just leveraged the existing messaging infrastructure they were already using for other services, and also reused their know-how. Their new microservice will barely register in the overall traffic volume, but they don't need to either deploy dedicated infrastructure or onboard onto yet another technology just to have message listeners.


Yeah that’s nothing.

My busy discussion forum built in PHP running on a toaster of a server was handling way more load than that 15 years ago.


It seems a lot of the complaints weren't about kafka itself, but rather seemed to stem from internal communication problems. Custom kafka message headers could very well be custom http headers, and the problem is the same. Kafka is just coincidental.

Looking at the volume though, kafka is overkill. They most likely could have just used the database and reaped the benefits of doing everything in a single transaction, with easier row level locking. The post acknowledges this.

I do think it highlights the need for a small scale kafka, though. It's conceptually great to have everything work off of logs, but kafka does add a non trivial operational burden.


We've had "small-scale" Kafka for a long time. It's an append-only log, and there are a number of ways to implement it, but it's essentially that.

The thing that makes Kafka interesting is the technique of operating from a linux disk write-buffer. That's the trick that makes it fast and scale to huge volumes. But if you don't have the scale, you can stand up a table, or RabbitMQ, or anything that manages append-only ordered log entries. There doesn't need to be a new thing... Kafka was the new thing.


Yes, there's nothing novel about an append only log. What's missing (or unbeknownst to me) is a library or small server that provides a good general purpose implementation.

It's not just a matter of "write to log, done!". Ensuring persistence, keeping track of consumer offsets, transparent compression, waking up consumers on new message availability, support for transactions...

It's not just a append only log that's wanted, it's a system for managing append only logs, without the complications like leadership election, replication, partitioning, etc.


> Looking at the volume though, kafka is overkill.

Overkill in what sense? The blog post seems to suggest Kafka was already pervasive in their organization, and that they leveraged the existing infrastructure and simply added.a couple of topics. How is this overkill?


>small scale kafka, though. It's conceptually great to have everything work off of logs, but kafka does add a non trivial operational burden.

does something like that exist ???


I think it'd be very easy to write your own. I used postgres subscribe/listen built in combined with a database table to get a distributed message system.

Writing a distributed, scalable system is really hard, and beyond the API, that is the real value for kafka


>I used postgres subscribe/listen built in combined with a database table to get a distributed message system.

Every single person I know who's done this says it was a fantastic decision, and the "eventually I'll have to migrate to X" never came.


how do deal with connection limits? you can't listen through a pooled connection.


It's relatively easy -- removing any networking requirements drastically simplifies the problem. There's still some non-trivial bits that vary depending on granularity for concurrency.

It's a weekend project to demonstrate the concept, maybe a few weeks to really flesh it out and iron out quirks. I imagine if you're willing to use sqlite as a backend for persistence, it gets a bit easier.


You may be right. However, consider the directorial perspective.

You have employees - you try to get and retain the best talent you can. However, every human has strengths and weaknesses, and these may not all be fully visible to you.

Rolling your own vs buying off the shelf is a gamble on future outages.

Will a third-party support and fix the issue, or have a strong community that can help you work through the issue?

If your best engineer builds something that works for long enough to become entrenched, but then carks it, will your best engineering talent be able to resolve the issue? If your rockstar quits, does the team have to pick through the halls of Cthulhu? Does your organisational ignorance of kernel networking suddenly become painfully apparent?

Remember, you need to be twice as clever to debug your code than to write it...


Depending on the meaning of "small-scale kafka", both RabbitMQ and redis do support streams.


One of my desires would be for it to be persistent. Hopefully with the option of different storage tiers, so as logs became older they could be moved to less costly medium and transparently fetched when requested.

Having an event sourced system doesn't make much sense unless you maintain messages from the start of the system. You can snapshot state and resume in order to quickly rebuild from a known good state. That doesn't help if there was a logic error corrupting every state from the start, and a full rebuild is required.

I'm unsure how redis streams behave with regard to cache eviction, nor am I familiar enough with rabbitmq to comment on it's behavior. It's been 10 years since I used either, and at the time neither were good solutions for a log based system.


Redis doesn't do key eviction by default. Everything lives forever, unless you tell it otherwise. Of course, it is recommended to turn on some form of persistence (and configure backups) so that things don't disappear if the server restarts.


Not that I'm aware of. I've been very tempted to write my own.


I used bash and netcat for a queue like this once. I stashed the on disk if the database was down and read them back if far end was down.

The think Kafka really brings to the table is trusting the pipe -- in a case where writing to disk queues would occur often enough, I would run into reliability building my own system, instead Kafka handles that type of indexing.


The main problem I have with Kafka is that their sales team is too good: at my previous employer the CIO was convinced we needed Kafka and bought a contract for sever 100k. But we already had all our events in a postgres database.

Admittedly that database had some complicated queries with lots of business logic to get a useful view on the data. But at first I hoped Kafka would somehow make this easier, but of course our particular usecase with a low event volume (hundreds per day), high latency tolerance (next day reporting was considered good enough), highly complex business logic (various computations that required knowledge of what was done previously) all made Kafka just about the least suitable tool for the job.

Of course the contract was already signed (I was naturally never consulted up front), so this resulted in lots of solution looking for a problem. No suitable problem was found so I ended up leaving enterprise world for a scale-up and the CIO is still doing whatever he wants for god knows why


I've worked at two large companies now with a mature managed Kafka offerings. The 'platform' engineering team handles all of the engineering, implementation, security and compliance, upgrades, observability etc. and have self-service onboarding with lots of recipes and sample integrations. My team moves about 5B messages a day through two topics and we're not putting a dent in the overall volume. It just enables use to move so much more quickly than we would if we had to deal with all of that ourselves.

So in our case it's clearly not an anti-pattern, but the right tool for the job.


The fact that you had an entire team operating it is the key, I think. I've seen various big buzzwordy techs used at various shops I've worked at and whether it was nice to work with totally came down to whether there was a team behind it operating it for us.

K8s with a k8s team running it - fabulous. Without a team and everyone kind of just needs to know enough to get by, except for the one guy who set it up? Dreadful.

Airflow when there's a team running it? Great. Luigi when it's just you, another dude, and the one guy who set it up? Not great.

Even RDS is like eh. Still had RDS tip over and we had to do manual vacuuming or something with compacting tables with dead tuples. Annoying when we were paying for RDS to ostensibly not have to do this.


Agreed -- have used it in a bank. It was very suited for that.


The way I think about this kind of problem is to remember that tools built to deal with huge scaling problems are generally dealing with a very complex set of variables. The tool is going to be designed to let you choose between all of those variables. There's no magic - just configuration whose complexity better matches that of your problem.

That being said, if you are not yet in a situation as complex as the one your tool is designed to deal with, there is a very good chance you will waste some time starting to use such a tool "early." You might get that time back later when you scale, you might have the right people to set up the complex tool the right way for your simple situation, but you are taking a bit of a risk. As long as you go into the situation with your eyes open I think most people end up ok. The horror stories almost always come from people who are working to fulfill needs they do not have and don't understand why their work isn't giving good ROI.


    Immediately starts to doubt OP's assumption/implementation of the "where it works great"
Jokes aside, agree with others. For the 7500/day, I would just push these into an S3/minio folder. And then dequeue 100 or N once every 10/30/60/T seconds. play around for the right N,T.

Then again am sure there maybe reasons/context/constraints unaware to us.

Eg - the ingestion is spiky, with the possibility of all 7500 in a few secs/minute, you would want to first make sure that the http traffic can scale before getting to the point where it can actually connect and push to the queue

Possible reason#2 - an intern who just finished up their first Kafka task; just got freed up

#3 - or this was the only infra available and a choice had to be made with the time available at hand

#4 - or this was an experiment to see for yourself

A majority may not agree with your view, but none of us really are in your shoes. So I applaud you for sharing your thoughts anyway.


Seems like a lot of what I read about Kafka really makes it sound like using it is quite, well, Kafkaesque

Why do so many engineers end up having such a struggle with an event sourcing system, yet the system itself remains highly popular I don’t know. I theorize the following:

- Its flexible enough to do things like receive events (messages) and sending downstream events from those received

- it can ingest events fast. A well tuned instance is very fast and can handle a lot of volume

- it often allows a middleware log point (or other work types) for things happening throughout your whole system

Perhaps all of these things (and more) are hard to attain using a different technology


The simple approach mentioned in the article gets annoying if you have micro-services that don't share a DB. You could add a shared DB or a nosql DB but then you may as well just add Kafka. Of course the key question then shouldn't be kafka or not-kafka but if you over engineered on micro-services.


> Why do so many engineers end up having such a struggle with an event sourcing system, yet the system itself remains highly popular I don’t know.

It's the mental model that's simple: some services write down what happened, other services 'do their own thing' with that information.

I can write the core of the system in 2022, with events like 'Joe wants to buy a bike', 'Joe owes us $200', 'Joe paid us $200', 'Send Joe the bike', etc.

In 2023 I want to build the book-keeping service, in 2024 I want to build the inventory management system, and in 2025 I want to hook it up to a CRM and see if we can try to sell some bike parts to Joe.

Why couldn't I just use REST for that? Because the recipients of the rest calls didn't exist yet.

The bad part of Kafka (in my opinion) is how opinionated the consumer logic is (oftentimes by necessity, because of the whole distributed system thing). Sometimes I just wanna ask "what offset are you up to?", but end up in API hell, and am unable to do it.


- developer tries to get those jobs paying $50k/y more “what kafka, event sourcing and microservices experience do you have”


This is what I've been starting to think about on a more abstract level: Introducing a new technology, a new system into a design isn't like putting a piece into a jigsaw puzzle, or even worse, trying to mold and force the system to fit whatever hole your design has. Many more specialized systems - and Kafka is one of them - should solve some problem, but they should also change your mental model of the system and you should look for the easiest way to introduce these heavy hitters.

For example, if you use Kafka or streaming solutions like Flink or Spark, you should change your mental model to (possibly large), (possibly resplayable) streams of events and look for simple ways to get these event streams going and good ways to consume them. And then you need to let the design push you where it wants you to go.

Like, at work, we recently had a discussion how it was so storage-expensive for a project to store all events of a day and how the query to count all of these events per tenant was taking so long. While they are using a streaming event processor in front of it. Like, what the hell - think in streams, tally up these events on the fly and persist that every hour?


Anything is deceptively deep if you understanding never goes beyond skin deep.

Also Avro is great but like Kafka you were probably holding it wrong.

I do prefer Protobuf in these particular scenarios as Protobufs features more closely align with svc <-> svc RPC style communication patterns while Avro shines in longer lived scenarios where messages need to be archived and you don't want to come up with your own framing for your Protobufs.

This is because Avro has the Avro Object Container Format which is a simple block based file format, which allows for relatively efficient seeking, block based compression etc. Protobuf unfortunately doesn't define any standard file formats or even wire protocol framing. If you need to do more than simply store and scan/read in bulk you might want to use Parquet instead though.

Reading this blog post was probably a waste of my time, hopefully this comment actually helps someone though.


I'm a big Avro fan, it isn't easy to be good at but if you follow its rules it is so much easier to evolve than plain json is.


The anti-pattern here isn't Kafka, it's using Kafka for 7,500 messages/day. Making your whole system asynchronous for that level of load is the textbook definition of over-engineering.


It's not necessarily over-engineering because scaling/performance isn't the only reason to want to be doing things asynchronously. In the example given, they could want to be decoupling consumers and producers from each other.

For example, they could have one service using CDC (his DB source connector) to propagate state out to a bunch of other systems and not know which systems are subscribing for changes.

An organization could have many such systems propagating state out to many other systems, using a single distributed log system.


Yes. For perspective, that's about one message every ten seconds.


Sorry for the naivety/not obvious from the comments: is that too much or too little? (I've used RabbitMQ much more than Kafka.)


Kafka is designed to maximize scalability, millions of messages a second. It's a pain in the neck to manage if you don't need it


At 7500 records a day, I'd even question RabbitMQ in the design.

If I was to judge that at work, my first thought would be that one of our busier postgres clusters with 2 read replicas is chugging through some 2-3k transactions per second without really needing much tuning or rather specialized hardware. The more ETL-oriented clusters are capable of processing some 100M - 200M rows per second when chugging through large queries, and these are just simple 4 core VMs again on not really specialized hardware. And postgres would parallelize these queries more if you gave it more cores and the queries aren't horrible.

At 7500 records a day or 3 million a year, you wouldn't be able to generate enough data to make one of these databases sweat over many years.

Hate me as a DBA, but write some good queries for whatever you're doing and run those in a cronjob at that scale.


I once worked on a chat-based system that handled load like this, and it was initially built with kafka. I worked out that the cost per message was several cents, haha. I replaced it with a redis queue, which was all I knew at the time, and it ran on a digitalocean droplet for CAD $5 per month for around 18 months before they scaled it up. It was handling ~100 messages per second peak at the end, which is still very low. The cause for concern was that the droplet silently failed due to memory issues on a particularly busy day, so it seemed reasonable to jump to the next tier to avoid that issue for a while.

For what it's worth I never intended for the 1CPU/1GB VM to go to production, but I was a consultant and they just ran with it. And it worked!

They swore off of kafka forever after the pains they had with it. Another consultant built that system for them, so it wasn't an internal decision exactly and they had no idea what they were getting into. I've heard of similar experiences since. I've sometimes hoped to land on a project where kafka was well suited to the problem, though; I learned a lot about it back then and it seemed incredibly cool. I was kind of envious of all these projects fully utilizing it!


I've used kafka to process data on the order of 50mb to 5000mb / second incoming. It has complexities that are worth eating for that type of use case.

For a message every 10 seconds, its use is ... hmm. It wouldn't be in my top 50 choices.

edit: and to be clear, I'm a huge fan of kafka: it sat there and silently just worked. It was great!


Way too little to justify event-driven architecture (let alone Kafka specifically), unless you have some specialized need like very slow event processing and need to display a "message received" notification to user before the processing happens. Or you really need the retry functionality and can't handle it some other ways.

Most businesses have no (hue hue) business doing event driven architecture. There is way too much overhead for local testing and overall complexity, especially when you want to properly handle errors.

"But, every developer should be able to set up their local." Yea, great, explain to the manual QA who may be amazing but just started 3 months ago.


I think we need to separate load from event driven architecture: what if you want to reload data on a client's view, even if that data only gets refreshed once a month? The load is very low (1 message/month), but still requires an event to be pushed to the client to refresh their data.


You probably want to be closer to 7,500 messages/second before Kafka becomes worthwhile.


It's almost nothing.

I'm currently maintaining a system that uses Redis as a message broker and at a rate of ~10 messages/second it marginally makes sense.


It is using a dump truck to sugar your coffee.


Too little to need Kafka.


too little


Sometimes you do want it to be asynchronous anyway, it just doesn't make sense to make it asychronous with a horizontally scalable distributed computing streaming platform...


You need something, but the premise of adding Kafka for this load is pretty weak. Use something you already have, because nearly anything will work.

Ex:

  - Your filesystem.
  - Someone else's filesystem (e.g. s3).
  - Your database.
  - REDIS.


Exactly, handled way higher loads than that with a basic table and a poller.


I agree that Kafka is alot of machinery for a fairly limited gain. but I really dislike this whole notion of 'antipattern', as if we can look at thing and assign a decontextualized thumbs up or down. that building systems is just a matter of assembling the right patterns, and avoiding the antipatterns.


Sometimes a technology or pattern can be poisonous in a very specific way that warrants a label: when they're most alluring to those least equipped to leverage them.

Just like microservices, you cross the activation energy to want Kafka very easily because it's appealing on résumés, sounds like a hedge against scale,etc.

But there's a huge asymmetry in understanding the drawbacks to them. When you spin up these systems, the drawbacks don't hit you immediately, it feels like they're solving the problem you had, and it's not until you've invested immense amounts of sweat capital (and literal capital) that you discover how badly you screwed up.

You need some way to match that low effort value prop with a low friction warning: this is not a panacea for your problems. It only seems simple, it's not simple, it will hurt you unless know it will hurt you and simply have the resources and scale to play through that hurt.

To me that warning is what's implied by "antipattern", it's not never use this, it's never use this unless you know why you should never use this.


I agree completely. Kafka, and event based architectures, are highly overused, but are also the right thing sometimes. It's FAR easier to manage an architecture where systems call into the source of truth for a given piece of data via APIs, and you should only switch to an event model if you truly need to.


This was not explicitly addressed in the post, but the big "Kafka antipattern" out there is building "microservice infrastructure" and using a stateful message broker between services where you should be using RPC/look-aside load balancing with deadlines and retries.

Some morons even write books and blog posts about this. The funny thing is this sort of shit is done in the name of scale, but the big folks never operate this way. Large scale infrastructures actively disdain keeping buffers and state in the middle of the request flow. They cannot afford the cost and latency of such systems. They do it the sane way[1].

[1] https://www.usenix.org/conference/osdi23/presentation/saokar


Whilst Kafka isn’t a good fit for everything it genuinely sounds like these issues were organisational, not tech and no stack would have suffered the same frustrations.


The main struggle I have with Kafka is managing partitioning to avoid hotspots.

I have a scenario where we have hundreds of installs of an old and shitty RDBMS on customer sites, and we need to replicate changes to data to a central store. We had to come up with a bespoke event system that would capture insert, update, and delete events, ship them to a REST endpoint, who would then throw them into Kafka to be processed into the central store. Kafka’s ordered message log made it ideal for this scenario, as we can’t play events out of order (although because of poor design in the old databases, it sometimes happens and we built a retry system using additional Kafka topics, nonetheless avoiding out of order messages is critical to keep consumer lag under control).

This works mostly ok, but we have a problem when individual customers have big bursts of traffic. Ultimately, we need records to be processed in the order they happen, per customer. Naively, we could partition by customer ID, but arbitrarily adding new partitions as we add customers is not practical over time, and regardless, bulk inserts, updates, etc. could cause large amounts of latency for a customer. So, we’re doing a balancing act of trying to partition using customer ID + a “bundle name” of related tables (the net effect being activity to dependent tables for the same customer always go to the same partition and thus process in order). We’re also looking at using additional topics to create high, medium, and low priority queues, but while that may smooth out some of the problems, it really only breaks the original problem into three smaller versions of the same problem, effectively kicking the can down the road.

Ultimately, the best solution would be to get rid of the crappy RDBMS and replace with something that we can binlog or otherwise sync transactionally rather than record by record. We are working on this, but it’s slow going. In the meantime, we continue to wrestle with Kafka partitioning woes.

As an aside, we also got rid of Avro. It just didn’t have any benefits that outweighed the challenges to get it and keep it working over time. Much easier to just use plain json, a common message class library between consumers and producers, and a fast, traditional json library. I’ll fully admit that perhaps the avro woes are more an issue inexperience, but I seem to find more people who have the same experience as me than not. Either way, plain json has not caused us any problems.


Deceptively deep topic, and yet such a superficial post.


Kafka will be trash when used as a message bus which is exactly what author had experienced. It can work but it’s designed for stream processing not messaging so it will always be inferior when used this way


For those who came here for Franz Kafka, this article and posts are about Apache's "distributed event store and stream-processing platform".

https://en.wikipedia.org/wiki/Apache_Kafka


Why did you think anyone thought this was about Franz Kafka?


Amazing reply


Curious what language the OP and their team was using to integrate with Avro. Binary serialization can be a bit awkward, but Avro is a very stable API and it isn't difficult to find/ and or build abstractions to work with it. Perhaps I am spoiled coming from a Clojure perspective?


Oh, I’ve seen much worse than this. I truly believe system design interviews and Confluent marketing/sales have made Kafka a midwit trap:

1. You cannot just use Kafka for free. It will take dev time to set up itself, dev time to code sources and sinks, dev time to handle commonly glossed over but utterly important details like idempotency, retries, duplicate messages, consumed-but-not-committed (or whatever the term is in Kafka world) network interruptions or restarts of consumers.

2. You cannot just continue to use Kafka for free. Running it has an operational cost; this is mitigated using it as a PAAS but not fully solved as you’ll still need to twiddle configurations and scramble to deal with things like “We had no idea we’d need to handle idempotency or use deadletter queues, fixed it, but now need to deal with old data before our fix”.

3. There are many ways you can implement async producer:consumer patterns that are less complex and less costly than Kafka. For example, you can write data to S3. Or you can store records in a regular RDBMS. Kafka isn’t worth it unless you really need “real-time” ingestion but can’t/won’t/shouldn’t implement an actually-real time (synchronous) system instead, like if you get large spikes and are ok with ingestion going from O(seconds) to O(minutes) when that happens.

4. There’s a good chance you don’t need an async queue at all. If consumers can horizontally scale quickly (like with Lambda) why not synchronously invoke them over HTTP/RPC and only use async queues (or a file, etc) for messages that fail multiple retries? Since external users usually aren’t directly writing to your Kafka topic, and thus you have a degree control over your ingestion and consumption, why not just combine the two services (since you can experience data loss from external world to ingestion service anyway, and in fact this is a pretty likely source of failures, your queue may not even be solving the problem you think it is). If you don’t need ordering or partition and are using queues for eg config update propagation, why not just synchronously update consumers or implement basic polling in your consumers?

5. A lot of Kafka/Confluent “features” like retries or logging you can get with so many other tools and services but for some reason these can be the actual selling point more so than the fact it’s an async queue (that also has these features).

Yes, in a FAANG design interview where it only costs you 5seconds to say “and we’ll use an async queue like Kafka between these components to handle variable load and partially consumed data” it’s a great tool that saves you a lot of time. And when you pretend integration and maintenance costs don’t exist, and don’t even know what idempotency means, and are sitting across from some slick Confluent salesperson telling you Kafka can be THE database of everything your company does with all these nice features, it sounds great to midwit managers and hasbeen architecture astronauts. In reality? The dumb unsexy alternatives probably solve your actual problem more simply


from your reply it seems like you have not worked with high throughput workloads that FAANG deals with daily.

your suggestion of single node rdbms as a replacement to kafka suggest you dont have experience with workloads that cannot be served by a single machine, yet you still need a single architecture.

agree that Confluent took a gread product that works for high load use case, and then tries to shove it to each and every average Fortune1000 enterprise use case with 100 users and traffic that could be well served by SQLite/Postgres on a single machine


Not only have I worked on some of the highest throughput workloads in FAANG, the ones I’ve worked on involve consuming from async queues (including Kafka) as a first class feature and use additional queueing under the hood to make the system work.

Kafka is overkill for the vast majority of users, including many in FAANG, and at companies in FAANG that I’ve been at, usually either an internal async queue or a bespoke solution for the problem is used instead. But more importantly, async queueing via Kafka-like technology itself is way over-applied. Polling a single node RDBMs with a horizontally scaling cache in front of it is a fine way to propagate eg config changes at massive scale. And many of the “real-time ingestion” use cases that people love using Kafka for are much better off simply being synchronous, with engineering effort instead focused on the ability to rapidly scale consumers up and down.


On my laptop even a demo with 5000 records a second seems almost too few for Kafka.


ok, pretty embarrassing I thought of Kafka the author first and wondered why he would have been an antipattern ...


The company I worked for when GDPR went into action used kafka in every configuration possible. I've seen some pretty decent uses but also some horrific ones.

I agree with the author to an extent - kafka is overly complex for low traffic systems. It can however be a godsend in specific cases, but what those cases are can be hard to pinpoint.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: