Hacker News new | past | comments | ask | show | jobs | submit login
Kafka vs. Redpanda performance – do the claims add up? (jack-vanlightly.com)
213 points by itunpredictable on May 15, 2023 | hide | past | favorite | 141 comments

alex here, original author of redpanda

is hard to respond to a 6-part blog series content - released all at once - on an HN thread.

- what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.

- the kafka partition model of one segment per partition could be optimized in both arch

- the benefit for all of us, is that all of these things will be committed to the OMB (open messaging benchmark) and will be on git for anyone interested in running it themselves.

- we welcome all confluent customers (since the post is from the field cto office) to benchmark against us and choose the best platform. this is how engineering is done. In fact, we will help you run it for you at no cost. Your hardware, your workload head-to-head. We'll help you set it up with both.... but let's keep the rest of the thread technical.

- log.flush.interval.messages=1 - this is something we've taken a stance a long long time ago in 2019. As someone who has personally talked to hundreds of enterprises to date, most workloads in the world should err on the side of safety and flushing to disk (fsync()). Hardware is very good today and you no longer have to choose between safety and reasonable performance. This isn't the high latency you used to see on spinning disks.

It's a common misconception about Kafka and fsyncs. But the Kafka replication protocol has a recovery mechanism, much in the same way that Viewstamped Replication Revisited does (except it's safer due to the page cache), which allows Kafka to write to disk asynchronously. The trade-off is that we need fault domains (AZs in the cloud), but if we care about durability and availability, we should be deploying across AZs anyway. We've seen plenty of full region outages, but zero power loss events in multiple AZs in six years.

Kafka and fsyncs: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

As far as read the blog post, I understand that it assumes the scenario that "a replica dies (and loses its log prefix due to no fcyns) and came back instantaneously (before another replica catches up to the leader)".

Then, in Kafka, what if the leader dies with power failure and came back instantaneously?

i.e.: Let's say there are 3 replicas A(L), B(F), C(F) (L = leader, F = follower)

- 1) append a message to A

- 2) B, C replicas the message. The message is committed

- 3) A dies and came back instantaneously before zk.session.timeout elapsed (i.e. no leadership failover happens), with losing its log prefix due to no fsync

Then B, C truncates the log and the committed message could be lost? Or is there any additional safety mechanism for this scenario?

I love this question. Would be great to hear back from Confluent about this.

One safety mechanism I can think of is that the replicas will detect the leader is down and trigger leader election themselves. Or that upon restart the leader realized it restarted and triggers leader election in a way that B ends up as the leader. (not sure either is being done)

As I think about it more, even if there’s a solution I think I’ll stick to running Redpanda or running Kafka with fsync.

The solution seems to be fsync. It’s what it’s for. It’s very appealing to wave it away because it’s expensive.

The situation above may be just one example of data loss, but it seems there could be others when we gamble on hoping servers restart quickly enough, and don’t crash at the same time, etc.

Two comments here.

1) What about Kafka + KRaft, doesn't that suffer the same problem you point out in Redpanda? If so, recommending to your customers to run KRaft without fsync would be like recommending running with a Zookeeper that sometimes doesn't work. Or do I fundamentally misunderstand KRaft?

2) You mention simultaneous power failures deep into the fsync blog post. I think this should be more visible in your benchmark blog post, when you write about turnning off fsyncs.

1) KRaft is only for metadata replication, and data replication is done in ISR based even in KRaft, so it doesn't change the conclusion

Nah, I dug deeper into this. Right conclusion, wrong reasoning.

The reason KRaft turns out to be fine is because the KRaft topic does fsync! Source: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A...

KRaft is used for metadata replication in the same way that Zookeeper is used for metadata. I.e., in a very meaningful way.

repeating things does not make them true. I read the post. You can only control some failures, but happy for us to write our thoughts in blog form.

> what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.

Confluent themselves can show this, the part I'm curious about is whether you can show data loss outside of the known documented failure modes. Because I, as any can anyone, show data loss by running a cluster without fsync and simultaneously pulling the plug on every server.

> Because I, as any can anyone, show data loss by running a cluster without fsync and simultaneously pulling the plug on every server.

Woah, yeah that's a serious problem. Data loss under that scenario is nothing to sneeze at.

Then enable fsync. I don't really see a way around requiring synchronization to persistent disk if you want persistence cross power outages, right?

Yes, that is what the linked benchmarks discuss...

It's not a serious problem for most deployments though.

You should be running Kafka in multiple DCs/AZs for high availability and scalability.

And in that scenario fsync is nice but not necessary.

I suppose but that's the trade-off for performance. You have to design your system so that can't happen. Which if you're cloud then it's deploying multi-az, if you're coloing then paying for racks with separate power and/or having battery so you have time to fsync and shut down and if you're fully on-prem then you don't need my advice.

Or I suppose just pay for a managed service from someone who does that for you.

I am not entirely sure what is the reason to make Kafka transactional. The original goal was to have a message queue that holds statistical data where the data loss cannot significantly alter the outcome of the analytics performed on the (often incomplete) data. Why are in this argument about fsync and such now? Did something change?

If you need reliable data storage do not use Kafka or similar technologies.

Do you have a source/link for that original goal? I wasn’t aware of this, and as such expect that I can rely on kafka for my events. Also, if this is really the case it should be mentioned on the homepage of Kafka.

Just checked kafka’s homepage, it mentions mission critical, durable, fault tolerant, stores data safely, zero message loss, trusted… Seems they’ve moved on from their original goal.

"The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type."


Kafka is used widely as a persistent event store, and its development features reflect that.

Why would I not just turn on fsync or deploy in a distributed pattern for reliability so I can just continue using it instead of ripping it out, benchmarking something new, teaching the entire org something new, potentially negotiating a new contract, and then executing a huge migration?

Just like heroin is widely used a recreational drug. We live in a free world and you can use Kafka as a persistent reliable store, even use it transactionally.

Instead of reading the marketing claims I like to read what @aphyr has to say about data storage systems.


Are you sure performance would be acceptable if you just turned on fsync on every message?

well it obviously depends on your usage patterns.

But at a certain point any technology is going to reach the limits of what current hardware and operating system primitives can do.

fsync vs. distributed consensus vs. other tradeoffs w.r.t reliability and consistency are not inherent to "Kafka or similar technologies". It's inherent to anything that runs on a computer in the real world.

Generally unless your scale is mind-bogglingly big, the ROI on tuning what you already have is going to be way way bigger than just ripping it out because you read a benchmarking article.

>The original goal was to have a message queue that holds statistical data

I suppose that might have been the original goal, but the current tag line includes "data integration" and "mission-critical".

"Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."

I guess you can add any feature to anything. I think this whole investor driven development is just sad.

Following up - https://redpanda.com/blog/why-fsync-is-needed-for-data-safet...

Try this on your laptop see global data loss - hint: multi-az is not enough

Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.

Can you turn fsync off and rely on recovery with Redpanda?

no, because it is built into the raft protocol itself. with Acks=-1 we only acknowledge to the producer once data has

1. writen to majority 2. majority has done an fsync()

i can see in the future giving people opt-out options here tho.

This was a nice read! There are a few issues on both sides, some that others have mentioned and some that I have not seen yet:

For Redpanda:

1. I don't like that they did not include full disk performance, not sure if that was intentional but it feels like it... Seems like and obvious gap in their testing. Perhaps most of their workloads have records time out rather than get pushed out by bytes first, not sure.

2. Their benchmark was def selective, sure, but they sell via proof of performance for tested workloads IIUC, no via their posted benchmarks. The posted benchmarks just get them into the proof stage in a sales pipeline.

For Kafka (and Confluent, and this test):

1. Don't turn off fsync for Kafka if you leave it on with Redpanda, that's certainly not a fair test.

Batching should be done on the client side anyway, as most packages already do by default. If you are worried about too many fsyncs degrading performance, batch harder on your clients. It's the better way to batch anyway.

2. If confluent cloud is using java 11, then I don't like that java 17 is used for this either. It's not a fair comparison seeing that most people will want it managed anyways, so it gives unrealistic expectations of what they can get

3. Confluent charges a stupid amount of money

4. The author works for Confluent, so I'm not convinced that this test would have been posted if they saw Redpanda greatly outperform Kafka

With Both:

1. Exactly once delivery is total marketing BS. At least Redpanda mentions you need idempotency, but you get exactly once behavior with full idempotency anyway. What you build should be prepared for this, not the infra you use IMO as all you need is one external system to break this promise for the whole system to lose it

I prefer Redpanda as I find it easier to run, and Redpanda actually cares about their users whether they are paid or not. Confluent wont talk to you unless you have a monthly budget of at least $10k, Redpanda has extremely helpful people in their slack just waiting to talk to you.

Ultimately you don't just buy into the software, you buy into the team backing it, and I'd pick Redpanda easily, knowing that they can actually help me and care without needing to give them $10k.

> Batching should be done on the client side anyway, as most packages already do by default. If you are worried about too many fsyncs degrading performance, batch harder on your clients. It's the better way to batch anyway.

This is of course why performance suffers with 50 producers and 288 partitions: not because there is any inherent scale issue in supporting 50 clients (Repanda supports 1000s of clients), but because a 500 MiB/s load spread out among 50 producers and 288 partitions is only ~36 KiB/s per partition-client pair, which is where batching happens. With a linger of 1 ms (the time you'd wait for a batch to form) that's only 36 bytes per linger period so this test is designed to ensure there is no batching at all, to maximize the cost of fsyncs and put Redpanda in a bad light.

A second problem is that most benchmarks, including the one used here, use uniform timings for everything. E.g., when you set the OpenMessaging benchmark to send 1000 messages per second, it schedules a send of one message every 1 millisecond, exactly: i.e., there is no variance in the inter-message timing.

In the real world, message timing is often likely to be much more random, especially when the messages come from external events, like a user click or market event (these are likely to follow a Poisson distribution).

This actually ends up mattering a lot, because message batching will in general be worse under perfect uniformity. E.g., if you have a linger time of 1 ms, a rate of say 900 messages/sec will get no batching (other than forced batching), because each message arrives ~1.1 ms after the last, missing the linger period. If the arrival times were instead random, or especially if they were bursty, you’d get a fair amount of batching just due to randomness, even though the average inter-message time would still be 1.1 ms.

Disclosure: I work at Redpanda.

Of course, have your producers linger is just another potential source of data loss if the client node dies before it can actually produce.

This is not data loss sense we talk about for Kafka or other queues, however, since the messages have not been acked: the state of unacked messages is completely unknown and no guarantees are made about them.

Again, if we’re talking about a full failure across all AZs, this feels like a distinction without a difference.

We weren't necessarily talking about that at all but whether data "lost" because a client crashed before it received acknowledgement of a durable write from the server is somehow the same as losing data that has been acknowledged by the backend.

I argue they are not at all the same: it is, for example, the difference between getting an error when you try to place an online order and getting a successful confirmation but the order is then silently lost.

My most recent Confluent Kafka upgrade put Java 17 on the system, so I'd say doing the benchmark under Java 17 is reasonable despite Confluent Cloud running 11. My upgrade was done using the Confluent Ansible playbooks.

>Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety.

Respect to the Kafka team as Kafka is an incredible piece of software, but the Mongo guys got torched for eternity for pulling the same shenanigans.

Kafka, unlike Mongo DB, relies on recovery/replication instead of fsync:


Kafka has never tried to hide that fact and it does not, in any way, make Kafka unsafe.

I don't think Kafka using eschewing fsyncs is a bad thing; I'm aware of the risks. What I'm pointing out, and what got Mongo killed in the court of public opinion, was saying "our database is blazing fast because we turned off fsyncs".

Benchmarking a system that fsyncs every write to one that doesn't isn't an apples-to-apples comparison. You are free to make the argument that you might not need them, but if you are benchmarking systems and one of them fsyncs by default, that is the level of durability I'm going to expect, otherwise I can assume the other guy will be just as fast if he turns off fsyncs as well.

Is durability preserved when you lose replica connectivity around the same time as power to your CPU? As tends to happen.

Exactly I will never ever try MongoDB because of that. A database that do not fsync should not be called a database.

MongoDB moved on from mmap at version ~3.6. WiredTiger can be configured to fsync every commit. Enjoy trying MongoDB!

PS: I really miss working with mongodb. It's been almost 7 years since I last used it. I'm surprised I don't see it mentioned very often anymore.

Last I heard of MongoDB it was getting utterly buried by the Jepsen guy, and for anyone that follows distributed systems at some technical level, that is damning. He finds stuff wrong with everything, but that one was particularly damning.

MongoDB has always seemed to place write consistency secondary to other priorities (mostly sales / read / features) which is frankly a crap way to do a database, much less a distributed one. And I am so sick of MongoDB basically saying "no it's fixed in the new version" which is always a major red flag.

Right now it's getting its lunch eaten by Postgres's document interface from what I can tell.

a) Every distributed database has had serious issues with Jepsen.

b) MongoDB has been growing revenue ~40% year on year for the last few years.

c) PostgreSQL is only a serious competitor for MongoDB if you have small datasets. After all these years PostgresSQL still is ridiculously poor when it comes to clustering, replication etc. Everyone's solution of "just buy a bigger instance" is just laughable.

Growing revenue of a owner company as a argument for database? We have an Oracle fan here.

Jepsen does find stuff with everything. Thus you have to know what is being discussed is serious and blatantly bad, or just the usual "wow distributed is hard".

Which is why his papers are so great.

But the MongoDB one was "wow this is bad".

Every distributed database has been "wow this is bad".

I assume you have an example of one that wasn't ?

"MongoDB’s default level of write concern was (and remains) acknowledgement by a single node, which means MongoDB may lose data by default."

Cassandra doesn't do that, consistency level is fundamental to the documentation and user guide. That is AWFUL.

"Curiously, MongoDB omitted any mention of these findings in their MongoDB and Jepsen page. Instead, that page discusses only passing results, makes no mention of read or write concern, buries the actual report in a footnote, and goes on to claim:

    MongoDB offers among the strongest data consistency, correctness, and safety guarantees of any database available today.

That is fraud. That is clownshow. Enjoy your increasing revenue.

The default write concern for the last 2 years has been majority.

And single node is a perfectly fine default for most use cases.

After all Cassandra's default consistency level is 1.

Most users of ScyllaDB/Cassandra use quorum or local_quorum

And most users of MongoDB use majority write concern.

Hence my point there is no difference in defaults between MongoDB and Cassandra.

Although it was some time ago and I may be misremembering, I seem to recall reading the Jepsen article on RedPanda and thinking that it (and Postgresql) were among the better reports.

Certainly, not all Jepsen reports are all that bad, and tbh I'm at leaast as interested in the way the vendors respond (some of which have been terrible).

Kafka doesn't do any stupid tricks, but uses the underlying platform for the full potential: https://kafka.apache.org/documentation/#linuxflush

With the usual recommended settings, XFS filesystem, 3 replicas, 2 "in-sync" replicas, etc., it is rather safe. You can also tune background flush to your liking.

The above tradeoffs are very reasonable and Kafka runs very fast on slow disk s(magnetic or in cloud), and even faster on SSD/NVMe disks.

Kafka is not a database....

Maybe you could say that if it acted like redis pub/sub and nothing was stored.

MongoDB has been doing fsync by default for over a decade now .

And those that actually had tried it were aware that every client enabled fsync out of the box. So in fact the entire situation was seriously overblown.

But sure let irrational ideology affect your technology decisions. That will work out well.

Avoiding a database that has a proven historical record of disregarding data consistency and resorting to marketing gimmicks is "irrational ideology"?

Not everyone has time to review every single line of code in their tech stacks. Past reputation is important, and your replies here don't seem to be of much help to MongoDB's reputation as far as I can tell.

> Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety.

And then in this article it's explained how Kafka is actually unsafe:

> Kafka may handle simultaneous broker crashes but simultaneous power failure is a problem.

just against simultaneous node crashes (whole VM/machine).

I mean - sure in practice running in different AZs, etc. will probably be good enough, but technically...

You can't eliminate the risk of data loss, only control for it. fsync is one such control. Empirically, having separate power failure domains strongly controls for the power loss risk.

In the tail there are all kinds of things that will lose you data. I've actually seen systems lose data with the fsync every message strategy on simultaneous power loss. There was latent corruption of the filesystem due to a kernel bug. After power cycling a majority of nodes had unrecoverable filesystems.

In my experience, even on modern flash the cost of fsync is non trivial. It pessimizes io. You can try to account for this with group commit / batching but but generally the batch window needs to be large relative to network rtt to be effective.

fsync is much more necessary on single primary systems.

I only remember losing one etcd cluster, and it was due to something along these lines. Data center at the customer site lost power, and we were called when they couldn't recover our software. All the etcd volumes were corrupted, and after volume recovery by the customer IT department, we found all our etcd files corrupted.

My best guess is their volume systems simply lied about the fsync, which I've heard of a few times about different vendors.

If your workload demands it, then by all means set log.flush.interval.messages=1 or find an alternative solution that is a better match for your requirements.

Kafka has never pretended that ack'd messages have been persisted to disk, only that they've been replicated per your requested acks.

Yep kafka by default is setup to lose data, many people dont know or dont care it seems…

Well, that just isn't accurate really. Kafka would need simulteanous VM failure across all AZs. That just doesn't happen in the real world often enough to worry about. It has never happened in Confluent Cloud. RP have a similar issue. Single AZ deployments with local NVMe drives. AZ loses power, a majority of brokers could lose all their data. Then there's data corruption. Fsyncs alone don't save you. The next step would be to implement Protocol Aware Recovery (https://www.usenix.org/conference/fast18/presentation/alagap...) like TigerBeetle have. Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?

I vouched for this comment (can we please not, folks?). Sure but many people dont run across AZ bc it costs a ton of money. Fsync alone dont save you but it sure makes it less likely to suffer data loss.

> Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?

Sure, why not? I think zk doesn’t do fsync too btw

My gut feeling is that if your only AZ goes down (or all your AZs simultaneously), you're going to lose data period because your producers are now all stuck, your APIs are unavailable, etc. Whether the data loss begins at the exact moment power failed or a couple minutes before doesn't matter, vs. the additional cost to fsync constantly.

I mean it's good to know all the failure modes, but at the end of the day it's also good to know how much handling them will cost, and it's often not worth it.

This is very practical way of looking at the problem and is true for majority of systems, but anyone serious enough about keeping their data, and not just pretending, has some kind of back pressure mechanism built in, so the messages will stop flowing if they can't be processed.

Right, and best case that’s going to come back as 503s or 429s, and if that continues for any length of time your customers are going to view it as morally equivalent to data loss (or maybe worse, if the response has no reason for them to be tied to some event stream).

producers stuck != data loss (if you use transactional commits at least). If you run in multiple regions you dont need multi az in a lot of architectures

I don't mean because of some misfeature in the Kafka protocol, I mean because events are still coming in but have nowhere to go. Unless you built a spill as wide as your Kafka cluster. Which isn't worth it, so no one does it.

Who is running single az deployments who also cares about data loss and availability? Seriously? I’ve personally supported 1000s of kafka deploys and this isn’t a thing in the cloud at least. There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary. Data loss in kafka isn't a real problem that hurts real world users at all.

I was grabbing beer with a buddy who has ran some large - petabytes per month - Kafka deployments, and his experience was very much that Kafka will lose acked writes if not very carefully configured. He had direct experiences with data loss from JVM GC creating terrible flaky cluster conditions and, separately, from running out of disk on one cluster machine

> There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary

1. Don't have to do it by message

2. It's used by many distributed db engines, kafka and (i think) zk are the outliers here, not the other way around

Kafka is not a "db engine". zk is a "db engine" in the same way 'DNS' is a "db engine".

Oh, DNS is definitely a database engine [1] ;)

[1]: https://dyna53.io

Ah yes, the semantic argument. Fyi - pulsar and etcd do use fsync

No one is arguing with you. You were making an argument based on a misinformed software category assertion and the error was pointed out. So r/fyi/til maybe?

I can't list names about the "unserious" people who aren't running multi-AZ, but this is the approach to durability that MongoDB took ~15 years ago and they have never lived it down.

It may just be that data reliability isn't a huge concern for messaging queues, so it's less of an issue, but pretending the risk isn't there doesn't help anyone.

Zookeeper absolutely does fsync, you can't disable it (without libeatmydata). It will log a warning if fsync is too slow as well.

Exactly because we read the documentation and we use it for things where losing data is acceptable.

Just like using HyperLogLog acceptable in many scenarios, using Kafka also acceptable. I am quite baffled how widespread the misuse of technology.

Need reliable data storage? Use a database.

the opposite should be true tho. opt-in for unsafe. you are the minority if you read the docs, let's be real :) most ppl never read the full docs. of the ppl i chat w/ is more like 5%

> Redpanda end-to-end latency of their 1 GB/s benchmark increased by a large amount once the brokers reached their data retention limit and started deleting segment files. Current benchmarks are based on empty drive performance.

This seems really disingenuous to use empty drive performance, since anyone who cares about performance is going to be caring about continuous use.

It's pretty ironic considering they blame JVM garbage collection for bad latency, but ignore their own disk garbage collection that also seems to cause some pretty bad latency.

The disk thrashes because of fsyncs (Kafka doesn't perform any fsync's). But you can provision more disk space to mitigate this problem. And it looks like the test was set up this way to make Redpanda look worse.

You have to provision you disk space accordingly. NVMe needs some free space to have good performance. In this case I think that in Redpanda benchmarks the disk space was available and in case of benchmarks done by Confluent guy the system was provisioned to use all disk space.

With page cache it's OK, because the FTL layer of the drive will work with 32MiB blocks but in case of Redpanda the drive will struggle because FTL mappings are complex and GC has more work. If Kafka would be doing fsync's the behaviour would be the same.

Overall, this looks like a smearing campaign against Redpanda. The guy who wrote this article works for Confluent and he published it on his own domain to look more neutral. The benchmarks are not fair because one of the systems is doing fsyncs and the other does not. Most differences could be explained by this fact alone.

I'd like to see these on OpenJDK 11, since that's what Confluent is running on and the author makes a point of switching to 17 even though he works for Confluent.

In either case, Confluent Platform is ridiculously expensive and approached the costs (licensing alone) for our entire cloud spend. I'd love to see more run-on-k8s alternatives to CFK.

There's really no reason for Confluent to be so expensive, the pricing depends on so many factors that it's so easy to fuck up and receive stellar bills. Also the fact that they are releasing so many components with restrictive licenses, or that they postponed the Kafka tiered storage feature (that allows you to unload some of the topics data to S3 instead of expensive SSD disks) so that they could squeeze more money from their customers.

For long term storage I agree too. The reason we invented our byoc was so that (1) you own your storage and (2) we only charge you for value add

I really dislike the way Confluent has treated the Kafka ecosystem. It feels like they went out of their way to make OSS Kafka kludgy and then priced their enterprise offering completely out of the reach of anyone but Fortune 500.

I have been using Pulsar for new projects not because of performance or anything but because all the features you expect to be built-in are. Georeplication, shared-subscription w/selective ACK, schema registry etc.

Also it's wildly more pluggable, the authn/authz plugin infrastruction in particular is great. I was even able to write a custom Pulsar segment compactor to do GDPR deletions without giving up offloaded segment longevity.

The segment offload is actually huge especially because tools like Secor for Kafka are dead now and you are stuck on the Kafka Connect ecosystem which personally I really find distasteful.

I agree with Confluent pricing, we had the same experience. We switched to pub/sub and Azure Event hubs.

I don't even understand why Confluent should price their offering so high. ITs not like Real time is an exclusive service that other platforms don't have.

Because they have stock they need to pump.

I've found talking to Confluent about anything is a complete waste of time unless it's a very specific technical issue. They're always pushing their cloud as the solution, and it's very aggressive.

Really? We had the opposite experience. We got the impression that sales loves to sell against their cloud. Probably commission related.

At my last org, we spent hundreds on confluent and then they did a pricing adjustment and our bill went up 4x. No exaggeration. We moves from kinesis to confluent because it was cheaper. After that, we moved back to save money.

There are serious limitations with azure event pub though, especially max number of topics .

We ended migrating to aiven after finding confluent pricing unreasonable.

Have you checked Strimzi for Kafka on k8s? it’s super good

Strimzi is really great, creating a Debezium Change Data Capture system and seeing all topics and users as Kubernetes CRDs is just ordered and magic. The only downside is that Redpanda isn't yet supported in Strimzi, but when I met them at KubeCon last month they mentioned the possibility it might be supported in future :)

We really wanted to try redpanda, but operationally it does not appear to be very k8s* native and infact looks like a lot of one off hand holding to get it working properly.

Hopefully that can get ironed out in the future. Until then we will stick with the Strimzi operator and kafka.

Also Confluent is absolutely pricing themselves out of the market. We looked at their self hosted confluent operator and they wanted something like $9k per node, when they do nothing but provide an operator. Insanity.

our real storage is s3 - local disk is for staging/raft layer. how is that not cloud native. if you are referring to cloud native as k8s it is true that our k8s operator was built mostly for our cloud but we released it... the good news is a new interface (same code) w/ more friendly user-defaults is about to get released. you can track it all on github tho.

Oh I see the operator now, looks decent. Before the deployment documentation I had found was very manual and full of a lot of pod exec commands.

Worked with many operators in the wild and anything that gives you more control through CRD/automation and less manual pod intervention is a huge win, let's us bake into our already existing pipelines for deployment and releases also. The Confluent($$$$)/Strimzi operators do well on that front. I'm super excited to have competition in this space!

I'll keep an eye out for the new release!

totally. we built a new team focused on the dev experience of k8s alone. 90seconds to prod (on a working eks cluster) with TLS, external certs, etc. That's the benchmark we're trying to hit :)

MongoDB Enterprise is 8-20k per node and they just provide management software and support.

Curious which version you tried and what k8s environment did you explore?

I'd like to see a baseline of fio and iperf3 for these same instances so we know how much raw performance is available for disk, network alone and together.

Cloud instances have their own performance pathologies, esp in the use of remote disks.

As for RP and Kafka performance, I'd love to see a parameter sweep over both configuration dimensions as well as workload. I know this is a large space, but it needs to be done to characterize the available capacity, latency and bandwidth.

These instances can manage up disk throughput up to 2 GB/s (400K IOPS) and network throughout of 25gbps or ~3.1 GB/s.

There are so many dimensions, with configurations, CPU architecture, hardware resources plus all the workloads and the client configs. It gets kind of crazy. I like to use a dimension testing approach where I fix everything but vary one or possibly two dimensions at a time and plot the relationships to performance.

I agree, sounds like a good approach.

Can the instance do 2 GB/s to disk at the same time it is doing 3.1GB/s across the network? Is that bidirectional capacity or on a single direction? How many threads does it take to achieve those numbers?

That is kind of a nice property, that the network has 50% more bandwidth than the disk. 2x would be even nicer, but that turns out to be 1.5 and 3, so a slight reduction in disk throughput.

Are you able to run a single RP Kafka node and blast data into it over loopback? That could isolate the network and see how much of the available disk bandwidth a single node is able to achieve over different payload configurations before moving on to a distributed disk+network test. If it can only hit 1GB/s on a single node, you know there is room to improve in the write path to disk.

The other thing that people might be looking for when using RP over AK is less jitter due to GC activity. For latency sensitive applications this can be way more important than raw throughput. I'd use or borrow some techniques from wrk2 that makes sure to account for coordinated omission.




"I hope you come away with a new appreciation that trade-offs exist, there is no free lunch despite the implementation language or algorithms used. Optimizations exist, but you can’t optimize for everything. In distributed systems you won’t find companies or projects that state that they optimized for CAP in the CAP theorem. Equally, we can’t optimize for high throughput, low latency, low cost, high availability and high durability all at the same time. As system builders we have to choose our trade-offs, that single silver-bullet architecture is still out there, we haven’t found it yet."

> In distributed systems you won’t find companies or projects that state that they optimized for CAP in the CAP theorem.

This is absolutely rich from the company that keeps promising "exactly once delivery" (with reams of fine print about what "exactly" and "once" and "delivery" mean).

Author say "Redpanda incorrectly claim Kafka is unsafe because it doesn’t fsync - it is not true".

If you don't Fsync the batch, it's possible the server would send response to client saying data was written successfully while the batch is still just in memory and then the server loose power and never write it to disk.

Maybe the author have a different definition of unsafe but to me if it's not ACID it's unsafe!

Kafka won't ack to the producer in default conf until the replicas have acked to the leader.

A topic partition can lose some messages without compromising the correctness of the data replication protocol itself.

But I don't think anyone would call a configuration where you ca lose message a safe configuration.

"...all this is really just benchmarketing, but as I stated before, if no-one actually tests this stuff out and writes about it, people will just start believing it. We need a reality check."

Well said

The biggest point of contention here seems to be over whether kafka can still be considered durable/safe when fsync is disabled.

Seems like it'd be valuable to have a trusted third party like https://jepsen.io/ test it out! (not related, just a fan of their work)

A tangent but how do distributed robustness properties in face of communication hiccups compare between Redpanda and Kafka? Eg with Raft apparently you can still fail in presence of asymmetric network failures (like in https://blog.cloudflare.com/a-byzantine-failure-in-the-real-...)

I wonder if there's an embedded equivalent for such systems? Something like fasterlog but more mature?

You could try nats jetstream https://docs.nats.io/nats-concepts/jetstream

I've found nats to be very lightweight, and it can bridge (bidirectional) to kafka.

Edit: Oh it also supports websockets to the browser

When you add compaction, indexing, recovery, tiered storage, etc. some things become harder to reason about wrt systems resources if you are embedded.

TLDR: "I work at Confluent, the owners of Kafka, and I have determined through my tests that Redpanda's performance is greatly exaggerated."

I don't think we can get a less reliable or trustworthy set of performance tests than when someone's paycheck depends on the outcome of those tests. If Redpanda's performance were found to be better, would he really publish the test results?

I mean, the other benchmarks we have are from RedPanda, so we're comparing one biased set of benchmarks to another biased set of benchmarks. Ultimately it's a matter of the reader understanding the methodology and drawing their own conclusions based on their own experience. I appreciate that the author explains the changes they've made, the impact of those changes, and why they think the changes are reasonable (ex: disabling fsync).

Personally I'm happy to see companies competing on performance like this. If one company puts out benchmarks I want to see their competition come in with their own benchmarks. Ideally we'll see improvements to both products, and a refined benchmarking suite and philosophy.

Disabling fsync is dubious.

I do find it interesting that Confluent feels the need to respond to RP given the disparities in size, install base, etc.

I've been watching Redpanda for a couple years primarily because I'm interested in their wasm data transformations. In the past 3 months I've heard it mentioned several dozen times by other teams in our company, vs. maybe 2-3 times in the >1y prior. So something seems in the air, and presumably Confluent has noticed.

I'm not sure why, Kafka per se doesn't seem to have really dropped any significant balls lately (and we're self-hosted so Confluent isn't very relevant).

Everyone is thinking about their cloud costs right now, so something that offers higher perf and lower ops is going to be more relevant today.

We’re about to released a revamped wasm and new sdk with prev lessons learned. Should be cool

Any sign of JSON schema in the registry? That would be great if so!

I actually enjoy these kinds of benchmarks. They're both incentivized to show their own platforms running in the most optimal setups and they're also incentivized to call out any BS from the other party. In the end users get to see the good and the bad of both platforms.

For this particular post I like that they explained each settings change they're making and why. In many of these benchmarks people will make some change and either not mention it or won't explain why they made the change and users are left trying to figure it out.

I don't think who does the benchmark, any benchmark, matters as long as they're open about how it was done, what properties were set, ideally why they were set, and what their results were. The big picture goal is to ostensibly be able to reproduce such benchmarks.

But I've found through industry that most benchmarks, especially for infrastructure software, are performed by the vendors. The burden for standing up the system(s) to pull off the benchmark is usually high enough that independents are rarely going to take up that banner and do it themselves.

Also, notably closed source systems, some vendors don't license their software to allow public benchmarks.

So, transparency is all we can really hope for.

I remember the halcyon days of the database wars with the vendors publishing new benchmarks seemingly ever month. Fun to watch "Lies, damn lies, and statistics" rear up on its hind legs and roar. And some of the monster clusters of hardware these folks put together were legion.

Similarly I enjoyed when Sun was publishing JEE benchmarks on cheap hardware running Glassfish against MySQL. At least they were publishing on these smaller systems more akin to what many companies may run internally in contrast to these million dollar cluster benchmarks BEA and Oracle were publishing.

Finally, just to throw this out, modern hardware is just extraordinary. Hard to appreciate how fast modern machines are if you didn't live with them in the old days.

Were in the glory days where we, most of we, simply don't care. Off the shelf hardware running untuned servers with reasonable algorithms have so much bandwidth and capability, just gets harder and harder to saturate today.

> Off the shelf hardware running untuned servers with reasonable algorithms have so much bandwidth and capability, just gets harder and harder to saturate today.

Interestingly that's not necessarily the case in the public cloud. I'm messing around with AWS storage for an upcoming talk. You definitely can saturate storage on AWS, and it's sometimes hard to tell why.

Author here. Anyone can run these tests. It's available for anyone to run and check my results.

Confluent doesn't own Kafka. Apache Kafka is an Apache project, with its own government structure. Some of the project management committee is employed by Confluent, but not all: e.g., the current PMC chair is Mickael Maison, employed by Red Hat. See https://projects.apache.org/committee.html?kafka

Kafka PMC is utterly dominated by Confluent or former employees. Everything Kafka does has been and always will be with Confluent's best interest first and foremost. The idea that Kafka isn't completely controlled by Confluent would be disingenuous at best. I don't have anything against Kafka or Confluent, but people should call a spade a spade here when it's blatantly obvious.

This is a very dim view, which doesn't match what I personally have seen, but to each his own.

Confluent don't own Kafka :)

Apache Software Foundation owns Kafka.

Meh. It's obvious Confluent exploited the status of being an Apache open source project in order to say they were open-source. But look at the make up of the PMC of Kafka and it's completely dominated by Confluent employees or former employees. Nothing gets done without Confluent's approval or best interest at heart.

Well, they did write most of it, and the PMC composition is changing.

> Nothing gets done without Confluent's approval or best interest at heart.

I disagree. This explicitly competes against the tiered storage in Confluent's enterprisey Kafka flavour: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A...

True. I wasn't trying to suggest there wasn't a bias here or minimize Confluent's involvement in the project.

EDIT: Thank you for clarification. It is a fair 3 node vs 3 node benchmark.

Does this benchmark compare both 3 node Kafka against 3 node Redpanda cluster? It's unclear.

I had to go to the repo to see the benchmarking setup to get an answer. Looks like 3 Kafka nodes.


Author here. Yes both Kafka and Redpanda were deployed on identical hardware: three i3en.6xlarge.

I'm not sure which of us read it correctly.

I -think- they are saying the original benchmark results done by RedPanda show a 9 node Kafka cluster being beaten by a 3 node RP cluster.

This new benchmark I assume is being done on identical hardware (i.e 3 nodes for both) just with less terrible Kafka settings.

I'm confused because it's never clarified, and it starts with:

> According to their Redpanda vs Kafka benchmark and their Total Cost of Ownership analysis, if you have a 1 GB/s workload, you only need three i3en.6xlarge instances with Redpanda, whereas Apache Kafka needs nine and still has poor performance.

but scanning it again, I think they are in fact doing 3 node vs 3 node benchmark. It's just a bit unclear.

Surely nobody runs 3 Kafka nodes in production, no?

I know of several production clusters at 50m+ ARR companies with three nodes.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact