Hacker Newsnew | past | comments | ask | show | jobs | submit | jvanlightly's commentslogin

Author here: Regarding point 3, it was my understanding that inserts are synchronous but updates may be asynchronous. Is that the case?


No, updates are also synchronous.

A table in ClickHouse is represented by a set of immutable data parts. SELECT query acquires a snapshot of currently active data parts and uses it for reading. INSERT query creates a new data part (or many) and links it to the set. Background merge operations (aka compactions) create new data parts from the existing and remove the existing data parts. The old data parts can still be used in a snapshot by some queries - they are ref-counted and will be removed later by a GC process. Updates are heavy (they are named "mutations"). Mutations create new data parts with modified data and remove the existing data parts. This is similar to merges. There are some optimizations, though: instead of a full copy of the data, a mutation can link the existing data and only add a small patch on top of it, should it be the row numbers of deleted records and the added records or a lazy expression to apply.

To summarize: - every data part is immutable; - inserts, selects, merges, and mutations work concurrently with no long locks; - it resembles a lock-free data structure, and we could almost say it is lock-free, but... there are mutexes inside for short locks.


Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption. https://www.usenix.org/conference/fast18/presentation/alagap...


> Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption

I believe OP does not make any claims about arbitrary log corruption. Neither raft nor Kafka protocol can handle it. It is about losing tail of the log due to fsync failures or rather lack of fsyncs.


the blog posts mentions that it is for any protocol that is non bft


These instances can manage up disk throughput up to 2 GB/s (400K IOPS) and network throughout of 25gbps or ~3.1 GB/s.

There are so many dimensions, with configurations, CPU architecture, hardware resources plus all the workloads and the client configs. It gets kind of crazy. I like to use a dimension testing approach where I fix everything but vary one or possibly two dimensions at a time and plot the relationships to performance.


I agree, sounds like a good approach.

Can the instance do 2 GB/s to disk at the same time it is doing 3.1GB/s across the network? Is that bidirectional capacity or on a single direction? How many threads does it take to achieve those numbers?

That is kind of a nice property, that the network has 50% more bandwidth than the disk. 2x would be even nicer, but that turns out to be 1.5 and 3, so a slight reduction in disk throughput.

Are you able to run a single RP Kafka node and blast data into it over loopback? That could isolate the network and see how much of the available disk bandwidth a single node is able to achieve over different payload configurations before moving on to a distributed disk+network test. If it can only hit 1GB/s on a single node, you know there is room to improve in the write path to disk.

The other thing that people might be looking for when using RP over AK is less jitter due to GC activity. For latency sensitive applications this can be way more important than raw throughput. I'd use or borrow some techniques from wrk2 that makes sure to account for coordinated omission.

https://github.com/giltene/wrk2

https://github.com/fede1024/kafka-benchmark

https://github.com/fede1024/rust-rdkafka


It's a common misconception about Kafka and fsyncs. But the Kafka replication protocol has a recovery mechanism, much in the same way that Viewstamped Replication Revisited does (except it's safer due to the page cache), which allows Kafka to write to disk asynchronously. The trade-off is that we need fault domains (AZs in the cloud), but if we care about durability and availability, we should be deploying across AZs anyway. We've seen plenty of full region outages, but zero power loss events in multiple AZs in six years.

Kafka and fsyncs: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...


As far as read the blog post, I understand that it assumes the scenario that "a replica dies (and loses its log prefix due to no fcyns) and came back instantaneously (before another replica catches up to the leader)".

Then, in Kafka, what if the leader dies with power failure and came back instantaneously?

i.e.: Let's say there are 3 replicas A(L), B(F), C(F) (L = leader, F = follower)

- 1) append a message to A

- 2) B, C replicas the message. The message is committed

- 3) A dies and came back instantaneously before zk.session.timeout elapsed (i.e. no leadership failover happens), with losing its log prefix due to no fsync

Then B, C truncates the log and the committed message could be lost? Or is there any additional safety mechanism for this scenario?


I love this question. Would be great to hear back from Confluent about this.

One safety mechanism I can think of is that the replicas will detect the leader is down and trigger leader election themselves. Or that upon restart the leader realized it restarted and triggers leader election in a way that B ends up as the leader. (not sure either is being done)

As I think about it more, even if there’s a solution I think I’ll stick to running Redpanda or running Kafka with fsync.


The solution seems to be fsync. It’s what it’s for. It’s very appealing to wave it away because it’s expensive.

The situation above may be just one example of data loss, but it seems there could be others when we gamble on hoping servers restart quickly enough, and don’t crash at the same time, etc.


Two comments here.

1) What about Kafka + KRaft, doesn't that suffer the same problem you point out in Redpanda? If so, recommending to your customers to run KRaft without fsync would be like recommending running with a Zookeeper that sometimes doesn't work. Or do I fundamentally misunderstand KRaft?

2) You mention simultaneous power failures deep into the fsync blog post. I think this should be more visible in your benchmark blog post, when you write about turnning off fsyncs.


1) KRaft is only for metadata replication, and data replication is done in ISR based even in KRaft, so it doesn't change the conclusion


Nah, I dug deeper into this. Right conclusion, wrong reasoning.

The reason KRaft turns out to be fine is because the KRaft topic does fsync! Source: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A...

KRaft is used for metadata replication in the same way that Zookeeper is used for metadata. I.e., in a very meaningful way.


repeating things does not make them true. I read the post. You can only control some failures, but happy for us to write our thoughts in blog form.


Well, that just isn't accurate really. Kafka would need simulteanous VM failure across all AZs. That just doesn't happen in the real world often enough to worry about. It has never happened in Confluent Cloud. RP have a similar issue. Single AZ deployments with local NVMe drives. AZ loses power, a majority of brokers could lose all their data. Then there's data corruption. Fsyncs alone don't save you. The next step would be to implement Protocol Aware Recovery (https://www.usenix.org/conference/fast18/presentation/alagap...) like TigerBeetle have. Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?


I vouched for this comment (can we please not, folks?). Sure but many people dont run across AZ bc it costs a ton of money. Fsync alone dont save you but it sure makes it less likely to suffer data loss.

> Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?

Sure, why not? I think zk doesn’t do fsync too btw


My gut feeling is that if your only AZ goes down (or all your AZs simultaneously), you're going to lose data period because your producers are now all stuck, your APIs are unavailable, etc. Whether the data loss begins at the exact moment power failed or a couple minutes before doesn't matter, vs. the additional cost to fsync constantly.

I mean it's good to know all the failure modes, but at the end of the day it's also good to know how much handling them will cost, and it's often not worth it.


This is very practical way of looking at the problem and is true for majority of systems, but anyone serious enough about keeping their data, and not just pretending, has some kind of back pressure mechanism built in, so the messages will stop flowing if they can't be processed.


Right, and best case that’s going to come back as 503s or 429s, and if that continues for any length of time your customers are going to view it as morally equivalent to data loss (or maybe worse, if the response has no reason for them to be tied to some event stream).


producers stuck != data loss (if you use transactional commits at least). If you run in multiple regions you dont need multi az in a lot of architectures


I don't mean because of some misfeature in the Kafka protocol, I mean because events are still coming in but have nowhere to go. Unless you built a spill as wide as your Kafka cluster. Which isn't worth it, so no one does it.


Who is running single az deployments who also cares about data loss and availability? Seriously? I’ve personally supported 1000s of kafka deploys and this isn’t a thing in the cloud at least. There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary. Data loss in kafka isn't a real problem that hurts real world users at all.


I was grabbing beer with a buddy who has ran some large - petabytes per month - Kafka deployments, and his experience was very much that Kafka will lose acked writes if not very carefully configured. He had direct experiences with data loss from JVM GC creating terrible flaky cluster conditions and, separately, from running out of disk on one cluster machine


> There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary

1. Don't have to do it by message

2. It's used by many distributed db engines, kafka and (i think) zk are the outliers here, not the other way around


Kafka is not a "db engine". zk is a "db engine" in the same way 'DNS' is a "db engine".


Oh, DNS is definitely a database engine [1] ;)

[1]: https://dyna53.io


Ah yes, the semantic argument. Fyi - pulsar and etcd do use fsync


No one is arguing with you. You were making an argument based on a misinformed software category assertion and the error was pointed out. So r/fyi/til maybe?


I can't list names about the "unserious" people who aren't running multi-AZ, but this is the approach to durability that MongoDB took ~15 years ago and they have never lived it down.

It may just be that data reliability isn't a huge concern for messaging queues, so it's less of an issue, but pretending the risk isn't there doesn't help anyone.


Zookeeper absolutely does fsync, you can't disable it (without libeatmydata). It will log a warning if fsync is too slow as well.


Author here. Anyone can run these tests. It's available for anyone to run and check my results.


Author here. Yes both Kafka and Redpanda were deployed on identical hardware: three i3en.6xlarge.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: