Thank you! 1) All data is partitioned based on the "instanceId" of events (see `...

klaussilveira · 2024-10-29T13:58:35 1730210315

How do you guarantee ACID with Kafka being responsible for actually INSERT'ing into ClickHouse? Wouldn't it be less error prone to just use ClickHouse directly and their async inserts?

https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

pancomplex · 2024-10-30T00:16:35 1730247395

I am thinking about setting this up as as a configuration for the type of traffic that doesn't require Kafka.

That being said, Kafka has in my experience come in super handy again and again, simply because it adds an incredible extra layer of fault tolerance when running at scale, including the ability to replay events, replicate, fail over, etc. I'd be nervous about letting the amount of throughput we receive directly interface to ClickHouse (though I'd be excited to run an experiment with this).

bosky101 · 2024-10-29T03:29:27 1730172567

Not sure of the CH Kafka engine but generally I think you should partition by userId.

Because the next step would be trying to run some cron for a user or event based trigger based on the events.

And the only way to avoid multiple machines doing the same work / sending the same comms - would be to push all users events to a partition. This way with multiple workers you don't have the risk of duplicate processing.

svilen_dobrev · 2024-10-31T18:21:46 1730398906

check "partial ordering" concept. What is the minimum independent "thing"? Probably user?

example over user+invoices: i.e. there are things that have to come in exact order (e.g. activity on certain invoice), and there are things that can move around (i.e. processing those, timewise), being independent from one another (different invoices' activities, wholesale). But when same user acts on different invoices, then whole one-user-activity should be in exact order.. not just invoice-activity