1) All data is partitioned based on the "instanceId" of events (see `instanceId` here: https://docs.trench.dev/api-reference/events-create). Instance IDs are typically a logically meaningful way of separating users (such as by company/team/etc.) that allows for sharding the data across nodes.
2) Yes, this the number 1 thing on our roadmap right now (if anyone is interested in helping build this, please reach out!)
3) We're using the Kafka engine in ClickHouse for throttling the ingestion of events. It's partitioned by instanceId (see #1) for scaling/fast queries over similar events.
4) My benchmarks in production showed a single EC2 instance (16 cores / 32 gb ram) barely working at 1000+ inserts / second with roughly the same amount of queries per second. Load averages 0.91, 0.89 0.9. This was in stark contrast to our AWS Postgres cluster which continued to hit 90%+ CPU and low memory with 80 ACUs, before we finished the migration to Trench.
5) We seemed to solve this by running individual Node processes on every core (16 in parallel). Was the limit you saw caused by ClickHouse's inbound HTTP interface?
6) Right now the system uses just a default MergeTree ordered by instanceId, useId, timestamp. This works really well for doing queries across the same user or instance, especially when generating timeseries graphs.
7) I am still trying to figure out the best Kafka partitioning scheme. userId seems to be the best for avoiding hot partitions. Curious if you have any experience with this?
Let us know how the migration goes and feel free to connect with me (christian@trench.dev).
How do you guarantee ACID with Kafka being responsible for actually INSERT'ing into ClickHouse? Wouldn't it be less error prone to just use ClickHouse directly and their async inserts?
I am thinking about setting this up as as a configuration for the type of traffic that doesn't require Kafka.
That being said, Kafka has in my experience come in super handy again and again, simply because it adds an incredible extra layer of fault tolerance when running at scale, including the ability to replay events, replicate, fail over, etc. I'd be nervous about letting the amount of throughput we receive directly interface to ClickHouse (though I'd be excited to run an experiment with this).
Not sure of the CH Kafka engine but generally I think you should partition by userId.
Because the next step would be trying to run some cron for a user or event based trigger based on the events.
And the only way to avoid multiple machines doing the same work / sending the same comms - would be to push all users events to a partition. This way with multiple workers you don't have the risk of duplicate processing.
check "partial ordering" concept. What is the minimum independent "thing"? Probably user?
example over user+invoices:
i.e. there are things that have to come in exact order (e.g. activity on certain invoice), and there are things that can move around (i.e. processing those, timewise), being independent from one another (different invoices' activities, wholesale). But when same user acts on different invoices, then whole one-user-activity should be in exact order.. not just invoice-activity
1) All data is partitioned based on the "instanceId" of events (see `instanceId` here: https://docs.trench.dev/api-reference/events-create). Instance IDs are typically a logically meaningful way of separating users (such as by company/team/etc.) that allows for sharding the data across nodes.
2) Yes, this the number 1 thing on our roadmap right now (if anyone is interested in helping build this, please reach out!)
3) We're using the Kafka engine in ClickHouse for throttling the ingestion of events. It's partitioned by instanceId (see #1) for scaling/fast queries over similar events.
4) My benchmarks in production showed a single EC2 instance (16 cores / 32 gb ram) barely working at 1000+ inserts / second with roughly the same amount of queries per second. Load averages 0.91, 0.89 0.9. This was in stark contrast to our AWS Postgres cluster which continued to hit 90%+ CPU and low memory with 80 ACUs, before we finished the migration to Trench.
5) We seemed to solve this by running individual Node processes on every core (16 in parallel). Was the limit you saw caused by ClickHouse's inbound HTTP interface?
6) Right now the system uses just a default MergeTree ordered by instanceId, useId, timestamp. This works really well for doing queries across the same user or instance, especially when generating timeseries graphs.
7) I am still trying to figure out the best Kafka partitioning scheme. userId seems to be the best for avoiding hot partitions. Curious if you have any experience with this?
Let us know how the migration goes and feel free to connect with me (christian@trench.dev).