Sharding high-throughput Redis without downtime

rumblertumbler · 2024-07-25T13:37:36.000000Z

> we have robust monitoring and observability infrastructure in place to spot trends ahead of time

can you share the tools you use for monitoring and debugging performance issues in a sharded Redis architecture?

brunoscheufler · 2024-07-25T14:22:54.000000Z

Bruno from Inngest here, thanks for asking!

In general, we use OpenTelemetry[1] for instrumenting our services in production, collecting metrics and logs for important events. Specifically, we have set up

- multiple dashboards informing us about current system usage (events received, processed) including e2e latency distributions, compute resource usage for different deployments, and top operations

- metrics on critical systems (data stores including Redis, messaging infrastructure, connection poolers for Postgres, etc.) to gauge current resource utilization and typical load patterns

- alerting on unexpected deviations in KPIs (a subset of the metrics above) to help us spot and react to issues quickly

- forecasting on product usage and compute resource utilization patterns for planning medium to long-term infrastructure work

Hope this helps!

[1]: https://opentelemetry.io/

h1fra · 2024-07-25T13:55:50.000000Z

> data only ever resides on a single cluster node and its replica

So a shard is not replicated in multiples nodes? If this node is down/in maintenance you can't read/write to this shard at all?

brunoscheufler · 2024-07-25T14:13:30.000000Z

Hey, Bruno from Inngest here, great question! Due to how Redis Clustering works, each individual slot (or keyspace) is assigned to a single shard (primary/read replica group), so data remains available.

More specifically, in production we're running multiple cluster shards in a primary/read replica configuration with support for automatic failover. Individual nodes are distributed across multiple availability zones (AZs) to prevent issues in one data center from impacting the entire cluster.

In case of downtime or maintenance of the primary node, the respective read replica is automatically promoted to primary, and writes can be continued within a few seconds.

Hope this helps!