In general, we use OpenTelemetry[1] for instrumenting our services in production, collecting metrics and logs for important events. Specifically, we have set up
- multiple dashboards informing us about current system usage (events received, processed) including e2e latency distributions, compute resource usage for different deployments, and top operations
- metrics on critical systems (data stores including Redis, messaging infrastructure, connection poolers for Postgres, etc.) to gauge current resource utilization and typical load patterns
- alerting on unexpected deviations in KPIs (a subset of the metrics above) to help us spot and react to issues quickly
- forecasting on product usage and compute resource utilization patterns for planning medium to long-term infrastructure work
Hey, Bruno from Inngest here, great question! Due to how Redis Clustering works, each individual slot (or keyspace) is assigned to a single shard (primary/read replica group), so data remains available.
More specifically, in production we're running multiple cluster shards in a primary/read replica configuration with support for automatic failover. Individual nodes are distributed across multiple availability zones (AZs) to prevent issues in one data center from impacting the entire cluster.
In case of downtime or maintenance of the primary node, the respective read replica is automatically promoted to primary, and writes can be continued within a few seconds.
can you share the tools you use for monitoring and debugging performance issues in a sharded Redis architecture?