Netflix has 100+ production clusters running Cockroach DB

pritambarhate · on Nov 2, 2022

I am quite interested in both Cockroach DB and YugaByte. The new age scale out SQL databases. Does anyone here have used any of these in production at scale? If yes, how was your experience? What are the problems you faced? How is TCO compared to something like AWS Aurora?

sean0- · on Nov 2, 2022

Yes. AMA.

metadat · on Nov 2, 2022

What is it like operating 100 of these CockroachDB clusters?

That's quite a few, but if production incidents and issues are rare, might not be too big of a headache.

If high QoS or 9's of uptime is required, the upfront investment in thoroughly and effectively monitoring any new service is a significant engineering endeavor.

sean0- · on Nov 9, 2022

At this point, esp with the upcoming 22.2 release, nearly all failure modes are handled on a machine timescale. There used to be some unbounded failure scenarios that you could run into where a human would have to kick a node, but to the best of my knowledge, they have all been resolved, and unplanned failures are bounded to about 10s now.

One of the bigger burdens from large fleets like this is fleet management and maintenance, specifically managing upgrades. If you have a story around that, you're in good shape.

Most of the excitement comes from moving workloads onto crdb from PG, where it's not uncommon to have workloads with right-leaning indexes or workloads. This class of problem is solvable, but it will catch some people by surprise.

At the end of the day, having a database that can absorb punishment from as small as a few dozen QPS up to millions of QPS is a big reliability win, esp with the self-healing characteristics of the database technology.

metadat · on Nov 9, 2022

What kind of work loads and traffic have you been running on it so far?

How does it compare to the overhead and experience of DIY postgres?