MySQL Cluster Auto-Recovery

fipar · on May 5, 2019

It's worth pointing that "MySQL Cluster" is actually a different product[1] and not what is used in the article (Innodb Cluster or Group Replication).

[1]: https://dev.mysql.com/doc/refman/8.0/en/mysql-cluster-overvi...

tyingq · on May 5, 2019

"To cut the story short, I had to accept that there is nothing I can do with the networking issues and there are likely to be significant “bursts” of dropped packets that will cause disconnection of our MySQL8 cluster nodes.

So we know there is very likely an infrastructure problem. The question is, can we mitigate it on the application layer?"

That seems unwise. It's on DigitalOcean, so they could have migrated to different VPS instances and checked to see if the problem followed.

dc352 · on May 5, 2019

We are still looking into it and I’m in touch with DO and I hope to learn more from their support. Unfortunately, their response time is currently around 2 days.

I guess my point here was that Our database should be resilient to this kind of infra issues and ideally self-heal if these are transient events.

tyingq · on May 5, 2019

Yes, the work to get the cluster resilient is terrific. Just suggesting moving away from the problematic VPS if a root cause can't be found.

420codebro · on May 5, 2019

Also he could consider adding in some fancy redundancy (SD-WAN, BGP, IP-SLA, whatever) to ensure that a single point of network I/O failure does not cascade to a split-brain scenario.

Might reduce bandwidth or increase latency, but a total outage is pretty painful when you're taking about distributed/replicated databases.

lefred · on May 6, 2019

MySQL evangelist here ;-), for any cluster/HA solution relying on multiple machines linked by a network, if you can't rely on the network, it's very complicated to rely on the solution. This is valid for MySQL InnoDB Cluster but for any other quorum based solution. The third point of the requirements is Network Performance (https://dev.mysql.com/doc/refman/8.0/en/group-replication-re...) maybe we should add "and Reliability" Of course as you may have noticed, 8.0.16 already brings new features helping with a flaky network and even if we will never encourage people to deploy MySQL InnoDB Cluster (Group Replication) on a bad network, we are constantly working on improving the user experience even on such environment.

bratao · on May 5, 2019

I recommend looking at TiDB for anyone using a cluster. I still not migrate to it, but from my tests, things are looking great.

fipar · on May 5, 2019

I really like TiDB but I don't think it's a general-purpose replacement for any other MySQL deployment.

For what is described in this article specifically, TiDB would make you change the storage engine (from Innodb to Rocksdb), and if they have all their data in a single Group Replication cluster they probably don't need sharding either.

c4pt0r · on May 6, 2019

Right, TiDB developer here, TiDB is more suitable for the scenarios that the data cannot fit in one machine. I would say sharding may bring extra maintenance cost and code refactoring, and sometimes the storage change is almost transparent to the application layer (well, that depends what’s the workload), but if the application can live with that, TiDB would be a good alternative for large MySQL sharding cluster