Node and Cluster Recovery

This article describes how ClusterControl manages Node and Cluster Recovery

Node Recovery

One more more nodes fails, but the majority of nodes (atleast one node is in SYNCHED) are still up.

ClusterControl will wait for the nodes to timeout before attempting node recovery.

ClusterControl will connect to the failed node and restart the failed node, setting the appropriate wsrep_cluster_address
Now follows two options the node will perform:

IST - fast recovery, transfer only missed write sets. IST will be used if possible. If not possible SST will be used:
SST - full recovery after hard stop : The node will recover using e.g rsync or xtrabackup

Cluster Recovery

Cluster Recovery will be used if possible. ClusterControl implements multiple protocols that can be run in this mode:

Cluster Recovery due to split brain

If atleast one node is in state Initialized and is in state Non-primary , and this was caused by network partitioning/split brain, ClusterControl will make this node the primary component.

Cluster Recovery due to all nodes failed

If all nodes have failed ClusterControl will try and find one node to bring up to Synched, and then run Node Recovery to recover the other nodes.

It is first running through one protocol to deem the most suitable node to be recovered into Synched. Should this protocol fail, it will gear up and pick a node, and create a new cluster on this node.

For issues please file a support request.

Node and Cluster Recovery

Node Recovery

Cluster Recovery

Cluster Recovery due to split brain

Cluster Recovery due to all nodes failed

Comments

Didn't find what you were looking for?