Node and Cluster Recovery
This article describes how ClusterControl manages Node and Cluster Recovery
One more more nodes fails, but the majority of nodes (atleast one node is in SYNCHED) are still up.
ClusterControl will wait for the nodes to timeout before attempting node recovery.
ClusterControl will connect to the failed node and restart the failed node, setting the appropriate wsrep_cluster_address
Now follows two options the node will perform:
- IST - fast recovery, transfer only missed write sets. IST will be used if possible. If not possible SST will be used:
- SST - full recovery after hard stop : The node will recover using e.g rsync or xtrabackup
Cluster Recovery will be used if possible. ClusterControl implements multiple protocols that can be run in this mode:
Cluster Recovery due to split brain
If atleast one node is in state Initialized and is in state Non-primary , and this was caused by network partitioning/split brain, ClusterControl will make this node the primary component.
Cluster Recovery due to all nodes failed
If all nodes have failed ClusterControl will try and find one node to bring up to Synched, and then run Node Recovery to recover the other nodes.
It is first running through one protocol to deem the most suitable node to be recovered into Synched. Should this protocol fail, it will gear up and pick a node, and create a new cluster on this node.
For issues please file a support request.
Please sign in to leave a comment.