Failure handling with ClusterControl™ and Galera
In order to keep the database cluster stable and running, it is important for the system to be resilient to failures. Failures are caused by either software bugs or hardware problems, and can happen at any time. In case a server goes down, failure handling, failover and reconfiguration of the Galera cluster needs to be automatic, so as to minimize downtime.
In case of a node failing, applications connected to that node can connect to another node and continue to do database requests. Keepalive messages are sent between nodes in order to detect failures, in which case the failed node is excluded from the cluster.
ClusterControl™ will restart the failed database process, and point it to one of the existing nodes (a ‘donor’) to resynchronize. The resynchronization process is handled by Galera.
While the failed node is resynchronizing, any new transactions (writesets) coming from the existing nodes will be cached in a slave queue. Once the node has caught up, it will be considered as SYNCED and is ready to accept client connections.
Please sign in to leave a comment.