Losing database connectivity during Veeam snapshot/backup
I have a 3 node Galera cluster with keepalived and HAproxy. There is a virtual IP that primarily stays on one node, but will float to one of the other nodes should the primary become unavailable. I have noticed that when Veeam performs a backup of the virtual machines hosting the cluster, I lose database connectivity from my client applications (Zabbix and FreeRADIUS).
I've tried to correct this problem by having Veeam perform backup on the primary node 30 minutes after it has completed the other 2, with the thought that the IP would move to another machine when the primary becomes unavailable during the snapshot period. This, however, does not work, as I am seeing my log files at the time of backup filling with things like this from the Zabbix server:
13073:20181126:033533.943 [Z3005] query failed:  Lost connection to MySQL server during query...
Looking through the logs from the primary node, I believe that since the VM is in a frozen state, it cannot tell one of the other nodes to pick up the IP address, leading to the dB connection failure.
Is there a way to run a pre-freeze (and post-thaw) script to manually move the IP to a different node and have it be cluster aware? I've tried taking down keepalived, but cluster control automatically restarts it and moves the IP back to the machine (as expected).
Did you consider a setup of asynchronous replication and run backups from there - a node that will be not part of the cluster but will have the same data. You can easily setup that with ClusterControl or by merely enabling binary log on one of the Galera cluster nodes with master/slave replication. You will need one VM more, however, it can help you when the restore will be required, or to apply more complex recovery scenarios like delayed replication.
Thanks for the reply. The database backups are working ok - although I am troubleshooting some stuff there too, unrelated - I'm more talking about the machine backups (ie snapshot of entire VM). When Veeam runs, it grabs a VMWare snapshot, then backs that up. It's during this period of the machine be "snapshotted" when I start seeing database connection errors because the virtual IP never moves to one of the other 2 available nodes.
Typically, like with a MS-SQL server, a pre-freeze script can be set up within Veeam to run on the guest OS that will perform some action prior to the snapshot. The script can be in bash form, and Veeam would wait for it to execute prior to performing the backup. It would, for example, tell ClusterControl to move the IP address to another machine so as to avoid the database from becoming unavailable. The post-thaw script would then allow the machine to go back to being available.
As I was writing this, I tried placing the primary node into maintenance mode, which did not move the IP. I also tried to stop the node (from the Node Actions pull-down), and surprisingly that didn't move the IP either. I was hoping to go down the path of invoking one of these two options in my pre-freeze, but now I'm curious how to get the IP to move at all, besides just manually rebooting the node.
I see. This can probably can be solved on multiple layers however I would consider Galera desync during a backup.
You can take a node out of the cluster: with
SET GLOBAL wsrep_desync = ON; SET GLOBAL wsrep_on = OFF;
If your proxy is set up in a way that can check the status of Galera nodes then the new connections should be rerouted to other nodes. You can find more details about it here: wsrep_desync. HAProxy with a help of cluster check script can take a status of the nodes, however, dedicated DB Load balancer like ProxySQL may be a better option here.
Thanks again Bart,
Yesterday I tried your suggestion, but I don't think my HAproxy is set up to handle it (see below). The primary node stayed with the virtual IP address. Also, to note, 'SET GLOBAL WSREP_ON = OFF;' complained that it wasn't a GLOBAL. I took out GLOBAL from the query and it worked, but I'm not sure if that was what I needed to do.
Here's my HAProxy config:
timeout client 10800s
timeout server 10800s
# Added by ben to check status of MySQL, need to research more to
# determine proper implementation
# See http://galeracluster.com/documentation-webpages/haproxy.html
option mysql-check user haproxy
# option allbackups
default-server port 9200 inter 2s downinter 5s rise 3 fall 2 slowstart 60s maxconn 64 maxqueue 128 weight 100
server node3 node3:3306 check # (node1/2/3 are mapped in /etc/hosts)
server node1 node1:3306 check
server node2 node2:3306 check
I tried another method of placing the node in maintenance mode and then stopping keepalived using pre-freeze scripts within Veeam, and this seemed to achieve the desired effect momentarily, but the IP returns to node1 after a few seconds. I have an executable node1_pre-freeze.sh saved on the cluster controller:
# Place Node 1 into maintenance mode
s9s maintenance --create \
--start="$(date -d 'now' '+%Y-%m-%d %H:%M:%S')" \
--end="$(date -d 'now + 30 minutes' '+%Y-%m-%d %H:%M:%S')" \
--reason="Veeam Backup" > /dev/null 2>&1
# Pause the script for 15 sec to make sure we're in maintenance mode
# Stop keepalived to transfer away the floating IP
sudo -u cluster ssh node1 'sudo systemctl stop keepalived'
And on Veeam I'm invoking it with a script saved on the Veeam machine, which gets uploaded to node1's /tmp prior to executing over SSH:
su -c "ssh controller -i \
However, keepalived starts back up every time. I've tried with systemctl and '/etc/init.d/keepalived stop'. I realize this is the intention of High Availability, but I'd really like for it to stop automatically starting while in maintenance mode so that I can get a backup in without the services complaining that it can't communicate with the database.
Please sign in to leave a comment.