Losing database connectivity during Veeam snapshot/backup

Comments

4 comments

  • Avatar
    Bart Oles

    Hello Ben,

    Did you consider a setup of asynchronous replication and run backups from there - a node that will be not part of the cluster but will have the same data. You can easily setup that with ClusterControl or by merely enabling binary log on one of the Galera cluster nodes with master/slave replication. You will need one VM more, however, it can help you when the restore will be required, or to apply more complex recovery scenarios like delayed replication. 

    Kind regards,
    Bart

  • Avatar
    Ben M

    Hello Bart,

    Thanks for the reply. The database backups are working ok - although I am troubleshooting some stuff there too, unrelated - I'm more talking about the machine backups (ie snapshot of entire VM). When Veeam runs, it grabs a VMWare snapshot, then backs that up. It's during this period of the machine be "snapshotted" when I start seeing database connection errors because the virtual IP never moves to one of the other 2 available nodes.

    Typically, like with a MS-SQL server, a pre-freeze script can be set up within Veeam to run on the guest OS that will perform some action prior to the snapshot. The script can be in bash form, and Veeam would wait for it to execute prior to performing the backup. It would, for example, tell ClusterControl to move the IP address to another machine so as to avoid the database from becoming unavailable. The post-thaw script would then allow the machine to go back to being available.

    As I was writing this, I tried placing the primary node into maintenance mode, which did not move the IP. I also tried to stop the node (from the Node Actions pull-down), and surprisingly that didn't move the IP either. I was hoping to go down the path of invoking one of these two options in my pre-freeze, but now I'm curious how to get the IP to move at all, besides just manually rebooting the node.

  • Avatar
    Bart Oles

    I see. This can probably can be solved on multiple layers however I would consider Galera desync during a backup.

    You can take a node out of the cluster: with 

    SET GLOBAL wsrep_desync = ON; SET GLOBAL wsrep_on = OFF; 

    If your proxy is set up in a way that can check the status of Galera nodes then the new connections should be rerouted to other nodes. You can find more details about it here: wsrep_desync. HAProxy with a help of cluster check script can take a status of the nodes, however, dedicated DB Load balancer like ProxySQL may be a better option here.

  • Avatar
    Ben M

    Thanks again Bart,

    Yesterday I tried your suggestion, but I don't think my HAproxy is set up to handle it (see below). The primary node stayed with the virtual IP address. Also, to note, 'SET GLOBAL WSREP_ON = OFF;' complained that it wasn't a GLOBAL. I took out GLOBAL from the query and it worked, but I'm not sure if that was what I needed to do.

     

    Here's my HAProxy config:

    listen haproxy_node1_3307
    bind *:3307
    mode tcp
    timeout client 10800s
    timeout server 10800s
    balance leastconn
    option httpchk

    # Added by ben to check status of MySQL, need to research more to
    # determine proper implementation
    # See http://galeracluster.com/documentation-webpages/haproxy.html
    option tcpka
    option mysql-check user haproxy

    # option allbackups
    default-server port 9200 inter 2s downinter 5s rise 3 fall 2 slowstart 60s maxconn 64 maxqueue 128 weight 100
    server node3 node3:3306 check  # (node1/2/3 are mapped in /etc/hosts)
    server node1 node1:3306 check
    server node2 node2:3306 check

    ==========

    I tried another method of placing the node in maintenance mode and then stopping keepalived using pre-freeze scripts within Veeam, and this seemed to achieve the desired effect momentarily, but the IP returns to node1 after a few seconds. I have an executable node1_pre-freeze.sh saved on the cluster controller:

    #!/bin/bash

    # Place Node 1 into maintenance mode
    s9s maintenance --create \
    --nodes=node1 \
    --start="$(date -d 'now' '+%Y-%m-%d %H:%M:%S')" \
    --end="$(date -d 'now + 30 minutes' '+%Y-%m-%d %H:%M:%S')" \
    --reason="Veeam Backup" > /dev/null 2>&1

    # Pause the script for 15 sec to make sure we're in maintenance mode
    sleep 15s

    # Stop keepalived to transfer away the floating IP
    sudo -u cluster ssh node1 'sudo systemctl stop keepalived'
    exit 0

     

    And on Veeam I'm invoking it with a script saved on the Veeam machine, which gets uploaded to node1's /tmp prior to executing over SSH:

    #!/bin/bash

    su -c "ssh controller -i \
    /home/cluster/.ssh/id_rsa_controller \
    '/home/cluster/ClusterControl/node1_pre-freeze.sh'" cluster

    exit 0

    However, keepalived starts back up every time. I've tried with systemctl and '/etc/init.d/keepalived stop'. I realize this is the intention of High Availability, but I'd really like for it to stop automatically starting while in maintenance mode so that I can get a backup in without the services complaining that it can't communicate with the database.

     

Please sign in to leave a comment.

Powered by Zendesk