Galera Cluster - Diffrent DB size on each node / Resync Node

Comments

8 comments

  • Avatar
    Ashraf Sharif

    Hi,

    Regarding GRAheader for 5.7, you may try this one: https://www.dropbox.com/s/xeokdy4ok2cyc3o/GRA_header57.tar.gz?dl=0

    Regarding mysqlcheck:

    AFAIK, mysqlcheck command is not replicated in Galera and it uses the unsupported explicit LOCK/UNLOCK TABLE statements during the operation. Details on this can be found at https://mariadb.com/kb/en/library/mariadb-galera-cluster-known-limitations/ . The error log that you've shown was related to an error when MySQL was starting up, not the actual reason why the node got kicked out from the cluster at the first place. If you could give the whole MySQL error log, that would be helpful.

    Regarding node resync:

    Yes, that is the best way to fully resync a node in Galera, by wiping out /var/lib/mysql content (or simply delete grastate.dat). Depending on the DB size, it could cause performance impact to the donor node so handle it with care.

    Regarding adding users to new nodes:

    When a node rejoins and resyncs via SST, the new node should have all the users from the synced node. If you are adding a user via CREATE/GRANT statements, the user will be synced by Galera automatically. Not sure what you really mean by manually add users to new node, one use case I can think of if you add a user via 'INSERT INTO' statement to mysql.user table, which requires FLUSH PRIVILEGES and won't be synced by Galera due to its MyISAM storage engine.

    Regards,
    Ashraf

    0
    Comment actions Permalink
  • Avatar
    M Mueller

    Hi Ashraf,

     

    thanks for your help.

     

    Yesterday the same node got kicked out again. We do a full mysqldump on that one every hour (command: /usr/bin/mysqldump --defaults-file=/etc/mysql/my.cnf --opt --single-transaction --skip-lock-tables --skip-add-locks --set-gtid-purged=OFF --routines --triggers --all-databases > destination.sql ).

     

    Is that command right?

     

    Here are the errors on that node, before it got kicked out:

     

    2019-01-24T11:01:02.875232Z 8 [ERROR] Slave SQL: Could not execute Write_rows event on table db3.UserEvent; Cannot add or update a child row: a foreign key constraint fails (`db3`.`UserEvent`, CONSTRAINT `UserEvent_ibfk_1` FOREIGN KEY (`eid`) REFERENCES `Event` (`uuid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 241, Error_code: 1452

    2019-01-24T11:01:02.875247Z 8 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 12370423

    2019-01-24T11:01:02.875331Z 8 [Note] WSREP: Applier statement rollback needed

    2019-01-24T11:01:02.875394Z 8 [Warning] WSREP: Failed to apply app buffer: seqno: 12370423, status: 1

    at galera/src/trx_handle.cpp:apply():353

    Retrying 2th time

    2019-01-24T11:01:02.875537Z 8 [ERROR] Slave SQL: Could not execute Write_rows event on table db3.UserEvent; Cannot add or update a child row: a foreign key constraint fails (`db3`.`UserEvent`, CONSTRAINT `UserEvent_ibfk_1` FOREIGN KEY (`eid`) REFERENCES `Event` (`uuid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 241, Error_code: 1452

    2019-01-24T11:01:02.875544Z 8 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 12370423

    2019-01-24T11:01:02.875608Z 8 [Note] WSREP: Applier statement rollback needed

    2019-01-24T11:01:02.875636Z 8 [Warning] WSREP: Failed to apply app buffer: seqno: 12370423, status: 1

    at galera/src/trx_handle.cpp:apply():353

    Retrying 3th time

    2019-01-24T11:01:02.875753Z 8 [ERROR] Slave SQL: Could not execute Write_rows event on table db3.UserEvent; Cannot add or update a child row: a foreign key constraint fails (`db3`.`UserEvent`, CONSTRAINT `UserEvent_ibfk_1` FOREIGN KEY (`eid`) REFERENCES `Event` (`uuid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 241, Error_code: 1452

    2019-01-24T11:01:02.875758Z 8 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 12370423

    2019-01-24T11:01:02.875917Z 8 [Note] WSREP: Applier statement rollback needed

    2019-01-24T11:01:02.875951Z 8 [Warning] WSREP: Failed to apply app buffer: seqno: 12370423, status: 1

    at galera/src/trx_handle.cpp:apply():353

    Retrying 4th time

    2019-01-24T11:01:02.876072Z 8 [ERROR] Slave SQL: Could not execute Write_rows event on table db3.UserEvent; Cannot add or update a child row: a foreign key constraint fails (`db3`.`UserEvent`, CONSTRAINT `UserEvent_ibfk_1` FOREIGN KEY (`eid`) REFERENCES `Event` (`uuid`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 241, Error_code: 1452

    2019-01-24T11:01:02.876081Z 8 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 12370423

    2019-01-24T11:01:02.876135Z 8 [Note] WSREP: Applier statement rollback needed

    2019-01-24T11:01:02.889001Z 8 [ERROR] WSREP: Failed to apply trx: source: 4827da4a-0223-11e9-be7f-87eaa7558c60 version: 4 local: 0 state: APPLYING flags: 1 conn_id: 41296663 trx_id: 5250016608 seqnos (l: 10841157, g: 12370423, s: 12370422, d: 12370320, ts: 3259055077710382)

    2019-01-24T11:01:02.889019Z 8 [ERROR] WSREP: Failed to apply trx 12370423 4 times

    2019-01-24T11:01:02.889023Z 8 [ERROR] WSREP: Node consistency compromised, aborting...

    2019-01-24T11:01:02.895259Z 8 [Note] WSREP: turning isolation on

    2019-01-24T11:01:02.896259Z 8 [Note] WSREP: Closing send monitor...

    2019-01-24T11:01:02.896269Z 8 [Note] WSREP: Closed send monitor.

    2019-01-24T11:01:02.896276Z 8 [Note] WSREP: gcomm: terminating thread

    2019-01-24T11:01:02.896287Z 8 [Note] WSREP: gcomm: joining thread

    2019-01-24T11:01:02.898108Z 8 [Note] WSREP: gcomm: closing backend

    2019-01-24T11:01:03.197365Z 8 [Note] WSREP: (d41e767f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.0.104.13:4567 tcp://10.0.104.15:4567

    2019-01-24T11:01:07.898246Z 8 [Note] WSREP: declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)

    2019-01-24T11:01:07.898264Z 8 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)

    2019-01-24T11:01:07.898272Z 8 [Note] WSREP: evs::proto(d41e767f, LEAVING, view_id(REG,2e409302,41)) suspecting node: 2e409302

    2019-01-24T11:01:07.898278Z 8 [Note] WSREP: evs::proto(d41e767f, LEAVING, view_id(REG,2e409302,41)) suspected node without join message, declaring inactive

    2019-01-24T11:01:07.898283Z 8 [Note] WSREP: evs::proto(d41e767f, LEAVING, view_id(REG,2e409302,41)) suspecting node: 4827da4a

    2019-01-24T11:01:07.898287Z 8 [Note] WSREP: evs::proto(d41e767f, LEAVING, view_id(REG,2e409302,41)) suspected node without join message, declaring inactive

    2019-01-24T11:01:07.898313Z 8 [Note] WSREP: Current view of cluster as seen by this node

    view (view_id(NON_PRIM,2e409302,41)

    memb {

    d41e767f,0

    }

    joined {

    }

    left {

    }

    partitioned {

    2e409302,0

    4827da4a,0

    }

    )

    2019-01-24T11:01:07.898341Z 8 [Note] WSREP: Current view of cluster as seen by this node

    view ((empty))

    2019-01-24T11:01:07.899501Z 8 [Note] WSREP: gcomm: closed

    2019-01-24T11:01:07.899570Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

    2019-01-24T11:01:07.899853Z 0 [Note] WSREP: Flow-control interval: [100, 100]

    2019-01-24T11:01:07.899859Z 0 [Note] WSREP: Trying to continue unpaused monitor

    2019-01-24T11:01:07.899863Z 0 [Note] WSREP: Received NON-PRIMARY.

    2019-01-24T11:01:07.899866Z 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 12370428)

    2019-01-24T11:01:07.899881Z 0 [Note] WSREP: Received self-leave message.

    2019-01-24T11:01:07.899887Z 0 [Note] WSREP: Flow-control interval: [0, 0]

    2019-01-24T11:01:07.899890Z 0 [Note] WSREP: Trying to continue unpaused monitor

    2019-01-24T11:01:07.899894Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.

    2019-01-24T11:01:07.899897Z 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 12370428)

    2019-01-24T11:01:07.900345Z 0 [Note] WSREP: RECV thread exiting 0: Success

    2019-01-24T11:01:07.901703Z 8 [Note] WSREP: recv_thread() joined.

    2019-01-24T11:01:07.901717Z 8 [Note] WSREP: Closing replication queue.

    2019-01-24T11:01:07.901721Z 8 [Note] WSREP: Closing slave action queue.

    2019-01-24T11:01:07.902133Z 8 [Note] WSREP: /usr/sbin/mysqld: Terminated.

     

     

    We are only sending traffic to the first node, this is the second one. I only see errors on second and third node. The third node is always being used as donor in such cases. Do you have an idea what the problem is and how we can solve it? Does this happen very often on other clusters? 

     

    With a normal MySQL Installation we never had problems.

    0
    Comment actions Permalink
  • Avatar
    M Mueller

    We have the same error some time after a new resync but this time the node works just fine.

    0
    Comment actions Permalink
  • Avatar
    Ashraf Sharif

    Hi,

    Can you check if you have any tables without Primary Key defined? I don't know about the version your are running on but your issue seems related to this bug: https://jira.percona.com/browse/PXC-1566 .

    Since you mentioned that only first node is receiving all the traffics, that node is the best candidate for a donor if you would want to re-sync everything again. 

    Regards,
    Ashraf

    0
    Comment actions Permalink
  • Avatar
    M Mueller

    No we do not have any tables without Primary Keys. Our version is:

     

    Percona-XtraDB-Cluster-server 5.7.23-23-57

    Percona XtraDB Cluster (GPL), Release rel23, Revision f5578f0, WSREP version 31.31, wsrep_31.31
     

     

    So i think it's very updated. 

     

    How can i choose wich node to use for donor? Node is running with binary loging enabled and shown as Master in ClusterControl. Would i have to shutdown node 3 and 2 (so there is only node 1 left) and then let node 2 resync?

     

    0
    Comment actions Permalink
  • Avatar
    M Mueller

    Any idea?

    0
    Comment actions Permalink
  • Avatar
    Paul Namuag

    Hi M Mueller,

    Your issue is mainly inconsistency. Galera does have data inconsistency detection during/or when transaction is currently processed. You can read the FAQ here http://galeracluster.com/documentation-webpages/faq.html. Possible scenarious could be encounter split brain (see here https://severalnines.com/blog/how-recover-galera-cluster-or-mysql-replication-split-brain-syndrome), or possible could be wsrep_on was set OFF for instance. 

     

    I suggest you can check the most consistent node and fix or sync the affected table. You can use mysqldump on this case or use pt-table-checksum tool from Percona (see this post https://www.percona.com/blog/2012/10/15/using-pt-table-checksum-with-percona-xtradb-cluster/). Once you are able to identify most correct and consistent node, then you can sync from those node to the affected node(s).

    0
    Comment actions Permalink
  • Avatar
    M Mueller

    Just wanted to post an update: The reason was the foreign key and that i've done a mysqldump every hour. 

     

    As soon as i set wsrep_slave_threads = 1 it worked without any problem.

     

    But thanks for your help.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk