Failed to rejoin Galera cluster
Dear all,
we have 4node Percona XtraDB Cluster v5.5.x and until yesterday all worked fine. Yesterday we had to temporarily disconnect one node due to maintenance and now we are not able to rejoin the cluster again.
Each time we start the node it fails after SST. I tried to completely deleted all files inside the MySQL datadir but the result is the same.
Could you please someone help?
Thanks in advance!
mysql_error.log
============
160405 22:06:36 [Note] WSREP: Read nil XID from storage engines, skipping position init
160405 22:06:36 [Note] WSREP: wsrep_load(): loading provider library 'none'
160405 22:06:36 [Note] WSREP: Service disconnected.
160405 22:06:37 [Note] WSREP: Some threads may fail to exit.
Error in my_thread_global_end(): 1 threads didn't exit
160405 22:06:42 [Note] WSREP: Read nil XID from storage engines, skipping position init
160405 22:06:42 [Note] WSREP: wsrep_load(): loading provider library 'none'
160405 22:06:43 [Note] WSREP: Service disconnected.
160405 22:06:44 [Note] WSREP: Some threads may fail to exit.
Error in my_thread_global_end(): 1 threads didn't exit
160405 22:06:49 mysqld_safe Starting mysqld daemon with databases from /galera
160405 22:06:49 mysqld_safe Skipping wsrep-recover for empty datadir: /galera
160405 22:06:49 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
160405 22:06:49 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
160405 22:06:49 [Note] WSREP: Read nil XID from storage engines, skipping position init
160405 22:06:49 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
160405 22:06:49 [Note] WSREP: wsrep_load(): Galera 2.12(r318911d) by Codership Oy <info@codership.com> loaded successfully.
160405 22:06:49 [Warning] WSREP: Could not open saved state file for reading: /galera//grastate.dat
160405 22:06:49 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
160405 22:06:49 [Note] WSREP: Preallocating 5368710440/5368710440 bytes in '/galera//galera.cache'...
160405 22:06:49 [Note] WSREP: Passing config to GCS: base_host = 10.1.1.4; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /galera/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /galera//galera.cache; gcache.page_size = 128M; gcache.size = 5G; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet.backend = asio; protonet.version
160405 22:06:50 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
160405 22:06:50 [Note] WSREP: wsrep_sst_grab()
160405 22:06:50 [Note] WSREP: Start replication
160405 22:06:50 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
160405 22:06:50 [Note] WSREP: protonet asio version 0
160405 22:06:50 [Note] WSREP: backend: asio
160405 22:06:50 [Note] WSREP: GMCast version 0
160405 22:06:50 [Note] WSREP: (eebb1260, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
160405 22:06:50 [Note] WSREP: (eebb1260, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
160405 22:06:50 [Note] WSREP: EVS version 0
160405 22:06:50 [Note] WSREP: PC version 0
160405 22:06:50 [Note] WSREP: gcomm: connecting to group 'My_Galera_Cluster', peer '10.1.1.1:,10.1.1.2:,10.1.1.3:'
160405 22:06:50 [Note] WSREP: (eebb1260, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
160405 22:06:50 [Note] WSREP: declaring 7c575c0d at tcp://10.1.1.1:4567 stable
160405 22:06:50 [Note] WSREP: declaring eed4f866 at tcp://10.1.1.2:4567 stable
160405 22:06:50 [Note] WSREP: declaring f203440b at tcp://10.1.1.3:4567 stable
160405 22:06:50 [Note] WSREP: Node 7c575c0d state prim
160405 22:06:50 [Note] WSREP: gcomm: connected
160405 22:06:50 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
160405 22:06:50 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
160405 22:06:50 [Note] WSREP: Opened channel 'Jablotron_Galera_Cluster'
160405 22:06:50 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 4
160405 22:06:50 [Note] WSREP: Waiting for SST to complete.
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: sent state msg: eebc36bc-fb69-11e5-93d0-d34c86a21acd
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: got state msg: eebc36bc-fb69-11e5-93d0-d34c86a21acd from 0 (db3)
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: got state msg: eebc36bc-fb69-11e5-93d0-d34c86a21acd from 2 (db1)
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: got state msg: eebc36bc-fb69-11e5-93d0-d34c86a21acd from 3 (db2)
160405 22:06:50 [Note] WSREP: STATE EXCHANGE: got state msg: eebc36bc-fb69-11e5-93d0-d34c86a21acd from 1 (db4)
160405 22:06:50 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 178,
members = 3/4 (joined/total),
act_id = 247014377,
last_appl. = -1,
protocols = 0/4/3 (gcs/repl/appl),
group UUID = 04f450ca-def9-11e5-95f1-b33677ec468f
160405 22:06:50 [Note] WSREP: Flow-control interval: [32, 32]
160405 22:06:50 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 247014377)
160405 22:06:50 [Note] WSREP: State transfer required:
Group state: 04f450ca-def9-11e5-95f1-b33677ec468f:247014377
Local state: 00000000-0000-0000-0000-000000000000:-1
160405 22:06:50 [Note] WSREP: New cluster view: global state: 04f450ca-def9-11e5-95f1-b33677ec468f:247014377, view# 179: Primary, number of nodes: 4, my index: 1, protocol version 3
160405 22:06:50 [Warning] WSREP: Gap in state sequence. Need state transfer.
160405 22:06:50 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.1.4:4444' --auth 'xbuser:acRHRlm5aBX' --datadir '/galera/' --defaults-file '/etc/my.cnf' --parent '21480''
WSREP_SST: [INFO] Streaming with xbstream (20160405 22:06:50.622)
WSREP_SST: [INFO] Using socat as streamer (20160405 22:06:50.623)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20160405 22:06:50.645)
160405 22:06:50 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.1.1.4:4444/xtrabackup_sst
160405 22:06:50 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
160405 22:06:50 [Note] WSREP: Assign initial position for certification: 247014377, protocol version: 2
160405 22:06:50 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (04f450ca-def9-11e5-95f1-b33677ec468f): 1 (Operation not permitted)
at galera/src/replicator_str.cpprepare_for_IST():447. IST will be unavailable.
160405 22:06:50 [Note] WSREP: Node 1 (db4) requested state transfer from '*any*'. Selected 0 (db3)(SYNCED) as donor.
160405 22:06:50 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 247014456)
160405 22:06:50 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20160405 22:06:51.061)
WSREP_SST: [INFO] Proceeding with SST (20160405 22:06:51.076)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20160405 22:06:51.078)
removed ‘/galera/mysql/db.frm’
removed ‘/galera/mysql/db.MYI’
removed ‘/galera/mysql/db.MYD’
removed ‘/galera/mysql/host.frm’
removed ‘/galera/mysql/host.MYI’
removed ‘/galera/mysql/host.MYD’
removed ‘/galera/mysql/user.frm’
removed ‘/galera/mysql/user.MYI’
removed ‘/galera/mysql/user.MYD’
removed ‘/galera/mysql/func.frm’
removed ‘/galera/mysql/func.MYI’
removed ‘/galera/mysql/func.MYD’
removed ‘/galera/mysql/plugin.frm’
removed ‘/galera/mysql/plugin.MYI’
removed ‘/galera/mysql/plugin.MYD’
removed ‘/galera/mysql/servers.frm’
removed ‘/galera/mysql/servers.MYI’
removed ‘/galera/mysql/servers.MYD’
removed ‘/galera/mysql/tables_priv.frm’
removed ‘/galera/mysql/tables_priv.MYI’
removed ‘/galera/mysql/tables_priv.MYD’
removed ‘/galera/mysql/columns_priv.frm’
removed ‘/galera/mysql/columns_priv.MYI’
removed ‘/galera/mysql/columns_priv.MYD’
etc.
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20160405 22:06:51.089)
160405 22:06:53 [Note] WSREP: (eebb1260, 'tcp://0.0.0.0:4567') turning message relay requesting off
160406 0:13:09 [Note] WSREP: 0 (db3): State transfer to 1 (db4) complete.
WSREP_SST: [INFO] Preparing the backup at /galera//.sst (20160406 00:13:09.207)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20160406 00:13:09.209)
160406 0:13:09 [Note] WSREP: Member 0 (db3) synced with group.
Terminated
WSREP_SST: [ERROR] Removing /galera//.sst/xtrabackup_galera_info file due to signal (20160406 00:36:32.563)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160406 00:36:32.566)
160406 0:36:32 [ERROR] WSREP: Process was aborted.
160406 0:36:32 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.134.203.213:4444' --auth 'xbuser:8JS5PiGacRHRlm5aBXOZviaU' --datadir '/galera/' --defaults-file '/etc/my.cnf' --parent '21480': 10 (No child processes)
160406 0:36:32 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
160406 0:36:32 [ERROR] WSREP: SST failed: 10 (No child processes)
160406 0:36:32 [ERROR] Aborting
-
Hi,
Do you have ClusterControl auto recovery feature enabled? If you are performing manual maintenance, you have to disable this feature so ClusterControl won't try to recover the failed node. When recovering a failed node, ClusterControl will ensure all related processes are terminated before firing them up again.
If auto recovery was disabled, please send us the MySQL error log and innobackup.*.log from db3 and db4. As a reminder, this is a public forum. You can submit a support ticket if you would like to maintain privacy with the logs.
Regards,
Ashraf
Please sign in to leave a comment.
Comments
1 comment