Nodes can't sync with cluster
Hi folks,
I have two very old MariaDB Galera Clusters (MariaDB: 5.5.46). Both are working but show status degraded. On each cluster one node is not able to rejoin the cluster. One is showing "Galera recovery blocked. The other says "Recovery in progress" but this is since 5 days and there are only a few small dbs.
"Galera Node Recovery in progress mysql error.log
170418 08:37:02 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
170418 08:37:02 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.aGTIFm' --pid-file='/var/lib/mysql/gedadvl21.a.space.corp-recover.pid'
170418 8:37:02 [Note] /usr/sbin/mysqld (mysqld 5.5.46-MariaDB-wsrep) starting as process 10852 ...
170418 08:37:05 mysqld_safe WSREP: Recovered position df0cea37-c175-11e3-b8f2-5ecb11682bc9:2248142
170418 8:37:05 [Note] WSREP: wsrep_start_position var submitted: 'df0cea37-c175-11e3-b8f2-5ecb11682bc9:2248142'
170418 8:37:05 [Note] /usr/sbin/mysqld (mysqld 5.5.46-MariaDB-wsrep) starting as process 10947 ...
170418 8:37:05 [Note] WSREP: Setting wsrep_ready to 0
170418 8:37:05 [Note] WSREP: Read nil XID from storage engines, skipping position init
170418 8:37:05 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
170418 8:37:05 [Note] WSREP: wsrep_load(): Galera 25.3.9(r3385) by Codership Oy <info@codership.com> loaded successfully.
170418 8:37:05 [Note] WSREP: CRC-32C: using hardware acceleration.
170418 8:37:05 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
170418 8:37:05 [Note] WSREP: Passing config to GCS: base_host = 172.28.22.21; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recover
170418 8:37:05 [Note] WSREP: Service thread queue flushed.
170418 8:37:05 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
170418 8:37:05 [Note] WSREP: wsrep_sst_grab()
170418 8:37:05 [Note] WSREP: Start replication
170418 8:37:05 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
170418 8:37:05 [Note] WSREP: protonet asio version 0
170418 8:37:05 [Note] WSREP: Using CRC-32C for message checksums.
170418 8:37:05 [Note] WSREP: backend: asio
170418 8:37:05 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
170418 8:37:05 [Note] WSREP: restore pc from disk failed
170418 8:37:05 [Note] WSREP: GMCast version 0
170418 8:37:05 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
170418 8:37:05 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
170418 8:37:05 [Note] WSREP: EVS version 0
170418 8:37:05 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '172.28.22.20:4567'
170418 8:37:05 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.28.22.4:4567
170418 8:37:05 [Note] WSREP: declaring 3e72c11e at tcp://172.28.22.4:4567 stable
170418 8:37:05 [Note] WSREP: declaring 65ce0cd9 at tcp://172.28.22.20:4567 stable
170418 8:37:05 [Note] WSREP: Node 3e72c11e state prim
170418 8:37:05 [Note] WSREP: view(view_id(PRIM,339e2a04,351861) memb {
339e2a04,0
3e72c11e,0
65ce0cd9,0
} joined {
} left {
} partitioned {
})
170418 8:37:05 [Note] WSREP: save pc into disk
170418 8:37:05 [Note] WSREP: gcomm: connected
170418 8:37:05 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
170418 8:37:05 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
170418 8:37:05 [Note] WSREP: Opened channel 'my_wsrep_cluster'
170418 8:37:05 [Note] WSREP: Waiting for SST to complete.
170418 8:37:05 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
170418 8:37:05 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 33eb1da6-2412-11e7-b2f7-1ad02597951e
170418 8:37:05 [Note] WSREP: STATE EXCHANGE: sent state msg: 33eb1da6-2412-11e7-b2f7-1ad02597951e
170418 8:37:05 [Note] WSREP: STATE EXCHANGE: got state msg: 33eb1da6-2412-11e7-b2f7-1ad02597951e from 0 (gedadvl21.a.space.corp)
170418 8:37:05 [Note] WSREP: STATE EXCHANGE: got state msg: 33eb1da6-2412-11e7-b2f7-1ad02597951e from 1 (gedadvl22.a.space.corp)
170418 8:37:05 [Note] WSREP: STATE EXCHANGE: got state msg: 33eb1da6-2412-11e7-b2f7-1ad02597951e from 2 (gedadvl20.a.space.corp)
170418 8:37:05 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 207,
members = 2/3 (joined/total),
act_id = 2248147,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = df0cea37-c175-11e3-b8f2-5ecb11682bc9
170418 8:37:05 [Note] WSREP: Flow-control interval: [28, 28]
170418 8:37:05 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2248147)
170418 8:37:05 [Note] WSREP: State transfer required:
Group state: df0cea37-c175-11e3-b8f2-5ecb11682bc9:2248147
Local state: 00000000-0000-0000-0000-000000000000:-1
170418 8:37:05 [Note] WSREP: New cluster view: global state: df0cea37-c175-11e3-b8f2-5ecb11682bc9:2248147, view# 208: Primary, number of nodes: 3, my index: 0, protocol version 3
170418 8:37:05 [Warning] WSREP: Gap in state sequence. Need state transfer.
170418 8:37:05 [Note] WSREP: Setting wsrep_ready to 0
170418 8:37:05 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '172.28.22.21' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '10947''
170418 8:37:05 [Note] WSREP: Prepared SST request: rsync|172.28.22.21:4444/rsync_sst
170418 8:37:05 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170418 8:37:05 [Note] WSREP: REPL Protocols: 7 (3, 2)
170418 8:37:05 [Note] WSREP: Service thread queue flushed.
170418 8:37:05 [Note] WSREP: Assign initial position for certification: 2248147, protocol version: 3
170418 8:37:05 [Note] WSREP: Service thread queue flushed.
170418 8:37:05 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (df0cea37-c175-11e3-b8f2-5ecb11682bc9): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():456. IST will be unavailable.
170418 8:37:05 [Note] WSREP: Member 0.0 (gedadvl21.a.space.corp) requested state transfer from '*any*'. Selected 1.0 (gedadvl22.a.space.corp)(SYNCED) as donor.
170418 8:37:05 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2248147)
170418 8:37:05 [Note] WSREP: Requesting state transfer: success, donor: 1
170418 8:37:08 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') turning message relay requesting off
170418 8:37:39 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.25929S), skipping check
170418 8:37:39 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.28.22.20:4567 tcp://172.28.22.4:4567
170418 8:37:39 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.25929S), skipping check
170418 8:37:39 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.28.22.20:4567 tcp://172.28.22.4:4567
170418 8:37:40 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') reconnecting to 65ce0cd9 (tcp://172.28.22.20:4567), attempt 0
170418 8:37:41 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.65275S), skipping check
170418 8:37:44 [Note] WSREP: (339e2a04, 'tcp://0.0.0.0:4567') turning message relay requesting off
170418 8:37:46 [Note] WSREP: view(view_id(NON_PRIM,339e2a04,351861) memb {
339e2a04,0
} joined {
} left {
} partitioned {
3e72c11e,0
65ce0cd9,0
})
170418 8:37:46 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
170418 8:37:46 [Note] WSREP: view(view_id(NON_PRIM,339e2a04,351862) memb {
339e2a04,0
} joined {
} left {
} partitioned {
3e72c11e,0
65ce0cd9,0
})
170418 8:37:46 [Note] WSREP: Flow-control interval: [16, 16]
170418 8:37:46 [Note] WSREP: Received NON-PRIMARY.
170418 8:37:46 [Note] WSREP: Shifting JOINER -> OPEN (TO: 2248147)
170418 8:37:46 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
170418 8:37:46 [Note] WSREP: Flow-control interval: [16, 16]
170418 8:37:46 [Note] WSREP: Received NON-PRIMARY.
170418 8:37:46 [Note] WSREP: declaring 3e72c11e at tcp://172.28.22.4:4567 stable
170418 8:37:46 [Note] WSREP: declaring 65ce0cd9 at tcp://172.28.22.20:4567 stable
170418 8:37:46 [Note] WSREP: re-bootstrapping prim from partitioned components
170418 8:37:46 [Note] WSREP: view(view_id(PRIM,339e2a04,351864) memb {
339e2a04,0
3e72c11e,0
65ce0cd9,0
} joined {
} left {
} partitioned {
})
170418 8:37:46 [Note] WSREP: save pc into disk
170418 8:37:46 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
170418 8:37:46 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4c03987b-2412-11e7-a95e-4351f6b2c576
170418 8:37:46 [Note] WSREP: STATE EXCHANGE: sent state msg: 4c03987b-2412-11e7-a95e-4351f6b2c576
170418 8:37:46 [Note] WSREP: STATE EXCHANGE: got state msg: 4c03987b-2412-11e7-a95e-4351f6b2c576 from 0 (gedadvl21.a.space.corp)
170418 8:37:46 [Note] WSREP: STATE EXCHANGE: got state msg: 4c03987b-2412-11e7-a95e-4351f6b2c576 from 1 (gedadvl22.a.space.corp)
170418 8:37:46 [Note] WSREP: STATE EXCHANGE: got state msg: 4c03987b-2412-11e7-a95e-4351f6b2c576 from 2 (gedadvl20.a.space.corp)
170418 8:37:46 [Warning] WSREP: Quorum: No node with complete state:
Version : 3
Flags : 0x1
Protocols : 0 / 7 / 3
State : NON-PRIMARY
Prim state : JOINER
Prim UUID : 33eb1da6-2412-11e7-b2f7-1ad02597951e
Prim seqno : 208
First seqno : -1
Last seqno : 2248147
Prim JOINED : 2
State UUID : 4c03987b-2412-11e7-a95e-4351f6b2c576
Group UUID : df0cea37-c175-11e3-b8f2-5ecb11682bc9
Name : 'gedadvl21.a.space.corp'
Incoming addr: '172.28.22.21:3306'
Version : 3
Flags : 0x2
Protocols : 0 / 7 / 3
State : NON-PRIMARY
Prim state : DONOR
Prim UUID : 33eb1da6-2412-11e7-b2f7-1ad02597951e
Prim seqno : 208
First seqno : 2244503
Last seqno : 2248147
Prim JOINED : 2
State UUID : 4c03987b-2412-11e7-a95e-4351f6b2c576
Group UUID : df0cea37-c175-11e3-b8f2-5ecb11682bc9
Name : 'gedadvl22.a.space.corp'
Incoming addr: '172.28.22.4:3306'
Version : 3
Flags : 0x2
Protocols : 0 / 7 / 3
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 33eb1da6-2412-11e7-b2f7-1ad02597951e
Prim seqno : 208
First seqno : 2232515
Last seqno : 2248147
Prim JOINED : 2
State UUID : 4c03987b-2412-11e7-a95e-4351f6b2c576
Group UUID : df0cea37-c175-11e3-b8f2-5ecb11682bc9
Name : 'gedadvl20.a.space.corp'
Incoming addr: '172.28.22.20:3306'
170418 8:37:46 [Note] WSREP: Full re-merge of primary 33eb1da6-2412-11e7-b2f7-1ad02597951e found: 2 of 2.
170418 8:37:46 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 208,
members = 2/3 (joined/total),
act_id = 2248147,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = df0cea37-c175-11e3-b8f2-5ecb11682bc9
170418 8:37:46 [Note] WSREP: Flow-control interval: [28, 28]
170418 8:37:46 [Note] WSREP: Restored state OPEN -> JOINER (2248147)
170418 8:37:51 [Warning] WSREP: Could not find peer:
170418 8:37:51 [Warning] WSREP: 1.0 (gedadvl22.a.space.corp): State transfer to -1.-1 (left the group) failed: -255 (Unknown error 255)
170418 8:37:51 [Note] WSREP: Member 1.0 (gedadvl22.a.space.corp) synced with group.
"Galera Recovery blocked mysql error.log"
170418 06:54:34 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
170418 06:54:34 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.nI7gQW' --pid-file='/var/lib/mysql/gedasvl17.a.space.corp-recover.pid'
170418 6:54:34 [Note] /usr/sbin/mysqld (mysqld 5.5.46-MariaDB-wsrep) starting as process 31856 ...
170418 06:54:38 mysqld_safe WSREP: Recovered position d50b5d83-c0a8-11e3-b31d-a7fbdb0acab2:8138017
170418 6:54:38 [Note] WSREP: wsrep_start_position var submitted: 'd50b5d83-c0a8-11e3-b31d-a7fbdb0acab2:8138017'
170418 6:54:38 [Note] /usr/sbin/mysqld (mysqld 5.5.46-MariaDB-wsrep) starting as process 32077 ...
170418 6:54:38 [Note] WSREP: Setting wsrep_ready to 0
170418 6:54:38 [Note] WSREP: Read nil XID from storage engines, skipping position init
170418 6:54:38 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
170418 6:54:38 [Note] WSREP: wsrep_load(): Galera 25.3.9(r3385) by Codership Oy <info@codership.com> loaded successfully.
170418 6:54:38 [Note] WSREP: CRC-32C: using hardware acceleration.
170418 6:54:38 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
170418 6:54:38 [Note] WSREP: Passing config to GCS: base_host = 172.28.7.174; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 16384M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recov
170418 6:54:38 [Note] WSREP: Service thread queue flushed.
170418 6:54:38 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
170418 6:54:38 InnoDB: The InnoDB memory heap is disabled
170418 6:54:38 InnoDB: Mutexes and rw_locks use GCC atomic builtins
170418 6:54:38 InnoDB: Compressed tables use zlib 1.2.3
170418 6:54:38 InnoDB: Using Linux native AIO
170418 6:54:38 InnoDB: Initializing buffer pool, size = 5.5G
170418 6:54:38 InnoDB: Completed initialization of buffer pool
170418 6:54:38 InnoDB: highest supported file format is Barracuda.
170418 6:54:39 InnoDB: Waiting for the background threads to start
170418 6:54:40 Percona XtraDB (http://www.percona.com) 5.5.45-MariaDB-37.4 started; log sequence number 185622560680
170418 6:54:40 [Note] Plugin 'FEEDBACK' is disabled.
170418 6:54:40 [Note] Server socket created on IP: '0.0.0.0'.
170418 6:54:40 [Note] Event Scheduler: Loaded 0 events
170418 6:54:40 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.46-MariaDB-wsrep' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server, wsrep_25.12.r4f81026
-
Hi Oliver,
It would be helpful if you could provide MySQL error logs on all DB nodes. Use our error reporting tool to generate it as described in this page:
https://severalnines.com/docs/troubleshooting.html#error-reportingThen, submit a support ticket so we can take a look at the debugging data privately (take note that this thread is posted in public forum).
Regards,
Ashraf
Please sign in to leave a comment.
Comments
1 comment