Whole cluster crashing again and again
Hi our cluster keeps crashing randomly it seems.
We run S9 MariaDB cluster on CentOS7
[root@gedadvl80 ~]# rpm -qa | grep -i control
clustercontrol-clud-1.9.2-332.x86_64
clustercontrol-ssh-1.9.2-118.x86_64
clustercontrol-cloud-1.9.2-332.x86_64
clustercontrol-1.9.2-8244.x86_64
clustercontrol-controller-1.9.2-5242.x86_64
clustercontrol-notifications-1.9.2-306.x86_64
[root@gedadvl80 ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
According to cluster control the cluster crashed on 2022-03-13 21:29 CET. Our systems are set to UTC so the logs indicate problems on 20:29.
I checked all logs below /var/log and found this:
/var/log/messages:Mar 13 20:29:01 gedadvl81 kernel: mariadbd[1545]: segfault at 0 ip 000055c86ba9f0e6 sp 00007f9f15a2aa30 error 6 in mariadbd[55c86aaf8000+1664000]
mysqld log:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::system_error> >'
what(): remote_endpoint: Transport endpoint is not connected
220313 20:29:01 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.Server version: 10.5.15-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=5
max_threads=502
thread_count=12
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1236187 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
??:0(my_print_stacktrace)[0x55c86baa1e3e]
??:0(handle_fatal_signal)[0x55c86b4939c7]
sigaction.c:0(__restore_rt)[0x7f9f2232e630]
:0(__GI_raise)[0x7f9f21779387]
:0(__GI_abort)[0x7f9f2177aa78]
??:0(__gnu_cxx::__verbose_terminate_handler())[0x7f9f21e73a95]
??:0(std::rethrow_exception(std::__exception_ptr::exception_ptr))[0x7f9f21e71a06]
??:0(std::terminate())[0x7f9f21e71a33]
??:0(__cxa_throw)[0x7f9f21e71c53]
4.8.2/stdexcept:112(runtime_error)[0x7f9f16c97f44]
impl/throw_error.ipp:49(asio::detail::do_throw_error(std::error_code const&, char const*))[0x7f9f16c97fec]
asio/error.hpp:228(get_system_category)[0x7f9f16caa11b]
src/gu_asio_stream_react.cpp:549(gu::AsioStreamReact::assign_addresses())[0x7f9f16ca3e8f]
src/gu_asio_stream_react.cpp:890(gu::AsioAcceptorReact::accept_handler(std::shared_ptr<gu::AsioStreamReact> const&, std::shared_ptr<gu::AsioAcceptorHandler> const&, std::error_code const&))[0x7f9f16ca6780]
detail/gcc_x86_fenced_block.hpp:80(asio::detail::gcc_x86_fenced_block::sbarrier())[0x7f9f16caadd3]
impl/task_io_service.ipp:373(asio::detail::task_io_service::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::task_io_service_thread_info&, std::error_code const&))[0x7f9f16c973eb]
impl/task_io_service.ipp:148(asio::detail::task_io_service::run(std::error_code&))[0x7f9f16c93281]
src/asio_protonet.cpp:104(gcomm::AsioProtonet::event_loop(gu::datetime::Period const&))[0x7f9f16bb92e9]
src/gu_threads.h:187(gu_mutex_lock_SYS)[0x7f9f16ba00d6]
src/gu_threads.h:105(gu_thread_exit)[0x7f9f16ba04a6]
pthread_create.c:0(start_thread)[0x7f9f22326ea5]
??:0(__clone)[0x7f9f21841b0d]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Fatal signal 11 while backtracing
2022-03-13 20:29:08 0 [Note] WSREP: Loading provider /usr/lib64/galera-4/libgalera_smm.so initial position: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375
2022-03-13 20:29:08 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-4/libgalera_smm.so'
2022-03-13 20:29:08 0 [Note] WSREP: wsrep_load(): Galera 26.4.11(r67341d0) by Codership Oy <info@codership.com> loaded successfully.
2022-03-13 20:29:08 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2022-03-13 20:29:08 0 [Note] WSREP: Found saved state: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:-1, safe_to_bootstrap: 0
2022-03-13 20:29:08 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa
Seqno: -1 - -1
Offset: -1
Synced: 0
2022-03-13 20:29:08 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa, offset: -1
2022-03-13 20:29:08 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/1073741848 bytes) complete.
2022-03-13 20:29:09 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (1073741848/1073741848 bytes) complete.
2022-03-13 20:29:09 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 24866-25375
2022-03-13 20:29:09 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/252128 bytes) complete.
2022-03-13 20:29:09 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 4/514 locked buffers2022-03-13 20:29:09 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): free space: 1073490592/1073741824
2022-03-13 20:29:09 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (252128/252128 bytes) complete.
2022-03-13 20:29:09 0 [Warning] WSREP: Option 'gcs.fc_master_slave' is deprecated and will be removed in the future versions, please use 'gcs.fc_single_primary' instead.
2022-03-13 20:29:09 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.28.22.48; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 1024M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0
2022-03-13 20:29:09 0 [Note] WSREP: Service thread queue flushed.
2022-03-13 20:29:09 0 [Note] WSREP: ####### Assign initial position for certification: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375, protocol version: -1
2022-03-13 20:29:09 0 [Note] WSREP: Start replication
2022-03-13 20:29:09 0 [Note] WSREP: Connecting with bootstrap option: 0
2022-03-13 20:29:09 0 [Note] WSREP: Setting GCS initial position to 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375
2022-03-13 20:29:09 0 [Note] WSREP: protonet asio version 0
2022-03-13 20:29:09 0 [Note] WSREP: Using CRC-32C for message checksums.
2022-03-13 20:29:09 0 [Note] WSREP: backend: asio
2022-03-13 20:29:09 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2022-03-13 20:29:09 0 [Note] WSREP: restore pc from disk successfully
2022-03-13 20:29:09 0 [Note] WSREP: GMCast version 0
2022-03-13 20:29:09 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567
2022-03-13 20:29:09 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') multicast: , ttl: 1
2022-03-13 20:29:09 0 [Note] WSREP: EVS version 1
2022-03-13 20:29:09 0 [Note] WSREP: gcomm: connecting to group 'DMZ', peer '172.28.22.48:,172.28.22.195:,172.28.22.147:'
2022-03-13 20:29:09 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address ssl://172.28.22.48:4567
2022-03-13 20:29:12 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://172.28.22.147:4567 timed out, no messages seen in PT3S, socket stats: rtt: 160 rttvar: 85 rto: 201000 lost: 0 last_data_recv: 299484683 cwnd: 10 last_queued_since: 299784682986172 last_delivered_since: 299784682986172 send_queue_length: 0 send_queue_bytes: 0
2022-03-13 20:29:12 0 [Note] WSREP: EVS version upgrade 0 -> 1
2022-03-13 20:29:12 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2022-03-13 20:29:12 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2022-03-13 20:29:12 0 [Note] WSREP: view(view_id(NON_PRIM,4477203e-a55f,32) memb {
4477203e-a55f,0
} joined {
} left {
} partitioned {
})
2022-03-13 20:29:12 0 [Note] WSREP: gcomm: connected
2022-03-13 20:29:12 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2022-03-13 20:29:12 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2022-03-13 20:29:12 0 [Note] WSREP: Opened channel 'DMZ'
2022-03-13 20:29:12 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-03-13 20:29:12 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-03-13 20:29:12 0 [Note] WSREP: Received NON-PRIMARY.
2022-03-13 20:29:12 1 [Note] WSREP: Starting rollbacker thread 1
2022-03-13 20:29:12 2 [Note] WSREP: Starting applier thread 2
2022-03-13 20:29:12 2 [Note] WSREP: ================================================
View:
id: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375
status: non-primary
protocol_version: -1
capabilities:
final: no
own_index: 0
members(1):
0: 4477203e-a052-11ec-a55f-9fc08735e142, gedadvl81.a.space.corp
=================================================
2022-03-13 20:29:12 2 [Note] WSREP: Non-primary view
2022-03-13 20:29:12 2 [Note] WSREP: Server status change disconnected -> connected
2022-03-13 20:29:12 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:12 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:13 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50176S), skipping check
2022-03-13 20:29:21 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') connection established to 879bc99e-9b9f ssl://172.28.22.195:4567
2022-03-13 20:29:21 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2022-03-13 20:29:21 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') connection established to 879bc99e-9b9f ssl://172.28.22.195:4567
2022-03-13 20:29:21 0 [Note] WSREP: declaring 879bc99e-9b9f at ssl://172.28.22.195:4567 stable
2022-03-13 20:29:21 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2022-03-13 20:29:21 0 [Note] WSREP: view(view_id(NON_PRIM,4477203e-a55f,33) memb {
4477203e-a55f,0
879bc99e-9b9f,0
} joined {
} left {
} partitioned {
})
2022-03-13 20:29:21 0 [Warning] WSREP: node uuid: 879bc99e-9b9f last_prim(type: 2, uuid: 807b1f4a-9563) is inconsistent to restored view(type: V_NON_PRIM, uuid: 4477203e-a55e
2022-03-13 20:29:21 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
2022-03-13 20:29:21 0 [Note] WSREP: Flow-control interval: [23, 23]
2022-03-13 20:29:21 0 [Note] WSREP: Received NON-PRIMARY.
2022-03-13 20:29:21 2 [Note] WSREP: ================================================
View:
id: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375protocol_version: -1
capabilities:
final: no
own_index: 0
members(2):
0: 4477203e-a052-11ec-a55f-9fc08735e142, gedadvl81.a.space.corp
1: 879bc99e-a053-11ec-9b9f-7738826a1738, unspecified
=================================================
2022-03-13 20:29:21 2 [Note] WSREP: Non-primary view
2022-03-13 20:29:21 2 [Note] WSREP: Server status change connected -> connected
2022-03-13 20:29:21 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:21 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:23 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') connection established to 807b1f4a-9564 ssl://172.28.22.147:4567
2022-03-13 20:29:23 0 [Note] WSREP: declaring 807b1f4a-9564 at ssl://172.28.22.147:4567 stable
2022-03-13 20:29:23 0 [Note] WSREP: declaring 879bc99e-9b9f at ssl://172.28.22.195:4567 stable
2022-03-13 20:29:23 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2022-03-13 20:29:23 0 [Note] WSREP: view(view_id(NON_PRIM,4477203e-a55f,34) memb {
4477203e-a55f,0
807b1f4a-9564,0
879bc99e-9b9f,0
} joined {
} left {
} partitioned {
})
2022-03-13 20:29:23 0 [Warning] WSREP: node uuid: 807b1f4a-9564 last_prim(type: 2, uuid: 807b1f4a-9563) is inconsistent to restored view(type: V_NON_PRIM, uuid: 4477203e-a55e
2022-03-13 20:29:23 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
2022-03-13 20:29:23 0 [Note] WSREP: Flow-control interval: [28, 28]
2022-03-13 20:29:23 0 [Note] WSREP: Received NON-PRIMARY.
2022-03-13 20:29:23 2 [Note] WSREP: ================================================
View:
id: 08d993a0-1adc-11ec-a16e-1fb4d5c71ffa:25375
status: non-primary
protocol_version: -1
capabilities:
final: no
own_index: 0
members(3):
0: 4477203e-a052-11ec-a55f-9fc08735e142, gedadvl81.a.space.corp
1: 807b1f4a-a052-11ec-9564-ff2f93ba9bc1, unspecified
2: 879bc99e-a053-11ec-9b9f-7738826a1738, unspecified
=================================================
2022-03-13 20:29:23 2 [Note] WSREP: Non-primary view
2022-03-13 20:29:23 2 [Note] WSREP: Server status change connected -> connected
2022-03-13 20:29:23 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:23 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-03-13 20:29:26 0 [Note] WSREP: (4477203e-a55f, 'ssl://0.0.0.0:4567') turning message relay requesting off
2022-03-13 20:29:36 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:36 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2022-03-13 20:29:37 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
2022-03-13 20:29:38 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:38 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:38 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:38 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:43 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:44 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:50 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:29:52 0 [Warning] WSREP: Handshake failed: no shared cipher
2022-03-13 20:30:16 0 [Warning] WSREP: Handshake failed: peer did not return a certificate
status: non-primary
Cluster was running fine for over one year now. I really don't know what is going on.
Any help is highly appreciated.
-
Hi Duncan,
thanks for pointing me in that direction. I find it hard to believe that I'm affected by a bug that has been reported in 2020. I'm running the following version of MariaDB:
$ rpm -qa | grep -i maria
MariaDB-compat-10.5.15-1.el7.centos.x86_64
MariaDB-common-10.5.15-1.el7.centos.x86_64
MariaDB-client-10.5.15-1.el7.centos.x86_64
MariaDB-server-10.5.15-1.el7.centos.x86_64
MariaDB-backup-10.5.15-1.el7.centos.x86_64The solution in the bug report seems to be installing 10.5.9. As my version is higher, this is either a new bug or a bug that has been reintroduced. :) I will raise a new bug report.
Thanks and Best Regards,
Oliver
-
Hi,
I just realized that there is already MariaDB 10.6 and even higher available. I'm running CC 1.9.2 and this has support for MariaDB 10.6. Is there an upgrade guide available or can I simply use the one from mariadb.com?
https://mariadb.com/kb/en/upgrading-from-mariadb-105-to-mariadb-106/
Best Regards,
Oliver
Please sign in to leave a comment.
Comments
5 comments