Problem with cluster and can't recovery it

CoolMan

January 28, 2025 18:32

Hello! i have cluster with 4 nodes. it's was working wall but today it was stuck with many inser quierys which was waiting for some process. i try to kill all processs on each node of cluster but it's just changed status to killed.

I didn't know how to solve it and stoped all nodes after it i start recovery of cluster made boootstrap from one node.

But it's stoped on this job https://www.evernote.com/shard/s115/sh/a4feddb5-0f21-41d7-a79a-faee1c4e7ed6/XGTMT1Ub4Bbep06_9SlF44DgbzkPcojRuA1F3aX-MxYXwa-PtqujI8fNzw

[20:02:52]:

192.168.50.17: Stopping the service mariadb.

[20:02:50]:

192.168.50.17: Stopping MySQL service.

[20:02:49]:

192.168.50.17:3306: Stopping mysqld (timeout=120, force stop after timeout=true).

[20:02:49]:

192.168.50.17:3306: Stopping node.

[20:02:49]:

192.168.50.56:3306: Stopped node.

[20:02:45]:

192.168.50.56: All processes stopped.

[20:02:40]:

192.168.50.56: Stopping the service mariadb.

[20:02:38]:

192.168.50.56: Stopping MySQL service.

[20:02:37]:

192.168.50.56:3306: Stopping mysqld (timeout=120, force stop after timeout=true).

[20:02:37]:

192.168.50.56:3306: Stopping node.

[20:02:37]:

Ensuring cluster is stopped.

[20:02:37]:

Setting cluster_autorecovery: false, node_autorecovery: false settings.

[20:02:37]:

Saving cluster_autorecovery: true, node_autorecovery: true settings.

[20:02:37]:

The keyfile is '/root/.ssh/id_rsa'.

[20:02:37]:

The username is 'root'.

[20:02:37]:

The creds name is 'ssh_cred_cluster_3_6245'.

[20:02:37]:

Cluster ID is 3.

[20:02:37]:

Using SSH credentials from cluster.

[20:02:37]:

CMON version 2.3.0.11519.

Can somebody help to resolve this problem ?

Comments

4 comments

Official comment
Krzysztof Ksiazek

February 04, 2025 11:17
Hello,

It is hard to tell what might be causing this issue. If you encounter any kind of problems, the best way to get some help is to create a ticket via our support portal (support.severalnines.com). All trial environments are entitled to support. When you will be opening a ticket make sure, please, that you include an error report. You can generate it by running
```
s9s-error-reporter -i 0 
```
on the ClusterControl instance.

Thanks,
Krzysztof
Comment actions Permalink
Talha Zulfiqar

June 20, 2025 06:54
It looks like your cluster got stuck due to heavy insert queries and the recovery process didn’t complete after stopping all nodes. Since the bootstrap is hanging, it's likely a metadata or lock issue. Try the following:

Manually clean temporary or lock files on all nodes (e.g., rm -rf /var/lib/mysql/grastate.dat).

Ensure only one node boots with --wsrep-new-cluster, others start normally.

Check logs (/var/log/mysql/error.log) on the bootstrap node for blocking errors.

If still stuck, restore from a recent backup or consult with your cluster manager tool support.
0

Comment actions Permalink
Louis Barlow

November 14, 2025 05:27
It sounds like a really stressful situation - when a cluster locks up during heavy insert activity, it can be tricky to recover cleanly. A few things you might want to double-check:

Make sure the node you used for the bootstrap was truly the most up-to-date one. If the cluster was stuck mid-transaction, an out-of-sync node can cause further issues during recovery.

Check the logs (especially grastate.dat and the MySQL error logs) on all nodes — they usually give a good hint about what caused the initial stall.

If the transactions were waiting on a specific lock or long-running query, clearing that and then doing a graceful shutdown often prevents this kind of situation.

After the bootstrap, verify each node joins with a full SST/IST and no errors in Galera sync stages.

Hopefully with a clean bootstrap and proper state transfer you can bring everything back online smoothly. If you can share any specific log messages, people here might be able to point to the exact cause.
0

Comment actions Permalink
Bertie Liam

December 05, 2025 06:53
Hey CoolMan, this sounds like a classic case of stuck queries holding locks. Bootstrapping from a single node is fine, but make sure that node has the latest data and bring the other nodes up one at a time. Also, check logs for any conflicts before fully restarting the cluster
0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?