Problem with cluster and can't recovery it
Hello! i have cluster with 4 nodes. it's was working wall but today it was stuck with many inser quierys which was waiting for some process. i try to kill all processs on each node of cluster but it's just changed status to killed.
I didn't know how to solve it and stoped all nodes after it i start recovery of cluster made boootstrap from one node.
But it's stoped on this job https://www.evernote.com/shard/s115/sh/a4feddb5-0f21-41d7-a79a-faee1c4e7ed6/XGTMT1Ub4Bbep06_9SlF44DgbzkPcojRuA1F3aX-MxYXwa-PtqujI8fNzw
-
Official comment
Hello,
It is hard to tell what might be causing this issue. If you encounter any kind of problems, the best way to get some help is to create a ticket via our support portal (support.severalnines.com). All trial environments are entitled to support. When you will be opening a ticket make sure, please, that you include an error report. You can generate it by running
s9s-error-reporter -i 0
on the ClusterControl instance.
Thanks,
KrzysztofComment actions -
It looks like your cluster got stuck due to heavy insert queries and the recovery process didn’t complete after stopping all nodes. Since the bootstrap is hanging, it's likely a metadata or lock issue. Try the following:
Manually clean temporary or lock files on all nodes (e.g., rm -rf /var/lib/mysql/grastate.dat).
Ensure only one node boots with --wsrep-new-cluster, others start normally.
Check logs (/var/log/mysql/error.log) on the bootstrap node for blocking errors.
If still stuck, restore from a recent backup or consult with your cluster manager tool support.
-
It sounds like a really stressful situation - when a cluster locks up during heavy insert activity, it can be tricky to recover cleanly. A few things you might want to double-check:
Make sure the node you used for the bootstrap was truly the most up-to-date one. If the cluster was stuck mid-transaction, an out-of-sync node can cause further issues during recovery.
Check the logs (especially grastate.dat and the MySQL error logs) on all nodes — they usually give a good hint about what caused the initial stall.
If the transactions were waiting on a specific lock or long-running query, clearing that and then doing a graceful shutdown often prevents this kind of situation.
After the bootstrap, verify each node joins with a full SST/IST and no errors in Galera sync stages.
Hopefully with a clean bootstrap and proper state transfer you can bring everything back online smoothly. If you can share any specific log messages, people here might be able to point to the exact cause.
Please sign in to leave a comment.
Comments
4 comments