Problem with cluster and can't recovery it

Comments

4 comments

  • Official comment
    Avatar
    Krzysztof Ksiazek

    Hello,

    It is hard to tell what might be causing this issue. If you encounter any kind of problems, the best way to get some help is to create a ticket via our support portal (support.severalnines.com). All trial environments are entitled to support. When you will be opening a ticket make sure, please, that you include an error report. You can generate it by running

    s9s-error-reporter -i 0 

    on the ClusterControl instance.

    Thanks,
    Krzysztof

    Comment actions Permalink
  • Avatar
    Talha Zulfiqar

    It looks like your cluster got stuck due to heavy insert queries and the recovery process didn’t complete after stopping all nodes. Since the bootstrap is hanging, it's likely a metadata or lock issue. Try the following:

    Manually clean temporary or lock files on all nodes (e.g., rm -rf /var/lib/mysql/grastate.dat).

    Ensure only one node boots with --wsrep-new-cluster, others start normally.

    Check logs (/var/log/mysql/error.log) on the bootstrap node for blocking errors.

    If still stuck, restore from a recent backup or consult with your cluster manager tool support.

    0
    Comment actions Permalink
  • Avatar
    Louis Barlow

    It sounds like a really stressful situation - when a cluster locks up during heavy insert activity, it can be tricky to recover cleanly. A few things you might want to double-check:

    Make sure the node you used for the bootstrap was truly the most up-to-date one. If the cluster was stuck mid-transaction, an out-of-sync node can cause further issues during recovery.

    Check the logs (especially grastate.dat and the MySQL error logs) on all nodes — they usually give a good hint about what caused the initial stall.

    If the transactions were waiting on a specific lock or long-running query, clearing that and then doing a graceful shutdown often prevents this kind of situation.

    After the bootstrap, verify each node joins with a full SST/IST and no errors in Galera sync stages.

    Hopefully with a clean bootstrap and proper state transfer you can bring everything back online smoothly. If you can share any specific log messages, people here might be able to point to the exact cause.

    0
    Comment actions Permalink
  • Avatar
    Bertie Liam

    Hey CoolMan, this sounds like a classic case of stuck queries holding locks. Bootstrapping from a single node is fine, but make sure that node has the latest data and bring the other nodes up one at a time. Also, check logs for any conflicts before fully restarting the cluster

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk