Cluster Control stability
Hi,
Currently wondering if there is some real interest for us to keep Cluster Control because of its poor stability, at the moment I'm a bit disappointed about the product, I know this is Open Source and I'm really aware of this because it is proposed free of charge and freely available to everyone, this is always a big effort from dev side I know !
Here some more details to explain my opinion :
Using cmon 1.1.30 at the moment, tried to upgrade at every new version :
1 - on every nodes continually having some messages within /var/log/cmon.log saying :
Jun 20 09:38:11 : (INFO) Checking if there is a MySQL Server running @ 127.0.0.1
Jun 20 09:38:20 : (WARNING) Query select p.pidfile, p.exec_cmd, p.process, p.id as pid, p.hid, p.pgrep_expr from processes p where p.hid=2 and p.active=1 and p.cid=1 failed: Error: Unknown column 'p.pgrep_expr' in 'field list'
I don't know why I have this warning first and I can't understand why this INFO is here : useless IMO, it fills the log this is the only thing I see.
2 - Very often in node's log and even in the cmon.log of the server on which the cluster control mysql server and web interface is located, we have :
Jun 20 09:42:45 : (WARNING) Could not open /proc/573/stat file
Jun 20 09:42:45 : (WARNING) Could not open /proc/647/stat file
Jun 20 09:42:45 : (WARNING) Could not open /proc/883/stat file
Jun 20 09:42:45 : (WARNING) Could not open /proc/936/stat file
For my specific case, these messages are here from the beginning we start to use the tool ( so probably from 1.1.16 or 1.1.18 if I remember well). Why ? I don't know.
3 - Very often also, I found that the cmon agent on nodes or on cluster control server is dead, the process is not running anymore and this without any log entries making it impossible to know the cause of this 'crash' ?
4 - Graphs/RRD problems :
I do not count anymore the times that my graph are not working anymore and/or displaying some weird log entries related with the cron job which is generating graphs, below is a sample :
ERROR: /var/lib/cmon//cluster_1_mysql_192.168.0.1|3306_stats.rrd: found extra data on update argument: 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:3009621:0:0:0:0:18:3:21:1:86492:86492:0:0:0:0:93:357:33:3:0:0:0:0:0:0:0:0:0:826393:4:11:38029:0:0:0:0:0:0:4:0:0:4:0:0:2:0:434863:223124655:38134:37751134
ERROR: /var/lib/cmon//cluster_1_mysql_192.168.0.5|3306_stats.rrd: found extra data on update argument: 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2485179:0:0:0:0:22:3:25:1:88011:88011:0:0:0:0:93:357:33:3:0:0:0:0:0:0:0:0:0:826393:0:0:36750:2:0:0:0:0:0:4:0:0:4:0:0:2:0:443950:229454199:36848:35546852
ERROR: /var/lib/cmon//cluster_1_mysql_192.168.0.10|3306_stats.rrd: found extra data on update argument: 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:8874516:0:0:0:0:31:5:101:0:155268:155268:0:0:0:0:93:357:33:3:0:0:0:0:0:0:0:0:0:826393:1:1:750726:1:0:0:1:0:0:4:0:0:4:0:0:2:0:76965:73500161:751278:380746069
/usr/bin/
ERROR: /var/lib/cmon//cluster_1_stats.rrd: expected 9 data source readings (got 1) from N
5 - Could it be possible or maybe is it already existing to have a detailed ChangeLog of what has been changed/upgraded/added/deleted/fixed ... at every new version please ?
One more question : is there a way to disable the authorisation for clustercontrol to stop/start mysql/Galera on nodes or to lower number of commands it's authorized to run on remote hosts because as I do not have any visibility on what cluster control can do/is doing on nodes and because of the fact that I really find the product too buggy at the moment, I would like to have it only be able to grab data to fill web interface, graphs and mysql database but not able to stop/start services on remote nodes ?
At your disposal to discuss about these points if needed for more informations in the aim to improve the product.
Regards,
Laurent
-
Hi Laurent,
1. Regarding:
- Jun 20 09:38:11 : (INFO) Checking if there is a MySQL Server running @ 127.0.0.1
- Jun 20 09:38:20 : (WARNING) Query select p.pidfile, p.exec_cmd, p.process, p.id as pid, p.hid, p.pgrep_expr from processes p where p.hid=2 and p.active=1 and p.cid=1 failed: Error: Unknown column 'p.pgrep_expr' in 'field list'
The warning is strange. Did you upgrade from an earlier version or rolled out 1.1.30 from the Configurator?
If not, did you apply the cmon_db-1.1.30.sql schema file as a part of the upgrade?
Download :
wget -O cmon_db-1.1.30.sql http://www.severalnines.com/downloads/cmon/cmon_db-1.1.30.sql
and do:
mysql -ucmon -p -h127.0.0.1 cmon < cmon_db-1.1.30.sql
The log message (INFO), sure that is a bit redundant.
2. Regarding Jun 20 09:42:45 : (WARNING) Could not open /proc/573/stat file
The agents and the contollers build up a list of what pids are running and then iterating through the list. At this point, a short lived process may have died. We can remove this printout, as it really does not add anything.
3. We are aware of some crashing bugs, there are fixes on the way. We will also be adding more traceability.
4. Pewh, RRD is problematic and we are looking at replacing it.
I don't know the problem you describe here, that is the first time i see it.
If you do
rm -rf /var/lib/cmon/*
so that the rrd databases will be initialized again, do you still see the same printouts?
5. Agree, we need to improve on this.
6. Regarding disabling galera recovery in ClusterControl, we can look at adding this option.
- At your disposal to discuss about these points if needed for more informations in the aim to improve the product.
We appreciate all the great feedback.
Regarding #1 and #4 above, please feel free to create a support ticket if you want to upload any files or additional information.
We have had a great response regarding Galera, and the adoption has gone very fast considering v1.0 came out in October 2011.
Since people use our tools in pretty much any Linux environment with different machine and network setups, behind any type of firewalls, etc, there are naturally situations where new problems occur that need to be investigated and resolved by our engineering team. We do want to resolve as many cases as we can, so that new users will not experience the same problems. Thanks for helping us find these issues.
Best regards,
Johan
-
Hi Johan,
First, thanks a lot for your fast answer as usual.
For point 1, you're right, I did not import this sql script and after doing this, everything seems to be OK and no more warning anymore about this failed query.
I think that I do not have the 'proper' upgrade method, could you please tell what is the recommanded method to upgrade from version to version please ? At the moment, I'm only doing untaring of http://www.severalnines.com/downloads/cmon/cmon-1.1.30-32bit-glibc23-mc70.tar.gz then stop cmon and unlink the symbolic link to the curently installed version to point it to the new version.
By the way, I saw that the sql file you mention is not viewable from : http://www.severalnines.com/downloads/cmon/ so for newer versions, I will not be able to know if this script is present/ready or not ?
For points 2 and 3, this is good to know that you're working on this and are already aware of these problems.
About point 4, I removed all files under /var/lib/cmon/ then wait for the next execution of the cron job, it seems I do not have anymore the error "found extra data on update argument" but still the log saying :ERROR: /var/lib/cmon//cluster_1_stats.rrd: expected 9 data source readings (got 1) from N
At the moment, I do not have needs to retain history for graphs because I'm still in testing mode but it could be a problem to have to delete all files and so history is then lost.Regarding point 6, I think this could be a great improvement.
Again ,thanks for your reply, time and support, often people say when there is some problems, things that are not working properly but never to report things that are OK and that we are happy with, so I have to say that I'm very satisfied with your reactivity, responsiveness and quality of answers, moreover for an open source software, IMO it was needed to be say !Regards,
Laurent
-
Hi Laurent,
We will be including upgrade instructions with the release changelog, so please use those to upgrade.
We believe we have fixed all the issues reported above in 1.1.32, please see http://support.severalnines.com/entries/21633407-released-clustercontrol-v1-1-32, it would be great to hear if this works better for you.
Thanks again for your feedback, let us know if there is anything else.
Best regards,
Johan
-
Hi Laurent,
32-bit versions are now up:
http://www.severalnines.com/downloads/cmon/
Are you planning on upgrading to 64-bit arch anytime soon?
Thank you,
Johan
Please sign in to leave a comment.
Comments
6 comments