Alex Reid
asked this on November 16, 2011 18:51
So, this has happened several times now.
My cluster layout is like this:
1 control - m1.large
4 cluster db - m1.xlarge
I started with the Galera Cluster Configurator and the stock Ubuntu 11.04 Natty 64-bit AMI on these nodes. Got them set up after some fiddling, everything looked great.. then I went to import my existing MySQL dataset. Sometime in the middle of the night (Monday night), the MySQLd process on the node I was importing data on crashed underneath me. When I logged into it again with a MySQL client, the wsrep status was OFF, and the cluster as a whole had fallen apart.
Since I was not using this in production yet, I figured I'd just start over with CentOS. I had a few little hitches with the Galera Cluster Configurator on Ubuntu, and thought it was likely more mature on CentOS.. plus, the Ubuntu SysV init scripts for MySQL were a bit hinky. So, stock CentOS 5.6 64Bit, the Rightscale provided image, same server layout / sizes..
And the same thing happened. Here's my description in a little more detail that I sent to severalnines support:
"While doing a large import on my #2 node in a screen session, the mysqld server crashed. When it automatically was restarted by mysqld_safe, it was unreachable by a mysql client on the localhost .. the client just hung, without even reaching authentication. After a service restart, I am able to connect. I did a show status and got this for the wsrep, but then the mysqld server immediately crashed again, i.e. show status worked, then I got "MySQL server has gone away""
That is:
But, because I had split up my data this time before importing, I noted that
1) It was my largest (in rowcount) table that it was in the midst of importing while it crashed (157 Million rows)
2) This table is partitioned. I'll post the schema here: (Table and column names changed for privacy's sake)
delimiter $$
CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`c_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`type` enum('b','c','i') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */$$
So now I have one node in the state above, while my other 3 are not being good Galera cluster nodes either..
Node 1
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized (0) |
| wsrep_connected | OFF |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | OFF |
+------------------------------------------+--------------------------------------+
Node 3
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 14 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+
Node 4
| wsrep_local_state | 2 |
| wsrep_local_state_comment | Donor (+) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 12 |
| wsrep_cluster_size | 4 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 2 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+
I'm also attaching the entire MySQL log for the node that fell over.
I really want to move forward with Galera as my MySQL HA solution, but I can't move forward without this data intact.
Comments
Hm. Actually it looks like it only made it half-way through importing this table, not the one I posted above, at least according to the only node that I can still query (because it considers itself synced)
delimiter $$
CREATE TABLE `dvs` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$
Hi,
Looking at your logs, I could not help noticing:
111115 17:17:31 InnoDB: Initializing buffer pool, size = 10.5G
...
111115 17:17:32 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
I think this is a serious misconfiguration: total size of ib_logfiles should be around 20-25% of the buffer pool size and definitely exceed the biggest transaction size, which in this case could easily exceed 5M.
Before continuing, I'd strongly suggest doing the following on all of the nodes:
After that you most likely will need to rejoin the nodes into cluster:
Log in as root into one of the nodes and do
mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
- this shall bootstrap the cluster.
The on the remaining nodes do:
mysql> SET GLOBAL wsrep_cluster_address='gcomm://<first node IP>';
Regards,
Alex
Yeah, since I didn't really know what I was getting into with Galera, I did not apply my normal InnoDB configuration settings to the my.cnf that was generated by the SeveralNines configurator.
I did as you suggested on the first node in the cluster. When executing "SET GLOBAL wsrep_cluster_address='gcomm://';", the mysqld process underneath my mysql client session crashed again.
I think I'm going to re-bootstrap the cluster with the new my.cnf and try the offending data import again. In the meanwhile, I've attached the error log -- this is from node 1, which was not the node I was performing the import on, just the primary.
On a last gasp attempt before I re-bootstrap, I attempted this once more. I'll post back if the gaffe in the innodb log file size fixes things.
Is there an appropriate setting for the Galera cache, or anything else like that?
mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
Query OK, 0 rows affected (2.51 sec)
mysql> show status;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0
ERROR:
Can't connect to the server
mysql> show status;
No connection. Trying to reconnect...
Alright, with a re-bootstrap of the cluster, it's made it through the table it crashed on the import of previously and is continuing to import data smoothly.
Thanks,
Alex
Ok,about "crashing" - I was assuming that you have empty wsrep_cluster_address on startup (I'm not fully fluent with what ClusterControl is doing there). So it was 'gcomm://' already and "SET GLOBAL..." was unnecessary. And there is a bug/feature that does not allow doing that second time. Guess it is a bug.
About GCache. Majority of Galera parameters are set via wsrep_provider_options variable. You can see a list here: http://www.codership.com/wiki/doku.php?id=galera_parameters
The is no hard requirement on GCache size (it grows as it needs). In the future though it may make sense to make gcache.size comparable to the size of the database.
Regards,
Alex
Ok, so we got much further with more sensible innodb log sizes, but a deadlock happened, killing the import. When I attempted to clean one of the tables and start a fresh import, I think the same issue happened again.
I'm not certain as to why it was unable to connect to the primary node (ip 10.176.215.163). That was the only node that thought its Galera status was ON/OK.
Looking at the log, this particular node lost connectivity to the rest of the cluster at around 2:42:00. Anything specific happened at that point? Also it looks like 10.176.215.163 was restarted then, which might be the reason for connectivity loss. It would also be helpful to see the logs from 10.176.215.163 and the rest of the nodes. My suspicion is that they have crashed simultaneously at 2:42 due to a bug.
Were you doing the import on this node and the rest were "slaves" to it?
So the import was being done to the #2 node, the #1 node is the primary. I have the full set of logs from the most recent crash (I nuked the cluster and started over again last night) here, as well as a my.cnf (they're identical).
The ips and hostnames line up like this:
10.176.215.163: db1
10.176.247.97: db2
10.174.57.115: db3
10.174.65.69: db4
Thanks for the logs, they tell a story of a tragic beauty:
Everything went well until at around 14:40-14:50 slaves (db1, db3, db4) started to experience some performance problems, most likely IO storms (and that must be swapping - is it possible to check memory usage history in Cluster Control? If the nodes are still alive, could you check memory usage?). The nodes lost responsiveness and cluster was falling apart, and then mysqld_safe nailed it by restarting db1.
Looking at how synchronously slaves started to loose it suggests a bug in the slave code. We'll try to reproduce it. I'm not sure if there is a workaround for it.
So, I set swappiness to 10 on these database servers. The CPU use was pretty steady -- about 30%, and no swap use. What I found was super interesting was this network spike, which was tied to an insert spike..
Which I find also curious, as this is how I was moving the data, and I'd presume this was regular insert behavior:
mysqldump -hOLD_DB_SERVER --extended-insert --quick client --tables dgs d d_i | mysql -pXXXXXXXX client
I've attached the graphs and the latest log -- this time I was in fact doing the data import into the "master node". That part does not seem to matter so much. We're adding some more detailed monitoring on this end so that we can capture more detail than Cluster Control provides -- as far as I can find, it does not provide detailed memory usage graphs, at least in the dashboard.
Out of curiosity, any idea why all these temp tables are being created in memory? And whether Galera has anything to do with creating them?
Hello! We have identified this issue as related to temp table usage. There are over 60mm temp table creates noted in cluster control stats, however, none of these are disk based temp tables.
The source of the failure is then OOM killer taking out mysqld.
Two questions:
- What is Galera using so many temp tables for? (or is this a misleading statistic?)
- Can we force temp table usage to go to disk? Will this impact Galera negatively?
Thanks!
Hi,
That's strange.
So with that in mind, I can suspect that there is some bug related to either processing of Cluster Control queries or processing of specific tables in the dump. Can I suggest to retry the import with control node turned off (shut down)? And if that fails, could you provide imported db schema?
And one more, I noticed that you have query cache enabled in my.cnf. It is at the moment not supported and may cause adverse effects. Could you please disable it too?
Regards,
Alex
Sorry for the delay (holidays and all)
I turned off the Query cache, and disabled cmon and my other monitoring. It seemed that temp table creation was at a minimum (single digits), but the import still crashed the cluster.
I'm importing 3 (big) tables, but currently it seems that it's been crashing during just one of them (column names have been edited to protect the, er, innocent) :
Table 1 (the biggest in rowcount)
CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`co_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`i_type` enum('one','two','three') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */
Table 2
CREATE TABLE `d` (
`d_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`d_type` enum('one','two') NOT NULL DEFAULT 'one',
PRIMARY KEY (`d_id`,`d_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Table 3 (the one that it appears causes crashes?)
CREATE TABLE `devices_i` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Is there anything else worth trying? I seem to have a case that will reliably crash a clean Galera cluster every time..
So, which table crashes the cluster, devices_i? Or any three of them can?
I ran into the same issue yesterday. The trouble I encountered is that Galera holds all transactions in memory before committing them. When you attempt a really large transaction, you run out of memory. Really large transactions include really large single statements.
I guess you are using mysqldump to move your data over to Galera, and you're using --extended-insert/-e. Make sure that is disabled with --skip-extended-insert. This will create an insert statement for each record in one table, not one insert statement for all records in one table. Running millions of inserts will take significantly longer, but it worked in my case.
Or you could be using the --single-transaction switch which may also cause the same trouble.
Or you might have wsrep_convert_LOCK_to_trx enabled in Galera. That combined with --add-locks, --lock-tables, or --lock-all-tables could cause trouble. Note that --opt enables --add-locks and --lock-tables.
Note that setting gcs.recv_q_hard_limit won't fix the problem. It only checks the size of the cache _after_ the transaction has been inputted, but you've already ran out of memory before then. I ran into that, too.
It's possible that adding more swap will let the import complete. You can always create a temporary swap file. I'm sure you know how, but for others stumbling across this: dd if=/dev/zero of=/swap_file bs=1G count=16 && mkswap /swap_file && swapon /swap_file.
I hope this helps!
-Mark
Hi everybody.
I'd like to let you know that we have identified at least one source of trouble to be certification index being purged insufficiently frequently and implementing a fix for it to be included in the 1.1 release (few days time).
It is true that not using --single-transaction and --extended-insert might relieve the situation, but naturally it will affect the speed of the dump.
Allegation that transactions are stored in RAM is not correct, they are stored in mmapped disk buffer, and get into RAM only at the moment of applying, so the size of the slave queue does not affect memory usage and it is the size of slave queue that is limited by gcs.recv_q_hard_limit. The nature of limiting is not regulatory, as Mark inferred: if slave queue exceeds the limit, the node will abort as if system runs out of memory. So you don't want to set it to less than the disk space you can give to slave queue allocation.
Regards,
Alex
Alex,
Thank you for the clarification.
I can confirm the issue I was experiencing personally involved the OOM. I had two SSH sessions open to the server, one doing the import, the other watching in top. At the point MySQL got killed, my top SSH session was also killed by the OOM. In my case, I am running Galera on three AWS Micro instances (640 MB RAM). So is it possible that the RAM limits of the instance were exceeded when attempting to apply the too-large transaction? The top session showed there was next to no free RAM available at the time the OOM killer ran. Either way, it's important to note large transactions need lots of memory at some point during their execution.
-Mark
Mark,
I'm quite confident that OOM that you have had was due to certification index buildup. We have identified a test case where certification index could grow up to 1.2Gb. So it certainly can be an issue on machines which are tight on memory. It is true that large transactions lead to this, but only indirectly and only through the number of rows, not through actual amount of data. You can have a 1Gb transaction which RAM footprint in Galera can be 1K. Since transactions are stored in mmapped files, they would not require much RAM even when applying: the processed pages can simply be paged out. (you can see it as a dedicated swap file for transactions)
This certification index buildup is not a leak, but simply a result of cleanup not happening often enough. Experiments show that increasing cleanup frequency solves the issue.
Regards,
Alex
Alex,
Awesome. That makes sense. The tables I was having trouble had small rows, but over a million rows.
I assume the fix will also help when deleting a million rows from a table, too? I just discovered that probably related issue now.
-Mark
Mark,
It depends on what you mean by "helping". Indeed, deleting a million rows from a table is a good example of a key-only transaction that would take a lot of RAM in Galera while containing no data. So timely cert index cleanup can help to handle 10 such transactions if you can handle one. But you probably can't. Not on micro AWS instance. If we accept such transaction, at some point we'll need all those keys in memory, and that could be hundreds of megabytes. So currently transaction size is limited to 128K (wsrep_max_ws_rows) simply out of practical considerations. Dealing with million row transactions requires totally different approach and might be a subject of 1.2 Galera release.
Regards,
Alex
Alex,
Deleting a million rows is not something I'd typically do. The first time I tried it, it crashed Galera (OOM). I ended up deleting 250,000 rows at a time. However, I ran into the certification index issue again (I think). I had to restart all three of the Galera nodes between batches of 250,000 rows to avoid running out of memory (the memory usage doesn't seem to go down). The primary keys were 64 bit ints, if that's relevant.
-Mark
Mark, this is exactly the case of certification index buildup. This is fixed in trunk.
Hi everybody,
We have released new MySQL/Galera versions which should fix that OOM bug (certification index is now cleaned up as soon as it exceeds certain size)
Although the bug is in Galera library, wsrep API has changed so both MySQL and Galera must be updated.
Thanks,
Alex
Sorry, links:
Galera: https://launchpad.net/galera/1.x/22.1.1
MySQL 5.1: https://launchpad.net/codership-mysql/5.1/5.1.59-22.2
MySQL 5.5: https://launchpad.net/codership-mysql/5.5/5.5.17-22.3