Having trouble importing data into a Galera cluster
So, this has happened several times now.
My cluster layout is like this:
1 control - m1.large
4 cluster db - m1.xlarge
I started with the Galera Cluster Configurator and the stock Ubuntu 11.04 Natty 64-bit AMI on these nodes. Got them set up after some fiddling, everything looked great.. then I went to import my existing MySQL dataset. Sometime in the middle of the night (Monday night), the MySQLd process on the node I was importing data on crashed underneath me. When I logged into it again with a MySQL client, the wsrep status was OFF, and the cluster as a whole had fallen apart.
Since I was not using this in production yet, I figured I'd just start over with CentOS. I had a few little hitches with the Galera Cluster Configurator on Ubuntu, and thought it was likely more mature on CentOS.. plus, the Ubuntu SysV init scripts for MySQL were a bit hinky. So, stock CentOS 5.6 64Bit, the Rightscale provided image, same server layout / sizes..
And the same thing happened. Here's my description in a little more detail that I sent to severalnines support:
"While doing a large import on my #2 node in a screen session, the mysqld server crashed. When it automatically was restarted by mysqld_safe, it was unreachable by a mysql client on the localhost .. the client just hung, without even reaching authentication. After a service restart, I am able to connect. I did a show status and got this for the wsrep, but then the mysqld server immediately crashed again, i.e. show status worked, then I got "MySQL server has gone away""
That is:
| wsrep_local_state_uuid | 00000000-0000-0000-0000-000000000000 |
But, because I had split up my data this time before importing, I noted that
1) It was my largest (in rowcount) table that it was in the midst of importing while it crashed (157 Million rows)
2) This table is partitioned. I'll post the schema here: (Table and column names changed for privacy's sake)
delimiter $$
CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`c_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`type` enum('b','c','i') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */$$
So now I have one node in the state above, while my other 3 are not being good Galera cluster nodes either..
Node 1
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized (0) |
| wsrep_connected | OFF |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | OFF |
+------------------------------------------+--------------------------------------+
Node 3
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 14 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+
Node 4
| wsrep_local_state | 2 |
| wsrep_local_state_comment | Donor (+) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 12 |
| wsrep_cluster_size | 4 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 2 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+
I'm also attaching the entire MySQL log for the node that fell over.
I really want to move forward with Galera as my MySQL HA solution, but I can't move forward without this data intact.
error.log
-
Hm. Actually it looks like it only made it half-way through importing this table, not the one I posted above, at least according to the only node that I can still query (because it considers itself synced)
delimiter $$
CREATE TABLE `dvs` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$ -
Hi,
Looking at your logs, I could not help noticing:
111115 17:17:31 InnoDB: Initializing buffer pool, size = 10.5G
...
111115 17:17:32 InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MBI think this is a serious misconfiguration: total size of ib_logfiles should be around 20-25% of the buffer pool size and definitely exceed the biggest transaction size, which in this case could easily exceed 5M.
Before continuing, I'd strongly suggest doing the following on all of the nodes:
- shut down mysqld
- rm -f /mnt/data/mysql/ib_logfile*
- set innodb_log_file_size=1G in my.cnf
- restart mysqld
After that you most likely will need to rejoin the nodes into cluster:
Log in as root into one of the nodes and do
mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
- this shall bootstrap the cluster.
The on the remaining nodes do:
mysql> SET GLOBAL wsrep_cluster_address='gcomm://<first node IP>';
Regards,
Alex
-
Yeah, since I didn't really know what I was getting into with Galera, I did not apply my normal InnoDB configuration settings to the my.cnf that was generated by the SeveralNines configurator.
I did as you suggested on the first node in the cluster. When executing "SET GLOBAL wsrep_cluster_address='gcomm://';", the mysqld process underneath my mysql client session crashed again.
I think I'm going to re-bootstrap the cluster with the new my.cnf and try the offending data import again. In the meanwhile, I've attached the error log -- this is from node 1, which was not the node I was performing the import on, just the primary.
-
On a last gasp attempt before I re-bootstrap, I attempted this once more. I'll post back if the gaffe in the innodb log file size fixes things.
Is there an appropriate setting for the Galera cache, or anything else like that?
mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
Query OK, 0 rows affected (2.51 sec)mysql> show status;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0
ERROR:
Can't connect to the servermysql> show status;
No connection. Trying to reconnect... -
Ok,about "crashing" - I was assuming that you have empty wsrep_cluster_address on startup (I'm not fully fluent with what ClusterControl is doing there). So it was 'gcomm://' already and "SET GLOBAL..." was unnecessary. And there is a bug/feature that does not allow doing that second time. Guess it is a bug.
About GCache. Majority of Galera parameters are set via wsrep_provider_options variable. You can see a list here: http://www.codership.com/wiki/doku.php?id=galera_parameters
The is no hard requirement on GCache size (it grows as it needs). In the future though it may make sense to make gcache.size comparable to the size of the database.
Regards,
Alex
-
Ok, so we got much further with more sensible innodb log sizes, but a deadlock happened, killing the import. When I attempted to clean one of the tables and start a fresh import, I think the same issue happened again.
I'm not certain as to why it was unable to connect to the primary node (ip 10.176.215.163). That was the only node that thought its Galera status was ON/OK.
-
Looking at the log, this particular node lost connectivity to the rest of the cluster at around 2:42:00. Anything specific happened at that point? Also it looks like 10.176.215.163 was restarted then, which might be the reason for connectivity loss. It would also be helpful to see the logs from 10.176.215.163 and the rest of the nodes. My suspicion is that they have crashed simultaneously at 2:42 due to a bug.
Were you doing the import on this node and the rest were "slaves" to it?
-
So the import was being done to the #2 node, the #1 node is the primary. I have the full set of logs from the most recent crash (I nuked the cluster and started over again last night) here, as well as a my.cnf (they're identical).
The ips and hostnames line up like this:
10.176.215.163: db1
10.176.247.97: db2
10.174.57.115: db3
10.174.65.69: db4
-
Thanks for the logs, they tell a story of a tragic beauty:
Everything went well until at around 14:40-14:50 slaves (db1, db3, db4) started to experience some performance problems, most likely IO storms (and that must be swapping - is it possible to check memory usage history in Cluster Control? If the nodes are still alive, could you check memory usage?). The nodes lost responsiveness and cluster was falling apart, and then mysqld_safe nailed it by restarting db1.
Looking at how synchronously slaves started to loose it suggests a bug in the slave code. We'll try to reproduce it. I'm not sure if there is a workaround for it.
-
So, I set swappiness to 10 on these database servers. The CPU use was pretty steady -- about 30%, and no swap use. What I found was super interesting was this network spike, which was tied to an insert spike..
Which I find also curious, as this is how I was moving the data, and I'd presume this was regular insert behavior:
mysqldump -hOLD_DB_SERVER --extended-insert --quick client --tables dgs d d_i | mysql -pXXXXXXXX client
I've attached the graphs and the latest log -- this time I was in fact doing the data import into the "master node". That part does not seem to matter so much. We're adding some more detailed monitoring on this end so that we can capture more detail than Cluster Control provides -- as far as I can find, it does not provide detailed memory usage graphs, at least in the dashboard. -
Hello! We have identified this issue as related to temp table usage. There are over 60mm temp table creates noted in cluster control stats, however, none of these are disk based temp tables.
The source of the failure is then OOM killer taking out mysqld.
Two questions:
- What is Galera using so many temp tables for? (or is this a misleading statistic?)
- Can we force temp table usage to go to disk? Will this impact Galera negatively?
Thanks!
-
Hi,
That's strange.
- Galera, or MySQL-wsrep patch for that matter does not create any temporary tables, at least intentionally.
- MySQL itself can in some circumstances, but this should not be happening on slave load. Some SELECTs and other reads, like SHOW TABLES, may generate temp tables.
- There is some SELECT load during import, quite noticeable on the query_statistics graph. What is it coming from? Maybe it is Cluster Control queries?
- I guess it is now an established fact that the system runs out of memory.
- I don't think that temporary tables graph is 100% reliable info. What does 60 m stand for? 60 million? This looks like a CC bug. Could you verify it with SHOW STATUS LIKE 'created_tmp_tables';?
- I tried to reproduce any sort of memleak (with or without temporary tables) by loading big dbt2 dumps in my "isolated" environment and could not.
So with that in mind, I can suspect that there is some bug related to either processing of Cluster Control queries or processing of specific tables in the dump. Can I suggest to retry the import with control node turned off (shut down)? And if that fails, could you provide imported db schema?
And one more, I noticed that you have query cache enabled in my.cnf. It is at the moment not supported and may cause adverse effects. Could you please disable it too?
Regards,
Alex
-
Sorry for the delay (holidays and all)
I turned off the Query cache, and disabled cmon and my other monitoring. It seemed that temp table creation was at a minimum (single digits), but the import still crashed the cluster.
I'm importing 3 (big) tables, but currently it seems that it's been crashing during just one of them (column names have been edited to protect the, er, innocent) :
Table 1 (the biggest in rowcount)
CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`co_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`i_type` enum('one','two','three') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */Table 2
CREATE TABLE `d` (
`d_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`d_type` enum('one','two') NOT NULL DEFAULT 'one',
PRIMARY KEY (`d_id`,`d_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8Table 3 (the one that it appears causes crashes?)
CREATE TABLE `devices_i` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8Is there anything else worth trying? I seem to have a case that will reliably crash a clean Galera cluster every time..
-
I ran into the same issue yesterday. The trouble I encountered is that Galera holds all transactions in memory before committing them. When you attempt a really large transaction, you run out of memory. Really large transactions include really large single statements.
I guess you are using mysqldump to move your data over to Galera, and you're using --extended-insert/-e. Make sure that is disabled with --skip-extended-insert. This will create an insert statement for each record in one table, not one insert statement for all records in one table. Running millions of inserts will take significantly longer, but it worked in my case.
Or you could be using the --single-transaction switch which may also cause the same trouble.
Or you might have wsrep_convert_LOCK_to_trx enabled in Galera. That combined with --add-locks, --lock-tables, or --lock-all-tables could cause trouble. Note that --opt enables --add-locks and --lock-tables.
Note that setting gcs.recv_q_hard_limit won't fix the problem. It only checks the size of the cache _after_ the transaction has been inputted, but you've already ran out of memory before then. I ran into that, too.
It's possible that adding more swap will let the import complete. You can always create a temporary swap file. I'm sure you know how, but for others stumbling across this: dd if=/dev/zero of=/swap_file bs=1G count=16 && mkswap /swap_file && swapon /swap_file.
I hope this helps!
-Mark
-
Hi everybody.
I'd like to let you know that we have identified at least one source of trouble to be certification index being purged insufficiently frequently and implementing a fix for it to be included in the 1.1 release (few days time).
It is true that not using --single-transaction and --extended-insert might relieve the situation, but naturally it will affect the speed of the dump.
Allegation that transactions are stored in RAM is not correct, they are stored in mmapped disk buffer, and get into RAM only at the moment of applying, so the size of the slave queue does not affect memory usage and it is the size of slave queue that is limited by gcs.recv_q_hard_limit. The nature of limiting is not regulatory, as Mark inferred: if slave queue exceeds the limit, the node will abort as if system runs out of memory. So you don't want to set it to less than the disk space you can give to slave queue allocation.
Regards,
Alex
-
Alex,
Thank you for the clarification.
I can confirm the issue I was experiencing personally involved the OOM. I had two SSH sessions open to the server, one doing the import, the other watching in top. At the point MySQL got killed, my top SSH session was also killed by the OOM. In my case, I am running Galera on three AWS Micro instances (640 MB RAM). So is it possible that the RAM limits of the instance were exceeded when attempting to apply the too-large transaction? The top session showed there was next to no free RAM available at the time the OOM killer ran. Either way, it's important to note large transactions need lots of memory at some point during their execution.
-Mark
-
Mark,
I'm quite confident that OOM that you have had was due to certification index buildup. We have identified a test case where certification index could grow up to 1.2Gb. So it certainly can be an issue on machines which are tight on memory. It is true that large transactions lead to this, but only indirectly and only through the number of rows, not through actual amount of data. You can have a 1Gb transaction which RAM footprint in Galera can be 1K. Since transactions are stored in mmapped files, they would not require much RAM even when applying: the processed pages can simply be paged out. (you can see it as a dedicated swap file for transactions)
This certification index buildup is not a leak, but simply a result of cleanup not happening often enough. Experiments show that increasing cleanup frequency solves the issue.
Regards,
Alex
-
Mark,
It depends on what you mean by "helping". Indeed, deleting a million rows from a table is a good example of a key-only transaction that would take a lot of RAM in Galera while containing no data. So timely cert index cleanup can help to handle 10 such transactions if you can handle one. But you probably can't. Not on micro AWS instance. If we accept such transaction, at some point we'll need all those keys in memory, and that could be hundreds of megabytes. So currently transaction size is limited to 128K (wsrep_max_ws_rows) simply out of practical considerations. Dealing with million row transactions requires totally different approach and might be a subject of 1.2 Galera release.
Regards,
Alex
-
Alex,
Deleting a million rows is not something I'd typically do. The first time I tried it, it crashed Galera (OOM). I ended up deleting 250,000 rows at a time. However, I ran into the certification index issue again (I think). I had to restart all three of the Galera nodes between batches of 250,000 rows to avoid running out of memory (the memory usage doesn't seem to go down). The primary keys were 64 bit ints, if that's relevant.
-Mark
-
Sorry, links:
Galera: https://launchpad.net/galera/1.x/22.1.1MySQL 5.1: https://launchpad.net/codership-mysql/5.1/5.1.59-22.2
MySQL 5.5: https://launchpad.net/codership-mysql/5.5/5.5.17-22.3
Please sign in to leave a comment.
Comments
26 comments