Forums/Community Help

Answered

Having trouble importing data into a Galera cluster

Alex Reid
asked this on November 16, 2011 18:51

So, this has happened several times now. 

My cluster layout is like this:

1 control - m1.large

4 cluster db - m1.xlarge

 

I started with the Galera Cluster Configurator and the stock Ubuntu 11.04 Natty 64-bit AMI on these nodes. Got them set up after some fiddling, everything looked great.. then I went to import my existing MySQL dataset. Sometime in the middle of the night (Monday night), the MySQLd process on the node I was importing data on crashed underneath me. When I logged into it again with a MySQL client, the wsrep status was OFF, and the cluster as a whole had fallen apart.

Since I was not using this in production yet, I figured I'd just start over with CentOS. I had a few little hitches with the Galera Cluster Configurator on Ubuntu, and thought it was likely more mature on CentOS.. plus, the Ubuntu SysV init scripts for MySQL were a bit hinky. So, stock CentOS 5.6 64Bit, the Rightscale provided image, same server layout / sizes.. 

And the same thing happened. Here's my description in a little more detail that I sent to severalnines support:

"While doing a large import on my #2 node in a screen session, the mysqld server crashed. When it automatically was restarted by mysqld_safe, it was unreachable by a mysql client on the localhost .. the client just hung, without even reaching authentication. After a service restart, I am able to connect. I did a show status and got this for the wsrep, but then the mysqld server immediately crashed again, i.e. show status worked, then I got "MySQL server has gone away""

That is: 

mysql> show status;
....

| wsrep_local_state_uuid                   | 00000000-0000-0000-0000-000000000000 |
| wsrep_protocol_version                   | 1                                    |
| wsrep_last_committed                     | 18446744073709551615                 |
| wsrep_replicated                         | 0                                    |
| wsrep_replicated_bytes                   | 0                                    |
| wsrep_received                           | 1                                    |
| wsrep_received_bytes                     | 356                                  |
| wsrep_local_commits                      | 0                                    |
| wsrep_local_cert_failures                | 0                                    |
| wsrep_local_bf_aborts                    | 0                                    |
| wsrep_local_replays                      | 0                                    |
| wsrep_local_send_queue                   | 0                                    |
| wsrep_local_send_queue_avg               | 0.000000                             |
| wsrep_local_recv_queue                   | 0                                    |
| wsrep_local_recv_queue_avg               | 0.000000                             |
| wsrep_flow_control_paused                | 0.000000                             |
| wsrep_flow_control_sent                  | 0                                    |
| wsrep_flow_control_recv                  | 0                                    |
| wsrep_cert_deps_distance                 | 0.000000                             |
| wsrep_apply_oooe                         | 0.000000                             |
| wsrep_apply_oool                         | 0.000000                             |
| wsrep_apply_window                       | 0.000000                             |
| wsrep_commit_oooe                        | 0.000000                             |
| wsrep_commit_oool                        | 0.000000                             |
| wsrep_commit_window                      | 0.000000                             |
| wsrep_local_state                        | 1                                    |
| wsrep_local_state_comment                | Waiting for SST (4)                  |
| wsrep_cert_index_size                    | 0                                    |
| wsrep_cluster_conf_id                    | 12                                   |
| wsrep_cluster_size                       | 4                                    |
| wsrep_cluster_state_uuid                 | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status                     | Primary                              |
| wsrep_connected                          | ON                                   |
| wsrep_local_index                        | 1                                    |
| wsrep_provider_name                      | Galera                               |
| wsrep_provider_vendor                    | Codership Oy <info@codership.com>    |
| wsrep_provider_version                   | 21.1.0(r86)                          |
| wsrep_ready                              | OFF                                  |
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...

 

 

But, because I had split up my data this time before importing, I noted that 

1) It was my largest (in rowcount) table that it was in the midst of importing while it crashed (157 Million rows)

2) This table is partitioned. I'll post the schema here: (Table and column names changed for privacy's sake)

delimiter $$

CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`c_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`type` enum('b','c','i') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */$$

 

So now I have one node in the state above, while my other 3 are not being good Galera cluster nodes either..

Node 1

| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized (0) |
| wsrep_connected | OFF |
| wsrep_local_index | 18446744073709551615 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | OFF |
+------------------------------------------+--------------------------------------+

 

Node 3

| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 14 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+

Node 4

| wsrep_local_state | 2 |
| wsrep_local_state_comment | Donor (+) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 12 |
| wsrep_cluster_size | 4 |
| wsrep_cluster_state_uuid | b6b0ac9a-0fd7-11e1-0800-02efcec769b8 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 2 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 21.1.0(r86) |
| wsrep_ready | ON |
+------------------------------------------+--------------------------------------+

 

I'm also attaching the entire MySQL log for the node that fell over. 

I really want to move forward with Galera as my MySQL HA solution, but I can't move forward without this data intact. 

 

 

Comments

User photo
Alex Reid

Hm. Actually it looks like it only made it half-way through importing this table, not the one I posted above, at least according to the only node that I can still query (because it considers itself synced)

 

delimiter $$

CREATE TABLE `dvs` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$

 

November 16, 2011 19:02
User photo
Alex

Hi,

Looking at your logs, I could not help noticing:

111115 17:17:31 InnoDB: Initializing buffer pool, size = 10.5G
...
111115 17:17:32  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB

I think this is a serious misconfiguration: total size of ib_logfiles should be around 20-25% of the buffer pool size and definitely exceed the biggest transaction size, which in this case could easily exceed 5M.

Before continuing, I'd strongly suggest doing the following on all of the nodes:

  1. shut down mysqld
  2. rm -f /mnt/data/mysql/ib_logfile*
  3. set innodb_log_file_size=1G in my.cnf
  4. restart mysqld

After that you most likely will need to rejoin the nodes into cluster:

Log in as root into one of the nodes and do

mysql> SET GLOBAL wsrep_cluster_address='gcomm://';

- this shall bootstrap the cluster.

The on the remaining nodes do:

mysql> SET GLOBAL wsrep_cluster_address='gcomm://<first node IP>';

 

Regards,

Alex

November 16, 2011 22:51
User photo
Alex Reid

Yeah, since I didn't really know what I was getting into with Galera, I did not apply my normal InnoDB configuration settings to the my.cnf that was generated by the SeveralNines configurator.

I did as you suggested on the first node in the cluster. When executing "SET GLOBAL wsrep_cluster_address='gcomm://';", the mysqld process underneath my mysql client session crashed again.

 

I think I'm going to re-bootstrap the cluster with the new my.cnf and try the offending data import again. In the meanwhile, I've attached the error log -- this is from node 1, which was not the node I was performing the import on, just the primary.

November 16, 2011 23:25
User photo
Alex Reid

On a last gasp attempt before I re-bootstrap, I attempted this once more. I'll post back if the gaffe in the innodb log file size fixes things.

 

Is there an appropriate setting for the Galera cache, or anything else like that?

 

 

mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
Query OK, 0 rows affected (2.51 sec)

mysql> show status;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0
ERROR:
Can't connect to the server

mysql> show status;
No connection. Trying to reconnect...

November 16, 2011 23:30
User photo
Alex Reid

Alright, with a re-bootstrap of the cluster, it's made it through the table it crashed on the import of previously and is continuing to import data smoothly. 

 

Thanks,

Alex

November 17, 2011 01:52
User photo
Alex

Ok,about "crashing" - I was assuming that you have empty wsrep_cluster_address on startup (I'm not fully fluent with what ClusterControl is doing there). So it was 'gcomm://' already and "SET GLOBAL..." was unnecessary. And there is a bug/feature that does not allow doing that second time. Guess it is a bug.

About GCache. Majority of Galera parameters are set via wsrep_provider_options variable. You can see a list here: http://www.codership.com/wiki/doku.php?id=galera_parameters

The is no hard requirement on GCache size (it grows as it needs). In the future though it may make sense to make gcache.size comparable to the size of the database.

Regards,

Alex

November 17, 2011 02:26
User photo
Alex Reid

Ok, so we got much further with more sensible innodb log sizes, but a deadlock happened, killing the import. When I attempted to clean one of the tables and start a fresh import, I think the same issue happened again.

I'm not certain as to why it was unable to connect to the primary node (ip 10.176.215.163). That was the only node that thought its Galera status was ON/OK.

November 17, 2011 09:00
User photo
Alex

Looking at the log, this particular node lost connectivity to the rest of the cluster at around 2:42:00. Anything specific happened at that point? Also it looks like 10.176.215.163 was restarted then, which might be the reason for connectivity loss. It would also be helpful to see the logs from 10.176.215.163 and the rest of the nodes. My suspicion is that they have crashed simultaneously at 2:42 due to a bug.

Were you doing the import on this node and the rest were "slaves" to it?

November 17, 2011 11:22
User photo
Alex Reid

So the import was being done to the #2 node, the #1 node is the primary. I have the full set of logs from the most recent crash (I nuked the cluster and started over again last night) here, as well as a my.cnf (they're identical).

 

The ips and hostnames line up like this:

10.176.215.163: db1

10.176.247.97: db2

10.174.57.115: db3

10.174.65.69: db4

 

November 17, 2011 23:38
User photo
Alex

Thanks for the logs, they tell a story of a tragic beauty:

Everything went well until at around 14:40-14:50 slaves (db1, db3, db4) started to experience some performance problems, most likely IO storms (and that must be swapping - is it possible to check memory usage history in Cluster Control? If the nodes are still alive, could you check memory usage?). The nodes lost responsiveness and cluster was falling apart, and then mysqld_safe nailed it by restarting db1.

Looking at how synchronously slaves started to loose it suggests a bug in the slave code. We'll try to reproduce it. I'm not sure if there is a workaround for it.

November 18, 2011 18:58
User photo
Alex Reid

So, I set swappiness to 10 on these database servers. The CPU use was pretty steady -- about 30%, and no swap use. What I found was super interesting was this network spike, which was tied to an insert spike.. 

Which I find also curious, as this is how I was moving the data, and I'd presume this was regular insert behavior:

mysqldump -hOLD_DB_SERVER  --extended-insert --quick client --tables dgs d d_i | mysql -pXXXXXXXX client


I've attached the graphs and the latest log -- this time I was in fact doing the data import into the "master node". That part does not seem to matter so much. We're adding some more detailed monitoring on this end so that we can capture more detail than Cluster Control provides -- as far as I can find, it does not provide detailed memory usage graphs, at least in the dashboard.

November 18, 2011 23:21
User photo
Alex Reid

Out of curiosity, any idea why all these temp tables are being created in memory? And whether Galera has anything to do with creating them?

 

November 18, 2011 23:27
User photo
Martin Peck

Hello! We have identified this issue as related to temp table usage. There are over 60mm temp table creates noted in cluster control stats, however, none of these are disk based temp tables.

The source of the failure is then OOM killer taking out mysqld.

Two questions:

- What is Galera using so many temp tables for? (or is this a misleading statistic?)

- Can we force temp table usage to go to disk? Will this impact Galera negatively?

Thanks!

November 18, 2011 23:48
User photo
Alex

Hi,

That's strange.

  1. Galera, or MySQL-wsrep patch for that matter does not create any temporary tables, at least intentionally.
  2. MySQL itself can in some circumstances, but this should not be happening on slave load. Some SELECTs and other reads, like SHOW TABLES, may generate temp tables.
  3. There is some SELECT load during import, quite noticeable on the query_statistics graph. What is it coming from? Maybe it is Cluster Control queries?
  4. I guess it is now an established fact that the system runs out of memory.
  5. I don't think that temporary tables graph is 100% reliable info. What does 60 m stand for? 60 million? This looks like a CC bug. Could you verify it with SHOW STATUS LIKE 'created_tmp_tables';?
  6. I tried to reproduce any sort of memleak (with or without temporary tables) by loading big dbt2 dumps in my "isolated" environment and could not.

So with that in mind, I can suspect that there is some bug related to either processing of Cluster Control queries or processing of specific tables in the dump. Can I suggest to retry the import with control node turned off (shut down)? And if that fails, could you provide imported db schema?

And one more, I noticed that you have query cache enabled in my.cnf. It is at the moment not supported and may cause adverse effects. Could you please disable it too?

Regards,

Alex

November 19, 2011 22:25
User photo
Alex Reid

Sorry for the delay (holidays and all)

 

I turned off the Query cache, and disabled cmon and my other monitoring. It seemed that temp table creation was at a minimum (single digits), but the import still crashed the cluster.

I'm importing 3 (big) tables, but currently it seems that it's been crashing during just one of them (column names have been edited to protect the, er, innocent) :

 

Table 1 (the biggest in rowcount)

CREATE TABLE `dgs` (
`d_id` bigint(20) unsigned NOT NULL,
`g_id` int(10) unsigned NOT NULL,
`state` tinyint(1) unsigned NOT NULL,
`p_id` int(10) unsigned DEFAULT NULL,
`co_id` int(10) unsigned DEFAULT NULL,
`o` tinyint(1) unsigned DEFAULT NULL,
`i_type` enum('one','two','three') DEFAULT NULL,
`ca_id` int(10) unsigned DEFAULT NULL,
`a_type` tinyint(3) unsigned DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`d_id`,`g_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (d_id)
PARTITIONS 10 */

Table 2 

CREATE TABLE `d` (
`d_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`d_type` enum('one','two') NOT NULL DEFAULT 'one',
PRIMARY KEY (`d_id`,`d_type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Table 3 (the one that it appears causes crashes?)

CREATE TABLE `devices_i` (
`d_id` bigint(20) unsigned NOT NULL,
`d_token` binary(20) NOT NULL,
PRIMARY KEY (`d_id`),
UNIQUE KEY `d_token` (`d_token`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

 

Is there anything else worth trying? I seem to have a case that will reliably crash a clean Galera cluster every time..

November 30, 2011 17:11
User photo
Alex

So, which table crashes the cluster, devices_i? Or any three of them can?

December 01, 2011 12:13
User photo
Mark Rose

I ran into the same issue yesterday. The trouble I encountered is that Galera holds all transactions in memory before committing them. When you attempt a really large transaction, you run out of memory. Really large transactions include really large single statements.

I guess you are using mysqldump to move your data over to Galera, and you're using --extended-insert/-e. Make sure that is disabled with --skip-extended-insert. This will create an insert statement for each record in one table, not one insert statement for all records in one table. Running millions of inserts will take significantly longer, but it worked in my case.

Or you could be using the --single-transaction switch which may also cause the same trouble. 

Or you might have wsrep_convert_LOCK_to_trx enabled in Galera. That combined with --add-locks, --lock-tables, or --lock-all-tables could cause trouble. Note that --opt enables --add-locks and --lock-tables.

Note that setting gcs.recv_q_hard_limit won't fix the problem. It only checks the size of the cache _after_ the transaction has been inputted, but you've already ran out of memory before then. I ran into that, too.

It's possible that adding more swap will let the import complete. You can always create a temporary swap file. I'm sure you know how, but for others stumbling across this: dd if=/dev/zero of=/swap_file bs=1G count=16 && mkswap /swap_file && swapon /swap_file.

I hope this helps!

-Mark

December 02, 2011 17:28
User photo
Alex

Hi everybody.

I'd like to let you know that we have identified at least one source of trouble to be certification index being purged insufficiently frequently and implementing a fix for it to be included in the 1.1 release (few days time).

It is true that not using --single-transaction and --extended-insert might relieve the situation, but naturally it will affect the speed of the dump.

Allegation that transactions are stored in RAM is not correct, they are stored in mmapped disk buffer, and get into RAM only at the moment of applying, so the size of the slave queue does not affect memory usage and it is the size of slave queue that is limited by gcs.recv_q_hard_limit. The nature of limiting is not regulatory, as Mark inferred: if slave queue exceeds the limit, the node will abort as if system runs out of memory. So you don't want to set it to less than the disk space you can give to slave queue allocation.

Regards,

Alex

December 02, 2011 18:09
User photo
Mark Rose

Alex,

Thank you for the clarification.

I can confirm the issue I was experiencing personally involved the OOM. I had two SSH sessions open to the server, one doing the import, the other watching in top. At the point MySQL got killed, my top SSH session was also killed by the OOM. In my case, I am running Galera on three AWS Micro instances (640 MB RAM). So is it possible that the RAM limits of the instance were exceeded when attempting to apply the too-large transaction? The top session showed there was next to no free RAM available at the time the OOM killer ran. Either way, it's important to note large transactions need lots of memory at some point during their execution.

-Mark

December 02, 2011 19:28
User photo
Alex

Mark,

I'm quite confident that OOM that you have had was due to certification index buildup. We have identified a test case where certification index could grow up to 1.2Gb. So it certainly can be an issue on machines which are tight on memory. It is true that large transactions lead to this, but only indirectly and only through the number of rows, not through actual amount of data. You can have a 1Gb transaction which RAM footprint in Galera can be 1K. Since transactions are stored in mmapped files, they would not require much RAM even when applying: the processed pages can simply be paged out. (you can see it as a dedicated swap file for transactions)

This certification index buildup is not a leak, but simply a result of cleanup not happening often enough. Experiments show that increasing cleanup frequency solves the issue.

Regards,

Alex

December 02, 2011 20:27
User photo
Mark Rose

Alex,

Awesome. That makes sense. The tables I was having trouble had small rows, but over a million rows.

I assume the fix will also help when deleting a million rows from a table, too? I just discovered that probably related issue now.

-Mark

December 02, 2011 20:57
User photo
Alex

Mark,

It depends on what you mean by "helping". Indeed, deleting a million rows from a table is a good example of a key-only transaction that would take a lot of RAM in Galera while containing no data. So timely cert index cleanup can help to handle 10 such transactions if you can handle one. But you probably can't. Not on micro AWS instance. If we accept such transaction, at some point we'll need all those keys in memory, and that could be hundreds of megabytes. So currently transaction size is limited to 128K (wsrep_max_ws_rows) simply out of practical considerations. Dealing with million row transactions requires totally different approach and might be a subject of 1.2 Galera release.

Regards,

Alex

December 03, 2011 03:35
User photo
Mark Rose

Alex,

Deleting a million rows is not something I'd typically do. The first time I tried it, it crashed Galera (OOM). I ended up deleting 250,000 rows at a time. However, I ran into the certification index issue again (I think). I had to restart all three of the Galera nodes between batches of 250,000 rows to avoid running out of memory (the memory usage doesn't seem to go down). The primary keys were 64 bit ints, if that's relevant.

-Mark

December 04, 2011 06:12
User photo
Alex

Mark, this is exactly the case of certification index buildup. This is fixed in trunk.

December 04, 2011 11:57
User photo
Alex

Hi everybody,

We have released new MySQL/Galera versions which should fix that OOM bug (certification index is now cleaned up as soon as it exceeds certain size)

Although the bug is in Galera library, wsrep API has changed so both MySQL and Galera must be updated.

Thanks,

Alex

December 11, 2011 03:05
User photo
Alex
December 11, 2011 03:08