Having trouble importing data into a Galera cluster

Comments

26 comments

  • Avatar
    Alex Reid

    Hm. Actually it looks like it only made it half-way through importing this table, not the one I posted above, at least according to the only node that I can still query (because it considers itself synced)

     

    delimiter $$

    CREATE TABLE `dvs` (
    `d_id` bigint(20) unsigned NOT NULL,
    `d_token` binary(20) NOT NULL,
    PRIMARY KEY (`d_id`),
    UNIQUE KEY `d_token` (`d_token`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8$$

     

    0
    Comment actions Permalink
  • Avatar
    Alex

    Hi,

    Looking at your logs, I could not help noticing:

    111115 17:17:31 InnoDB: Initializing buffer pool, size = 10.5G
    ...
    111115 17:17:32  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
    InnoDB: Setting log file ./ib_logfile0 size to 5 MB

    I think this is a serious misconfiguration: total size of ib_logfiles should be around 20-25% of the buffer pool size and definitely exceed the biggest transaction size, which in this case could easily exceed 5M.

    Before continuing, I'd strongly suggest doing the following on all of the nodes:

    1. shut down mysqld
    2. rm -f /mnt/data/mysql/ib_logfile*
    3. set innodb_log_file_size=1G in my.cnf
    4. restart mysqld

    After that you most likely will need to rejoin the nodes into cluster:

    Log in as root into one of the nodes and do

    mysql> SET GLOBAL wsrep_cluster_address='gcomm://';

    - this shall bootstrap the cluster.

    The on the remaining nodes do:

    mysql> SET GLOBAL wsrep_cluster_address='gcomm://<first node IP>';

     

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    Yeah, since I didn't really know what I was getting into with Galera, I did not apply my normal InnoDB configuration settings to the my.cnf that was generated by the SeveralNines configurator.

    I did as you suggested on the first node in the cluster. When executing "SET GLOBAL wsrep_cluster_address='gcomm://';", the mysqld process underneath my mysql client session crashed again.

     

    I think I'm going to re-bootstrap the cluster with the new my.cnf and try the offending data import again. In the meanwhile, I've attached the error log -- this is from node 1, which was not the node I was performing the import on, just the primary.

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    On a last gasp attempt before I re-bootstrap, I attempted this once more. I'll post back if the gaffe in the innodb log file size fixes things.

     

    Is there an appropriate setting for the Galera cache, or anything else like that?

     

     

    mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
    Query OK, 0 rows affected (2.51 sec)

    mysql> show status;
    ERROR 2006 (HY000): MySQL server has gone away
    No connection. Trying to reconnect...
    ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0
    ERROR:
    Can't connect to the server

    mysql> show status;
    No connection. Trying to reconnect...

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    Alright, with a re-bootstrap of the cluster, it's made it through the table it crashed on the import of previously and is continuing to import data smoothly. 

     

    Thanks,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Alex

    Ok,about "crashing" - I was assuming that you have empty wsrep_cluster_address on startup (I'm not fully fluent with what ClusterControl is doing there). So it was 'gcomm://' already and "SET GLOBAL..." was unnecessary. And there is a bug/feature that does not allow doing that second time. Guess it is a bug.

    About GCache. Majority of Galera parameters are set via wsrep_provider_options variable. You can see a list here: http://www.codership.com/wiki/doku.php?id=galera_parameters

    The is no hard requirement on GCache size (it grows as it needs). In the future though it may make sense to make gcache.size comparable to the size of the database.

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    Ok, so we got much further with more sensible innodb log sizes, but a deadlock happened, killing the import. When I attempted to clean one of the tables and start a fresh import, I think the same issue happened again.

    I'm not certain as to why it was unable to connect to the primary node (ip 10.176.215.163). That was the only node that thought its Galera status was ON/OK.

    0
    Comment actions Permalink
  • Avatar
    Alex

    Looking at the log, this particular node lost connectivity to the rest of the cluster at around 2:42:00. Anything specific happened at that point? Also it looks like 10.176.215.163 was restarted then, which might be the reason for connectivity loss. It would also be helpful to see the logs from 10.176.215.163 and the rest of the nodes. My suspicion is that they have crashed simultaneously at 2:42 due to a bug.

    Were you doing the import on this node and the rest were "slaves" to it?

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    So the import was being done to the #2 node, the #1 node is the primary. I have the full set of logs from the most recent crash (I nuked the cluster and started over again last night) here, as well as a my.cnf (they're identical).

     

    The ips and hostnames line up like this:

    10.176.215.163: db1

    10.176.247.97: db2

    10.174.57.115: db3

    10.174.65.69: db4

     

    0
    Comment actions Permalink
  • Avatar
    Alex

    Thanks for the logs, they tell a story of a tragic beauty:

    Everything went well until at around 14:40-14:50 slaves (db1, db3, db4) started to experience some performance problems, most likely IO storms (and that must be swapping - is it possible to check memory usage history in Cluster Control? If the nodes are still alive, could you check memory usage?). The nodes lost responsiveness and cluster was falling apart, and then mysqld_safe nailed it by restarting db1.

    Looking at how synchronously slaves started to loose it suggests a bug in the slave code. We'll try to reproduce it. I'm not sure if there is a workaround for it.

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    So, I set swappiness to 10 on these database servers. The CPU use was pretty steady -- about 30%, and no swap use. What I found was super interesting was this network spike, which was tied to an insert spike.. 

    Which I find also curious, as this is how I was moving the data, and I'd presume this was regular insert behavior:

    mysqldump -hOLD_DB_SERVER  --extended-insert --quick client --tables dgs d d_i | mysql -pXXXXXXXX client


    I've attached the graphs and the latest log -- this time I was in fact doing the data import into the "master node". That part does not seem to matter so much. We're adding some more detailed monitoring on this end so that we can capture more detail than Cluster Control provides -- as far as I can find, it does not provide detailed memory usage graphs, at least in the dashboard.

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    Out of curiosity, any idea why all these temp tables are being created in memory? And whether Galera has anything to do with creating them?

     

    0
    Comment actions Permalink
  • Avatar
    Martin Peck

    Hello! We have identified this issue as related to temp table usage. There are over 60mm temp table creates noted in cluster control stats, however, none of these are disk based temp tables.

    The source of the failure is then OOM killer taking out mysqld.

    Two questions:

    - What is Galera using so many temp tables for? (or is this a misleading statistic?)

    - Can we force temp table usage to go to disk? Will this impact Galera negatively?

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Alex

    Hi,

    That's strange.

    1. Galera, or MySQL-wsrep patch for that matter does not create any temporary tables, at least intentionally.
    2. MySQL itself can in some circumstances, but this should not be happening on slave load. Some SELECTs and other reads, like SHOW TABLES, may generate temp tables.
    3. There is some SELECT load during import, quite noticeable on the query_statistics graph. What is it coming from? Maybe it is Cluster Control queries?
    4. I guess it is now an established fact that the system runs out of memory.
    5. I don't think that temporary tables graph is 100% reliable info. What does 60 m stand for? 60 million? This looks like a CC bug. Could you verify it with SHOW STATUS LIKE 'created_tmp_tables';?
    6. I tried to reproduce any sort of memleak (with or without temporary tables) by loading big dbt2 dumps in my "isolated" environment and could not.

    So with that in mind, I can suspect that there is some bug related to either processing of Cluster Control queries or processing of specific tables in the dump. Can I suggest to retry the import with control node turned off (shut down)? And if that fails, could you provide imported db schema?

    And one more, I noticed that you have query cache enabled in my.cnf. It is at the moment not supported and may cause adverse effects. Could you please disable it too?

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Alex Reid

    Sorry for the delay (holidays and all)

     

    I turned off the Query cache, and disabled cmon and my other monitoring. It seemed that temp table creation was at a minimum (single digits), but the import still crashed the cluster.

    I'm importing 3 (big) tables, but currently it seems that it's been crashing during just one of them (column names have been edited to protect the, er, innocent) :

     

    Table 1 (the biggest in rowcount)

    CREATE TABLE `dgs` (
    `d_id` bigint(20) unsigned NOT NULL,
    `g_id` int(10) unsigned NOT NULL,
    `state` tinyint(1) unsigned NOT NULL,
    `p_id` int(10) unsigned DEFAULT NULL,
    `co_id` int(10) unsigned DEFAULT NULL,
    `o` tinyint(1) unsigned DEFAULT NULL,
    `i_type` enum('one','two','three') DEFAULT NULL,
    `ca_id` int(10) unsigned DEFAULT NULL,
    `a_type` tinyint(3) unsigned DEFAULT NULL,
    `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    PRIMARY KEY (`d_id`,`g_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8
    /*!50100 PARTITION BY KEY (d_id)
    PARTITIONS 10 */

    Table 2 

    CREATE TABLE `d` (
    `d_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `d_type` enum('one','two') NOT NULL DEFAULT 'one',
    PRIMARY KEY (`d_id`,`d_type`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

    Table 3 (the one that it appears causes crashes?)

    CREATE TABLE `devices_i` (
    `d_id` bigint(20) unsigned NOT NULL,
    `d_token` binary(20) NOT NULL,
    PRIMARY KEY (`d_id`),
    UNIQUE KEY `d_token` (`d_token`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

     

    Is there anything else worth trying? I seem to have a case that will reliably crash a clean Galera cluster every time..

    0
    Comment actions Permalink
  • Avatar
    Alex

    So, which table crashes the cluster, devices_i? Or any three of them can?

    0
    Comment actions Permalink
  • Avatar
    Mark Rose

    I ran into the same issue yesterday. The trouble I encountered is that Galera holds all transactions in memory before committing them. When you attempt a really large transaction, you run out of memory. Really large transactions include really large single statements.

    I guess you are using mysqldump to move your data over to Galera, and you're using --extended-insert/-e. Make sure that is disabled with --skip-extended-insert. This will create an insert statement for each record in one table, not one insert statement for all records in one table. Running millions of inserts will take significantly longer, but it worked in my case.

    Or you could be using the --single-transaction switch which may also cause the same trouble. 

    Or you might have wsrep_convert_LOCK_to_trx enabled in Galera. That combined with --add-locks, --lock-tables, or --lock-all-tables could cause trouble. Note that --opt enables --add-locks and --lock-tables.

    Note that setting gcs.recv_q_hard_limit won't fix the problem. It only checks the size of the cache _after_ the transaction has been inputted, but you've already ran out of memory before then. I ran into that, too.

    It's possible that adding more swap will let the import complete. You can always create a temporary swap file. I'm sure you know how, but for others stumbling across this: dd if=/dev/zero of=/swap_file bs=1G count=16 && mkswap /swap_file && swapon /swap_file.

    I hope this helps!

    -Mark

    0
    Comment actions Permalink
  • Avatar
    Alex

    Hi everybody.

    I'd like to let you know that we have identified at least one source of trouble to be certification index being purged insufficiently frequently and implementing a fix for it to be included in the 1.1 release (few days time).

    It is true that not using --single-transaction and --extended-insert might relieve the situation, but naturally it will affect the speed of the dump.

    Allegation that transactions are stored in RAM is not correct, they are stored in mmapped disk buffer, and get into RAM only at the moment of applying, so the size of the slave queue does not affect memory usage and it is the size of slave queue that is limited by gcs.recv_q_hard_limit. The nature of limiting is not regulatory, as Mark inferred: if slave queue exceeds the limit, the node will abort as if system runs out of memory. So you don't want to set it to less than the disk space you can give to slave queue allocation.

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Mark Rose

    Alex,

    Thank you for the clarification.

    I can confirm the issue I was experiencing personally involved the OOM. I had two SSH sessions open to the server, one doing the import, the other watching in top. At the point MySQL got killed, my top SSH session was also killed by the OOM. In my case, I am running Galera on three AWS Micro instances (640 MB RAM). So is it possible that the RAM limits of the instance were exceeded when attempting to apply the too-large transaction? The top session showed there was next to no free RAM available at the time the OOM killer ran. Either way, it's important to note large transactions need lots of memory at some point during their execution.

    -Mark

    0
    Comment actions Permalink
  • Avatar
    Alex

    Mark,

    I'm quite confident that OOM that you have had was due to certification index buildup. We have identified a test case where certification index could grow up to 1.2Gb. So it certainly can be an issue on machines which are tight on memory. It is true that large transactions lead to this, but only indirectly and only through the number of rows, not through actual amount of data. You can have a 1Gb transaction which RAM footprint in Galera can be 1K. Since transactions are stored in mmapped files, they would not require much RAM even when applying: the processed pages can simply be paged out. (you can see it as a dedicated swap file for transactions)

    This certification index buildup is not a leak, but simply a result of cleanup not happening often enough. Experiments show that increasing cleanup frequency solves the issue.

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Mark Rose

    Alex,

    Awesome. That makes sense. The tables I was having trouble had small rows, but over a million rows.

    I assume the fix will also help when deleting a million rows from a table, too? I just discovered that probably related issue now.

    -Mark

    0
    Comment actions Permalink
  • Avatar
    Alex

    Mark,

    It depends on what you mean by "helping". Indeed, deleting a million rows from a table is a good example of a key-only transaction that would take a lot of RAM in Galera while containing no data. So timely cert index cleanup can help to handle 10 such transactions if you can handle one. But you probably can't. Not on micro AWS instance. If we accept such transaction, at some point we'll need all those keys in memory, and that could be hundreds of megabytes. So currently transaction size is limited to 128K (wsrep_max_ws_rows) simply out of practical considerations. Dealing with million row transactions requires totally different approach and might be a subject of 1.2 Galera release.

    Regards,

    Alex

    0
    Comment actions Permalink
  • Avatar
    Mark Rose

    Alex,

    Deleting a million rows is not something I'd typically do. The first time I tried it, it crashed Galera (OOM). I ended up deleting 250,000 rows at a time. However, I ran into the certification index issue again (I think). I had to restart all three of the Galera nodes between batches of 250,000 rows to avoid running out of memory (the memory usage doesn't seem to go down). The primary keys were 64 bit ints, if that's relevant.

    -Mark

    0
    Comment actions Permalink
  • Avatar
    Alex

    Mark, this is exactly the case of certification index buildup. This is fixed in trunk.

    0
    Comment actions Permalink
  • Avatar
    Alex

    Hi everybody,

    We have released new MySQL/Galera versions which should fix that OOM bug (certification index is now cleaned up as soon as it exceeds certain size)

    Although the bug is in Galera library, wsrep API has changed so both MySQL and Galera must be updated.

    Thanks,

    Alex

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk