Severalnines - Galera Cluster

Someone accidently deleted part of the database. Someone forgot to include a WHERE clause in a DELETE query, or they dropped the wrong table. Things like that may and will happen, it is inevitable and human. But the impact can be disastrous. What can you do to guard yourself against such situations, and how can you recover your data? In this blog post, we will cover some of the most typical cases of the data loss, and how you can prepare yourself so you can recover from them.

Preparations

There are things you should do in order to ensure a smooth recovery. Let’s go through them. Please keep in mind that it’s not “pick one” situation - ideally you will implement all of the measures we are going to discuss below.

Backup

You have to have a backup, there is no getting away from it. You should have your backup files tested - unless you test your backups, you cannot be sure if they are any good and if you will ever be able to restore them. For disaster recovery you should keep a copy of your backup somewhere outside of your datacenter - just in case the whole datacenter becomes unavailable. To speed up the recovery, it’s very useful to keep a copy of the backup also on the database nodes. If your dataset is large, copying it over the network from a backup server to the database node which you want to restore may take significant time. Keeping the latest backup locally may significantly improve recovery times.

Logical Backup

Your first backup, most likely, will be a physical backup. For MySQL or MariaDB, it will be either something like xtrabackup or some sort of filesystem snapshot. Such backups are great for restoring a whole dataset or for provisioning new nodes. However, in case of deletion of a subset of data, they suffer from significant overhead. First of all, you are not able to restore all of the data, or else you will overwrite all changes that happened after the backup was created. What you are looking for is the ability to restore just a subset of data, only the rows which were accidentally removed. To do that with a physical backup, you would have to restore it on a separate host, locate removed rows, dump them and then restore them on the production cluster. Copying and restoring hundreds of gigabytes of data just to recover a handful of rows is something we would definitely call a significant overhead. To avoid it you can use logical backups - instead of storing physical data, such backups store data in a text format. This makes it easier to locate the exact data which was removed, which can then be restored directly on the production cluster. To make it even easier, you can also split such logical backup in parts and backup each and every table to a separate file. If your dataset is large, it will make sense to split one huge text file as much as possible. This will make the backup inconsistent but for the majority of the cases, this is no issue - if you will need to restore the whole dataset to a consistent state, you will use physical backup, which is much faster in this regard. If you need to restore just a subset of data, the requirements for consistency are less stringent.

Point-In-Time Recovery

Backup is just a beginning - you will be able to restore your data to the point at which the backup was taken but, most likely, data was removed after that time. Just by restoring missing data from the latest backup, you may lose any data that was changed after the backup. To avoid that you should implement Point-In-Time Recovery. For MySQL it basically means you will have to use binary logs to replay all the changes which happened between the moment of the backup and the data loss event. The below screenshot shows how ClusterControl can help with that.

What you will have to do is to restore this backup up to the moment just before the data loss. You will have to restore it on a separate host in order not to make changes on the production cluster. Once you have the backup restored, you can log into that host, find the missing data, dump it and restore on the production cluster.

Delayed Slave

All of the methods we discussed above have one common pain point - it takes time to restore the data. It may take longer, when you restore all of the data and then try to dump only the interesting part. It may take less time if you have logical backup and you can quickly drill down to the data you want to restore, but it is by no means a quick task. You still have to find a couple of rows in a large text file. The larger it is, the more complicated the task gets - sometimes the sheer size of the file slows down all actions. One method to avoid those problems is to have a delayed slave. Slaves typically try to stay up to date with the master but it is also possible to configure them so that they will keep a distance from their master. In the below screenshot, you can see how to use ClusterControl to deploy such a slave:

In short, we have here an option to add a replication slave to the database setup and configure it to be delayed. In the screenshot above, the slave will be delayed by 3600 seconds, which is one hour. This lets you to use that slave to recover the removed data up to one hour from the data deletion. You will not have to restore a backup, it will be enough to run mysqldump or SELECT ... INTO OUTFILE for the missing data and you will get the data to restore on your production cluster.

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

Restoring Data

In this section, we will go through a couple of examples of accidental data deletion and how you can recover from them. We will walk through recovery from a full data loss, we will also show how to recover from a partial data loss when using physical and logical backups. We will finally show you how to restore accidentally deleted rows if you have a delayed slave in your setup.

Full Data Loss

Accidental “rm -rf” or “DROP SCHEMA myonlyschema;” has been executed and you ended up with no data at all. If you happened to also remove files other than from the MySQL data directory, you may need to reprovision the host. To keep things simpler we will assume that only MySQL has been impacted. Let’s consider two cases, with a delayed slave and without one.

No Delayed Slave

In this case the only thing thing we can do is to restore the last physical backup. As all of our data has been removed, we don’t need to be worried about activity which happened after the data loss because with no data, there is no activity. We should be worried about the activity which happened after the backup took place. This means we have to do a Point-in-Time restore. Of course, it will take longer than to just restore data from the backup. If bringing your database up quickly is more crucial than to have all of the data restored, you can as well just restore a backup and be fine with it.

First of all, if you still have access to binary logs on the server you want to restore, you can use them for PITR. First, we want to convert the relevant part of the binary logs to a text file for further investigation. We know that data loss happened after 13:00:00. First, let’s check which binlog file we should investigate:

root@vagrant:~# ls -alh /var/lib/mysql/binlog.*
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:32 /var/lib/mysql/binlog.000001
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:33 /var/lib/mysql/binlog.000002
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:35 /var/lib/mysql/binlog.000003
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:38 /var/lib/mysql/binlog.000004
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:39 /var/lib/mysql/binlog.000005
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:41 /var/lib/mysql/binlog.000006
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:43 /var/lib/mysql/binlog.000007
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:45 /var/lib/mysql/binlog.000008
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:47 /var/lib/mysql/binlog.000009
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:49 /var/lib/mysql/binlog.000010
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:51 /var/lib/mysql/binlog.000011
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:53 /var/lib/mysql/binlog.000012
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:55 /var/lib/mysql/binlog.000013
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:57 /var/lib/mysql/binlog.000014
-rw-r----- 1 mysql mysql 1.1G Apr 23 10:59 /var/lib/mysql/binlog.000015
-rw-r----- 1 mysql mysql 306M Apr 23 13:18 /var/lib/mysql/binlog.000016

As can be seen, we are interested in the last binlog file.

root@vagrant:~# mysqlbinlog --start-datetime='2018-04-23 13:00:00' --verbose /var/lib/mysql/binlog.000016 > sql.out

Once done, let’s take a look at the contents of this file. We will search for ‘drop schema’ in vim. Here’s a relevant part of the file:

# at 320358785
#180423 13:18:58 server id 1  end_log_pos 320358850 CRC32 0x0893ac86    GTID    last_committed=307804   sequence_number=307805  rbr_only=no
SET @@SESSION.GTID_NEXT= '52d08e9d-46d2-11e8-aa17-080027e8bf1b:443415'/*!*/;
# at 320358850
#180423 13:18:58 server id 1  end_log_pos 320358946 CRC32 0x487ab38e    Query   thread_id=55    exec_time=1     error_code=0
SET TIMESTAMP=1524489538/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
drop schema sbtest
/*!*/;

As we can see, we want to restore up to position 320358785. We can pass this data to the ClusterControl UI:

Delayed Slave

If we have a delayed slave and that host is enough to handle all of the traffic, we can use it and promote it to master. First though, we have to make sure it caught up with the old master up to the point of the data loss. We will use some CLI here to make it happen. First, we need to figure out on which position the data loss happened. Then we will stop the slave and let it run up to the data loss event. We showed how to get the correct position in the previous section - by examining binary logs. We can either use that position (binlog.000016, position 320358785) or, if we use a multithreaded slave, we should use GTID of the data loss event (52d08e9d-46d2-11e8-aa17-080027e8bf1b:443415) and replay queries up to that GTID.

First, let’s stop the slave and disable delay:

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> CHANGE MASTER TO MASTER_DELAY = 0;
Query OK, 0 rows affected (0.02 sec)

Then we can start it up to a given binary log position.

mysql> START SLAVE UNTIL MASTER_LOG_FILE='binlog.000016', MASTER_LOG_POS=320358785;
Query OK, 0 rows affected (0.01 sec)

If we’d like to use GTID, the command will look different:

mysql> START SLAVE UNTIL SQL_BEFORE_GTIDS = ‘52d08e9d-46d2-11e8-aa17-080027e8bf1b:443415’;
Query OK, 0 rows affected (0.01 sec)

Once the replication stopped (meaning all of the events we asked for have been executed), we should verify that the host contains the missing data. If so, you can promote it to master and then rebuild other hosts using new master as the source of data.

This is not always the best option. All depends on how delayed your slave is - if it is delayed by a couple of hours, it may not make sense to wait for it to catch up, especially if write traffic is heavy in your environment. In such case, it’s most likely faster to rebuild hosts using physical backup. On the other hand, if you have a rather small volume of traffic, this could be a nice way to actually quickly fix the issue, promote new master and get on with serving traffic, while the rest of the nodes are being rebuilt in the background.

Partial Data Loss - Physical Backup

In case of the partial data loss, physical backups can be inefficient but, as those are the most common type of a backup, it’s very important to know how to use them for partial restore. First step will always be to restore a backup up to a point in time before the data loss event. It’s also very important to restore it on a separate host. ClusterControl uses xtrabackup for physical backups so we will show how to use it. Let’s assume we ran the following incorrect query:

DELETE FROM sbtest1 WHERE id < 23146;

We wanted to delete just a single row (‘=’ in WHERE clause), instead we deleted a bunch of them (< in WHERE clause). Let’s take a look at the binary logs to find at which position the issue happened. We will use that position to restore the backup to.

mysqlbinlog --verbose /var/lib/mysql/binlog.000003 > bin.out

Now, let’s look at the output file and see what we can find there. We are using row-based replication therefore we will not see the exact SQL that was executed. Instead (as long as we will use --verbose flag to mysqlbinlog) we will see events like below:

### DELETE FROM `sbtest`.`sbtest1`
### WHERE
###   @1=999296
###   @2=1009782
###   @3='96260841950-70557543083-97211136584-70982238821-52320653831-03705501677-77169427072-31113899105-45148058587-70555151875'
###   @4='84527471555-75554439500-82168020167-12926542460-82869925404'

As can be seen, MySQL identifies rows to delete using very precise WHERE condition. Mysterious signs in the human-readable comment, “@1”, “@2”, mean “first column”, “second column”. We know that the first column is ‘id’, which is something we are interested in. We need to find a large DELETE event on a ‘sbtest1’ table. Comments which will follow should mention id of 1, then id of ‘2’, then ‘3’ and so on - all up to id of ‘23145’. All should be executed in a single transaction (single event in a binary log). After analysing the output using ‘less’, we found:

### DELETE FROM `sbtest`.`sbtest1`
### WHERE
###   @1=1
###   @2=1006036
###   @3='123'
###   @4='43683718329-48150560094-43449649167-51455516141-06448225399'
### DELETE FROM `sbtest`.`sbtest1`
### WHERE
###   @1=2
###   @2=1008980
###   @3='123'
###   @4='05603373460-16140454933-50476449060-04937808333-32421752305'

The event, to which those comments are attached started at:

#180427  8:09:21 server id 1  end_log_pos 29600687 CRC32 0x8cfdd6ae     Xid = 307686
COMMIT/*!*/;
# at 29600687
#180427  8:09:21 server id 1  end_log_pos 29600752 CRC32 0xb5aa18ba     GTID    last_committed=42844    sequence_number=42845   rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '0c695e13-4931-11e8-9f2f-080027e8bf1b:55893'/*!*/;
# at 29600752
#180427  8:09:21 server id 1  end_log_pos 29600826 CRC32 0xc7b71da5     Query   thread_id=44    exec_time=0     error_code=0
SET TIMESTAMP=1524816561/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
BEGIN
/*!*/;
# at 29600826

So, we want to restore the backup up to the previous commit at position 29600687. Let’s do that now. We’ll use external server for that. We will restore backup up to that position and we will keep the restore server up and running so we can then later extract the missing data.

Once restore is completed, let’s make sure our data has been recovered:

mysql> SELECT COUNT(*) FROM sbtest.sbtest1 WHERE id < 23146;
+----------+
| COUNT(*) |
+----------+
|    23145 |
+----------+
1 row in set (0.03 sec)

Looks good. Now we can extract this data into a file which we will load back on the master.

mysql> SELECT * FROM sbtest.sbtest1 WHERE id < 23146 INTO OUTFILE 'missing.sql';
ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement

Something is not right - this is because the server is configured to be able to write files only in a particular location - it’s all about security, we don’t want to let users save contents anywhere they like. Let’s check where we can save our file:

mysql> SHOW VARIABLES LIKE "secure_file_priv";
+------------------+-----------------------+
| Variable_name    | Value                 |
+------------------+-----------------------+
| secure_file_priv | /var/lib/mysql-files/ |
+------------------+-----------------------+
1 row in set (0.13 sec)

Ok, let’s try one more time:

mysql> SELECT * FROM sbtest.sbtest1 WHERE id < 23146 INTO OUTFILE '/var/lib/mysql-files/missing.sql';
Query OK, 23145 rows affected (0.05 sec)

Now it looks much better. Let’s copy the data to the master:

root@vagrant:~# scp /var/lib/mysql-files/missing.sql 10.0.0.101:/var/lib/mysql-files/
missing.sql                                                                                                                                                                      100% 1744KB   1.7MB/s   00:00

Now it’s time to load the missing rows on the master and test if it succeeded:

mysql> LOAD DATA INFILE '/var/lib/mysql-files/missing.sql' INTO TABLE sbtest.sbtest1;
Query OK, 23145 rows affected (2.22 sec)
Records: 23145  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT COUNT(*) FROM sbtest.sbtest1 WHERE id < 23146;
+----------+
| COUNT(*) |
+----------+
|    23145 |
+----------+
1 row in set (0.00 sec)

That’s all, we restored our missing data.

Partial Data Loss - Logical Backup

In the previous section, we restored lost data using physical backup and an external server. What if we had logical backup created? Let’s take a look. First, let’s verify that we do have a logical backup:

root@vagrant:~# ls -alh /root/backups/BACKUP-13/
total 5.8G
drwx------ 2 root root 4.0K Apr 27 07:35 .
drwxr-x--- 5 root root 4.0K Apr 27 07:14 ..
-rw-r--r-- 1 root root 2.4K Apr 27 07:35 cmon_backup.metadata
-rw------- 1 root root 5.8G Apr 27 07:35 mysqldump_2018-04-27_071434_complete.sql.gz

Yes, it’s there. Now, it’s time to decompress it.

root@vagrant:~# mkdir /root/restore
root@vagrant:~# zcat /root/backups/BACKUP-13/mysqldump_2018-04-27_071434_complete.sql.gz > /root/restore/backup.sql

When you look into it, you will see that the data is stored in multi-value INSERT format. For example:

INSERT INTO `sbtest1` VALUES (1,1006036,'18034632456-32298647298-82351096178-60420120042-90070228681-93395382793-96740777141-18710455882-88896678134-41810932745','43683718329-48150560094-43449649167-51455516141-06448225399'),(2,1008980,'69708345057-48265944193-91002879830-11554672482-35576538285-03657113365-90301319612-18462263634-56608104414-27254248188','05603373460-16140454933-50476449060-04937808333-32421752305')

All we need to do now is to pinpoint where our table is located and then where the rows, which are of interest to us, are stored. First, knowing mysqldump patterns (drop table, create new one, disable indexes, insert data) let’s figure out which line contains CREATE TABLE statement for ‘sbtest1’ table:

root@vagrant:~/restore# grep -n "CREATE TABLE \`sbtest1\`" backup.sql > out
root@vagrant:~/restore# cat out
971:CREATE TABLE `sbtest1` (

Now, using a method of trial and error, we need to figure out where to look for our rows. We’ll show you the final command we came up with. The whole trick is to try and print different range of lines using sed and then check if the latest line contains rows close to, but later than what we are searching for. In the command below we look for lines between 971 (CREATE TABLE) and 993. We also ask sed to quit once it reaches line 994 as the rest of the file is of no interest to us:

root@vagrant:~/restore# sed -n '971,993p; 994q' backup.sql > 1.sql
root@vagrant:~/restore# tail -n 1 1.sql  | less

The output looks like below:

INSERT INTO `sbtest1` VALUES (31351,1007187,'23938390896-69688180281-37975364313-05234865797-89299459691-74476188805-03642252162-40036598389-45190639324-97494758464','60596247401-06173974673-08009930825-94560626453-54686757363'),

This means that our row range (up to row with id of 23145) is close. Next, it’s all about manual cleaning of the file. We want it to start with the first row we need to restore:

INSERT INTO `sbtest1` VALUES (1,1006036,'18034632456-32298647298-82351096178-60420120042-90070228681-93395382793-96740777141-18710455882-88896678134-41810932745','43683718329-48150560094-43449649167-51455516141-06448225399')

And end up with the last row to restore:

(23145,1001595,'37250617862-83193638873-99290491872-89366212365-12327992016-32030298805-08821519929-92162259650-88126148247-75122945670','60801103752-29862888956-47063830789-71811451101-27773551230');

We had to trim some of the unneeded data (it is multiline insert) but after all of this we have a file which we can load back on the master.

root@vagrant:~/restore# cat 1.sql | mysql -usbtest -psbtest -h10.0.0.101 sbtest
mysql: [Warning] Using a password on the command line interface can be insecure.

Finally, last check:

mysql> SELECT COUNT(*) FROM sbtest.sbtest1 WHERE id < 23146;
+----------+
| COUNT(*) |
+----------+
|    23145 |
+----------+
1 row in set (0.00 sec)

All is good, data has been restored.

Partial Data Loss, Delayed Slave

In this case, we will not go through the whole process. We already described how to identify the position of a data loss event in the binary logs. We also described how to stop a delayed slave and start the replication again, up to a point before the data loss event. We also explained how to use SELECT INTO OUTFILE and LOAD DATA INFILE to export data from external server and load it on the master. That’s all you need. As long as the data is still on the delayed slave, you have to stop it. Then you need to locate the position before the data loss event, start the slave up to that point and, once this is done, use the delayed slave to extract data which was deleted, copy the file to master and load it to restore the data.

Conclusion

Restoring lost data is not fun, but if you follow the steps we went through in this blog, you will have a good chance of recovering what you lost.

Tags:

MySQL

MariaDB

human error

data loss

recovery

Join us on Tuesday May 29th for this new webinar with Severalnines Support Engineer Bart Oles, who will walk you through what you need to know in order to migrate from standalone or a master-slave MySQL/MariaDB setup to Galera Cluster.

When considering such a migration, plenty of questions typically come up, such as: how do we migrate? Does the schema or application change? What are the limitations? Can a migration be done online, without service interruption? What are the potential risks?

Galera Cluster has become a mainstream option for high availability MySQL and MariaDB. And though it is now known as a credible replacement for traditional MySQL master-slave architectures, it is not a drop-in replacement.

It has some characteristics that make it unsuitable for certain use cases, however, most applications can still be adapted to run on it.

The benefits are clear: multi-master InnoDB setup with built-in failover and read scalability.

Join us on May 29th for this walk-through on how to migrate to Galera Cluster for MySQL and MariaDB.

Date, Time & Registration

Europe/MEA/APAC

Tuesday, May 29th at 09:00 BST / 10:00 CEST (Germany, France, Sweden)

North America/LatAm

Tuesday, May 29th at 09:00 PDT (US) / 12:00 EDT (US)

Agenda

Application use cases for Galera
Schema design
Events and Triggers
Query design
Migrating the schema
Load balancer and VIP
Loading initial data into the cluster
Limitations:
- Cluster technology
- Application vendor support
Performing Online Migration to Galera
Operational management checklist
Belts and suspenders: Plan B
Demo

Speaker

Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

We look forward to “seeing” you there and to insightful discussions!

Tags:

You may have heard about the term “split brain”. What it is? How does it affect your clusters? In this blog post we will discuss what exactly it is, what danger it may pose to your database, how we can prevent it, and if everything goes wrong, how to recover from it.

Long gone are the days of single instances, nowadays almost all databases run in replication groups or clusters. This is great for high availability and scalability, but a distributed database introduces new dangers and limitations. One case which can be deadly is a network split. Imagine a cluster of multiple nodes which, due to network issues, was split in two parts. For obvious reasons (data consistency), both parts shouldn’t handle traffic at the same time as they are isolated from each other and data cannot be transferred between them. It is also wrong from the application point of view - even if, eventually, there would be a way to sync the data (although reconciliation of 2 datasets is not trivial). For a while, part of the application would be unaware of the changes made by other application hosts, which accesses the other part of the database cluster. This can lead to serious problems.

The condition in which the cluster has been divided in two or more parts that are willing to accept writes is called “split brain”.

The biggest problem with split brain is data drift, as writes happen on both parts of the cluster. None of MySQL flavors provide automated means of merging datasets that have diverged. You will not find such feature in MySQL replication, Group Replication or Galera. Once the data has diverged, the only option is to either use one of the parts of the cluster as the source of truth and discard changes executed on the other part - unless we can follow some manual process in order to merge the data.

This is why we will start with how to prevent split brain from happening. This is so much easier than having to fix any data discrepancy.

How to prevent split brain

The exact solution depends on the type of the database and the setup of the environment. We will take a look at some of the most common cases for Galera Cluster and MySQL Replication.

Galera cluster

Galera has a built-in “circuit breaker” to handle split brain: it rely on a quorum mechanism. If a majority (50% + 1) of the nodes are available in the cluster, Galera will operate normally. If there is no majority, Galera will stop serving traffic and switch to so called “non-Primary” state. This is pretty much all you need to deal with a split brain situation while using Galera. Sure, there are manual methods to force Galera into “Primary” state even if there’s not a majority. Thing is, unless you do that, you should be safe.

The way how quorum is calculated has important repercussions - at a single datacenter level, you want to have an odd number of nodes. Three nodes give you a tolerance for failure of one node (2 nodes match the requirement of more than 50% of the nodes in the cluster being available). Five nodes will give you a tolerance for failure of two nodes (5 - 2 = 3 which is more than 50% from 5 nodes). On the other hand, using four nodes will not improve your tolerance over three node cluster. It would still handle only a failure of one node (4 - 1 = 3, more than 50% from 4) while failure of two nodes will render the cluster unusable (4 - 2 = 2, just 50%, not more).

While deploying Galera cluster in a single datacenter, please keep in mind that, ideally, you would like to distribute nodes across multiple availability zones (separate power source, network, etc.) - as long as they do exist in your datacenter, that is. A simple setup may look like below:

At the multi-datacenter level, those considerations are also applicable. If you want Galera cluster to automatically handle datacenter failures, you should use an odd number of datacenters. To reduce costs, you can use a Galera arbitrator in one of them instead of a database node. Galera arbitrator (garbd) is a process which takes part in the quorum calculation but it does not contain any data. This makes it possible to use it even on very small instances as it is not resource-intensive - although the network connectivity has to be good as it ‘sees’ all the replication traffic. Example setup may look like on a diagram below:

MySQL Replication

With MySQL replication the biggest issue is that there is no quorum mechanism builtin, as it is in Galera cluster. Therefore more steps are required to ensure that your setup will not be affected by a split brain.

One method is to avoid cross-datacenter automated failovers. You can configure your failover solution (it can be through ClusterControl, or MHA or Orchestrator) to failover only within single datacenter. If there was a full datacenter outage, it would be up to the admin to decide how to failover and how to ensure that the servers in the failed datacenter will not be used.

There are options to make it more automated. You can use Consul to store data about the nodes in the replication setup, and which one of them is the master. Then it will be up to the admin (or via some scripting) to update this entry and move writes to the second datacenter. You can benefit from an Orchestrator/Raft setup where Orchestrator nodes can be distributed across multiple datacenters and detect split brain. Based on this you could take different actions like, as we mentioned previously, update entries in our Consul or etcd. The point is that this is a much more complex environment to setup and automate than Galera cluster. Below you can find example of multi-datacenter setup for MySQL replication.

Please keep in mind that you still have to create scripts to make it work, i.e. monitor Orchestrator nodes for a split brain and take necessary actions to implement STONITH and ensure that the master in datacenter A will not be used once the network converge and connectivity will be restored.

Split brain happened - what to do next?

The worst case scenario happened and we have data drift. We will try to give you some hints what can be done here. Unfortunately, the exact steps will depend mostly on your schema design so it will not be possible to write a precise how-to guide.

What you have to keep in mind is that the ultimate goal will be to copy data from one master to the other and recreate all relations between tables.

First of all, you have to identify which node will continue serving data as master. This is a dataset to which you will merge data stored on the other “master” instance. Once that’s done, you have to identify data from old master which is missing on the current master. This will be manual work. If you have timestamps in your tables, you can leverage them to pinpoint the missing data. Ultimately, binary logs will contain all data modifications so you can rely on them. You may also have to rely on your knowledge of the data structure and relations between tables. If your data is normalized, one record in one table could be related to records in other tables. For example, your application may insert data to “user” table which is related to “address” table using user_id. You will have to find all related rows and extract them.

Next step will be to load this data into the new master. Here comes the tricky part - if you prepared your setups beforehand, this could be simply a matter of running a couple of inserts. If not, this may be rather complex. It’s all about primary key and unique index values. If your primary key values are generated as unique on each server using some sort of UUID generator or using auto_increment_increment and auto_increment_offset settings in MySQL, you can be sure that the data from the old master you have to insert won’t cause primary key or unique key conflicts with data on the new master. Otherwise, you may have to manually modify data from the old master to ensure it can be inserted correctly. It sounds complex, so let’s take a look at an example.

Let’s imagine we insert rows using auto_increment on node A, which is a master. For the sake of simplicity, we will focus on a single row only. There are columns ‘id’ and ‘value’.

If we insert it without any particular setup, we’ll see entries like below:

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’

Those will replicate to the slave (B). If the split brain happens and writes will be executed on both old and new master, we will end up with following situation:

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’
1004, ‘some value4’
1005, ‘some value5’
1006, ‘some value7’

1000, ‘some value0’
1001, ‘some value1’
1002, ‘some value2’
1003, ‘some value3’
1004, ‘some value6’
1005, ‘some value8’
1006, ‘some value9’

As you can see, there’s no way to simply dump records with id of 1004, 1005 and 1006 from node A and store them on node B because we will end up with duplicated primary key entries. What needs to be done is to change values of id column in the rows that will be inserted to a value larger than the maximum value of the id column from the table. This is all what’s needed for single rows. For more complex relations, where multiple tables are involved, you may have to make the changes in multiple locations.

On the other hand, if we had anticipated this potential problem and configured our nodes to store odd id’s on node A and even id’s on node B, the problem would have been so much easier to solve.

Node A was configured with auto_increment_offset = 1 and auto_increment_increment = 2

Node B was configured with auto_increment_offset = 2 and auto_increment_increment = 2

This is how the data would look on node A before the split brain:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’

When split brain happened, it will look like below.

Node A:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1009, ‘some value4’
1011, ‘some value5’
1013, ‘some value7’

Node B:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1008, ‘some value6’
1010, ‘some value8’
1012, ‘some value9’

Now we can easily copy missing data from node A:

1009, ‘some value4’
1011, ‘some value5’
1013, ‘some value7’

And load it to node B ending up with following data set:

1001, ‘some value0’
1003, ‘some value1’
1005, ‘some value2’
1007, ‘some value3’
1008, ‘some value6’
1009, ‘some value4’
1010, ‘some value8’
1011, ‘some value5’
1012, ‘some value9’
1013, ‘some value7’

Sure, rows are not in the original order, but this should be ok. In the worst case scenario you will have to order by ‘value’ column in queries and maybe add an index on it to make the sorting fast.

Now, imagine hundreds or thousands of rows and a highly normalized table structure - to restore one row may mean you will have to restore several of them in additional tables. With a need to change id’s (because you didn’t have protective settings in place) across all related rows and all of this being manual work, you can imagine that this is not the best situation to be in. It takes time to recover and it is an error-prone process. Luckily, as we discussed at the beginning, there are means to minimize chances that split brain will impact your system or to reduce the work that needs to be done to sync back your nodes. Make sure you use them and stay prepared.

Tags:

Container orchestration tools simplify the running of a distributed system, by deploying and redeploying containers and handling any failures that occur. One might need to move applications around, e.g., to handle updates, scaling, or underlying host failures. While this sounds great, it does not always work well with a strongly consistent database cluster like Galera. You can’t just move database nodes around, they are not stateless applications. Also, the order in which you perform operations on a cluster has high significance. For instance, restarting a Galera cluster has to start from the most advanced node, or else you will lose data. Therefore, we’ll show you how to run Galera Cluster on Docker without a container orchestration tool, so you have total control.

In this blog post, we are going to look into how to run a MariaDB Galera Cluster on Docker containers using the standard Docker image on multiple Docker hosts, without the help of orchestration tools like Swarm or Kubernetes. This approach is similar to running a Galera Cluster on standard hosts, but the process management is configured through Docker.

Before we jump further into details, we assume you have installed Docker, disabled SElinux/AppArmor and cleared up the rules inside iptables, firewalld or ufw (whichever you are using). The following are three dedicated Docker hosts for our database cluster:

host1.local - 192.168.55.161
host2.local - 192.168.55.162
host3.local - 192.168.55.163

Multi-host Networking

First of all, the default Docker networking is bound to the local host. Docker Swarm introduces another networking layer called overlay network, which extends the container internetworking to multiple Docker hosts in a cluster called Swarm. Long before this integration came into place, there were many network plugins developed to support this - Flannel, Calico, Weave are some of them.

Here, we are going to use Weave as the Docker network plugin for multi-host networking. This is mainly due to its simplicity to get it installed and running, and support for DNS resolver (containers running under this network can resolve each other's hostname). There are two ways to get Weave running - systemd or through Docker. We are going to install it as a systemd unit, so it's independent from Docker daemon (otherwise, we would have to start Docker first before Weave gets activated).

Download and install Weave:

$ curl -L git.io/weave -o /usr/local/bin/weave
$ chmod a+x /usr/local/bin/weave

Create a systemd unit file for Weave:

$ cat > /etc/systemd/system/weave.service << EOF
[Unit]
Description=Weave Network
Documentation=http://docs.weave.works/weave/latest_release/
Requires=docker.service
After=docker.service
[Service]
EnvironmentFile=-/etc/sysconfig/weave
ExecStartPre=/usr/local/bin/weave launch --no-restart $PEERS
ExecStart=/usr/bin/docker attach weave
ExecStop=/usr/local/bin/weave stop
[Install]
WantedBy=multi-user.target
EOF

Define IP addresses or hostname of the peers inside /etc/sysconfig/weave:

$ echo 'PEERS="192.168.55.161 192.168.55.162 192.168.55.163"'> /etc/sysconfig/weave

Start and enable Weave on boot:

$ systemctl start weave
$ systemctl enable weave

Repeat the above 4 steps on all Docker hosts. Verify with the following command once done:

$ weave status

The number of peers is what we are looking after. It should be 3:

          ...
          Peers: 3 (with 6 established connections)
          ...

Running a Galera Cluster

Now the network is ready, it's time to fire our database containers and form a cluster. The basic rules are:

Container must be created under --net=weave to have multi-host connectivity.
Container ports that need to be published are 3306, 4444, 4567, 4568.
The Docker image must support Galera. If you'd like to use Oracle MySQL, then get the Codership version. If you'd like Percona's, use this image instead. In this blog post, we are using MariaDB's.

The reasons we chose MariaDB as the Galera cluster vendor are:

Galera is embedded into MariaDB, starting from MariaDB 10.1.
The MariaDB image is maintained by the Docker and MariaDB teams.
One of the most popular Docker images out there.

Bootstrapping a Galera Cluster has to be performed in sequence. Firstly, the most up-to-date node must be started with "wsrep_cluster_address=gcomm://". Then, start the remaining nodes with a full address consisting of all nodes in the cluster, e.g, "wsrep_cluster_address=gcomm://node1,node2,node3". To accomplish these steps using container, we have to do some extra steps to ensure all containers are running homogeneously. So the plan is:

We would need to start with 4 containers in this order - mariadb0 (bootstrap), mariadb2, mariadb3, mariadb1.
Container mariadb0 will be using the same datadir and configdir with mariadb1.
Use mariadb0 on host1 for the first bootstrap, then start mariadb2 on host2, mariadb3 on host3.
Remove mariadb0 on host1 to give way for mariadb1.
Lastly, start mariadb1 on host1.

At the end of the day, you would have a three-node Galera Cluster (mariadb1, mariadb2, mariadb3). The first container (mariadb0) is a transient container for bootstrapping purposes only, using cluster address "gcomm://". It shares the same datadir and configdir with mariadb1 and will be removed once the cluster is formed (mariadb2 and mariadb3 are up) and nodes are synced.

By default, Galera is turned off in MariaDB and needs to be enabled with a flag called wsrep_on (set to ON) and wsrep_provider (set to the Galera library path) plus a number of Galera-related parameters. Thus, we need to define a custom configuration file for the container to configure Galera correctly.

Let's start with the first container, mariadb0. Create a file under /containers/mariadb0/conf.d/my.cnf and add the following lines:

$ mkdir -p /containers/mariadb0/conf.d
$ cat /containers/mariadb0/conf.d/my.cnf
[mysqld]

default_storage_engine          = InnoDB
binlog_format                   = ROW

innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_lock_schedule_algorithm  = FCFS # MariaDB >10.1.19 and >10.2.3 only

wsrep_on                        = ON
wsrep_provider                  = /usr/lib/galera/libgalera_smm.so
wsrep_sst_method                = xtrabackup-v2

Since the image doesn't come with MariaDB Backup (which is the preferred SST method for MariaDB 10.1 and MariaDB 10.2), we are going to stick with xtrabackup-v2 for the time being.

To perform the first bootstrap for the cluster, run the bootstrap container (mariadb0) on host1:

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --env MYSQL_USER=proxysql \
        --env MYSQL_PASSWORD=proxysqlpassword \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

The parameters used in the the above command are:

--name, creates the container named "mariadb0",
--hostname, assigns the container a hostname "mariadb0.weave.local",
--net, places the container in the weave network for multi-host networing support,
--publish, exposes ports 3306, 4444, 4567, 4568 on the container to the host,
$(weave dns-args), configures DNS resolver for this container. This command can be translated into Docker run as "--dns=172.17.0.1 --dns-search=weave.local.",
--env MYSQL_ROOT_PASSWORD, the MySQL root password,
--env MYSQL_USER, creates "proxysql" user to be used later with ProxySQL for database routing,
--env MYSQL_PASSWORD, the "proxysql" user password,
--volume /containers/mariadb1/datadir:/var/lib/mysql, creates /containers/mariadb1/datadir if does not exist and map it with /var/lib/mysql (MySQL datadir) of the container (for bootstrap node, this could be skipped),
--volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d, mounts the files under directory /containers/mariadb1/conf.d of the Docker host, into the container at /etc/mysql/mariadb.conf.d.
mariadb:10.2.15, uses MariaDB 10.2.15 image from here,
--wsrep_cluster_address, Galera connection string for the cluster. "gcomm://" means bootstrap. For the rest of the containers, we are going to use a full address instead.
--wsrep_sst_auth, authentication string for SST user. Use the same user as root,
--wsrep_node_address, the node hostname, in this case we are going to use the FQDN provided by Weave.

The bootstrap container contains several key things:

The name, hostname and wsrep_node_address is mariadb0, but it uses the volumes of mariadb1.
The cluster address is "gcomm://"
There are two additional --env parameters - MYSQL_USER and MYSQL_PASSWORD. This parameters will create additional user for our proxysql monitoring purpose.

Verify with the following command:

$ docker ps
$ docker logs -f mariadb0

Once you see the following line, it indicates the bootstrap process is completed and Galera is active:

2018-05-30 23:19:30 139816524539648 [Note] WSREP: Synchronized with group, ready for connections

Create the directory to load our custom configuration file in the remaining hosts:

$ mkdir -p /containers/mariadb2/conf.d # on host2
$ mkdir -p /containers/mariadb3/conf.d # on host3

Then, copy the my.cnf that we've created for mariadb0 and mariadb1 to mariadb2 and mariadb3 respectively:

$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb2/conf.d/ # on host1
$ scp /containers/mariadb1/conf.d/my.cnf /containers/mariadb3/conf.d/ # on host1

Next, create another 2 database containers (mariadb2 and mariadb3) on host2 and host3 respectively:

$ docker run -d \
        --name ${NAME} \
        --hostname ${NAME}.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/${NAME}/datadir:/var/lib/mysql \
        --volume /containers/${NAME}/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
    
--wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=${NAME}.weave.local

** Replace ${NAME} with mariadb2 or mariadb3 respectively.

However, there is a catch. The entrypoint script checks the mysqld service in the background after database initialization by using MySQL root user without password. Since Galera automatically performs synchronization through SST or IST when starting up, the MySQL root user password will change, mirroring the bootstrapped node. Thus, you would see the following error during the first start up:

018-05-30 23:27:13 140003794790144 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
MySQL init process in progress…
MySQL init process failed.

The trick is to restart the failed containers once more, because this time, the MySQL datadir would have been created (in the first run attempt) and it would skip the database initialization part:

$ docker start mariadb2 # on host2
$ docker start mariadb3 # on host3

Once started, verify by looking at the following line:

$ docker logs -f mariadb2
…
2018-05-30 23:28:39 139808069601024 [Note] WSREP: Synchronized with group, ready for connections

At this point, there are 3 containers running, mariadb0, mariadb2 and mariadb3. Take note that mariadb0 is started using the bootstrap command (gcomm://), which means if the container is automatically restarted by Docker in the future, it could potentially become disjointed with the primary component. Thus, we need to remove this container and replace it with mariadb1, using the same Galera connection string with the rest and use the same datadir and configdir with mariadb0.

First, stop mariadb0 by sending SIGTERM (to ensure the node is going to be shutdown gracefully):

$ docker kill -s 15 mariadb0

Then, start mariadb1 on host1 using similar command as mariadb2 or mariadb3:

$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
    
--wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

This time, you don't need to do the restart trick because MySQL datadir already exists (created by mariadb0). Once the container is started, verify the cluster size is 3, the status must be in Primary and the local state is synced:

$ docker exec -it mariadb3 mysql -uroot "-pPM7%cB43$sd@^1" -e 'select variable_name, variable_value from information_schema.global_status where variable_name in ("wsrep_cluster_size", "wsrep_local_state_comment", "wsrep_cluster_status", "wsrep_incoming_addresses")'
+---------------------------+-------------------------------------------------------------------------------+
| variable_name             | variable_value                                                                |
+---------------------------+-------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 3                                                                             |
| WSREP_CLUSTER_STATUS      | Primary                                                                       |
| WSREP_INCOMING_ADDRESSES  | mariadb1.weave.local:3306,mariadb3.weave.local:3306,mariadb2.weave.local:3306 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                                        |
+---------------------------+-------------------------------------------------------------------------------+

At this point, our architecture is looking something like this:

Although the run command is pretty long, it well describes the container's characteristics. It's probably a good idea to wrap the command in a script to simplify the execution steps, or use a compose file instead.

Database Routing with ProxySQL

Now we have three database containers running. The only way to access to the cluster now is to access the individual Docker host’s published port of MySQL, which is 3306 (map to 3306 to the container). So what happens if one of the database containers fails? You have to manually failover the client's connection to the next available node. Depending on the application connector, you could also specify a list of nodes and let the connector do the failover and query routing for you (Connector/J, PHP mysqlnd). Otherwise, it would be a good idea to unify the database resources into a single resource, that can be called a service.

This is where ProxySQL comes into the picture. ProxySQL can act as the query router, load balancing the database connections similar to what "Service" in Swarm or Kubernetes world can do. We have built a ProxySQL Docker image for this purpose and will maintain the image for every new version with our best effort.

Before we run the ProxySQL container, we have to prepare the configuration file. The following is what we have configured for proxysql1. We create a custom configuration file under /containers/proxysql1/proxysql.cnf on host1:

$ cat /containers/proxysql1/proxysql.cnf
datadir="/var/lib/proxysql"
admin_variables=
{
        admin_credentials="admin:admin"
        mysql_ifaces="0.0.0.0:6032"
        refresh_interval=2000
}
mysql_variables=
{
        threads=4
        max_connections=2048
        default_query_delay=0
        default_query_timeout=36000000
        have_compress=true
        poll_timeout=2000
        interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
        default_schema="information_schema"
        stacksize=1048576
        server_version="5.1.30"
        connect_timeout_server=10000
        monitor_history=60000
        monitor_connect_interval=200000
        monitor_ping_interval=200000
        ping_interval_server=10000
        ping_timeout_server=200
        commands_stats=true
        sessions_sort=true
        monitor_username="proxysql"
        monitor_password="proxysqlpassword"
}
mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)
mysql_users =
(
        { username = "sbtest" , password = "password" , default_hostgroup = 10 , active = 1 }
)
mysql_query_rules =
(
        {
                rule_id=100
                active=1
                match_pattern="^SELECT .* FOR UPDATE"
                destination_hostgroup=10
                apply=1
        },
        {
                rule_id=200
                active=1
                match_pattern="^SELECT .*"
                destination_hostgroup=20
                apply=1
        },
        {
                rule_id=300
                active=1
                match_pattern=".*"
                destination_hostgroup=10
                apply=1
        }
)
scheduler =
(
        {
                id = 1
                filename = "/usr/share/proxysql/tools/proxysql_galera_checker.sh"
                active = 1
                interval_ms = 2000
                arg1 = "10"
                arg2 = "20"
                arg3 = "1"
                arg4 = "1"
                arg5 = "/var/lib/proxysql/proxysql_galera_checker.log"
        }
)

The above configuration will:

configure two host groups, the single-writer and multi-writer group, as defined under "mysql_servers" section,
send reads to all Galera nodes (hostgroup 20) while write operations will go to a single Galera server (hostgroup 10),
schedule the proxysql_galera_checker.sh,
use monitor_username and monitor_password as the monitoring credentials created when we first bootstrapped the cluster (mariadb0).

Copy the configuration file to host2, for ProxySQL redundancy:

$ mkdir -p /containers/proxysql2/ # on host2
$ scp /containers/proxysql1/proxysql.cnf /container/proxysql2/ # on host1

Then, run the ProxySQL containers on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --publish 6033 \
        --publish 6032 \
        --restart always \
        --net=weave \
        $(weave dns-args) \
        --hostname ${NAME}.weave.local \
        -v /containers/${NAME}/proxysql.cnf:/etc/proxysql.cnf \
        -v /containers/${NAME}/data:/var/lib/proxysql \
        severalnines/proxysql

** Replace ${NAME} with proxysql1 or proxysql2 respectively.

We specified --restart=always to make it always available regardless of the exit status, as well as automatic startup when Docker daemon starts. This will make sure the ProxySQL containers act like a daemon.

Verify the MySQL servers status monitored by both ProxySQL instances (OFFLINE_SOFT is expected for the single-writer host group):

$ docker exec -it proxysql1 mysql -uadmin -padmin -h127.0.0.1 -P6032 -e 'select hostgroup_id,hostname,status from mysql_servers'
+--------------+----------------------+--------------+
| hostgroup_id | hostname             | status       |
+--------------+----------------------+--------------+
| 10           | mariadb1.weave.local | ONLINE       |
| 10           | mariadb2.weave.local | OFFLINE_SOFT |
| 10           | mariadb3.weave.local | OFFLINE_SOFT |
| 20           | mariadb1.weave.local | ONLINE       |
| 20           | mariadb2.weave.local | ONLINE       |
| 20           | mariadb3.weave.local | ONLINE       |
+--------------+----------------------+--------------+

At this point, our architecture is looking something like this:

All connections coming from 6033 (either from the host1, host2 or container's network) will be load balanced to the backend database containers using ProxySQL. If you would like to access an individual database server, use port 3306 of the physical host instead. There is no virtual IP address as single endpoint configured for the ProxySQL service, but we could have that by using Keepalived, which is explained in the next section.

Virtual IP Address with Keepalived

Since we configured ProxySQL containers to be running on host1 and host2, we are going to use Keepalived containers to tie these hosts together and provide virtual IP address via the host network. This allows a single endpoint for applications or clients to connect to the load balancing layer backed by ProxySQL.

As usual, create a custom configuration file for our Keepalived service. Here is the content of /containers/keepalived1/keepalived.conf:

vrrp_instance VI_DOCKER {
   interface ens33               # interface to monitor
   state MASTER
   virtual_router_id 52          # Assign one ID for this route
   priority 101
   unicast_src_ip 192.168.55.161
   unicast_peer {
      192.168.55.162
   }
   virtual_ipaddress {
      192.168.55.160             # the virtual IP
}

Copy the configuration file to host2 for the second instance:

$ mkdir -p /containers/keepalived2/ # on host2
$ scp /containers/keepalived1/keepalived.conf /container/keepalived2/ # on host1

Change the priority from 101 to 100 inside the copied configuration file on host2:

$ sed -i 's/101/100/g' /containers/keepalived2/keepalived.conf

**The higher priority instance will hold the virtual IP address (in this case is host1), until the VRRP communication is interrupted (in case host1 goes down).

Then, run the following command on host1 and host2 respectively:

$ docker run -d \
        --name=${NAME} \
        --cap-add=NET_ADMIN \
        --net=host \
        --restart=always \
        --volume /containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf \ osixia/keepalived:1.4.4

** Replace ${NAME} with keepalived1 and keepalived2.

The run command tells Docker to:

--name, create a container with
--cap-add=NET_ADMIN, add Linux capabilities for network admin scope
--net=host, attach the container into the host network. This will provide virtual IP address on the host interface, ens33
--restart=always, always keep the container running,
--volume=/containers/${NAME}/keepalived.conf:/usr/local/etc/keepalived/keepalived.conf, map the custom configuration file for container's usage.

After both containers are started, verify the virtual IP address existence by looking at the physical network interface of the MASTER node:

$ ip a | grep ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 192.168.55.161/24 brd 192.168.55.255 scope global ens33
    inet 192.168.55.160/32 scope global ens33

The clients and applications may now use the virtual IP address, 192.168.55.160 to access the database service. This virtual IP address exists on host1 at this moment. If host1 goes down, keepalived2 will take over the IP address and bring it up on host2. Take note that the configuration for this keepalived does not monitor the ProxySQL containers. It only monitors the VRRP advertisement of the Keepalived peers.

At this point, our architecture is looking something like this:

Summary

So, now we have a MariaDB Galera Cluster fronted by a highly available ProxySQL service, all running on Docker containers.

In part two, we are going to look into how to manage this setup. We’ll look at how to perform operations like graceful shutdown, bootstrapping, detecting the most advanced node, failover, recovery, scaling up/down, upgrades, backup and so on. We will also discuss the pros and cons of having this setup for our clustered database service.

Happy containerizing!

Tags:

What is SysBench? If you work with MySQL on a regular basis, then you most probably have heard of it. SysBench has been in the MySQL ecosystem for a long time. It was originally written by Peter Zaitsev, back in 2004. Its purpose was to provide a tool to run synthetic benchmarks of MySQL and the hardware it runs on. It was designed to run CPU, memory and I/O tests. It had also an option to execute OLTP workload on a MySQL database. OLTP stands for online transaction processing, typical workload for online applications like e-commerce, order entry or financial transaction systems.

In this blog post, we will focus on the SQL benchmark feature but keep in mind that hardware benchmarks can also be very useful in identifying issues on database servers. For example, I/O benchmark was intended to simulate InnoDB I/O workload while CPU tests involve simulation of highly concurrent, multi-treaded environment along with tests for mutex contentions - something which also resembles a database type of workload.

SysBench history and architecture

As mentioned, SysBench was originally created in 2004 by Peter Zaitsev. Soon after, Alexey Kopytov took over its development. It reached version 0.4.12 and the development halted. After a long break Alexey started to work on SysBench again in 2016. Soon version 0.5 has been released with OLTP benchmark rewritten to use LUA-based scripts. Then, in 2017, SysBench 1.0 was released. This was like day and night compared to the old, 0.4.12 version. First and the foremost, instead of hardcoded scripts, now we have the ability to customize benchmarks using LUA. For instance, Percona created TPCC-like benchmark which can be executed using SysBench. Let’s take a quick look at the current SysBench architecture.

SysBench is a C binary which uses LUA scripts to execute benchmarks. Those scripts have to:

Handle input from command line parameters
Define all of the modes which the benchmark is supposed to use (prepare, run, cleanup)
Prepare all of the data
Define how the benchmark will be executed (what queries will look like etc)

Scripts can utilize multiple connections to the database, they can also process results should you want to create complex benchmarks where queries depend on the result set of previous queries. With SysBench 1.0 it is possible to create latency histograms. It is also possible for the LUA scripts to catch and handle errors through error hooks. There’s support for parallelization in the LUA scripts, multiple queries can be executed in parallel, making, for example, provisioning much faster. Last but not least, multiple output formats are now supported. Before SysBench generated only human-readable output. Now it is possible to generate it as CSV or JSON, making it much easier to do post-processing and generate graphs using, for example, gnuplot or feed the data into Prometheus, Graphite or similar datastore.

Why SysBench?

The main reason why SysBench became popular is the fact that it is simple to use. Someone without prior knowledge can start to use it within minutes. It also provides, by default, benchmarks which cover most of the cases - OLTP workloads, read-only or read-write, primary key lookups and primary key updates. All which caused most of the issues for MySQL, up to MySQL 8.0. This was also a reason why SysBench was so popular in different benchmarks and comparisons published on the Internet. Those posts helped to promote this tool and made it into the go-to synthetic benchmark for MySQL.

Another good thing about SysBench is that, since version 0.5 and incorporation of LUA, anyone can prepare any kind of benchmark. We already mentioned TPCC-like benchmark but anyone can craft something which will resemble her production workload. We are not saying it is simple, it will be most likely a time-consuming process, but having this ability is beneficial if you need to prepare a custom benchmark.

Being a synthetic benchmark, SysBench is not a tool which you can use to tune configurations of your MySQL servers (unless you prepared LUA scripts with custom workload or your workload happen to be very similar to the benchmark workloads that SysBench comes with). What it is great for is to compare performance of different hardware. You can easily compare performance of, let’s say, different type of nodes offered by your cloud provider and maximum QPS (queries per second) they offer. Knowing that metric and knowing what you pay for given node, you can then calculate even more important metric - QP$ (queries per dollar). This will allow you to identify what node type to use when building a cost-efficient environment. Of course, SysBench can be used also for initial tuning and assessing feasibility of a given design. Let’s say we build a Galera cluster spanning across the globe - North America, EU, Asia. How many inserts per second can such a setup handle? What would be the commit latency? Does it even make sense to do a proof of concept or maybe network latency is high enough that even a simple workload does not work as you would expect it to.

What about stress-testing? Not everyone has moved to the cloud, there are still companies preferring to build their own infrastructure. Every new server acquired should go through a warm-up period during which you will stress it to pinpoint potential hardware defects. In this case SysBench can also help. Either by executing OLTP workload which overloads the server, or you can also use dedicated benchmarks for CPU, disk and memory.

As you can see, there are many cases in which even a simple, synthetic benchmark can be very useful. In the next paragraph we will look at what we can do with SysBench.

What SysBench can do for you?

What tests you can run?

As mentioned at the beginning, we will focus on OLTP benchmarks and just as a reminder we’ll repeat that SysBench can also be used to perform I/O, CPU and memory tests. Let’s take a look at the benchmarks that SysBench 1.0 comes with (we removed some helper LUA files and non-database LUA scripts from this list).

-rwxr-xr-x 1 root root 1.5K May 30 07:46 bulk_insert.lua
-rwxr-xr-x 1 root root 1.3K May 30 07:46 oltp_delete.lua
-rwxr-xr-x 1 root root 2.4K May 30 07:46 oltp_insert.lua
-rwxr-xr-x 1 root root 1.3K May 30 07:46 oltp_point_select.lua
-rwxr-xr-x 1 root root 1.7K May 30 07:46 oltp_read_only.lua
-rwxr-xr-x 1 root root 1.8K May 30 07:46 oltp_read_write.lua
-rwxr-xr-x 1 root root 1.1K May 30 07:46 oltp_update_index.lua
-rwxr-xr-x 1 root root 1.2K May 30 07:46 oltp_update_non_index.lua
-rwxr-xr-x 1 root root 1.5K May 30 07:46 oltp_write_only.lua
-rwxr-xr-x 1 root root 1.9K May 30 07:46 select_random_points.lua
-rwxr-xr-x 1 root root 2.1K May 30 07:46 select_random_ranges.lua

Let’s go through them one by one.

First, bulk_insert.lua. This test can be used to benchmark the ability of MySQL to perform multi-row inserts. This can be quite useful when checking, for example, performance of replication or Galera cluster. In the first case, it can help you answer a question: “how fast can I insert before replication lag will kick in?”. In the later case, it will tell you how fast data can be inserted into a Galera cluster given the current network latency.

All oltp_* scripts share a common table structure. First two of them (oltp_delete.lua and oltp_insert.lua) execute single DELETE and INSERT statements. Again, this could be a test for either replication or Galera cluster - push it to the limits and see what amount of inserting or purging it can handle. We also have other benchmarks focused on particular functionality - oltp_point_select, oltp_update_index and oltp_update_non_index. These will execute a subset of queries - primary key-based selects, index-based updates and non-index-based updates. If you want to test some of these functionalities, the tests are there. We also have more complex benchmarks which are based on OLTP workloads: oltp_read_only, oltp_read_write and oltp_write_only. You can run either a read-only workload, which will consist of different types of SELECT queries, you can run only writes (a mix of DELETE, INSERT and UPDATE) or you can run a mix of those two. Finally, using select_random_points and select_random_ranges you can run some random SELECT either using random points in IN() list or random ranges using BETWEEN.

How you can configure a benchmark?

What is also important, benchmarks are configurable - you can run different workload patterns using the same benchmark. Let’s take a look at the two most common benchmarks to execute. We’ll have a deep dive into OLTP read_only and OLTP read_write benchmarks. First of all, SysBench has some general configuration options. We will discuss here only the most important ones, you can check all of them by running:

sysbench --help

Let’s take a look at them.

  --threads=N                     number of threads to use [1]

You can define what kind of concurrency you’d like SysBench to generate. MySQL, as every software, has some scalability limitations and its performance will peak at some level of concurrency. This setting helps to simulate different concurrencies for a given workload and check if it already has passed the sweet spot.

  --events=N                      limit for total number of events [0]
  --time=N                        limit for total execution time in seconds [10]

Those two settings govern how long SysBench should keep running. It can either execute some number of queries or it can keep running for a predefined time.

  --warmup-time=N                 execute events for this many seconds with statistics disabled before the actual benchmark run with statistics enabled [0]

This is self-explanatory. SysBench generates statistical results from the tests and those results may be affected if MySQL is in a cold state. Warmup helps to identify “regular” throughput by executing benchmark for a predefined time, allowing to warm up the cache, buffer pools etc.

  --rate=N                        average transactions rate. 0 for unlimited rate [0]

By default SysBench will attempt to execute queries as fast as possible. To simulate slower traffic this option may be used. You can define here how many transactions should be executed per second.

  --report-interval=N             periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]

By default SysBench generates a report after it completed its run and no progress is reported while the benchmark is running. Using this option you can make SysBench more verbose while the benchmark still runs.

  --rand-type=STRING   random numbers distribution {uniform, gaussian, special, pareto, zipfian} to use by default [special]

SysBench gives you ability to generate different types of data distribution. All of them may have their own purposes. Default option, ‘special’, defines several (it is configurable) hot-spots in the data, something which is quite common in web applications. You can also use other distributions if your data behaves in a different way. By making a different choice here you can also change the way your database is stressed. For example, uniform distribution, where all of the rows have the same likeliness of being accessed, is much more memory-intensive operation. It will use more buffer pool to store all of the data and it will be much more disk-intensive if your data set won’t fit in memory. On the other hand, special distribution with couple of hot-spots will put less stress on the disk as hot rows are more likely to be kept in the buffer pool and access to rows stored on disk is much less likely. For some of the data distribution types, SysBench gives you more tweaks. You can find this info in ‘sysbench --help’ output.

  --db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]

Using this setting you can decide if SysBench should use prepared statements (as long as they are available in the given datastore - for MySQL it means PS will be enabled by default) or not. This may make a difference while working with proxies like ProxySQL or MaxScale - they should treat prepared statements in a special way and all of them should be routed to one host making it impossible to test scalability of the proxy.

In addition to the general configuration options, each of the tests may have its own configuration. You can check what is possible by running:

root@vagrant:~# sysbench ./sysbench/src/lua/oltp_read_write.lua  help
sysbench 1.1.0-2e6b7d5 (using bundled LuaJIT 2.1.0-beta3)

oltp_read_only.lua options:
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --range_size=N                Range size for range SELECT queries [100]
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --tables=N                    Number of tables [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --point_selects=N             Number of point SELECT queries per transaction [10]

Again, we will discuss the most important options from here. First of all, you have a control of how exactly a transaction will look like. Generally speaking, it consists of different types of queries - INSERT, DELETE, different type of SELECT (point lookup, range, aggregation) and UPDATE (indexed, non-indexed). Using variables like:

  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --delete_inserts=N            Number of DELETE/INSERT combinations per transaction [1]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --point_selects=N             Number of point SELECT queries per transaction [10]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]

You can define what a transaction should look like. As you can see by looking at the default values, majority of queries are SELECTs - mainly point selects but also different types of range SELECTs (you can disable all of them by setting range_selects to off). You can tweak the workload towards more write-heavy workload by increasing the number of updates or INSERT/DELETE queries. It is also possible to tweak settings related to secondary indexes, auto increment but also data set size (number of tables and how many rows each of them should hold). This lets you customize your workload quite nicely.

  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]

This is another setting, quite important when working with proxies. By default, SysBench will attempt to execute queries in explicit transaction. This way the dataset will stay consistent and not affected: SysBench will, for example, execute INSERT and DELETE on the same row, making sure the data set will not grow (impacting your ability to reproduce results). However, proxies will treat explicit transactions differently - all queries executed within a transaction should be executed on the same host, thus removing the ability to scale the workload. Please keep in mind that disabling transactions will result in data set diverging from the initial point. It may also trigger some issues like duplicate key errors or such. To be able to disable transactions you may also want to look into:

  --mysql-ignore-errors=[LIST,...] list of errors to ignore, or "all" [1213,1020,1205]

This setting allows you to specify error codes from MySQL which SysBench should ignore (and not kill the connection). For example, to ignore errors like: error 1062 (Duplicate entry '6' for key 'PRIMARY') you should pass this error code: --mysql-ignore-errors=1062

What is also important, each benchmark should present a way to provision a data set for tests, run them and then clean it up after the tests complete. This is done using ‘prepare’, ‘run’ and ‘cleanup’ commands. We will show how this is done in the next section.

Examples

In this section we’ll go through some examples of what SysBench can be used for. As mentioned earlier, we’ll focus on the two most popular benchmarks - OLTP read only and OLTP read/write. Sometimes it may make sense to use other benchmarks, but at least we’ll be able to show you how those two can be customized.

Primary Key lookups

First of all, we have to decide which benchmark we will run, read-only or read-write. Technically speaking it does not make a difference as we can remove writes from R/W benchmark. Let’s focus on the read-only one.

As a first step, we have to prepare a data set. We need to decide how big it should be. For this particular benchmark, using default settings (so, secondary indexes are created), 1 million rows will result in ~240 MB of data. Ten tables, 1000 000 rows each equals to 2.4GB:

root@vagrant:~# du -sh /var/lib/mysql/sbtest/
2.4G    /var/lib/mysql/sbtest/
root@vagrant:~# ls -alh /var/lib/mysql/sbtest/
total 2.4G
drwxr-x--- 2 mysql mysql 4.0K Jun  1 12:12 .
drwxr-xr-x 6 mysql mysql 4.0K Jun  1 12:10 ..
-rw-r----- 1 mysql mysql   65 Jun  1 12:08 db.opt
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:12 sbtest10.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:12 sbtest10.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest1.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest1.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest2.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest2.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest3.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest3.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:10 sbtest4.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:10 sbtest4.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest5.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest5.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest6.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest6.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest7.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest7.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:11 sbtest8.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:11 sbtest8.ibd
-rw-r----- 1 mysql mysql 8.5K Jun  1 12:12 sbtest9.frm
-rw-r----- 1 mysql mysql 240M Jun  1 12:12 sbtest9.ibd

This should give you idea how many tables you want and how big they should be. Let’s say we want to test in-memory workload so we want to create tables which will fit into InnoDB buffer pool. On the other hand, we want also to make sure there are enough tables not to become a bottleneck (or, that the amount of tables matches what you would expect in your production setup). Let’s prepare our dataset. Please keep in mind that, by default, SysBench looks for ‘sbtest’ schema which has to exist before you prepare the data set. You may have to create it manually.

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=4 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 prepare
sysbench 1.1.0-2e6b7d5 (using bundled LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest2'
Inserting 1000000 records into 'sbtest4'
Inserting 1000000 records into 'sbtest3'
Inserting 1000000 records into 'sbtest1'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest6'...
Inserting 1000000 records into 'sbtest6'
Creating table 'sbtest7'...
Inserting 1000000 records into 'sbtest7'
Creating table 'sbtest5'...
Inserting 1000000 records into 'sbtest5'
Creating table 'sbtest8'...
Inserting 1000000 records into 'sbtest8'
Creating a secondary index on 'sbtest6'...
Creating a secondary index on 'sbtest7'...
Creating a secondary index on 'sbtest5'...
Creating a secondary index on 'sbtest8'...
Creating table 'sbtest10'...
Inserting 1000000 records into 'sbtest10'
Creating table 'sbtest9'...
Inserting 1000000 records into 'sbtest9'
Creating a secondary index on 'sbtest10'...
Creating a secondary index on 'sbtest9'...

Once we have our data, let’s prepare a command to run the test. We want to test Primary Key lookups therefore we will disable all other types of SELECT. We will also disable prepared statements as we want to test regular queries. We will test low concurrency, let’s say 16 threads. Our command may look like below:

sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 --range_selects=off --db-ps-mode=disable --report-interval=1 run

What did we do here? We set the number of threads to 16. We decided that we want our benchmark to run for 300 seconds, without a limit of executed queries. We defined connectivity to the database, number of tables and their size. We also disabled all range SELECTs, we also disabled prepared statements. Finally, we set report interval to one second. This is how a sample output may look like:

[ 297s ] thds: 16 tps: 97.21 qps: 1127.43 (r/w/o: 935.01/0.00/192.41) lat (ms,95%): 253.35 err/s: 0.00 reconn/s: 0.00
[ 298s ] thds: 16 tps: 195.32 qps: 2378.77 (r/w/o: 1985.13/0.00/393.64) lat (ms,95%): 189.93 err/s: 0.00 reconn/s: 0.00
[ 299s ] thds: 16 tps: 178.02 qps: 2115.22 (r/w/o: 1762.18/0.00/353.04) lat (ms,95%): 155.80 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 16 tps: 217.82 qps: 2640.92 (r/w/o: 2202.27/0.00/438.65) lat (ms,95%): 125.52 err/s: 0.00 reconn/s: 0.00

Every second we see a snapshot of workload stats. This is quite useful to track and plot - final report will give you averages only. Intermediate results will make it possible to track the performance on a second by second basis. The final report may look like below:

SQL statistics:
    queries performed:
        read:                            614660
        write:                           0
        other:                           122932
        total:                           737592
    transactions:                        61466  (204.84 per sec.)
    queries:                             737592 (2458.08 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      204.8403
    time elapsed:                        300.0679s
    total number of events:              61466

Latency (ms):
         min:                                   24.91
         avg:                                   78.10
         max:                                  331.91
         95th percentile:                      137.35
         sum:                              4800234.60

Threads fairness:
    events (avg/stddev):           3841.6250/20.87
    execution time (avg/stddev):   300.0147/0.02

You will find here information about executed queries and other (BEGIN/COMMIT) statements. You’ll learn how many transactions were executed, how many errors happened, what was the throughput and total elapsed time. You can also check latency metrics and the query distribution across threads.

If we were interested in latency distribution, we could also pass ‘--histogram’ argument to SysBench. This results in an additional output like below:

Latency histogram (values are in milliseconds)
       value  ------------- distribution ------------- count
      29.194 |******                                   1
      30.815 |******                                   1
      31.945 |***********                              2
      33.718 |******                                   1
      34.954 |***********                              2
      35.589 |******                                   1
      37.565 |***********************                  4
      38.247 |******                                   1
      38.942 |******                                   1
      39.650 |***********                              2
      40.370 |***********                              2
      41.104 |*****************                        3
      41.851 |*****************************            5
      42.611 |*****************                        3
      43.385 |*****************                        3
      44.173 |***********                              2
      44.976 |**************************************** 7
      45.793 |***********************                  4
      46.625 |***********                              2
      47.472 |*****************************            5
      48.335 |**************************************** 7
      49.213 |***********                              2
      50.107 |**********************************       6
      51.018 |***********************                  4
      51.945 |**************************************** 7
      52.889 |*****************                        3
      53.850 |*****************                        3
      54.828 |***********************                  4
      55.824 |***********                              2
      57.871 |***********                              2
      58.923 |***********                              2
      59.993 |******                                   1
      61.083 |******                                   1
      63.323 |***********                              2
      66.838 |******                                   1
      71.830 |******                                   1

Once we are good with our results, we can clean up the data:

sysbench /root/sysbench/src/lua/oltp_read_only.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=10 --table-size=1000000 --range_selects=off --db-ps-mode=disable --report-interval=1 cleanup

Write-heavy traffic

Let’s imagine here that we want to execute a write-heavy (but not write-only) workload and, for example, test I/O subsystem’s performance. First of all, we have to decide how big the dataset should be. We’ll assume ~48GB of data (20 tables, 10 000 000 rows each). We need to prepare it. This time we will use the read-write benchmark.

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_write.lua --threads=4 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=20 --table-size=10000000 prepare

Once this is done, we can tweak the defaults to force more writes into the query mix:

root@vagrant:~# sysbench /root/sysbench/src/lua/oltp_read_write.lua --threads=16 --events=0 --time=300 --mysql-host=10.0.0.126 --mysql-user=sbtest --mysql-password=pass --mysql-port=3306 --tables=20 --delete_inserts=10 --index_updates=10 --non_index_updates=10 --table-size=10000000 --db-ps-mode=disable --report-interval=1 run

As you can see from the intermediate results, transactions are now on a write-heavy side:

[ 5s ] thds: 16 tps: 16.99 qps: 946.31 (r/w/o: 231.83/680.50/33.98) lat (ms,95%): 1258.08 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 16 tps: 17.01 qps: 955.81 (r/w/o: 223.19/698.59/34.03) lat (ms,95%): 1032.01 err/s: 0.00 reconn/s: 0.00
[ 7s ] thds: 16 tps: 12.00 qps: 698.91 (r/w/o: 191.97/482.93/24.00) lat (ms,95%): 1235.62 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 16 tps: 14.01 qps: 683.43 (r/w/o: 195.12/460.29/28.02) lat (ms,95%): 1533.66 err/s: 0.00 reconn/s: 0.00

Understanding the results

As we showed above, SysBench is a great tool which can help to pinpoint some of the performance issues of MySQL or MariaDB. It can also be used for initial tuning of your database configuration. Of course, you have to keep in mind that, to get the best out of your benchmarks, you have to understand why results look like they do. This would require insights into the MySQL internal metrics using monitoring tools, for instance, ClusterControl. This is quite important to remember - if you don’t understand why the performance was like it was, you may draw incorrect conclusions out of the benchmarks. There is always a bottleneck, and SysBench can help raise the performance issues, which you then have to identify.

Tags:

As we saw in the first part of this blog, a strongly consistent database cluster like Galera does not play well with container orchestration tools like Kubernetes or Swarm. We showed you how to deploy Galera and configure process management for Docker, so you retain full control of the behaviour. This blog post is the continuation of that, we are going to look into operation and maintenance of the cluster.

To recap some of the main points from the part 1 of this blog, we deployed a three-node Galera cluster, with ProxySQL and Keepalived on three different Docker hosts, where all MariaDB instances run as Docker containers. The following diagram illustrates the final deployment:

Graceful Shutdown

To perform a graceful MySQL shutdown, the best way is to send SIGTERM (signal 15) to the container:

$ docker kill -s 15 {db_container_name}

If you would like to shutdown the cluster, repeat the above command on all database containers, one node at a time. The above is similar to performing "systemctl stop mysql" in systemd service for MariaDB. Using "docker stop" command is pretty risky for database service because it waits for 10 seconds timeout and Docker will force SIGKILL if this duration is exceeded (unless you use a proper --timeout value).

The last node that shuts down gracefully will have the seqno not equal to -1 and safe_to_bootstrap flag is set to 1 in the /{datadir volume}/grastate.dat of the Docker host, for example on host2:

$ cat /containers/mariadb2/datadir/grastate.dat
# GALERA saved state
version: 2.1
uuid:    e70b7437-645f-11e8-9f44-5b204e58220b
seqno:   7099
safe_to_bootstrap: 1

Detecting the Most Advanced Node

If the cluster didn't shut down gracefully, or the node that you were trying to bootstrap wasn't the last node to leave the cluster, you probably wouldn't be able to bootstrap one of the Galera node and might encounter the following error:

2016-11-07 01:49:19 5572 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node.
It was not the last one to leave the cluster and may not contain all the updates.
To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

Galera honours the node that has safe_to_bootstrap flag set to 1 as the first reference node. This is the safest way to avoid data loss and ensure the correct node always gets bootstrapped.

If you got the error, we have to find out the most advanced node first before picking up the node as the first to be bootstrapped. Create a transient container (with --rm flag), map it to the same datadir and configuration directory of the actual database container with two MySQL command flags, --wsrep_recover and --wsrep_cluster_address. For example, if we want to know mariadb1 last committed number, we need to run:

$ docker run --rm --name mariadb-recover \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/conf.d \
        mariadb:10.2.15 \
        --wsrep_recover \
        --wsrep_cluster_address=gcomm://
2018-06-12  4:46:35 139993094592384 [Note] mysqld (mysqld 10.2.15-MariaDB-10.2.15+maria~jessie) starting as process 1 ...
2018-06-12  4:46:35 139993094592384 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
...
2018-06-12  4:46:35 139993094592384 [Note] Plugin 'FEEDBACK' is disabled.
2018-06-12  4:46:35 139993094592384 [Note] Server socket created on IP: '::'.
2018-06-12  4:46:35 139993094592384 [Note] WSREP: Recovered position: e70b7437-645f-11e8-9f44-5b204e58220b:7099

The last line is what we are looking for. MariaDB prints out the cluster UUID and the sequence number of the most recently committed transaction. The node which holds the highest number is deemed as the most advanced node. Since we specified --rm, the container will be removed automatically once it exits. Repeat the above step on every Docker host by replacing the --volume path to the respective database container volumes.

Once you have compared the value reported by all database containers and decided which container is the most up-to-date node, change the safe_to_bootstrap flag to 1 inside /{datadir volume}/grastate.dat manually. Let's say all nodes are reporting the same exact sequence number, we can just pick mariadb3 to be bootstrapped by changing the safe_to_bootstrap value to 1:

$ vim /containers/mariadb3/datadir/grasate.dat
...
safe_to_bootstrap: 1

Save the file and start bootstrapping the cluster from that node, as described in the next chapter.

Bootstrapping the Cluster

Bootstrapping the cluster is similar to the first docker run command we used when starting up the cluster for the first time. If mariadb1 is the chosen bootstrap node, we can simply re-run the created bootstrap container:

$ docker start mariadb0 # on host1

Otherwise, if the bootstrap container does not exist on the chosen node, let's say on host2, run the bootstrap container command and map the existing mariadb2's volumes. We are using mariadb0 as the container name on host2 to indicate it is a bootstrap container:

$ docker run -d \
        --name mariadb0 \
        --hostname mariadb0.weave.local \
        --net weave \
        --publish "3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb2/datadir:/var/lib/mysql \
        --volume /containers/mariadb2/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm:// \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb0.weave.local

You may notice that this command is slightly shorter as compared to the previous bootstrap command described in this guide. Since we already have the proxysql user created in our first bootstrap command, we may skip these two environment variables:

--env MYSQL_USER=proxysql
--env MYSQL_PASSWORD=proxysqlpassword

Then, start the remaining MariaDB containers, remove the bootstrap container and start the existing MariaDB container on the bootstrapped host. Basically the order of commands would be:

$ docker start mariadb1 # on host1
$ docker start mariadb3 # on host3
$ docker stop mariadb0 # on host2
$ docker start mariadb2 # on host2

At this point, the cluster is started and is running at full capacity.

Resource Control

Memory is a very important resource in MySQL. This is where the buffers and caches are stored, and it's critical for MySQL to reduce the impact of hitting the disk too often. On the other hand, swapping is bad for MySQL performance. By default, there will be no resource constraints on the running containers. Containers use as much of a given resource as the host’s kernel will allow. Another important thing is file descriptor limit. You can increase the limit of open file descriptor, or "nofile" to something higher to cater for the number of files MySQL server can open simultaneously. Setting this to a high value won't hurt.

To cap memory allocation and increase the file descriptor limit to our database container, one would append --memory, --memory-swap and --ulimit parameters into the "docker run" command:

$ docker kill -s 15 mariadb1
$ docker rm -f mariadb1
$ docker run -d \
        --name mariadb1 \
        --hostname mariadb1.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --memory 16g \
        --memory-swap 16g \
        --ulimit nofile:16000:16000 \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb1/datadir:/var/lib/mysql \
        --volume /containers/mariadb1/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb1.weave.local

Take note that if --memory-swap is set to the same value as --memory, and --memory is set to a positive integer, the container will not have access to swap. If --memory-swap is not set, container swap will default to --memory multiply by 2. If --memory and --memory-swap are set to the same value, this will prevent containers from using any swap. This is because --memory-swap is the amount of combined memory and swap that can be used, while --memory is only the amount of physical memory that can be used.

Some of the container resources like memory and CPU can be controlled dynamically through "docker update" command, as shown in the following example to upgrade the memory of container mariadb1 to 32G on-the-fly:

$ docker update \
    --memory 32g \
    --memory-swap 32g \
    mariadb1

Do not forget to tune the my.cnf accordingly to suit the new specs. Configuration management is explained in the next section.

Configuration Management

Most of the MySQL/MariaDB configuration parameters can be changed during runtime, which means you don't need to restart to apply the changes. Check out the MariaDB documentation page for details. The parameter listed with "Dynamic: Yes" means the variable is loaded immediately upon changing without the necessity to restart MariaDB server. Otherwise, set the parameters inside the custom configuration file in the Docker host. For example, on mariadb3, make the changes to the following file:

$ vim /containers/mariadb3/conf.d/my.cnf

And then restart the database container to apply the change:

$ docker restart mariadb3

Verify the container starts up the process by looking at the docker logs. Perform this operation on one node at a time if you would like to make cluster-wide changes.

Backup

Taking a logical backup is pretty straightforward because the MariaDB image also comes with mysqldump binary. You simply use the "docker exec" command to run the mysqldump and send the output to a file relative to the host path. The following command performs mysqldump backup on mariadb2 and saves it to /backups/mariadb2 inside host2:

$ docker exec -it mariadb2 mysqldump -uroot -p --single-transaction > /backups/mariadb2/dump.sql

Binary backup like Percona Xtrabackup or MariaDB Backup requires the process to access the MariaDB data directory directly. You have to either install this tool inside the container, or through the machine host or use a dedicated image for this purpose like "perconalab/percona-xtrabackup" image to create the backup and stored it inside /tmp/backup on the Docker host:

$ docker run --rm -it \
    -v /containers/mariadb2/datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --backup --host=mariadb2 --user=root --password=mypassword

You can also stop the container with innodb_fast_shutdown set to 0 and copy over the datadir volume to another location in the physical host:

$ docker exec -it mariadb2 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'
$ docker kill -s 15 mariadb2
$ cp -Rf /containers/mariadb2/datadir /backups/mariadb2/datadir_copied
$ docker start mariadb2

Restore

Restoring is pretty straightforward for mysqldump. You can simply redirect the stdin into the container from the physical host:

$ docker exec -it mariadb2 mysql -uroot -p < /backups/mariadb2/dump.sql

You can also use the standard mysql client command line remotely with proper hostname and port value instead of using this "docker exec" command:

$ mysql -uroot -p -h127.0.0.1 -P3306 < /backups/mariadb2/dump.sql

For Percona Xtrabackup and MariaDB Backup, we have to prepare the backup beforehand. This will roll forward the backup to the time when the backup was finished. Let's say our Xtrabackup files are located under /tmp/backup of the Docker host, to prepare it, simply:

$ docker run --rm -it \
    -v mysql-datadir:/var/lib/mysql \
    -v /tmp/backup:/xtrabackup_backupfiles \
    perconalab/percona-xtrabackup \
    --prepare --target-dir /xtrabackup_backupfiles

The prepared backup under /tmp/backup of the Docker host then can be used as the MariaDB datadir for a new container or cluster. Let's say we just want to verify restoration on a standalone MariaDB container, we would run:

$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /tmp/backup:/var/lib/mysql \
    mariadb:10.2.15

If you performed a backup using stop and copy approach, you can simply duplicate the datadir and use the duplicated directory as a volume maps to MariaDB datadir to run on another container. Let's say the backup was copied over under /backups/mariadb2/datadir_copied, we can run a new container by running:

$ mkdir -p /containers/mariadb-restored/datadir
$ cp -Rf /backups/mariadb2/datadir_copied /containers/mariadb-restored/datadir
$ docker run -d \
    --name mariadb-restored \
    --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
    -v /containers/mariadb-restored/datadir:/var/lib/mysql \
    mariadb:10.2.15

The MYSQL_ROOT_PASSWORD must match the actual root password for that particular backup.

MySQL on Docker: How to Containerize Your Database

Discover all you need to understand when considering to run a MySQL service on top of Docker container virtualization

Download the White[]aper

Database Version Upgrade

There are two types of upgrade - in-place upgrade or logical upgrade.

In-place upgrade involves shutting down the MariaDB server, replacing the old binaries with the new binaries and then starting the server on the old data directory. Once started, you have to run mysql_upgrade script to check and upgrade all system tables and also to check the user tables.

The logical upgrade involves exporting SQL from the current version using a logical backup utility such as mysqldump, running the new container with the upgraded version binaries, and then applying the SQL to the new MySQL/MariaDB version. It is similar to backup and restore approach described in the previous section.

Nevertheless, it's a good approach to always backup your database before performing any destructive operations. The following steps are required when upgrading from the current image, MariaDB 10.1.33 to another major version, MariaDB 10.2.15 on mariadb3 resides on host3:

Backup the database. It doesn't matter physical or logical backup but the latter using mysqldump is recommended.
Download the latest image that we would like to upgrade to:
```
$ docker pull mariadb:10.2.15
```

Set innodb_fast_shutdown to 0 for our database container:

$ docker exec -it mariadb3 mysql -uroot -p -e 'SET GLOBAL innodb_fast_shutdown = 0'

Graceful shut down the database container:
```
$ docker kill --signal=TERM mariadb3
```

Create a new container with the new image for our database container. Keep the rest of the parameters intact except using the new container name (otherwise it would conflict):

$ docker run -d \
        --name mariadb3-new \
        --hostname mariadb3.weave.local \
        --net weave \
        --publish "3306:3306" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb3/datadir:/var/lib/mysql \
        --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb0.weave.local,mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb3.weave.local

Run mysql_upgrade script:

$ docker exec -it mariadb3-new mysql_upgrade -uroot -p

If no errors occurred, remove the old container, mariadb3 (the new one is mariadb3-new):
```
$ docker rm -f mariadb3
```
Otherwise, if the upgrade process fails in between, we can fall back to the previous container:
```
$ docker stop mariadb3-new
$ docker start mariadb3
```

Major version upgrade can be performed similarly to the minor version upgrade, except you have to keep in mind that MySQL/MariaDB only supports major upgrade from the previous version. If you are on MariaDB 10.0 and would like to upgrade to 10.2, you have to upgrade to MariaDB 10.1 first, followed by another upgrade step to MariaDB 10.2.

Take note on the configuration changes being introduced and deprecated between major versions.

Failover

In Galera, all nodes are masters and hold the same role. With ProxySQL in the picture, connections that pass through this gateway will be failed over automatically as long as there is a primary component running for Galera Cluster (that is, a majority of nodes are up). The application won't notice any difference if one database node goes down because ProxySQL will simply redirect the connections to the other available nodes.

If the application connects directly to the MariaDB bypassing ProxySQL, failover has to be performed on the application-side by pointing to the next available node, provided the database node meets the following conditions:

Status wsrep_local_state_comment is Synced (The state "Desynced/Donor" is also possible, only if wsrep_sst_method is xtrabackup, xtrabackup-v2 or mariabackup).
Status wsrep_cluster_status is Primary.

In Galera, an available node doesn't mean it's healthy until the above status are verified.

Scaling Out

To scale out, we can create a new container in the same network and use the same custom configuration file for the existing container on that particular host. For example, let's say we want to add the fourth MariaDB container on host3, we can use the same configuration file mounted for mariadb3, as illustrated in the following diagram:

Run the following command on host3 to scale out:

$ docker run -d \
        --name mariadb4 \
        --hostname mariadb4.weave.local \
        --net weave \
        --publish "3306:3307" \
        --publish "4444" \
        --publish "4567" \
        --publish "4568" \
        $(weave dns-args) \
        --env MYSQL_ROOT_PASSWORD="PM7%cB43$sd@^1" \
        --volume /containers/mariadb4/datadir:/var/lib/mysql \
        --volume /containers/mariadb3/conf.d:/etc/mysql/mariadb.conf.d \
        mariadb:10.2.15 \
        --wsrep_cluster_address=gcomm://mariadb1.weave.local,mariadb2.weave.local,mariadb3.weave.local,mariadb4.weave.local \
        --wsrep_sst_auth="root:PM7%cB43$sd@^1" \
        --wsrep_node_address=mariadb4.weave.local

Once the container is created, it will join the cluster and perform SST. It can be accessed on port 3307 externally or outside of the Weave network, or port 3306 within the host or within the Weave network. It's not necessary to include mariadb0.weave.local into the cluster address anymore. Once the cluster is scaled out, we need to add the new MariaDB container into the ProxySQL load balancing set via admin console:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (10,'mariadb4.weave.local',3306);
mysql> INSERT INTO mysql_servers(hostgroup_id,hostname,port) VALUES (20,'mariadb4.weave.local',3306);
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

Repeat the above commands on the second ProxySQL instance.

Finally for the the last step, (you may skip this part if you already ran "SAVE .. TO DISK" statement in ProxySQL), add the following line into proxysql.cnf to make it persistent across container restart on host1 and host2:

$ vim /containers/proxysql1/proxysql.cnf # host1
$ vim /containers/proxysql2/proxysql.cnf # host2

And append mariadb4 related lines under mysql_server directive:

mysql_servers =
(
        { address="mariadb1.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=10, max_connections=100 },
        { address="mariadb1.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb2.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb3.weave.local" , port=3306 , hostgroup=20, max_connections=100 },
        { address="mariadb4.weave.local" , port=3306 , hostgroup=20, max_connections=100 }
)

Save the file and we should be good on the next container restart.

Scaling Down

To scale down, simply shuts down the container gracefully. The best command would be:

$ docker kill -s 15 mariadb4
$ docker rm -f mariadb4

Remember, if the database node left the cluster ungracefully, it was not part of scaling down and would affect the quorum calculation.

To remove the container from ProxySQL, run the following commands on both ProxySQL containers. For example, on proxysql1:

$ docker exec -it proxysql1 mysql -uadmin -padmin -P6032
mysql> DELETE FROM mysql_servers WHERE hostname="mariadb4.weave.local";
mysql> LOAD MYSQL SERVERS TO RUNTIME;
mysql> SAVE MYSQL SERVERS TO DISK;

You can then either remove the corresponding entry inside proxysql.cnf or just leave it like that. It will be detected as OFFLINE from ProxySQL point-of-view anyway.

Summary

With Docker, things get a bit different from the conventional way on handling MySQL or MariaDB servers. Handling stateful services like Galera Cluster is not as easy as stateless applications, and requires proper testing and planning.

In our next blog on this topic, we will evaluate the pros and cons of running Galera Cluster on Docker without any orchestration tools.

Tags:

Galera Cluster comes with many notable features that are not available in standard MySQL replication (or Group Replication); automatic node provisioning, true multi-master with conflict resolutions and automatic failover. There are also a number of limitations that could potentially impact cluster performance. Luckily, if you are not aware of these, there are workarounds. And if you do it right, you can minimize the impact of these limitations and improve overall performance.

We have previously covered many tips and tricks related to Galera Cluster, including running Galera on AWS Cloud. This blog post distinctly dives into the performance aspects, with examples on how to get the most out of Galera.

Replication Payload

A bit of introduction - Galera replicates writesets during the commit stage, transferring writesets from the originator node to the receiver nodes synchronously through the wsrep replication plugin. This plugin will also certify writesets on the receiver nodes. If the certification process passes, it returns OK to the client on the originator node and will be applied on the receiver nodes at a later time asynchronously. Else, the transaction will be rolled back on the originator node (returning error to the client) and the writesets that have been transferred to the receiver nodes will be discarded.

A writeset consists of write operations inside a transaction that changes the database state. In Galera Cluster, autocommit is default to 1 (enabled). Literally, any SQL statement executed in Galera Cluster will be enclosed as a transaction, unless you explicitly start with BEGIN, START TRANSACTION or SET autocommit=0. The following diagram illustrates the encapsulation of a single DML statement into a writeset:

For DML (INSERT, UPDATE, DELETE..), the writeset payload consists of the binary log events for a particular transaction while for DDLs (ALTER, GRANT, CREATE..), the writeset payload is the DDL statement itself. For DMLs, the writeset will have to be certified against conflicts on the receiver node while for DDLs (depending on wsrep_osu_method, default to TOI), the cluster cluster runs the DDL statement on all nodes in the same total order sequence, blocking other transactions from committing while the DDL is in progress (see also RSU). In simple words, Galera Cluster handles DDL and DML replication differently.

Round Trip Time

Generally, the following factors determine how fast Galera can replicate a writeset from an originator node to all receiver nodes:

Round trip time (RTT) to the farthest node in the cluster from the originator node.
The size of a writeset to be transferred and certified for conflict on the receiver node.

For example, if we have a three-node Galera Cluster and one of the nodes is located 10 milliseconds away (0.01 second), it's very unlikely you might be able to write more than 100 times per second to the same row without conflicting. There is a popular quote from Mark Callaghan which describes this behaviour pretty well:

"[In a Galera cluster] a given row can’t be modified more than once per RTT"

To measure RTT value, simply perform ping on the originator node to the farthest node in the cluster:

$ ping 192.168.55.173 # the farthest node

Wait for a couple of seconds (or minutes) and terminate the command. The last line of the ping statistic section is what we are looking for:

--- 192.168.55.172 ping statistics ---
65 packets transmitted, 65 received, 0% packet loss, time 64019ms
rtt min/avg/max/mdev = 0.111/0.431/1.340/0.240 ms

The max value is 1.340 ms (0.00134s) and we should take this value when estimating the minimum transactions per second (tps) for this cluster. The average value is 0.431ms (0.000431s) and we can use to estimate the average tps while min value is 0.111ms (0.000111s) which we can use to estimate the maximum tps. The mdev means how the RTT samples were distributed from the average. Lower value means more stable RTT.

Hence, transactions per second can be estimated by dividing RTT (in second) into 1 second:

Resulting,

Minimum tps: 1 / 0.00134 (max RTT) = 746.26 ~ 746 tps
Average tps: 1 / 0.000431 (avg RTT) = 2320.19 ~ 2320 tps
Maximum tps: 1 / 0.000111 (min RTT) = 9009.01 ~ 9009 tps

Note that this is just an estimation to anticipate replication performance. There is not much we can do to improve this on the database side, once we have everything deployed and running. Except, if you move or migrate the database servers closer to each other to improve the RTT between nodes, or upgrade the network peripherals or infrastructure. This would require maintenance window and proper planning.

Chunk Up Big Transactions

Another factor is the transaction size. After the writeset is transferred, there will be a certification process. Certification is a process to determine whether or not the node can apply the writeset. Galera generates MD5 checksum pseudo keys from every full row. The cost of certification depends on the size of the writeset, which translates into a number of unique key lookups into the certification index (a hash table). If you update 500,000 rows in a single transaction, for example:

# a 500,000 rows table
mysql> UPDATE mydb.settings SET success = 1;

The above will generate a single writeset with 500,000 binary log events in it. This huge writeset does not exceed wsrep_max_ws_size (default to 2GB) so it will be transferred over by Galera replication plugin to all nodes in the cluster, certifying these 500,000 rows on the receiver nodes for any conflicting transactions that are still in the slave queue. Finally, the certification status is returned to the group replication plugin. The bigger the transaction size, the higher risk it will be conflicting with other transactions that come from another master. Conflicting transactions waste server resources, plus cause a huge rollback to the originator node. Note that a rollback operation in MySQL is way slower and less optimized than commit operation.

The above SQL statement can be re-written into a more Galera-friendly statement with the help of simple loop, like the example below:

(bash)$ for i in {1..500}; do \
mysql -uuser -ppassword -e "UPDATE mydb.settings SET success = 1 WHERE success != 1 LIMIT 1000"; \
sleep 2; \
done

The above shell command would update 1000 rows per transaction for 500 times and wait for 2 seconds between executions. You could also use a stored procedure or other means to achieve a similar result. If rewriting the SQL query is not an option, simply instruct the application to execute the big transaction during a maintenance window to reduce the risk of conflicts.

For huge deletes, consider using pt-archiver from the Percona Toolkit - a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much.

Parallel Slave Threads

In Galera, the applier is a multithreaded process. Applier is a thread running within Galera to apply the incoming write-sets from another node. Which means, it is possible for all receivers to execute multiple DML operations that come right from the originator (master) node simultaneously. Galera parallel replication is only applied to transactions when it is safe to do so. It improves the probability of the node to sync up with the originator node. However, the replication speed is still limited to RTT and writeset size.

To get the best out of this, we need to know two things:

The number of cores the server has.
The value of wsrep_cert_deps_distance status.

The status wsrep_cert_deps_distance tells us the potential degree of parallelization. It is the value of the average distance between highest and lowest seqno values that can be possibly applied in parallel. You can use the wsrep_cert_deps_distance status variable to determine the maximum number of slave threads possible. Take note that this is an average value across time. Hence, in order get a good value, you have to hit the cluster with writes operations through test workload or benchmark until you see a stable value coming out.

To get the number of cores, you can simply use the following command:

$ grep -c processor /proc/cpuinfo
4

Ideally, 2, 3 or 4 threads of slave applier per CPU core is a good start. Thus, the minimum value for the slave threads should be 4 x number of CPU cores, and must not exceed the wsrep_cert_deps_distance value:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cert_deps_distance';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| wsrep_cert_deps_distance | 48.16667 |
+--------------------------+----------+

You can control the number of slave applier threads using wsrep_slave_thread variable. Even though this is a dynamic variable, only increasing the number would have an immediate effect. If you reduce the value dynamically, it would take some time, until the applier thread exits after it finishes applying. A recommended value is anywhere between 16 to 48:

mysql> SET GLOBAL wsrep_slave_threads = 48;

Take note that in order for parallel slave threads to work, the following must be set (which is usually pre-configured for Galera Cluster):

innodb_autoinc_lock_mode=2

Galera Cache (gcache)

Galera uses a preallocated file with a specific size called gcache, where a Galera node keeps a copy of writesets in circular buffer style. By default, its size is 128MB, which is rather small. Incremental State Transfer (IST) is a method to prepare a joiner by sending only the missing writesets available in the donor’s gcache. IST is faster than state snapshot transfer (SST), it is non-blocking and has no significant performance impact on the donor. It should be the preferred option whenever possible.

IST can only be achieved if all changes missed by the joiner are still in the gcache file of the donor. The recommended setting for this is to be as big as the whole MySQL dataset. If disk space is limited or costly, determining the right size of the gcache size is crucial, as it can influence the data synchronization performance between Galera nodes.

The below statement will give us an idea of the amount of data replicated by Galera. Run the following statement on one of the Galera nodes during peak hours (tested on MariaDB >10.0 and PXC >5.6, galera >3.x):

mysql> SET @start := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); do sleep(60); SET @end := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); SET @gcache := (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(@@GLOBAL.wsrep_provider_options,'gcache.size = ',-1), 'M', 1)); SELECT ROUND((@end - @start),2) AS `MB/min`, ROUND((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, ROUND(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`;
+--------+---------+-----------------+-----------------------+
| MB/min | MB/hour | gcache Size(MB) | Time to full(minutes) |
+--------+---------+-----------------+-----------------------+
|   7.95 |  477.00 |  128            |                 16.10 |
+--------+---------+-----------------+-----------------------+

We can estimate that the Galera node can have approximately 16 minutes of downtime, without requiring SST to join (unless Galera cannot determine the joiner state). If this is too short time and you have enough disk space on your nodes, you can change the wsrep_provider_options="gcache.size=<value>" to a more appropriate value. In this example workload, setting gcache.size=1G allows us to have 2 hours of node downtime with high probability of IST when the node rejoins.

It's also recommended to use gcache.recover=yes in wsrep_provider_options (Galera >3.19), where Galera will attempt to recover the gcache file to a usable state on startup rather than delete it, thus preserving the ability to have IST and avoiding SST as much as possible. Codership and Percona have covered this in details in their blogs. IST is always the best method to sync up after a node rejoins the cluster. It is 50% faster than xtrabackup or mariabackup and 5x faster than mysqldump.

Asynchronous Slave

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera use a flow control mechanism, to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. If you want to know about flow control, read this blog post by Jay Janssen from Percona.

In most cases, heavy operations like long running analytics (read-intensive) and backups (read-intensive, locking) are often inevitable, which could potentially degrade the cluster performance. The best way to execute this type of queries is by sending them to a loosely-coupled replica server, for instance, an asynchronous slave.

An asynchronous slave replicates from a Galera node using the standard MySQL asynchronous replication protocol. There is no limit on the number of slaves that can be connected to one Galera node, and chaining it out with an intermediate master is also possible. MySQL operations that execute on this server won't impact the cluster performance, apart from the initial syncing phase where a full backup must be taken on the Galera node to stage the slave before establishing the replication link (although ClusterControl allows you to build the async slave from an existing backup first, before connecting it to the cluster).

GTID (Global Transaction Identifier) provides a better transactions mapping across nodes, and is supported in MySQL 5.6 and MariaDB 10.0. With GTID, the failover operation on a slave to another master (another Galera node) is simplified, without the need to figure out the exact log file and position. Galera also comes with its own GTID implementation but these two are independent to each other.

Scaling out an asynchronous slave is one-click away if you are using ClusterControl -> Add Replication Slave feature:

Take note that binary logs must be enabled on the master (the chosen Galera node) before we can proceed with this setup. We have also covered the manual way in this previous post.

The following screenshot from ClusterControl shows the cluster topology, it illustrates our Galera Cluster architecture with an asynchronous slave:

ClusterControl automatically discovers the topology and generates the super cool diagram like above. You can also perform administration tasks directly from this page by clicking on the top-right gear icon of each box.

SQL-aware Reverse Proxy

ProxySQL and MariaDB MaxScale are intelligent reverse-proxies which understand MySQL protocol and is capable of acting as a gateway, router, load balancer and firewall in front of your Galera nodes. With the help of Virtual IP Address provider like LVS or Keepalived, and combining this with Galera multi-master replication technology, we can have a highly available database service, eliminating all possible single-point-of-failures (SPOF) from the application point-of-view. This will surely improve the availability and reliability the architecture as whole.

Another advantage with this approach is you will have the ability to monitor, rewrite or re-route the incoming SQL queries based on a set of rules before they hit the actual database server, minimizing the changes on the application or client side and routing queries to a more suitable node for optimal performance. Risky queries for Galera like LOCK TABLES and FLUSH TABLES WITH READ LOCK can be prevented way ahead before they would cause havoc to the system, while impacting queries like "hotspot" queries (a row that different queries want to access at the same time) can be rewritten or being redirected to a single Galera node to reduce the risk of transaction conflicts. For heavy read-only queries like OLAP or backup, you can route them over to an asynchronous slave if you have any.

Reverse proxy also monitors the database state, queries and variables to understand the topology changes and produce an accurate routing decision to the backend servers. Indirectly, it centralizes the nodes monitoring and cluster overview without the need to check on each and every single Galera node regularly. The following screenshot shows the ProxySQL monitoring dashboard in ClusterControl:

There are also many other benefits that a load balancer can bring to improve Galera Cluster significantly, as covered in details in this blog post, Become a ClusterControl DBA: Making your DB components HA via Load Balancers.

Final Thoughts

With good understanding on how Galera Cluster internally works, we can work around some of the limitations and improve the database service. Happy clustering!

Tags:

Introduction

When running Galera Cluster, it is a common practice to add one or more asynchronous slaves in the same or in a different datacenter. This provides us with a contingency plan with low RTO, and with a low operating cost. In the case of an unrecoverable problem in our cluster, we can quickly failover to it so applications can continue to have access to data.

When using this type of setup, we cannot just then rebuild our cluster from a previous backup. Since the async slave is now the new source of truth, we need to rebuild the cluster from it.

This does not mean that we only have one way to do it, maybe there is even a better way! Feel free to give us your suggestions in the comments section at the end of this post.

Topology

ClusterControl Topology View Online

Above, we can see a sample topology with Galera Cluster and an asynchronous replica/slave.

Database Diagram 1

Next we will see how we can recreate our cluster, starting from the slave, in the case of finding something like this:

Database Diagram 2

ClusterControl Topology View Offline

If we look at the previous image, we can see our 3 Galera nodes are down. Our slave is not able to connect to the Galera master, but it is in an "Up and running" state.

Promote slave

As our slave is working properly, we can promote it to master and point our applications to it. For this, we must disable the read-only parameter in our slave and reset the slave configuration.

In our slave (mysql1):

mysql> SET GLOBAL read_only=0;
Query OK, 0 rows affected (0.00 sec)
mysql> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)
mysql> RESET SLAVE;
Query OK, 0 rows affected (0.18 sec)

Create new cluster

Next, to start recovery of our failed cluster, we will create a new Galera Cluster. This can be easily done through ClusterControl ClusterControl, please scroll further down in this blog to see how.

Once we have deployed our new Galera cluster, we would have something like the following:

Database Diagram 3

Replication

We must ensure that we have the replication parameters configured.

For Galera nodes (galera1, galera2, galera3):

server_id=<ID>         # Different value in each node
binlog_format=ROW
log_bin = /var/lib/mysql-binlog/binlog
log_slave_updates = ON
gtid_mode = ON
enforce_gtid_consistency = true
relay_log = relay-bin
expire_logs_days = 7

For Master node (mysql1):

server_id=<ID>         # Different value in each node
binlog_format=ROW
log_bin=binlog
log_slave_updates=1
gtid_mode=ON
enforce_gtid_consistency=1
relay_log=relay-bin
expire_logs_days=7
read_only=ON
sync_binlog=1
report_host=<HOSTNAME or IP>    # Local server

In order for our new slave (galera1) to connect with our new master (mysql1), we must create a user with replication permissions in our master.

In our new master (mysql1):

mysql> GRANT REPLICATION SLAVE ON *.* TO 'slave_user'@'%' IDENTIFIED BY 'slave_password';

Note: We can replace the "%" with the IP of the Galera Cluster node that will be our slave, in our example, galera1.

Backup

If we do not have it, we must create a consistent backup of our master (mysql1) and load it in our new Galera Cluster. For this, we can use XtraBackup tool or mysqldump. Let’s see both options.

In our example we use the sakila database available for testing.

XtraBackup tool

We generate the backup in the new master (mysql1). In our case we send it to the local directory /root/backup:

$ innobackupex /root/backup/

We must get the message:

180705 22:08:14 completed OK!

We compress the backup and send it to the node that will be our slave (galera1):

$ cd /root/backup
$ tar zcvf 2018-07-05_22-08-07.tar.gz 2018-07-05_22-08-07
$ scp /root/backup/2018-07-05_22-08-07.tar.gz galera1:/root/backup/

In galera1, extract the backup:

$ tar zxvf /root/backup/2018-07-05_22-08-07.tar.gz

We stop the cluster (if it is started). For this we stop the mysql services of the 3 nodes:

$ service mysql stop

In galera1, we rename the data directory of mysql and load the backup:

$ mv /var/lib/mysql /var/lib/mysql.bak
$ innobackupex --copy-back /root/backup/2018-07-05_22-08-07

We must get the message:

180705 23:00:01 completed OK!

We assign the correct permissions on the data directory:

$ chown -R mysql.mysql /var/lib/mysql

Then we must initialize the cluster.

Once the first node is initialized, we must start the MySQL service for the remaining nodes, eliminating any previous copy of the file grastate.dat, and then verify that our data is updated.

$ rm /var/lib/mysql/grastate.dat
$ service mysql start

Note: Verify that the user used by XtraBackup is created in our initialized node, and is the same in each node.

mysqldump

In general, we do not recommend doing it with mysqldump, because it can be quite slow with a large volume of data. But it is an alternative to perform the task.

We generate the backup in the new master (mysql1):

$ mysqldump -uroot -p --single-transaction --skip-add-locks --triggers --routines --events --databases sakila > /root/backup/sakila_dump.sql

We compress it and send it to our slave node (galera1):

$ gzip /root/backup/sakila_dump.sql
$ scp /root/backup/sakila_dump.sql.gz galera1:/root/backup/

We load the dump into galera1.

$ gunzip /root/backup/sakila_dump.sql.gz
$ mysql -p < /root/backup/sakila_dump.sql

When the dump is loaded in galera1, we must restart the MySQL service on the remaining nodes, removing the file grastate.dat, and verify that we have our data updated.

$ rm /var/lib/mysql/grastate.dat
$ service mysql start

Start replication slave

Regardless of which option we choose, XtraBackup or mysqldump, if everything went well, in this step we can already turn on replication in the node that will be our slave (galera1).

$ mysql> CHANGE MASTER TO MASTER_HOST = 'mysql1', MASTER_PORT = 3306, MASTER_USER = 'slave_user', MASTER_PASSWORD = 'slave_password', MASTER_AUTO_POSITION = 1;
$ mysql> START SLAVE;

We verify that the slave is working:

mysql> SHOW SLAVE STATUS\G
       Slave_IO_Running: Yes
       Slave_SQL_Running: Yes

At this point, we have something like the following:

Database Diagram 4

After NewGalera1 is up to date, we can re-point the application to our new galera cluster, and reconfigure the asynchronous replication.

ClusterControl

As we mentioned earlier, with ClusterControl we can do several of the tasks mentioned above in a few simple clicks. It also has automatic recovery options, for both the nodes and the cluster. Let's see some tasks that it can assist with.

ClusterControl Deployment 1

To perform a deployment, simply select the option “Deploy Database Cluster” and follow the instructions that appear.

ClusterControl Deployment 2

We can choose between different kinds of technologies and vendors. We must specify User, Key or Password and port to connect by SSH to our servers. We also need the name for our new cluster and if we want ClusterControl to install the corresponding software and configurations for us.

ClusterControl Deployment 3

After setting up the SSH access information, we must define the nodes in our cluster. We can also specify which repository to use. We need to add our servers to the cluster that we are going to create.

We can monitor the status of the creation of our new cluster from the ClusterControl activity monitor.

Also, we can do an import of our current cluster or database following the same steps. In this case, ClusterControl won’t install the database software, because there is already a database running.

ClusterControl Add Replication Salve

To add a replication slave, you need to click on Cluster Actions, select Add Replication Slave, and add the SSH access information of the new server. ClusterControl will connect to the server to make the necessary configurations for this action.

ClusterControl Enable Binary Logging

To turn one or more Galera nodes into master servers (as in the sense of producing binlogs), you can go to Node Actions and select Enable Binary Logging.

ClusterControl Backups

Backups can be configured with XtraBackup (full or incremental) and mysqldump, and you have other options like upload the backup to the cloud, encryption, compression, schedule and more.

ClusterControl Restore

To restore the backup, go to Backup tab and choose Restore option, then you select in what server you want to restore.

ClusterControl Change Replication Master

If you have a slave and you want to change the master, or rebuild the replication, you can go to Node Actions and select the option.

Conclusion

As we could see, we have several ways to achieve our goal, some more complex, others more user friendly, but with any of them you can recreate a cluster from an asynchronous slave. Xtrabackup would restore faster for larger data volumes. To guard against operator error (e.g., an erroneous DROP TABLE), you could also use a delayed slave so you hopefully have time to stop the statement from propagating.

We hope that this information is useful, and that you never have to use it in production ;)

Tags:

galera cluster

recovery

restore

One of the cool features in Galera is automatic node provisioning and membership control. If a node fails or loses communication, it will be automatically evicted from the cluster and remain unoperational. As long as the majority of nodes are still communicating (Galera calls this PC - primary component), there is a very high chance the failed node would be able to automatically rejoin, resync and resume the replication once the connectivity is back.

Generally, all Galera nodes are equal. They hold the same data set and same role as masters, capable of handling read and write simultaneously, thanks to Galera group communication and certification-based replication plugin. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover, to skip the unoperational nodes while the cluster is partitioned.

In this blog post, we are going to look into understanding how Galera Cluster performs node and cluster recovery in case network partition happens. Just as a side note, we have covered a similar topic in this blog post some time back. Codership has explained Galera's recovery concept in great details in the documentation page, Node Failure and Recovery.

Node Failure and Eviction

In order to understand the recovery, we have to understand how Galera detects the node failure and eviction process first. Let's put this into a controlled test scenario so we can understand the eviction process better. Suppose we have a three-node Galera Cluster as illustrated below:

The following command can be used to retrieve our Galera provider options:

mysql> SHOW VARIABLES LIKE 'wsrep_provider_options'\G

It's a long list, but we just need to focus on some of the parameters to explain the process:

evs.inactive_check_period = PT0.5S; 
evs.inactive_timeout = PT15S; 
evs.keepalive_period = PT1S; 
evs.suspect_timeout = PT5S; 
evs.view_forget_timeout = P1D;
gmcast.peer_timeout = PT3S;

First of all, Galera follows ISO 8601 formatting to represent duration. P1D means the duration is one day, while PT15S means the duration is 15 seconds (note the time designator, T, that precedes the time value). For example if one wanted to increase evs.view_forget_timeout to 1 day and a half, one would set P1DT12H, or PT36H.

Considering all hosts haven't been configured with any firewall rules, we use the following script called block_galera.sh on galera2 to simulate a network failure to/from this node:

#!/bin/bash
# block_galera.sh
# galera2, 192.168.55.172

iptables -I INPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I INPUT -m tcp -p tcp --dport 3306 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 4567 -j REJECT
iptables -I OUTPUT -m tcp -p tcp --dport 3306 -j REJECT
# print timestamp
date

By executing the script, we get the following output:

$ ./block_galera.sh
Wed Jul  4 16:46:02 UTC 2018

The reported timestamp can be considered as the start of the cluster partitioning, where we lose galera2, while galera1 and galera3 are still online and accessible. At this point, our Galera Cluster architecture is looking something like this:

From Partitioned Node Perspective

On galera2, you will see some printouts inside the MySQL error log. Let's break them out into several parts. The downtime was started around 16:46:02 UTC time and after gmcast.peer_timeout=PT3S, the following appears:

2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 8b2041d6 with addr tcp://192.168.55.173:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.173:4567
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 737422d6 with addr tcp://192.168.55.171:4567 timed out, no messages seen in PT3S
2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 0

As it passed evs.suspect_timeout = PT5S, both nodes galera1 and galera3 are suspected as dead by galera2:

2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 8b2041d6
2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive
2018-07-04 16:46:07 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 0
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspecting node: 737422d6
2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Then, Galera will revise the current cluster view and the position of this node:

2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,54) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})
2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,55) memb {
        62116b35,0
} joined {
} left {
} partitioned {
        737422d6,0
        8b2041d6,0
})

With the new cluster view, Galera will perform quorum calculation to decide whether this node is part of the primary component. If the new component sees "primary = no", Galera will demote the local node state from SYNCED to OPEN:

2018-07-04 16:46:09 140454288942848 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Flow-control interval: [16, 16]
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Trying to continue unpaused monitor
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Received NON-PRIMARY.
2018-07-04 16:46:09 140454288942848 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2753699)

With the latest change on the cluster view and node state, Galera returns the post-eviction cluster view and global state as below:

2018-07-04 16:46:09 140454222194432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753699, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-07-04 16:46:09 140454222194432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can see the following global status of galera2 have changed during this period:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                                                                                    |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 1                                                                                                                                 |
| WSREP_CLUSTER_STATUS      | non-Primary                                                                                                                       |
| WSREP_EVS_DELAYED         | 737422d6-7db3-11e8-a2a2-bbe98913baf0:tcp://192.168.55.171:4567:1,8b2041d6-7f62-11e8-87d5-12a76678131f:tcp://192.168.55.173:4567:2 |
| WSREP_LOCAL_STATE_COMMENT | Initialized                                                                                                                       |
| WSREP_READY               | OFF                                                                                                                               |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+

At this point, MySQL/MariaDB server on galera2 is still accessible (database is listening on 3306 and Galera on 4567) and you can query the mysql system tables and list out the databases and tables. However when you jump into the non-system tables and make a simple query like this:

mysql> SELECT * FROM sbtest1;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

You will immediately get an error indicating WSREP is loaded but not ready to use by this node, as reported by wsrep_ready status. This is due to the node losing its connection to the Primary Component and it enters the non-operational state (the local node status was changed from SYNCED to OPEN). Data reads from nodes in a non-operational state are considered stale, unless you set wsrep_dirty_reads=ON to permit reads, although Galera still rejects any command that modifies or updates the database.

Finally, Galera will keep on listening and reconnecting to other members in the background infinitely:

2018-07-04 16:47:12 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 30
2018-07-04 16:47:13 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 30
2018-07-04 16:48:20 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 60
2018-07-04 16:48:22 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 60

The eviction process flow by Galera group communication for the partitioned node during network issue can be summarized as below:

Disconnects from the cluster after gmcast.peer_timeout.
Suspects other nodes after evs.suspect_timeout.
Retrieves the new cluster view.
Performs quorum calculation to determine the node's state.
Demotes the node from SYNCED to OPEN.
Attempts to reconnect to the primary component (other Galera nodes) in the background.

From Primary Component Perspective

On galera1 and galera3 respectively, after gmcast.peer_timeout=PT3S, the following appears in the MySQL error log:

2018-07-04 16:46:05 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.172:4567
2018-07-04 16:46:06 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') reconnecting to 62116b35 (tcp://192.168.55.172:4567), attempt 0

After it passed evs.suspect_timeout = PT5S, galera2 is suspected as dead by galera3 (and galera1):

2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 62116b35
2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Galera checks out if the other nodes respond to the group communication on galera3, it finds galera1 is in primary and stable state:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-04 16:46:11 139955510687488 [Note] WSREP: Node 737422d6 state prim

Galera revises the cluster view of this node (galera3):

2018-07-04 16:46:11 139955510687488 [Note] WSREP: view(view_id(PRIM,737422d6,55) memb {
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
        62116b35,0
})
2018-07-04 16:46:11 139955510687488 [Note] WSREP: save pc into disk

Galera then removes the partitioned node from the Primary Component:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: forgetting 62116b35 (tcp://192.168.55.172:4567)

The new Primary Component is now consisted of two nodes, galera1 and galera3:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

The Primary Component will exchange the state between each other to agree on the new cluster view and global state:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-07-04 16:46:11 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: sent state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 0 (192.168.55.171)
2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 1 (192.168.55.173)

Galera calculates and verifies the quorum of the state exchange between online members:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 27,
        members    = 2/2 (joined/total),
        act_id     = 2753703,
        last_appl. = 2753606,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Flow-control interval: [23, 23]
2018-07-04 16:46:11 139955502294784 [Note] WSREP: Trying to continue unpaused monitor

Galera updates the new cluster view and global state after galera2 eviction:

2018-07-04 16:46:11 139955214169856 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753703, view# 28: Primary, number of nodes: 2, my index: 1, protocol version 3
2018-07-04 16:46:11 139955214169856 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-04 16:46:11 139955214169856 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-04 16:46:11 139955214169856 [Note] WSREP: Assign initial position for certification: 2753703, protocol version: 3
2018-07-04 16:46:11 139956691814144 [Note] WSREP: Service thread queue flushed.
Clean up the partitioned node (galera2) from the active list:
2018-07-04 16:46:14 139955510687488 [Note] WSREP: cleaning up 62116b35 (tcp://192.168.55.172:4567)

At this point, both galera1 and galera3 will be reporting similar global status:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+------------------------------------------------------------------+
| VARIABLE_NAME             | VARIABLE_VALUE                                                   |
+---------------------------+------------------------------------------------------------------+
| WSREP_CLUSTER_SIZE        | 2                                                                |
| WSREP_CLUSTER_STATUS      | Primary                                                          |
| WSREP_EVS_DELAYED         | 1491abd9-7f6d-11e8-8930-e269b03673d8:tcp://192.168.55.172:4567:1 |
| WSREP_LOCAL_STATE_COMMENT | Synced                                                           |
| WSREP_READY               | ON                                                               |
+---------------------------+------------------------------------------------------------------+

They list out the problematic member in the wsrep_evs_delayed status. Since the local state is "Synced", these nodes are operational and you can redirect the client connections from galera2 to any of them. If this step is inconvenient, consider using a load balancer sitting in front of the database to simplify the connection endpoint from the clients.

Node Recovery and Joining

A partitioned Galera node will keep on attempting to establish connection with the Primary Component infinitely. Let's flush the iptables rules on galera2 to let it connect with the remaining nodes:

# on galera2
$ iptables -F

Once the node is capable of connecting to one of the nodes, Galera will start re-establishing the group communication automatically:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 8b2041d6 tcp://192.168.55.173:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable
2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 8b2041d6 at tcp://192.168.55.173:4567 stable

Node galera2 will then connect to one of the Primary Component (in this case is galera1, node ID 737422d6) to get the current cluster view and nodes state:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: Node 737422d6 state prim
2018-07-09 10:46:34 140075962705664 [Note] WSREP: view(view_id(PRIM,1491abd9,142) memb {
        1491abd9,0
        737422d6,0
        8b2041d6,0
} joined {
} left {
} partitioned {
})
2018-07-09 10:46:34 140075962705664 [Note] WSREP: save pc into disk

Galera will then perform state exchange with the rest of the members that can form the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: sent state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 0 (192.168.55.172)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 1 (192.168.55.171)
2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 2 (192.168.55.173)

The state exchange allows galera2 to calculate the quorum and produce the following result:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 71,
        members    = 2/3 (joined/total),
        act_id     = 2836958,
        last_appl. = 0,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc

Galera will then promote the local node state from OPEN to PRIMARY, to start and establish the node connection to the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Flow-control interval: [28, 28]
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Trying to continue unpaused monitor
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2836958)

As reported by the above line, Galera calculates the gap on how far the node is behind from the cluster. This node requires state transfer to catch up to writeset number 2836958 from 2761994:

2018-07-09 10:46:34 140075929970432 [Note] WSREP: State transfer required:
        Group state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
        Local state: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994
2018-07-09 10:46:34 140075929970432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958, view# 72: Primary, number of nodes:
3, my index: 0, protocol version 3
2018-07-09 10:46:34 140075929970432 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Assign initial position for certification: 2836958, protocol version: 3

Galera prepares the IST listener on port 4568 on this node and asks any Synced node in the cluster to become a donor. In this case, Galera automatically picks galera3 (192.168.55.173), or it could also pick a donor from the list under wsrep_sst_donor (if defined) r for the syncing operation:

2018-07-09 10:46:34 140075996276480 [Note] WSREP: Service thread queue flushed.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: IST receiver addr using tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.55.172:4568
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) requested state transfer from '*any*'. Selected 2.0 (192.168.55.173)(SYNCED) as donor.

It will then change the local node state from PRIMARY to JOINER. At this stage, galera2 is granted with state transfer request and starts to cache write-sets:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2836958)
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Requesting state transfer: success, donor: 2
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache history reset: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994 -> 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset

Node galera2 starts receiving the missing writesets from the selected donor's gcache (galera3):

2018-07-09 10:46:34 140075954312960 [Note] WSREP: 2.0 (192.168.55.173): State transfer to 0.0 (192.168.55.172) complete.
2018-07-09 10:46:34 140075929970432 [Note] WSREP: Receiving IST: 74964 writesets, seqnos 2761994-2836958
2018-07-09 10:46:34 140075593627392 [Note] WSREP: Receiving IST...  0.0% (    0/74964 events) complete.
2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 2.0 (192.168.55.173) synced with group.
2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567
2018-07-09 10:46:41 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') turning message relay requesting off
2018-07-09 10:46:44 140075593627392 [Note] WSREP: Receiving IST... 36.0% (27008/74964 events) complete.
2018-07-09 10:46:54 140075593627392 [Note] WSREP: Receiving IST... 71.6% (53696/74964 events) complete.
2018-07-09 10:47:02 140075593627392 [Note] WSREP: Receiving IST...100.0% (74964/74964 events) complete.
2018-07-09 10:47:02 140075929970432 [Note] WSREP: IST received: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958
2018-07-09 10:47:02 140075954312960 [Note] WSREP: 0.0 (192.168.55.172): State transfer from 2.0 (192.168.55.173) complete.

Once all the missing writesets are received and applied, Galera will promote galera2 as JOINED until seqno 2837012:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINER -> JOINED (TO: 2837012)
2018-07-09 10:47:02 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) synced with group.

The node applies any cached writesets in its slave queue and finishes catching up with the cluster. Its slave queue is now empty. Galera will promote galera2 to SYNCED, indicating the node is now operational and ready to serve clients:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2837012)
2018-07-09 10:47:02 140076605892352 [Note] WSREP: Synchronized with group, ready for connections

At this point, all nodes are back operational. You can verify by using the following statements on galera2:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY');
+---------------------------+----------------+
| VARIABLE_NAME             | VARIABLE_VALUE |
+---------------------------+----------------+
| WSREP_CLUSTER_SIZE        | 3              |
| WSREP_CLUSTER_STATUS      | Primary        |
| WSREP_EVS_DELAYED         |                |
| WSREP_LOCAL_STATE_COMMENT | Synced         |
| WSREP_READY               | ON             |
+---------------------------+----------------+

The wsrep_cluster_size reported as 3 and the cluster status is Primary, indicating galera2 is part of the Primary Component. The wsrep_evs_delayed has also been cleared and the local state is now Synced.

The recovery process flow for the partitioned node during network issue can be summarized as below:

Re-establishes group communication to other nodes.
Retrieves the cluster view from one of the Primary Component.
Performs state exchange with the Primary Component and calculates the quorum.
Changes the local node state from OPEN to PRIMARY.
Calculates the gap between local node and the cluster.
Changes the local node state from PRIMARY to JOINER.
Prepares IST listener/receiver on port 4568.
Requests state transfer via IST and picks a donor.
Starts receiving and applying the missing writeset from chosen donor's gcache.
Changes the local node state from JOINER to JOINED.
Catches up with the cluster by applying the cached writesets in the slave queue.
Changes the local node state from JOINED to SYNCED.

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

Cluster Failure

A Galera Cluster is considered failed if no primary component (PC) is available. Consider a similar three-node Galera Cluster as depicted in the diagram below:

A cluster is considered operational if all nodes or majority of the nodes are online. Online means they are able to see each other through Galera's replication traffic or group communication. If no traffic is coming in and out from the node, the cluster will send a heartbeat beacon for the node to response in a timely manner. Otherwise, it will be put into the delay or suspected list according to how the node responses.

If a node goes down, let's say node C, the cluster will remain operational because node A and B are still in quorum with 2 votes out of 3 to form a primary component. You should get the following cluster state on A and B:

mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+

If let's say a primary switch down went kaput, as illustrated in the following diagram:

At this point, every single node loses communication to each other, and the cluster state will be reported as non-Primary on all nodes (as what happened to galera2 in the previous case). Every node would calculate the quorum and find out that it is the minority (1 vote out of 3) thus losing the quorum, which means no Primary Component is formed and consequently all nodes refuse to serve any data. This is deemed as cluster failure.

Once the network issue is resolved, Galera will automatically re-establish the communication between members, exchange node's states and determine the possibility of reforming the primary component by comparing node state, UUIDs and seqnos. If the probability is there, Galera will merge the primary components as shown in the following lines:

2018-06-27  0:16:57 140203784476416 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: sent state msg: 5885911b-795c-11e8-8683-931c85442c7e
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 0 (192.168.55.171)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 1 (192.168.55.172)
2018-06-27  0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 2 (192.168.55.173)
2018-06-27  0:16:57 140203784476416 [Warning] WSREP: Quorum: No node with complete state:

        Version      : 4
        Flags        : 0x3
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.171'
        Incoming addr: '192.168.55.171:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.172'
        Incoming addr: '192.168.55.172:3306'

        Version      : 4
        Flags        : 0x2
        Protocols    : 0 / 8 / 3
        State        : NON-PRIMARY
        Desync count : 0
        Prim state   : SYNCED
        Prim UUID    : 5224a024-791b-11e8-a0ac-8bc6118b0f96
        Prim  seqno  : 5
        First seqno  : 112714
        Last  seqno  : 112725
        Prim JOINED  : 3
        State UUID   : 5885911b-795c-11e8-8683-931c85442c7e
        Group UUID   : 55238f52-41ee-11e8-852f-3316bdb654bc
        Name         : '192.168.55.173'
        Incoming addr: '192.168.55.173:3306'

2018-06-27  0:16:57 140203784476416 [Note] WSREP: Full re-merge of primary 5224a024-791b-11e8-a0ac-8bc6118b0f96 found: 3 of 3.
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 5,
        members    = 3/3 (joined/total),
        act_id     = 112725,
        last_appl. = 112722,
        protocols  = 0/8/3 (gcs/repl/appl),
        group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Flow-control interval: [28, 28]
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Trying to continue unpaused monitor
2018-06-27  0:16:57 140203784476416 [Note] WSREP: Restored state OPEN -> SYNCED (112725)
2018-06-27  0:16:57 140202564110080 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:112725, view# 6: Primary, number of nodes: 3, my index: 2, protocol version 3

A good indicator to know if the re-bootstrapping process is OK is by looking at the following line in the error log:

[Note] WSREP: Synchronized with group, ready for connections

ClusterControl Auto Recovery

ClusterControl comes with node and cluster automatic recovery features, because it oversees and understands the state of all nodes in the cluster. Automatic recovery is by default enabled if the cluster is deployed using ClusterControl. To enable or disable the cluster, simply clicking on the power icon in the summary bar as shown below:

Green icon means automatic recovery is turned on, while red is the opposite. You can monitor the recovery progress from the Activity -> Jobs dialog, like in this case, galera2 was totally inaccessible due to firewall blocking, thus forcing ClusterControl to report the following:

The recovery process will only be commencing after a graceful timeout (30 seconds) to give Galera node a chance to recover itself beforehand. If ClusterControl fails to recover a node or cluster, it will first pull all MySQL error logs from all accessible nodes and will raise the necessary alarms to notify the user via email or by pushing critical events to the third-party integration modules like PagerDuty, VictorOps or Slack. Manual intervention is then required. For Galera Cluster, ClusterControl will keep on trying to recover the failure until you mark the node as under maintenance, or disable the automatic recovery feature.

ClusterControl's automatic recovery is one of most favorite features as voted by our users. It helps you to take the necessary actions quickly, with a complete report on what has been attempted and recommendation steps to troubleshoot further on the issue. For users with support subscriptions, you can look for extra hands by escalating this issue to our technical support team for assistance.

Conclusion

Galera automatic node recovery and membership control are neat features to simplify the cluster management, improve the database reliability and reduce the risk of human error, as commonly haunting other open-source database replication technology like MySQL Replication, Group Replication and PostgreSQL Streaming/Logical Replication.

Tags:

Before you attempt to perform any schema changes on your production databases, you should make sure that you have a rock solid rollback plan; and that your change procedure has been successfully tested and validated in a separate environment. At the same time, it’s your responsibility to make sure that the change causes none or the least possible impact acceptable to the business. It’s definitely not an easy task.

In this article, we will take a look at how to perform database changes on MySQL and MariaDB in a controlled way. We will talk about some good habits in your day-to-day DBA work. We’ll focus on pre-requirements and tasks during the actual operations and problems that you may face when you deal with database schema changes. We will also talk about open source tools that may help you in the process.

Test and rollback scenarios

Backup

There are many ways to lose your data. Schema upgrade failure is one of them. Unlike application code, you can’t drop a bundle of files and declare that a new version has been successfully deployed. You also can’t just put back an older set of files to rollback your changes. Of course, you can run another SQL script to change the database again, but there are cases when the only accurate way to roll back changes is by restoring the entire database from backup.

However, what if you can’t afford to rollback your database to the latest backup, or your maintenance window is not big enough (considering system performance), so you can’t perform a full database backup before the change?

One may have a sophisticated, redundant environment, but as long as data is modified in both primary and standby locations, there is not much to do about it. Many scripts can just be run once, or the changes are impossible to undo. Most of the SQL change code falls into two groups:

Run once – you can’t add the same column to the table twice.
Impossible to undo – once you’ve dropped that column, it’s gone. You could undoubtedly restore your database, but that’s not precisely an undo.

You can tackle this problem in at least two possible ways. One would be to enable the binary log and take a backup, which is compatible with PITR. Such backup has to be full, complete and consistent. For xtrabackup, as long as it contains a full dataset, it will be PITR-compatible. For mysqldump, there is an option to make it PITR-compatible too. For smaller changes, a variation of mysqldump backup would be to take only a subset of data to change. This can be done with --where option. The backup should be part of the planned maintenance.

mysqldump -u -p --lock-all-tables --where="WHERE employee_id=100" mydb employees> backup_table_tmp_change_07132018.sql

Another possibility is to use CREATE TABLE AS SELECT.

You can store data or simple structure changes in the form of a fixed temporary table. With this approach you will get a source if you need to rollback your changes. It may be quite handy if you don’t change much data. The rollback can be done by taking data out from it. If any failures occur while copying the data to the table, it is automatically dropped and not created, so make sure that your statement creates a copy you need.

Obviously, there are some limitations too.

Because the ordering of the rows in the underlying SELECT statements cannot always be determined, CREATE TABLE ... IGNORE SELECT and CREATE TABLE ... REPLACE SELECT are flagged as unsafe for statement-based replication. Such statements produce a warning in the error log when using statement-based mode and are written to the binary log using the row-based format when using MIXED mode.

A very simple example of such method could be:

CREATE TABLE tmp_employees_change_07132018 AS SELECT * FROM employees where employee_id=100;
UPDATE employees SET salary=120000 WHERE employee_id=100;
COMMMIT;

Another interesting option may be MariaDB flashback database. When a wrong update or delete happens, and you would like to revert to a state of the database (or just a table) at a certain point in time, you may use the flashback feature.

Point-in-time rollback enables DBAs to recover data faster by rolling back transactions to a previous point in time rather than performing a restore from a backup. Based on ROW-based DML events, flashback can transform the binary log and reverse purposes. That means it can help undo given row changes fast. For instance, it can change DELETE events to INSERTs and vice versa, and it will swap WHERE and SET parts of the UPDATE events. This simple idea can dramatically speed up recovery from certain types of mistakes or disasters. For those who are familiar with the Oracle database, it’s a well known feature. The limitation of MariaDB flashback is the lack of DDL support.

Create a delayed replication slave

Since version 5.6, MySQL supports delayed replication. A slave server can lag behind the master by at least a specified amount of time. The default delay is 0 seconds. Use the MASTER_DELAY option for CHANGE MASTER TO to set the delay to N seconds:

CHANGE MASTER TO MASTER_DELAY = N;

It would be a good option if you didn’t have time to prepare a proper recovery scenario. You need to have enough delay to notice the problematic change. The advantage of this approach is that you don’t need to restore your database to take out data needed to fix your change. Standby DB is up and running, ready to pick up data which minimizes the time needed.

Create an asynchronous slave which is not part of the cluster

When it comes to Galera cluster, testing changes is not easy. All nodes run the same data, and heavy load can harm flow control. So you not only need to check if changes applied successfully, but also what the impact to the cluster state was. To make your test procedure as close as possible to the production workload, you may want to add an asynchronous slave to your cluster and run your test there. The test will not impact synchronization between cluster nodes, because technically it’s not part of the cluster, but you will have an option to check it with real data. Such slave can be easily added from ClusterControl.

ClusterControl add asynchronous slave

As shown in the above screenshot, ClusterControl can automate the process of adding an asynchronous slave in a few ways. You can add the node to the cluster, delay the slave. To reduce the impact on the master, you can use an existing backup instead of the master as the data source when building the slave.

Clone database and measure time

A good test should be as close as possible to the production change. The best way to do this is to clone your existing environment.

ClusterControl Clone Cluster for test

Perform changes via replication

To have better control over your changes, you can apply them on a slave server ahead of time and then do the switchover. For statement-based replication, this works fine, but for row-based replication, this can work up to a certain degree. Row-based replication enables extra columns to exist at the end of the table, so as long as it can write the first columns, it will be fine. First apply these setting to all slaves, then failover to one of the slaves and then implement the change to the master and attach that as a slave. If your modification involves inserting or removing a column in the middle of the table, it will work with row-based replication.

Operation

During the maintenance window, we do not want to have application traffic on the database. Sometimes it is hard to shut down all applications spread over the whole company. Alternatively, we want to allow only some specific hosts to access MySQL from remote (for example the monitoring system or the backup server). For this purpose, we can use the Linux packet filtering. To see what packet filtering rules are available, we can run the following command:

iptables -L INPUT -v

To close the MySQL port on all interfaces we use:

iptables -A INPUT -p tcp --dport mysql -j DROP

and to open the MySQL port again after the maintenance window:

iptables -D INPUT -p tcp --dport mysql -j DROP

For those without root access, you can change max_connection to 1 or 'skip networking'.

Logging

To get the logging process started, use the tee command at the MySQL client prompt, like this:

mysql> tee /tmp/my.out;

That command tells MySQL to log both the input and output of your current MySQL login session to a file named /tmp/my.out .Then execute your script file with source command.

To get a better idea of your execution times, you can combine it with the profiler feature. Start the profiler with

SET profiling = 1;

Then execute your Query with

SHOW PROFILES;

you see a list of queries the profiler has statistics for. So finally, you choose which query to examine with

SHOW PROFILE FOR QUERY 1;

Schema migration tools

Many times, a straight ALTER on the master is not possible - most of the cases it causes lag on the slave, and this may not be acceptable to the applications. What can be done, though, is to execute the change in a rolling mode. You can start with slaves and, once the change is applied to the slave, migrate one of the slaves as a new master, demote the old master to a slave and execute the change on it.

A tool that may help with such a task is Percona’s pt-online-schema-change. Pt-online-schema-change is straightforward - it creates a temporary table with the desired new schema (for instance, if we added an index, or removed a column from a table). Then, it creates triggers on the old table. Those triggers are there to mirror changes that happen on the original table to the new table. Changes are mirrored during the schema change process. If a row is added to the original table, it is also added to the new one. It emulates the way that MySQL alters tables internally, but it works on a copy of the table you wish to alter. It means that the original table is not locked, and clients may continue to read and change data in it.

Likewise, if a row is modified or deleted on the old table, it is also applied in the new table. Then, a background process of copying data (using LOW_PRIORITY INSERT) between old and new table begins. Once data has been copied, RENAME TABLE is executed.

Another intresting tool is gh-ost. Gh-ost creates a temporary table with the altered schema, just like pt-online-schema-change does. It executes INSERT queries, which use the following pattern to copy data from old to new table. Nevertheless it does not use triggers. Unfortunately triggers may be the source of many limitations. gh-ost uses the binary log stream to capture table changes and asynchronously applies them onto the ghost table. Once we verified that gh-ost can execute our schema change correctly, it’s time to actually execute it. Keep in mind that you may need to manually drop old tables that were created by gh-ost during the process of testing the migration. You can also use --initially-drop-ghost-table and --initially-drop-old-table flags to ask gh-ost to do it for you. The final command to execute is exactly the same as we used to test our change, we just added --execute to it.

pt-online-schema-change and gh-ost are very popular among Galera users. Nevertheless Galera has some additional options.The two methods Total Order Isolation (TOI) and Rolling Schema Upgrade (RSU) have both their pros and cons.

TOI - This is the default DDL replication method. The node that originates the writeset detects DDL at parsing time and sends out a replication event for the SQL statement before even starting the DDL processing. Schema upgrades run on all cluster nodes in the same total order sequence, preventing other transactions from committing for the duration of the operation. This method is good when you want your online schema upgrades to replicate through the cluster and don’t mind locking the entire table (similar to how default schema changes happened in MySQL).

SET GLOBAL wsrep_OSU_method='TOI';

RSU - perfom the schema upgrades locally. In this method, your writes are affecting only the node on which they are run. The changes do not replicate to the rest of the cluster.This method is good for non-conflicting operations and it will not slow down the cluster.

SET GLOBAL wsrep_OSU_method='RSU';

While the node processes the schema upgrade, it desynchronizes with the cluster. When it finishes processing the schema upgrade, it applies delayed replication events and synchronizes itself with the cluster. This could be a good option to run heavy index creations.

Conclusion

We presented here several different methods that may help you with planning your schema changes. Of course it all depends on your application and business requirements. You can design your change plan, perform necessary tests, but there is still a small chance that something will go wrong. According to Murphy’s law - “things will go wrong in any given situation, if you give them a chance”. So make sure you try out different ways of performing these changes, and pick the one that you are the most comfortable with.

Tags:

We are excited to announce the 1.6.2 release of ClusterControl - the all-inclusive database management system that lets you easily automate and manage highly available open source databases in any environment: on-premise or in the cloud.

ClusterControl 1.6.2 introduces new exciting Backup Management as well as Security & Compliance features for MySQL & PostgreSQL, support for MongoDB v 3.6 … and more!

Release Highlights

Backup Management

Continuous Archiving and Point-in-Time Recovery (PITR) for PostgreSQL
Rebuild a node from a backup with MySQL Galera clusters to avoid SST

Security & Compliance

New, consolidated Security section

Additional Highlights

Support for MongoDB v 3.6

View the ClusterControl ChangeLog for all the details!

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

View Release Details and Resources

Release Details

Backup Management

One of the issues with MySQL and PostgreSQL is that there aren’t really any out-of-the-box tools for users to simply (in the GUI) pick up restore-time: certain operations need to be performed to do that, such as finding the full backup, restore it and apply any changes manually that happened after the backup was taken.

ClusterControl provides a single process to restore data to point in time with no extra actions needed.

With the same system, users can verify their backups (in the case of MySQL for instance, ClusterControl will do the installation, set up the cluster, do a restore and, if the backup is sound, make it valid - which, as one can imagine, represents a lot of steps).

With ClusterControl, users can not only go back to a point in time, but also pick up the exact transaction that happened; and, with surgical precision, restore their data before disaster really strikes.

New for PostgreSQL

Continuous Archiving and Point-in-Time Recovery (PITR) for PostgreSQL: ClusterControl automates that process now and enables continuous WAL archiving as well as a PITR with backups.

New for MySQL Galera Cluster

Rebuild a node from a backup with MySQL Galera clusters to avoid SST: ClusterControl reduces the time it takes to recover a node by avoiding streaming a full dataset over the network from another node.

Security & Compliance

The new Security section in ClusterControl lets users easily check which security features they have enabled (or disabled) for their clusters, thus simplifying the process of taking the relevant security measures for their setups.

Additional New Functionalities

View the ClusterControl ChangeLog for all the details!

Download ClusterControl today!

Happy Clustering!

Tags:

point in time recovery

It this blog post, we will analyze 6 different failure scenarios in production database systems, ranging from single-server issues to multi-datacenter failover plans. We will walk you through recovery and failover procedures for the respective scenario. Hopefully, this will give you a good understanding of the risks you might face and things to consider when designing your infrastructure.

Database schema corrupted

Let's start with single node installation - a database setup in the simplest form. Easy to implement, at the lowest cost. In this scenario, you run multiple applications on the single server where each of the database schemas belongs to the different application. The approach for recovery of a single schema would depend on several factors.

Do I have any backup?
Do I have a backup and how fast can I restore it?
What kind of storage engine is in use?
Do I have a PITR-compatible (point in time recovery) backup?

Data corruption can be identified by mysqlcheck.

mysqlcheck -uroot -p <DATABASE>

Replace DATABASE with the name of the database, and replace TABLE with the name of the table that you want to check:

mysqlcheck -uroot -p <DATABASE> <TABLE>

Mysqlcheck checks the specified database and tables. If a table passes the check, mysqlcheck displays OK for the table. In below example, we can see that the table salaries requires recovery.

employees.departments                              OK
employees.dept_emp                                 OK
employees.dept_manager                             OK
employees.employees                                OK
Employees.salaries
Warning  : Tablespace is missing for table 'employees/salaries'
Error    : Table 'employees.salaries' doesn't exist in engine
status   : Operation failed
employees.titles                                   OK

For a single node installation with no additional DR servers, the primary approach would be to restore data from backup. But this is not the only thing you need to consider. Having multiple database schema under the same instance causes an issue when you have to bring your server down to restore data. Another question is if you can afford to rollback all of your databases to the last backup. In most cases, that would not be possible.

There are some exceptions here. It is possible to restore a single table or database from the last backup when point in time recovery is not needed. Such process is more complicated. If you have mysqldump, you can extract your database from it. If you run binary backups with xtradbackup or mariabackup and you have enabled table per file, then it is possible.

Here is how to check if you have a table per file option enabled.

mysql> SET GLOBAL innodb_file_per_table=1;

With innodb_file_per_table enabled, you can store InnoDB tables in a tbl_name .ibd file. Unlike the MyISAM storage engine, with its separate tbl_name .MYD and tbl_name .MYI files for indexes and data, InnoDB stores the data and the indexes together in a single .ibd file. To check your storage engine you need to run:

mysql> select table_name,engine from information_schema.tables where table_name='table_name' and table_schema='database_name';

or directly from the console:

[root@master ~]# mysql -u<username> -p -D<database_name> -e "show table status\G"
Enter password: 
*************************** 1. row ***************************
           Name: test1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 12
 Avg_row_length: 1365
    Data_length: 16384
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2018-05-24 17:54:33
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment:

To restore tables from xtradbackup, you need to go through an export process. Backup needs to be prepared before it can be restored. Exporting is done in the preparation stage. Once a full backup is created, run standard prepare procedure with the additional flag --export :

innobackupex --apply-log --export /u01/backup

This will create additional export files which you will use later on in the import phase. To import a table to another server, first create a new table with the same structure as the one that will be imported at that server:

mysql> CREATE TABLE corrupted_table (...) ENGINE=InnoDB;

discard the tablespace:

mysql> ALTER TABLE mydatabase.mytable DISCARD TABLESPACE;

Then copy mytable.ibd and mytable.exp files to database’s home, and import its tablespace:

mysql> ALTER TABLE mydatabase.mytable IMPORT TABLESPACE;

However to do this in a more controlled way, the recommendation would be to restore a database backup in other instance/server and copy what is needed back to the main system. To do so, you need to run the installation of the mysql instance. This could be done either on the same machine - but requires more effort to configure in a way that both instances can run on the same machine - for example, that would require different communication settings.

You can combine both task restore and installation using ClusterControl.

ClusterControl will walk you through the available backups on-prem or in the cloud, let you choose exact time for a restore or the precise log position, and install a new database instance if needed.

ClusterControl point in time recovery

ClusterControl restore and verify on a standalone host

CusterControl restore and verify on a standalone host. Installation options.

You can find more information about data recovery in blog My MySQL Database is Corrupted... What Do I Do Now?

Database instance corrupted on the dedicated server

Defects in the underlying platform are often the cause for database corruption. Your MySQL instance relies on a number of things to store and retrieve data - disk subsystem, controllers, communication channels, drivers and firmware. A crash can affect parts of your data, mysql binaries or even backup files that you store on the system. To separate different applications, you can place them on dedicated servers.

Different application schemas on separate systems is a good idea if you can afford them. One may say that this is a waste of resources, but there is a chance that the business impact will be less if only one of them goes down. But even then, you need to protect your database from data loss. Storing backup on the same server is not a bad idea as long you have a copy somewhere else. These days, cloud storage is an excellent alternative to tape backup.

ClusterControl enables you to keep a copy of your backup in the cloud. It supports uploading to the top 3 cloud providers - Amazon AWS, Google Cloud, and Microsoft Azure.

When you have your full backup restored, you may want to restore it to certain point in time. Point-in-time recovery will bring server up to date to a more recent time than when the full backup was taken. To do so, you need to have your binary logs enabled. You can check available binary logs with:

mysql> SHOW BINARY LOGS;

And current log file with:

SHOW MASTER STATUS;

Then you can capture incremental data by passing binary logs into sql file. Missing operations can be then re-executed.

mysqlbinlog --start-position='14231131' --verbose /var/lib/mysql/binlog.000010 /var/lib/mysql/binlog.000011 /var/lib/mysql/binlog.000012 /var/lib/mysql/binlog.000013 /var/lib/mysql/binlog.000015 > binlog.out

The same can be done in ClusterControl.

ClusterControl cloud backup

Database slave goes down

Ok, so you have your database running on a dedicated server. You created a sophisticated backup schedule with a combination of full and incremental backups, upload them to the cloud and store the latest backup on local disks for fast recovery. You have different backup retention policies - shorter for backups stored on local disk drivers and extended for your cloud backups.

It sounds like you are well prepared for a disaster scenario. But when it comes to the restore time, it may not satisfy your business needs.

You need a quick failover function. A server that will be up and running applying binary logs from the master where writes happen. Master/Slave replication starts a new chapter in the failover scenario. It's a fast method to bring your application back to life if you master goes down.

But there are few things to consider in the failover scenario. One is to setup a delayed replication slave, so you can react to fat finger commands that were triggered on the master server. A slave server can lag behind the master by at least a specified amount of time. The default delay is 0 seconds. Use the MASTER_DELAY option for CHANGE MASTER TO to set the delay to N seconds:

CHANGE MASTER TO MASTER_DELAY = N;

Second is to enable automated failover. There are many automated failover solutions on the market. You can set up automatic failover with command line tools like MHA, MRM, mysqlfailover or GUI Orchestrator and ClusterControl. When it's set up correctly, it can significantly reduce your outage.

ClusterControl supports automated failover for MySQL, PostgreSQL and MongoDB replications as well as multi-master cluster solutions Galera and NDB.

ClusterControl replication topology view

When a slave node crashes and the server is severely lagging behind, you may want to rebuild your slave server. The slave rebuild process is similar to restoring from backup.

ClusterControl rebuild slave

Database multi-master server goes down

Now when you have slave server acting as a DR node, and your failover process is well automated and tested, your DBA life becomes more comfortable. That's true, but there are a few more puzzles to solve. Computing power is not free, and your business team may ask you to better utilize your hardware, you may want to use your slave server not only as passive server, but also to serve write operations.

You may then want to investigate a multi-master replication solution. Galera Cluster has become a mainstream option for high availability MySQL and MariaDB. And though it is now known as a credible replacement for traditional MySQL master-slave architectures, it is not a drop-in replacement.

Galera cluster has a shared nothing architecture. Instead of shared disks, Galera uses certification based replication with group communication and transaction ordering to achieve synchronous replication. A database cluster should be able to survive a loss of a node, although it's achieved in different ways. In case of Galera, the critical aspect is the number of nodes. Galera requires a quorum to stay operational. A three node cluster can survive the crash of one node. With more nodes in your cluster, you can survive more failures.

Recovery process is automated so you don’t need to perfom any failover operations. However the good practice would be to kill nodes and see how fast you can bring them back. In order to make this operation more efficient, you can modify the galera cache size. If the galera cache size is not properly planned, your next booting node will have to take a full backup instead of only missing write-sets in the cache.

The failover scenario is simple as starting the intance. Based on the data in the galera cache, the booting node will perfom SST (restore from full backup) or IST (apply missing write-sets). However, this is often linked to human intervention. If you want to automate the entire failover process, you can use ClusterControl’s autorecovery functionality (node and cluster level).

ClusterControl cluster autorecovery

Estimate galera cache size:

MariaDB [(none)]> SET @start := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); do sleep(60); SET @end := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); SET @gcache := (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(@@GLOBAL.wsrep_provider_options,'gcache.size = ',-1), 'M', 1)); SELECT ROUND((@end - @start),2) AS `MB/min`, ROUND((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, ROUND(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`;

To make failover more consistent, you should enable gcache.recover=yes in mycnf. This option will revive the galera-cache on restart. This means the node can act as a DONOR and service missing write-sets (facilitating IST, instead of using SST).

2018-07-20  8:59:44 139656049956608 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 2,
    members    = 2/3 (joined/total),
    act_id     = 12810,
    last_appl. = 0,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 49eca8f8-0e3a-11e8-be4a-e7e3fe48cb69
2018-07-20  8:59:44 139656049956608 [Note] WSREP: Flow-control interval: [28, 28]
2018-07-20  8:59:44 139656049956608 [Note] WSREP: Trying to continue unpaused monitor
2018-07-20  8:59:44 139657311033088 [Note] WSREP: New cluster view: global state: 49eca8f8-0e3a-11e8-be4a-e7e3fe48cb69:12810, view# 3: Primary, number of nodes: 3, my index: 1, protocol version 3

Proxy SQL node goes down

If you have a virtual IP setup, all you have to do is to point your application to the virtual IP address and everything should be correct connection wise. It’s not enough to have your database instances spanning across multiple datacenters, you still need your applications to access them. Assume you have scaled out the number of read replicas, you might want to implement virtual IPs for each of those read replicas as well because of maintenance or availability reasons. It might become a cumbersome pool of virtual IPs that you have to manage. If one of those read replicas face a crash, you need to re-assign the virtual IP to the different host, or else your application will connect to either a host that is down or in the worst case, a lagging server with stale data.

ClusterControl HA load balancers topology view

Crashes are not frequent, but more probable than servers going down. If for whatever reason, a slave goes down, something like ProxySQL will redirect all of the traffic to the master, with the risk of overloading it. When the slave recovers, traffic will be redirected back to it. Usually, such downtime shouldn’t take more than a couple of minutes, so the overall severity is medium, even though the probability is also medium.

To have your load balancer components redundant, you can use keepalived.

ClusterControl: Deploy keepalived for ProxySQL load balancer

Datacenter goes down

The main problem with replication is that there is no majority mechanism to detect a datacenter failure and serve a new master. One of the resolutions is to use Orchestrator/Raft. Orchestrator is a topology supervisor that can control failovers. When used along with Raft, Orchestrator will become quorum-aware. One of the Orchestrator instances is elected as a leader and executes recovery tasks. The connection between orchestrator node does not correlate to transactional database commits and is sparse.

Orchestrator/Raft can use extra instances which perfom monitoring of the topology. In the case of network partitioning, the partitioned Orchestrator instances won’t take any action. The part of the Orchestrator cluster which has the quorum will elect a new master and make the necessary topology changes.

ClusterControl is used for management, scaling and, what’s most important, node recovery - Orchestrator would handle failovers, but if a slave would crash, ClusterControl will make sure it will be recovered. Orchestrator and ClusterControl would be located in the same availability zone, separated from the MySQL nodes, to make sure their activity won’t be affected by network splits between availability zones in the data center.

Tags:

Galera Cluster enforces strong data consistency, where all nodes in the cluster are tightly coupled. Although network segmentation is supported, replication performance is still bound by two factors:

Round trip time (RTT) to the farthest node in the cluster from the originator node.
The size of a writeset to be transferred and certified for conflict on the receiver node.

While there are ways to boost the performance of Galera, it is not possible to work around these 2 limiting factors.

Luckily, Galera Cluster was built on top of MySQL, which also comes with its built-in replication feature (duh!). Both Galera replication and MySQL replication exist in the same server software independently. We can make use of these technologies to work together, where all replication within a datacenter will be on Galera while inter-datacenter replication will be on standard MySQL Replication. The slave site can act as a hot-standby site, ready to serve data once the applications are redirected to the backup site. We covered this in a previous blog on MySQL architectures for disaster recovery.

In this blog post, we’ll see how straightforward it is to set up replication between two Galera Clusters (PXC 5.7). Then we’ll look at the more challenging part, that is, handling failures at both node and cluster levels. Failover and failback operations are crucial in order to preserve data integrity across the system.

Cluster Deployment

For the sake of our example, we’ll need at least two clusters and two sites - one for the primary and another one for the secondary. It works similarly to traditional MySQL master-slave replication, but on a bigger scale with three nodes in each site. With ClusterControl, you would achieve this by deploying two separate clusters, one on each site. Then, you would configure asynchronous replication between designed nodes from each cluster.

The following diagram illustrates our default architecture:

We have 6 nodes in total, 3 on the primary site and another 3 on the disaster recovery site. To simplify the node representation, we will use the following notations:

Primary site: galera1-P, galera2-P, galera3-P (master)
Disaster recovery site: galera1-DR, galera2-DR (slave), galera3-DR

Once the Galera Cluster is deployed, simply pick one node on each site to set up the asynchronous replication link. Take note that ALL Galera nodes must be configured with binary logging and log_slave_updates enabled. Enabling GTID is highly recommended, although not compulsory. On all nodes, configure with the following parameters inside my.cnf:

server_id=40 # this number must be different on every node.
binlog_format=ROW
log_bin = /var/lib/mysql-binlog/binlog
log_slave_updates = ON
gtid_mode = ON
enforce_gtid_consistency = true
expire_logs_days = 7

If you are using ClusterControl, from the web interface, pick Nodes -> the chosen Galera node -> Enable Binary Logging. You might then have to change the server-id on the DR site manually to make sure every node is holding a distinct server-id value.

Then on galera3-P, create the replication user:

mysql> GRANT REPLICATION SLAVE ON *.* to slave@'%' IDENTIFIED BY 'slavepassword';

On galera2-DR, point the slave to the current master, galera3-P:

mysql> CHANGE MASTER TO MASTER_HOST = 'galera3-primary', MASTER_USER = 'slave', MASTER_PASSWORD = 'slavepassword' , MASTER_AUTO_POSITION=1;

Start the replication slave on galera2-DR:

mysql> START SLAVE;

From ClusterControl dashboard, once the replication is established, you should see the DR site has a slave and the Primary site got 3 masters (nodes that produce binary logs):

The deployment is now complete. Applications should send writes to the Primary Site only, as the replication direction goes from Primary Site to DR site. Reads can be sent to both sites, although the DR site might be lagging behind. Assuming that writes only reach the Primary Site, it should not be necessary to set the DR site to read-only (although it can be a good precaution).

This setup will make the primary and disaster recovery site independent of each other, loosely connected with asynchronous replication. One of the Galera nodes in the DR site will be a slave, that replicates from one of the Galera nodes (master) in the primary site. Ensure that both sites are producing binary logs with GTID, and that log_slave_updates is enabled - the updates that come from the asynchronous replication stream will be applied to the other nodes in the cluster.

We now have a system where a cluster failure on the primary site will not affect the backup site. Performance-wise, WAN latency will not impact updates on the active cluster. These are shipped asynchronously to the backup site.

As a side note, it’s also possible to have a dedicated slave instance as replication relay, instead of using one of the Galera nodes as slave.

Node Failover Procedure

In case the current master (galera3-P) fails and the remaining nodes in the Primary Site are still up, the slave on the Disaster Recovery site (galera2-DR) should be directed to any available masters, as shown in the following diagram:

With GTID-based replication, this is peanut. Simply run the following on galera2-DR:

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO MASTER_HOST = 'galera1-P', MASTER_AUTO_POSITION=1;
mysql> START SLAVE;

Verify the slave status with:

mysql> SHOW SLAVE STATUS\G
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...            
           Retrieved_Gtid_Set: f66a6152-74d6-ee17-62b0-6117d2e643de:2043395-2047231
            Executed_Gtid_Set: d551839f-74d6-ee17-78f8-c14bd8b1e4ed:1-4,
f66a6152-74d6-ee17-62b0-6117d2e643de:1-2047231
...

Ensure the above values are reporting correctly. The executed GTID set should have 2 sets of GTID at this point, one from all transactions executed on the old master and the other for the new master.

Cluster Failover Procedure

If the primary cluster goes down, crashes, or simply loses connectivity from the application standpoint, the application can be directed to the DR site instantly. No database failover is necessary to continue the operation. If the application has connected to the DR site and starts to write, it is important to ensure no other writes are happening on the Primary Site once the DR site is activated.

The following diagram shows our architecture after application is failed over to the DR site:

Assuming the Primary Site is still down, at this point, there is no replication between sites until we re-configure one of the nodes in the Primary Site once it comes back up.

For clean up purposes, the slave process on the DR site has to be stopped. On galera2-DR, stop replication slave:

mysql> STOP SLAVE;

The failover to the DR site is now considered complete.

Cluster Failback Procedure

To failback to the Primary Site, one of the Galera nodes must become a slave to catch up on changes that happened on the DR site. The procedure would be something like the following:

Pick one node to become a slave in the Primary Site (galera3-P).
Stop all nodes other than the chosen one (galera1-P and galera2-P). Keep the chosen slave up (galera3-P).
Create a backup from the new master (galera2-DR) in the DR site and transfer it over to the chosen slave (galera3-P).
Restore the backup.
Start the replication slave.
Start the remaining Galera node in the Primary Site, with grastate.dat removed.

The below steps can then be performed to fail back the system to its original architecture - Primary is the master and DR is the slave.

1) Shut down all nodes other than the chosen slave:

$ systemctl stop mysql # galera1-P
$ systemctl stop mysql # galera2-P

Or from the ClusterControl interface, simply pick the node from the UI and click "Shutdown Node".

2) Pick a node in the DR site to be the new master (galera2-DR). Create a mysqldump backup with the necessary parameters (for PXC, pxc_strict_mode has to be set other than ENFORCING):

$ mysql -uroot -p -e 'set global pxc_strict_mode = "PERMISSIVE"'
$ mysqldump -uroot -p --all-databases --triggers --routines --events > dump.sql
$ mysql -uroot -p -e 'set global pxc_strict_mode = "ENFORCING"'

3) Transfer the backup to the chosen slave, galera3-P via your preferred remote copy tool:

$ scp dump.sql galera3-primary:~

4) In order to perform RESET MASTER on a Galera node, Galera replication plugin must be turned off. On galera3-P, disable Galera write-set replication temporarily and then restore the dump file in the very same session:

mysql> SET GLOBAL pxc_strict_mode = 'PERMISSIVE';
mysql> SET wsrep_on=OFF;
mysql> RESET MASTER;
mysql> SOURCE /root/dump.sql;

Variable wsrep_on is a session variable. Therefore, we have to perform the restore operation within the same session using SOURCE statement. Otherwise, restoring using standard mysql client would require wsrep_on=OFF or commenting wsrep_provider inside my.cnf set during MySQL startup.

5) Start the replication thread on the chosen slave, galera3-P:

mysql> CHANGE MASTER TO MASTER_HOST = 'galera2-DR', MASTER_USER = 'slave', MASTER_PASSWORD = 'slavepassword', MASTER_AUTO_POSITION = 1;
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS\G

6) Start the remaining of the nodes in the cluster (one node at a time), and force an SST by removing grastate.dat beforehand:

$ rm -Rf /var/lib/mysql/grastate.dat
$ systemctl start mysql

Or from ClusterControl, simply pick the node -> Start Node -> check "Perform an Initial Start".

The above will force other Galera nodes to re-sync with galera3-P through SST and get the most up-to-date data. At this point, the replication direction has switched, from DR to Primary. Write operations are coming to the DR site and the Primary Site has become the replicating site:

From ClusterControl dashboard, you would notice the Primary Site has a slave configured while the DR site are all masters. In ClusterControl, MASTER indicator means all Galera nodes are generating binary logs:

7) Optionally, we can clean up slave's entries on galera2-DR since it's already become a master:

mysql> RESET SLAVE ALL;

8) Once the Primary site catches up, we may switch the database traffic from application back to the primary cluster:

At this point, all writes must go to the Primary Site only. The replication link should be stopped as described under the "Cluster Failover Procedure" section above.

The above mentioned failback steps should be applied when staging back the DR site from the Primary Site:

Stop replication between primary site and DR site.
Re-slave one of the Galera nodes on the DR site to replicate from the Primary Site.
Start replication between both sites.

Once done, the replication direction has gone back to its original configuration, from Primary to DR. Writes operations are coming to the Primary Site and the DR Site is now the replicating site:

Finally, perform some clean ups on the newly promoted master by running "RESET SLAVE ALL".

Advantages

Cluster-to-cluster asynchronous replication comes with a number of advantages:

Minimal downtime during database failover operation. Basically, you can redirect the write almost instantly to the slave site, only and only if you can protect writes to not reach the master site (as these writes would not be replicated, and will probably be overwritten when re-syncing from the DR site).
No performance impact on the primary site since it is independent from the backup (DR) site. Replication from master to slave is performed asynchronously. The master site generates binary logs, the slave site replicates the events and applies the events at some later time.
Disaster recovery site can be used for other purposes, e.g., database backup, binary logs backup and reporting or heavy analytical queries (OLAP). Both sites can be used simultaneously, with exceptions on the replication lag and read-only operations on the slave side.
The DR cluster could potentially run on smaller instances in a public cloud environment, as long as they can keep up with the primary cluster. The instances can be upgraded if needed. In certain scenarios, it can save you some costs.
You only need one extra site for Disaster Recovery, as opposed to active-active Galera multi-site replication setup, which requires at least 3 sites to operate correctly.

Disadvantages

There are also drawbacks having this setup:

There is a chance of missing some data during failover if the slave was behind, since replication is asynchronous. This could be improved with semi-synchronous and multi-threaded slaves replication, albeit there will be another set of challenges waiting (network overhead, replication gap, etc).
Despite the failover operation being fairly simple, the failback operation can be tricky and prone to human error. It requires some expertise on switching master/slave role back to the primary site. It's recommended to keep the procedures documented, rehearse the failover/failback operation regularly and use accurate reporting and monitoring tools.
There is no built-in failure detection and notification process. You may need to automate and send out notifications to the relevant person once the unwanted event occurs. One good way is to regularly check the slave's status from other available node's point-of-view on the master's site before raising an alarm for the operation team.
Pretty costly, as you have to setup a similar number of nodes on the disaster recovery site. This is not black and white, as the cost justification usually comes from the requirements of your business. With some planning, it is possible to maximize usage of database resources at both sites, regardless of the database roles.

Tags:

galera

MySQL

MariaDB

asynchronous replication

failover

failback

Monitoring is a concern for containers, as the infrastructure is dynamic. Containers can be routinely created and destroyed, and are ephemeral. So how do you keep track of your MySQL instances running on Docker?

As with any software component, there are many options out there that can be used. We’ll look at Prometheus as a solution built for distributed infrastructure, and works very well with Docker.

This is a two-part blog. In this part 1 blog, we are going to cover the deployment aspect of our MySQL containers with Prometheus and its components, running as standalone Docker containers and Docker Swarm services. In part 2, we will look at the important metrics to monitor from our MySQL containers, as well as integration with the paging and notification systems.

Introduction to Prometheus

Prometheus is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. Prometheus collects metrics through pull mechanism from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. It supports all the target metrics that we want to measure if one would like to run MySQL as Docker containers. Those metrics include physical hosts metrics, Docker container metrics and MySQL server metrics.

Take a look at the following diagram which illustrates Prometheus architecture (taken from Prometheus official documentation):

We are going to deploy some MySQL containers (standalone and Docker Swarm) complete with a Prometheus server, MySQL exporter (i.e., a Prometheus agent to expose MySQL metrics, that can then be scraped by the Prometheus server) and also Alertmanager to handle alerts based on the collected metrics.

For more details check out the Prometheus documentation. In this example, we are going to use the official Docker images provided by the Prometheus team.

Standalone Docker

Deploying MySQL Containers

Let's run two standalone MySQL servers on Docker to simplify our deployment walkthrough. One container will be using the latest MySQL 8.0 and the other one is MySQL 5.7. Both containers are in the same Docker network called "db_network":

$ docker network create db_network
$ docker run -d \
--name mysql80 \
--publish 3306 \
--network db_network \
--restart unless-stopped \
--env MYSQL_ROOT_PASSWORD=mypassword \
--volume mysql80-datadir:/var/lib/mysql \
mysql:8 \
--default-authentication-plugin=mysql_native_password

MySQL 8 defaults to a new authentication plugin called caching_sha2_password. For compatibility with Prometheus MySQL exporter container, let's use the widely-used mysql_native_password plugin whenever we create a new MySQL user on this server.

For the second MySQL container running 5.7, we execute the following:

$ docker run -d \
--name mysql57 \
--publish 3306 \
--network db_network \
--restart unless-stopped \
--env MYSQL_ROOT_PASSWORD=mypassword \
--volume mysql57-datadir:/var/lib/mysql \
mysql:5.7

Verify if our MySQL servers are running OK:

[root@docker1 mysql]# docker ps | grep mysql
cc3cd3c4022a        mysql:5.7           "docker-entrypoint.s…"   12 minutes ago      Up 12 minutes       0.0.0.0:32770->3306/tcp   mysql57
9b7857c5b6a1        mysql:8             "docker-entrypoint.s…"   14 minutes ago      Up 14 minutes       0.0.0.0:32769->3306/tcp   mysql80

At this point, our architecture is looking something like this:

Let's get started to monitor them.

Exposing Docker Metrics to Prometheus

Docker has built-in support as Prometheus target, where we can use to monitor the Docker engine statistics. We can simply enable it by creating a text file called "daemon.json" inside the Docker host:

$ vim /etc/docker/daemon.json

And add the following lines:

{
  "metrics-addr" : "12.168.55.161:9323",
  "experimental" : true
}

Where 192.168.55.161 is the Docker host primary IP address. Then, restart Docker daemon to load the change:

$ systemctl restart docker

Since we have defined --restart=unless-stopped in our MySQL containers' run command, the containers will be automatically started after Docker is running.

Deploying MySQL Exporter

Before we move further, the mysqld exporter requires a MySQL user to be used for monitoring purposes. On our MySQL containers, create the monitoring user:

$ docker exec -it mysql80 mysql -uroot -p
Enter password:

mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Take note that it is recommended to set a max connection limit for the user to avoid overloading the server with monitoring scrapes under heavy load. Repeat the above statements onto the second container, mysql57:

$ docker exec -it mysql57 mysql -uroot -p
Enter password:

mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Let's run the mysqld exporter container called "mysql8-exporter" to expose the metrics for our MySQL 8.0 instance as below:

$ docker run -d \
--name mysql80-exporter \
--publish 9104 \
--network db_network \
--restart always \
--env DATA_SOURCE_NAME="exporter:exporterpassword@(mysql80:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

And also another exporter container for our MySQL 5.7 instance:

$ docker run -d \
--name mysql57-exporter \
--publish 9104 \
--network db_network \
--restart always \
-e DATA_SOURCE_NAME="exporter:exporterpassword@(mysql57:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

We enabled a bunch of collector flags for the container to expose the MySQL metrics. You can also enable --collect.slave_status, --collect.slave_hosts if you have a MySQL replication running on containers.

We should be able to retrieve the MySQL metrics via curl from the Docker host directly (port 32771 is the published port assigned automatically by Docker for container mysql80-exporter):

$ curl 127.0.0.1:32771/metrics
...
mysql_info_schema_threads_seconds{state="waiting for lock"} 0
mysql_info_schema_threads_seconds{state="waiting for table flush"} 0
mysql_info_schema_threads_seconds{state="waiting for tables"} 0
mysql_info_schema_threads_seconds{state="waiting on cond"} 0
mysql_info_schema_threads_seconds{state="writing to net"} 0
...
process_virtual_memory_bytes 1.9390464e+07

At this point, our architecture is looking something like this:

We are now good to setup the Prometheus server.

Deploying Prometheus Server

Firstly, create Prometheus configuration file at ~/prometheus.yml and add the following lines:

$ vim ~/prometheus.yml
global:
  scrape_interval:     5s
  scrape_timeout:      3s
  evaluation_interval: 5s

# Our alerting rule files
rule_files:
  - "alert.rules"

# Scrape endpoints
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql57-exporter:9104','mysql80-exporter:9104']

  - job_name: 'docker'
    static_configs:
      - targets: ['192.168.55.161:9323']

From the Prometheus configuration file, we have defined three jobs - "prometheus", "mysql" and "docker". The first one is the job to monitor the Prometheus server itself. The next one is the job to monitor our MySQL containers named "mysql". We define the endpoints on our MySQL exporters on port 9104, which exposed the Prometheus-compatible metrics from the MySQL 8.0 and 5.7 instances respectively. The "alert.rules" is the rule file that we will include later in the next blog post for alerting purposes.

We can then map the configuration with the Prometheus container. We also need to create a Docker volume for Prometheus data for persistency and also expose port 9090 publicly:

$ docker run -d \
--name prometheus-server \
--publish 9090:9090 \
--network db_network \
--restart unless-stopped \
--mount type=volume,src=prometheus-data,target=/prometheus \
--mount type=bind,src="$(pwd)"/prometheus.yml,target=/etc/prometheus/prometheus.yml \
--mount type=bind,src="$(pwd)
prom/prometheus

Now our Prometheus server is already running and can be accessed directly on port 9090 of the Docker host. Open a web browser and go to http://192.168.55.161:9090/ to access the Prometheus web UI. Verify the target status under Status -> Targets and make sure they are all green:

At this point, our container architecture is looking something like this:

Our Prometheus monitoring system for our standalone MySQL containers are now deployed.

Docker Swarm

Deploying a 3-node Galera Cluster

Supposed we want to deploy a three-node Galera Cluster in Docker Swarm, we would have to create 3 different services, each service representing one Galera node. Using this approach, we can keep a static resolvable hostname for our Galera container, together with MySQL exporter containers that will accompany each of them. We will be using MariaDB 10.2 image maintained by the Docker team to run our Galera cluster.

Firstly, create a MySQL configuration file to be used by our Swarm service:

$ vim ~/my.cnf
[mysqld]

default_storage_engine          = InnoDB
binlog_format                   = ROW

innodb_flush_log_at_trx_commit  = 0
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode        = 2
innodb_lock_schedule_algorithm  = FCFS # MariaDB >10.1.19 and >10.2.3 only

wsrep_on                        = ON
wsrep_provider                  = /usr/lib/galera/libgalera_smm.so
wsrep_sst_method                = mariabackup

Create a dedicated database network in our Swarm called "db_swarm":

$ docker network create --driver overlay db_swarm

Import our MySQL configuration file into Docker config so we can load it into our Swarm service when we create it later:

$ cat ~/my.cnf | docker config create my-cnf -

Create the first Galera bootstrap service, with "gcomm://" as the cluster address called "galera0". This is a transient service for bootstrapping process only. We will delete this service once we have gotten 3 other Galera services running:

$ docker service create \
--name galera0 \
--replicas 1 \
--hostname galera0 \
--network db_swarm \
--publish 3306 \
--publish 4444 \
--publish 4567 \
--publish 4568 \
--config src=my-cnf,target=/etc/mysql/mariadb.conf.d/my.cnf \
--env MYSQL_ROOT_PASSWORD=mypassword \
--mount type=volume,src=galera0-datadir,dst=/var/lib/mysql \
mariadb:10.2 \
--wsrep_cluster_address=gcomm:// \
--wsrep_sst_auth="root:mypassword" \
--wsrep_node_address=galera0

At this point, our database architecture can be illustrated as below:

Then, repeat the following command for 3 times to create 3 different Galera services. Replace {name} with galera1, galera2 and galera3 respectively:

$ docker service create \
--name {name} \
--replicas 1 \
--hostname {name} \
--network db_swarm \
--publish 3306 \
--publish 4444 \
--publish 4567 \
--publish 4568 \
--config src=my-cnf,target=/etc/mysql/mariadb.conf.d/my.cnf \
--env MYSQL_ROOT_PASSWORD=mypassword \
--mount type=volume,src={name}-datadir,dst=/var/lib/mysql \
mariadb:10.2 \
--wsrep_cluster_address=gcomm://galera0,galera1,galera2,galera3 \
--wsrep_sst_auth="root:mypassword" \
--wsrep_node_address={name}

Verify our current Docker services:

$ docker service ls 
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
wpcxye3c4e9d        galera0             replicated          1/1                 mariadb:10.2        *:30022->3306/tcp, *:30023->4444/tcp, *:30024-30025->4567-4568/tcp
jsamvxw9tqpw        galera1             replicated          1/1                 mariadb:10.2        *:30026->3306/tcp, *:30027->4444/tcp, *:30028-30029->4567-4568/tcp
otbwnb3ridg0        galera2             replicated          1/1                 mariadb:10.2        *:30030->3306/tcp, *:30031->4444/tcp, *:30032-30033->4567-4568/tcp
5jp9dpv5twy3        galera3             replicated          1/1                 mariadb:10.2        *:30034->3306/tcp, *:30035->4444/tcp, *:30036-30037->4567-4568/tcp

Our architecture is now looking something like this:

We need to remove the Galera bootstrap Swarm service, galera0, to stop it from running because if the container is being rescheduled by Docker Swarm, a new replica will be started with a fresh new volume. We run the risk of data loss because the --wsrep_cluster_address contains "galera0" in the other Galera nodes (or Swarm services). So, let's remove it:

$ docker service rm galera0

At this point, we have our three-node Galera Cluster:

We are now ready to deploy our MySQL exporter and Prometheus Server.

MySQL Exporter Swarm Service

$ docker exec -it {galera1} mysql -uroot -p
Enter password:

mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporterpassword' WITH MAX_USER_CONNECTIONS 3;
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';

Then, create the exporter service for each of the Galera services (replace {name} with galera1, galera2 and galera3 respectively):

$ docker service create \
--name {name}-exporter \
--network db_swarm \
--replicas 1 \
-p 9104 \
-e DATA_SOURCE_NAME="exporter:exporterpassword@({name}:3306)/" \
prom/mysqld-exporter:latest \
--collect.info_schema.processlist \
--collect.info_schema.innodb_metrics \
--collect.info_schema.tablestats \
--collect.info_schema.tables \
--collect.info_schema.userstats \
--collect.engine_innodb_status

At this point, our architecture is looking something like this with exporter services in the picture:

Prometheus Server Swarm Service

Finally, let's deploy our Prometheus server. Similar to the Galera deployment, we have to prepare the Prometheus configuration file first before importing it into Swarm using Docker config command:

$ vim ~/prometheus.yml
global:
  scrape_interval:     5s
  scrape_timeout:      3s
  evaluation_interval: 5s

# Our alerting rule files
rule_files:
  - "alert.rules"

# Scrape endpoints
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'galera'
    static_configs:
      - targets: ['galera1-exporter:9104','galera2-exporter:9104', 'galera3-exporter:9104']

From the Prometheus configuration file, we have defined three jobs - "prometheus" and "galera". The first one is the job to monitor the Prometheus server itself. The next one is the job to monitor our MySQL containers named "galera". We define the endpoints on our MySQL exporters on port 9104, which expose the Prometheus-compatible metrics from the three Galera nodes respectively. The "alert.rules" is the rule file that we will include later in the next blog post for alerting purposes.

Import the configuration file into Docker config to be used with Prometheus container later:

$ cat ~/prometheus.yml | docker config create prometheus-yml -

Let's run the Prometheus server container, and publish port 9090 of all Docker hosts for the Prometheus web UI service:

$ docker service create \
--name prometheus-server \
--publish 9090:9090 \
--network db_swarm \
--replicas 1 \    
--config src=prometheus-yml,target=/etc/prometheus/prometheus.yml \
--mount type=volume,src=prometheus-data,dst=/prometheus \
prom/prometheus

Verify with the Docker service command that we have 3 Galera services, 3 exporter services and 1 Prometheus service:

$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                         PORTS
jsamvxw9tqpw        galera1             replicated          1/1                 mariadb:10.2                  *:30026->3306/tcp, *:30027->4444/tcp, *:30028-30029->4567-4568/tcp
hbh1dtljn535        galera1-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30038->9104/tcp
otbwnb3ridg0        galera2             replicated          1/1                 mariadb:10.2                  *:30030->3306/tcp, *:30031->4444/tcp, *:30032-30033->4567-4568/tcp
jq8i77ch5oi3        galera2-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30039->9104/tcp
5jp9dpv5twy3        galera3             replicated          1/1                 mariadb:10.2                  *:30034->3306/tcp, *:30035->4444/tcp, *:30036-30037->4567-4568/tcp
10gdkm1ypkav        galera3-exporter    replicated          1/1                 prom/mysqld-exporter:latest   *:30040->9104/tcp
gv9llxrig30e        prometheus-server   replicated          1/1                 prom/prometheus:latest        *:9090->9090/tcp

Now our Prometheus server is already running and can be accessed directly on port 9090 from any Docker node. Open a web browser and go to http://192.168.55.161:9090/ to access the Prometheus web UI. Verify the target status under Status -> Targets and make sure they are all green:

At this point, our Swarm architecture is looking something like this:

To be continued..

We now have our database and monitoring stack deployed on Docker. In part 2 of the blog, we will look into the different MySQL metrics to keep an eye on. We’ll also see how to configure alerting with Prometheus.

Tags:

How would you like to merge "top" process for all your 5 database nodes and sort by CPU usage with just a one-liner command? Yeah, you read it right! How about interactive graphs display in the terminal interface? We introduced the CLI client for ClusterControl called s9s about a year ago, and it’s been a great complement to the web interface. It’s also open source..

In this blog post, we’ll show you how you can monitor your databases using your terminal and s9s CLI.

Introduction to s9s, The ClusterControl CLI

ClusterControl CLI (or s9s or s9s CLI), is an open source project and optional package introduced with ClusterControl version 1.4.1. It is a command line tool to interact, control and manage your database infrastructure using ClusterControl. The s9s command line project is open source and can be found on GitHub.

Starting from version 1.4.1, the installer script will automatically install the package (s9s-tools) on the ClusterControl node.

Some prerequisites. In order for you to run s9s-tools CLI, the following must be true:

A running ClusterControl Controller (cmon).
s9s client, install as a separate package.
Port 9501 must be reachable by the s9s client.

Installing the s9s CLI is straightforward if you install it on the ClusterControl Controller host itself:$ rm

$ rm -Rf ~/.s9s
$ wget http://repo.severalnines.com/s9s-tools/install-s9s-tools.sh
$ ./install-s9s-tools.sh

You can install s9s-tools outside of the ClusterControl server (your workstation laptop or bastion host), as long as the ClusterControl Controller RPC (TLS) interface is exposed to the public network (default to 127.0.0.1:9501). You can find more details on how to configure this in the documentation page.

To verify if you can connect to ClusterControl RPC interface correctly, you should get the OK response when running the following command:

$ s9s cluster --ping
PING OK 2.000 ms

As a side note, also look at the limitations when using this tool.

Example Deployment

Our example deployment consists of 8 nodes across 3 clusters:

PostgreSQL Streaming Replication - 1 master, 2 slaves
MySQL Replication - 1 master, 1 slave
MongoDB Replica Set - 1 primary, 2 secondary nodes

All database clusters were deployed by ClusterControl by using "Deploy Database Cluster" deployment wizard and from the UI point-of-view, this is what we would see in the cluster dashboard:

Cluster Monitoring

We will start by listing out the clusters:

$ s9s cluster --list --long
ID STATE   TYPE              OWNER  GROUP  NAME                   COMMENT
23 STARTED postgresql_single system admins PostgreSQL 10          All nodes are operational.
24 STARTED replication       system admins Oracle 5.7 Replication All nodes are operational.
25 STARTED mongodb           system admins MongoDB 3.6            All nodes are operational.

We see the same clusters as the UI. We can get more details on the particular cluster by using the --stat flag. Multiple clusters and nodes can also be monitored this way, the command line options can even use wildcards in the node and cluster names:

$ s9s cluster --stat *Replication
Oracle 5.7 Replication                                                                                                                                                                                               Name: Oracle 5.7 Replication              Owner: system/admins
      ID: 24                                  State: STARTED
    Type: REPLICATION                        Vendor: oracle 5.7
  Status: All nodes are operational.
  Alarms:  0 crit   1 warn
    Jobs:  0 abort  0 defnd  0 dequd  0 faild  7 finsd  0 runng
  Config: '/etc/cmon.d/cmon_24.cnf'
 LogFile: '/var/log/cmon_24.log'

                                                                                HOSTNAME    CPU   MEMORY   SWAP    DISK       NICs
                                                                                10.0.0.104 1  6% 992M 120M 0B 0B 19G 13G   10K/s 54K/s
                                                                                10.0.0.168 1  6% 992M 116M 0B 0B 19G 13G   11K/s 66K/s
                                                                                10.0.0.156 2 39% 3.6G 2.4G 0B 0B 19G 3.3G 338K/s 79K/s

The output above gives a summary of our MySQL replication together with the cluster status, state, vendor, configuration file and so on. Down the line, you can see the list of nodes that fall under this cluster ID with a summarized view of system resources for each host like number of CPUs, total memory, memory usage, swap disk and network interfaces. All information shown are retrieved from the CMON database, not directly from the actual nodes.

You can also get a summarized view of all databases on all clusters:

$ s9s  cluster --list-databases --long
SIZE        #TBL #ROWS     OWNER  GROUP  CLUSTER                DATABASE
  7,340,032    0         0 system admins PostgreSQL 10          postgres
  7,340,032    0         0 system admins PostgreSQL 10          template1
  7,340,032    0         0 system admins PostgreSQL 10          template0
765,460,480   24 2,399,611 system admins PostgreSQL 10          sbtest
          0  101         - system admins Oracle 5.7 Replication sys
Total: 5 databases, 789,577,728, 125 tables.

The last line summarizes that we have total of 5 databases with 125 tables, 4 of them are on our PostgreSQL cluster.

For a complete example of usage on s9s cluster command line options, check out s9s cluster documentation.

Node Monitoring

For nodes monitoring, s9s CLI has similar features with the cluster option. To get a summarized view of all nodes, you can simply do:

$ s9s node --list --long
STAT VERSION    CID CLUSTER                HOST       PORT  COMMENT
coC- 1.6.2.2662  23 PostgreSQL 10          10.0.0.156  9500 Up and running
poM- 10.4        23 PostgreSQL 10          10.0.0.44   5432 Up and running
poS- 10.4        23 PostgreSQL 10          10.0.0.58   5432 Up and running
poS- 10.4        23 PostgreSQL 10          10.0.0.60   5432 Up and running
soS- 5.7.23-log  24 Oracle 5.7 Replication 10.0.0.104  3306 Up and running.
coC- 1.6.2.2662  24 Oracle 5.7 Replication 10.0.0.156  9500 Up and running
soM- 5.7.23-log  24 Oracle 5.7 Replication 10.0.0.168  3306 Up and running.
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.125 27017 Up and Running
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.131 27017 Up and Running
coC- 1.6.2.2662  25 MongoDB 3.6            10.0.0.156  9500 Up and running
mo-- 3.2.20      25 MongoDB 3.6            10.0.0.35  27017 Up and Running
Total: 11

The most left-hand column specifies the type of the node. For this deployment, "c" represents ClusterControl Controller, 'p" for PostgreSQL, "m" for MongoDB, "e" for Memcached and s for generic MySQL nodes. The next one is the host status - "o" for online, "l" for off-line, "f" for failed nodes and so on. The next one is the role of the node in the cluster. It can be M for master, S for slave, C for controller and - for everything else. The remaining columns are pretty self-explanatory.

You can get all the list by looking at the man page of this component:

$ man s9s-node

From there, we can jump into a more detailed stats for all nodes with --stats flag:

$ s9s node --stat --cluster-id=24
 10.0.0.104:3306
    Name: 10.0.0.104              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.104                 Port: 3306
   Alias: -                         Owner: system/admins
   Class: CmonMySqlHost              Type: mysql
  Status: CmonHostOnline             Role: slave
      OS: centos 7.0.1406 core     Access: read-only
   VM ID: -
 Version: 5.7.23-log
 Message: Up and running.
LastSeen: Just now                    SSH: 0 fail(s)
 Connect: y Maintenance: n Managed: n Recovery: n Skip DNS: y SuperReadOnly: n
     Pid: 16592  Uptime: 01:44:38
  Config: '/etc/my.cnf'
 LogFile: '/var/log/mysql/mysqld.log'
 PidFile: '/var/lib/mysql/mysql.pid'
 DataDir: '/var/lib/mysql/'
 10.0.0.168:3306
    Name: 10.0.0.168              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.168                 Port: 3306
   Alias: -                         Owner: system/admins
   Class: CmonMySqlHost              Type: mysql
  Status: CmonHostOnline             Role: master
      OS: centos 7.0.1406 core     Access: read-write
   VM ID: -
 Version: 5.7.23-log
 Message: Up and running.
  Slaves: 10.0.0.104:3306
LastSeen: Just now                    SSH: 0 fail(s)
 Connect: n Maintenance: n Managed: n Recovery: n Skip DNS: y SuperReadOnly: n
     Pid: 975  Uptime: 01:52:53
  Config: '/etc/my.cnf'
 LogFile: '/var/log/mysql/mysqld.log'
 PidFile: '/var/lib/mysql/mysql.pid'
 DataDir: '/var/lib/mysql/'
 10.0.0.156:9500
    Name: 10.0.0.156              Cluster: Oracle 5.7 Replication (24)
      IP: 10.0.0.156                 Port: 9500
   Alias: -                         Owner: system/admins
   Class: CmonHost                   Type: controller
  Status: CmonHostOnline             Role: controller
      OS: centos 7.0.1406 core     Access: read-write
   VM ID: -
 Version: 1.6.2.2662
 Message: Up and running
LastSeen: 28 seconds ago              SSH: 0 fail(s)
 Connect: n Maintenance: n Managed: n Recovery: n Skip DNS: n SuperReadOnly: n
     Pid: 12746  Uptime: 01:10:05
  Config: ''
 LogFile: '/var/log/cmon_24.log'
 PidFile: ''
 DataDir: ''

Printing graphs with the s9s client can also be very informative. This presents the data the controller collected in various graphs. There are almost 30 graphs supported by this tool as listed here and s9s-node enumerates them all. The following shows server load histogram of all nodes for cluster ID 1 as collected by CMON, right from your terminal:

It is possible to set the start and end date and time. One can view short periods (like the last hour) or longer periods (like a week or a month). The following is an example of viewing the disk utilization for the last hour:

Using the --density option, a different view can be printed for every graph. This density graph shows not the time series, but how frequently the given values were seen (X-axis represents the density value):

If the terminal does not support Unicode characters, the --only-ascii option can switch them off:

The graphs have colors, where dangerously high values for example are shown in red. The list of nodes can be filtered with --nodes option, where you can specify the node names or use wildcards if convenient.

Process Monitoring

Another cool thing about s9s CLI is it provides a processlist of the entire cluster - a “top” for all nodes, all processes merged into one. The following command runs the "top" command on all database nodes for cluster ID 24, sorted by the most CPU consumption, and updated continuously:

$ s9s process --top --cluster-id=24
Oracle 5.7 Replication - 04:39:17                                                                                                                                                      All nodes are operational.
3 hosts, 4 cores, 10.6 us,  4.2 sy, 84.6 id,  0.1 wa,  0.3 st,
GiB Mem : 5.5 total, 1.7 free, 2.6 used, 0.1 buffers, 1.1 cached
GiB Swap: 0 total, 0 used, 0 free,

PID   USER     HOST       PR  VIRT      RES    S   %CPU   %MEM COMMAND
12746 root     10.0.0.156 20  1359348    58976 S  25.25   1.56 cmon
 1587 apache   10.0.0.156 20   462572    21632 S   1.38   0.57 httpd
  390 root     10.0.0.156 20     4356      584 S   1.32   0.02 rngd
  975 mysql    10.0.0.168 20  1144260    71936 S   1.11   7.08 mysqld
16592 mysql    10.0.0.104 20  1144808    75976 S   1.11   7.48 mysqld
22983 root     10.0.0.104 20   127368     5308 S   0.92   0.52 sshd
22548 root     10.0.0.168 20   127368     5304 S   0.83   0.52 sshd
 1632 mysql    10.0.0.156 20  3578232  1803336 S   0.50  47.65 mysqld
  470 proxysql 10.0.0.156 20   167956    35300 S   0.44   0.93 proxysql
  338 root     10.0.0.104 20     4304      600 S   0.37   0.06 rngd
  351 root     10.0.0.168 20     4304      600 R   0.28   0.06 rngd
   24 root     10.0.0.156 20        0        0 S   0.19   0.00 rcu_sched
  785 root     10.0.0.156 20   454112    11092 S   0.13   0.29 httpd
   26 root     10.0.0.156 20        0        0 S   0.13   0.00 rcuos/1
   25 root     10.0.0.156 20        0        0 S   0.13   0.00 rcuos/0
22498 root     10.0.0.168 20   127368     5200 S   0.09   0.51 sshd
14538 root     10.0.0.104 20        0        0 S   0.09   0.00 kworker/0:1
22933 root     10.0.0.104 20   127368     5200 S   0.09   0.51 sshd
28295 root     10.0.0.156 20   127452     5016 S   0.06   0.13 sshd
 2238 root     10.0.0.156 20   197520    10444 S   0.06   0.28 vc-agent-007
  419 root     10.0.0.156 20    34764     1660 S   0.06   0.04 systemd-logind
    1 root     10.0.0.156 20    47628     3560 S   0.06   0.09 systemd
27992 proxysql 10.0.0.156 20    11688      872 S   0.00   0.02 proxysql_galera
28036 proxysql 10.0.0.156 20    11688      876 S   0.00   0.02 proxysql_galera

There is also a --list flag which returns a similar result without continuous update (similar to "ps" command):

$ s9s process --list --cluster-id=25

Job Monitoring

Jobs are tasks performed by the controller in the background, so that the client application does not need to wait until the entire job is finished. ClusterControl executes management tasks by assigning an ID for every task and lets the internal scheduler decide whether two or more jobs can be run in parallel. For example, more than one cluster deployment can be executed simultaneously, as well as other long running operations like backup and automatic upload of backups to cloud storage.

In any management operation, it's would be helpful if we could monitor the progress and status of a specific job, like e.g., scale out a new slave for our MySQL replication. The following command add a new slave, 10.0.0.77 to scale out our MySQL replication:

$ s9s cluster --add-node --nodes="10.0.0.77" --cluster-id=24
Job with ID 66992 registered.

We can then monitor the jobID 66992 using the job option:

$ s9s job --log --job-id=66992
addNode: Verifying job parameters.
10.0.0.77:3306: Adding host to cluster.
10.0.0.77:3306: Testing SSH to host.
10.0.0.77:3306: Installing node.
10.0.0.77:3306: Setup new node (installSoftware = true).
10.0.0.77:3306: Setting SELinux in permissive mode.
10.0.0.77:3306: Disabling firewall.
10.0.0.77:3306: Setting vm.swappiness = 1
10.0.0.77:3306: Installing software.
10.0.0.77:3306: Setting up repositories.
10.0.0.77:3306: Installing helper packages.
10.0.0.77: Upgrading nss.
10.0.0.77: Upgrading ca-certificates.
10.0.0.77: Installing socat.
...
10.0.0.77: Installing pigz.
10.0.0.77: Installing bzip2.
10.0.0.77: Installing iproute2.
10.0.0.77: Installing tar.
10.0.0.77: Installing openssl.
10.0.0.77: Upgrading openssl openssl-libs.
10.0.0.77: Finished with helper packages.
10.0.0.77:3306: Verifying helper packages (checking if socat is installed successfully).
10.0.0.77:3306: Uninstalling existing MySQL packages.
10.0.0.77:3306: Installing replication software, vendor oracle, version 5.7.
10.0.0.77:3306: Installing software.
...

Or we can use the --wait flag and get a spinner with progress bar:

$ s9s job --wait --job-id=66992
Add Node to Cluster
- Job 66992 RUNNING    [         █] ---% Add New Node to Cluster

That's it for today's monitoring supplement. We hope that you’ll give the CLI a try and get value out of it. Happy clustering!

Tags:

Data protection is one of the most significant aspects of administering a database. Depending on the organizational structure, whether you are a developer, sysadmin or DBA, if you are managing the production database, you must monitor data for unauthorized access and usage. The purpose of security monitoring is twofold. One, to identify unauthorised activity on the database. And two, to check if databases ´and their configurations on a company-wide basis are compliant with security policies and standards.

In this article, we will divide monitoring for security in two categories. One will be related to auditing of MySQL and MariaDB databases activities. The second category will be about monitoring your instances for potential security gaps.

Query and connection policy-based monitoring

Continuous auditing is an imperative task for monitoring your database environment. By auditing your database, you can achieve accountability for actions taken or content accessed. Moreover, the audit may include some critical system components, such as the ones associated with financial data to support a precise set of regulations like SOX, or the EU GDPR regulation. Usually, it is achieved by logging information about DB operations on the database to an external log file.

By default, auditing in MySQL or MariaDB is disabled. You and achieve it by installing additional plugins or by capturing all queries with the query_log parameter. The general query log file is a general record of what MySQL is performing. The server records some information to this log when clients connect or disconnect, and it logs each SQL statement received from clients. Due to performance issues and lack of configuration options, the general_log is not a good solution for security audit purposes.

If you use MySQL Enterprise, you can use the MySQL Enterprise Audit plugin which is an extension to the proprietary MySQL version. MySQL Enterprise Audit Plugin plugin is only available with MySQL Enterprise, which is a commercial offering from Oracle. Percona and MariaDB have created their own open source versions of the audit plugin. Lastly, McAfee plugin for MySQL can also be used with various versions of MySQL. In this article, we will focus on the open source plugins, although the Enterprise version from Oracle seems to be the most robust and stable.

Characteristics of MySQL open source audit plugins

While the open source audit plugins do the same job as the Enterprise plugin from Oracle - they produce output with database query and connections - there are some major architectural differences.

MariaDB Audit Plugin – The MariaDB Audit Plugin works with MariaDB, MySQL (as of version 5.5.34 and 10.0.7) and Percona Server. MariaDB started including the Audit Plugin by default from versions 10.0.10 and 5.5.37, and it can be installed in any version from MariaDB 5.5.20. It is the only plugin that supports Oracle MySQL, Percona Server, and MariaDB. It is available on Windows and Linux platform. Versions starting from 1.2 are most stable, and it may be risky to use versions below that in your production environment.

McAfee MySQL Audit Plugin – This plugin does not use MySQL audit API. It was recently updated to support MySQL 5.7. Some tests show that API based plugins may provide better performance but you need to check it with your environment.

Percona Audit Log Plugin – Percona provides an open source auditing solution that installs with Percona Server 5.5.37+ and 5.6.17+ as part of the installation process. Comparing to other open source plugins, this plugin has more reach output features as it outputs XML, JSON and to syslog.

As it has some internal hooks to the server to be feature-compatible with Oracle’s plugin, it is not available as a standalone plugin for other versions of MySQL.

Plugin installation based on MariaDB audit extension

The installation of open source MySQL plugins is quite similar for MariaDB, Percona, and McAfee versions.
Percona and MariaDB add their plugins as part of the default server binaries, so there is no need to download plugins separately. The Percona version only officially supports it’s own fork of MySQL so there is no direct download from the vendor's website ( if you want to use this plugin with MySQL, you will have to obtain the plugin from a Percona server package). If you would like to use the MariaDB plugin with other forks of MySQL, then you can find it from https://downloads.mariadb.com/Audit-Plugin/MariaDB-Audit-Plugin/. The McAfee plugin is available at https://github.com/mcafee/mysql-audit/wiki/Installation.

Before you start the plugin installation, you can check if the plugin is present in the system. The dynamic plugin (doesn’t require instance restart) location can be checked with:

SHOW GLOBAL VARIABLES LIKE 'plugin_dir';

+---------------+--------------------------+
| Variable_name | Value                    |
+---------------+--------------------------+
| plugin_dir    | /usr/lib64/mysql/plugin/ |
+---------------+--------------------------+

Check the directory returned at the filesystem level to make sure you have a copy of the plugin library. If you do not have server_audit.so or server_audit.dll inside of /usr/lib64/mysql/plugin/, then more likely your MariaDB version is not supported and should upgrade it to latest version..

The syntax to install the MariaDB plugin is:

INSTALL SONAME 'server_audit';

To check installed plugins you need to run:

SHOW PLUGINS;
MariaDB [(none)]> show plugins;
+-------------------------------+----------+--------------------+--------------------+---------+
| Name                          | Status   | Type               | Library            | License |
+-------------------------------+----------+--------------------+--------------------+---------+
| binlog                        | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| mysql_native_password         | ACTIVE   | AUTHENTICATION     | NULL               | GPL     |
| mysql_old_password            | ACTIVE   | AUTHENTICATION     | NULL               | GPL     |
| wsrep                         | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| MRG_MyISAM                    | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| MEMORY                        | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| CSV                           | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| MyISAM                        | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| CLIENT_STATISTICS             | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INDEX_STATISTICS              | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| TABLE_STATISTICS              | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| USER_STATISTICS               | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| PERFORMANCE_SCHEMA            | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| InnoDB                        | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| INNODB_TRX                    | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INNODB_LOCKS                  | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INNODB_LOCK_WAITS             | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INNODB_CMP                    | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
...
| INNODB_MUTEXES                | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INNODB_SYS_SEMAPHORE_WAITS    | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| INNODB_TABLESPACES_ENCRYPTION | ACTIVE   | INFORMATION SCHEMA | NULL               | BSD     |
| INNODB_TABLESPACES_SCRUBBING  | ACTIVE   | INFORMATION SCHEMA | NULL               | BSD     |
| Aria                          | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| SEQUENCE                      | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| user_variables                | ACTIVE   | INFORMATION SCHEMA | NULL               | GPL     |
| FEEDBACK                      | DISABLED | INFORMATION SCHEMA | NULL               | GPL     |
| partition                     | ACTIVE   | STORAGE ENGINE     | NULL               | GPL     |
| rpl_semi_sync_master          | ACTIVE   | REPLICATION        | semisync_master.so | GPL     |
| rpl_semi_sync_slave           | ACTIVE   | REPLICATION        | semisync_slave.so  | GPL     |
| SERVER_AUDIT                  | ACTIVE   | AUDIT              | server_audit.so    | GPL     |
+-------------------------------+----------+--------------------+--------------------+---------+

If you need additional information, check the PLUGINS table in the information_schema database which contains more detailed information.

Another way to install the plugin is to enable the plugin in my.cnf and restart the instance. An example of a basic audit plugin configuration from MariaDB could be :

server_audit_events=CONNECT
server_audit_file_path=/var/log/mysql/audit.log
server_audit_file_rotate_size=1073741824
server_audit_file_rotations=8
server_audit_logging=ON
server_audit_incl_users=
server_audit_excl_users=
server_audit_output_type=FILE
server_audit_query_log_limit=1024

Above setting should be placed in my.cnf. Audit plugin will create file /var/log/mysql/audit.log which will rotate on size 1GB and there will be eight rotations until the file is overwritten. The file will contain only information about connections.

Currently, there are sixteen settings which you can use to adjust the MariaDB audit plugin.

server_audit_events
server_audit_excl_users
server_audit_file_path
server_audit_file_rotate_now
server_audit_file_rotate_size
server_audit_file_rotations
server_audit_incl_users
server_audit_loc_info
server_audit_logging
server_audit_mode
server_audit_output_type
Server_audit_query_log_limit
server_audit_syslog_facility
server_audit_syslog_ident
server_audit_syslog_info
server_audit_syslog_priority

Among them, you can find options to include or exclude users, set different logging events (CONNECT or QUERY) and switch between file and syslog.

To make sure the plugin will be enabled upon server startup, you have to set
plugin_load=server_audit=server_audit.so in your my.cnf settings. Such configuration can be additionally protected by server_audit=FORCE_PLUS_PERMANENT which will disable the plugin uninstall option.

UNINSTALL PLUGIN server_audit;

ERROR 1702 (HY000):
Plugin 'server_audit' is force_plus_permanent and can not be unloaded

Here is some sample entries produced by MariaDB audit plugin:

20180817 20:00:01,slave,cmon,cmon,31,0,DISCONNECT,information_schema,,0
20180817 20:47:01,slave,cmon,cmon,17,0,DISCONNECT,information_schema,,0
20180817 20:47:02,slave,cmon,cmon,19,0,DISCONNECT,information_schema,,0
20180817 20:47:02,slave,cmon,cmon,18,0,DISCONNECT,information_schema,,0
20180819 17:19:19,slave,cmon,cmon,12,0,CONNECT,information_schema,,0
20180819 17:19:19,slave,root,localhost,13,0,FAILED_CONNECT,,,1045
20180819 17:19:19,slave,root,localhost,13,0,DISCONNECT,,,0
20180819 17:19:20,slave,cmon,cmon,14,0,CONNECT,mysql,,0
20180819 17:19:20,slave,cmon,cmon,14,0,DISCONNECT,mysql,,0
20180819 17:19:21,slave,cmon,cmon,15,0,CONNECT,information_schema,,0
20180819 17:19:21,slave,cmon,cmon,16,0,CONNECT,information_schema,,0
20180819 19:00:01,slave,cmon,cmon,17,0,CONNECT,information_schema,,0
20180819 19:00:01,slave,cmon,cmon,17,0,DISCONNECT,information_schema,,0

Schema changes report

If you need to track only DDL changes, you can use the ClusterControl Operational Report on Schema Change. The Schema Change Detection Report shows any DDL changes on your database. This functionality requires an additional parameter in ClusterControl configuration file. If this is not set you will see the following information: schema_change_detection_address is not set in /etc/cmon.d/cmon_1.cnf. Once that is in place an example output may be like below:

It can be set up with a schedule, and the reports emailed to recipients.

ClusterControl: Schedule Operational Report

MySQL Database Security Assessment

Package upgrade check

First, we will start with security checks. Being up-to-date with MySQL patches will help reduce risks associated with known vulnerabilities present in the MySQL server. You can keep your environment up-to-date by using the vendors’ package repository. Based on this information you can build your own reports, or use tools like ClusterControl to verify your environment and alert you on possible updates.

ClusterControl Upgrade Report gathers information from the operating system and compares them to packages available in the repository. The report is divided into four sections; upgrade summary, database packages, security packages, and other packages. You can quickly compare what you have installed on your system and find a recommended upgrade or patch.

ClusterControl: Upgrade Report

ClusterControl: Upgrade Report details

To compare them manually you can run

SHOW VARIABLES WHERE variable_name LIKE "version";

With security bulletins like:
https://www.oracle.com/technetwork/topics/security/alerts-086861.html
https://nvd.nist.gov/view/vuln/search-results?adv_search=true&cves=on&cpe_vendor=cpe%3a%2f%3aoracle&cpe_produ
https://www.percona.com/doc/percona-server/LATEST/release-notes/release-notes_index.html
https://downloads.mariadb.org/mariadb/+releases/
https://www.cvedetails.com/vulnerability-list/vendor_id-12010/Mariadb.html
https://www.cvedetails.com/vulnerability-list/vendor_id-13000/Percona.html

Or vendor repositories:

On Debian

sudo apt list mysql-server

On RHEL/Centos

yum list | grep -i mariadb-server

Accounts without password

Blank passwords allow a user to login without using a password. MySQL used to come with a set of pre-created users, some of which can connect to the database without password or, even worse, anonymous users. Fortunately, this has changed in MySQL 5.7. Finally, it comes only with a root account that uses the password you choose at installation time.

For each row returned from the audit procedure, set a password:

SELECT User,host
FROM mysql.user
WHERE authentication_string='';

Additionally, you can install a password validation plugin and implement a more secure policy:

INSTALL PLUGIN validate_password SONAME 'validate_password.so';

SHOW VARIABLES LIKE 'default_password_lifetime';
SHOW VARIABLES LIKE 'validate_password%';

An good start can be:

plugin-load=validate_password.so
validate-password=FORCE_PLUS_PERMANENT
validate_password_length=14
validate_password_mixed_case_count=1
validate_password_number_count=1
validate_password_special_char_count=1
validate_password_policy=MEDIUM

Of course, these settings will depend on your business needs.

Remote access monitoring

Avoiding the use of wildcards within hostnames helps control the specific locations from which a given user may connect to and interact with the database.

You should make sure that every user can connect to MySQL only from specific hosts. You can always define several entries for the same user, this should help to reduce a need for wildcards.

Execute the following SQL statement to assess this recommendation (make sure no rows are returned):

SELECT user, host FROM mysql.user WHERE host = '%';

Test database

The default MySQL installation comes with an unused database called test and the test database is available to every user, especially to the anonymous users. Such users can create tables and write to them. This can potentially become a problem on its own - and writes would add some overhead and reduce database performance. It is recommended that the test database is dropped. To determine if the test database is present, run:

SHOW DATABASES LIKE 'test';

If you notice that the test database is present, this could be that mysql_secure_installation script which drops the test database (as well as other security-related activities) was not executed.

LOAD DATA INFILE

If both server and client has the ability to run LOAD DATA LOCAL INFILE, a client will be able to load data from a local file to a remote MySQL server. The local_infile parameter dictates whether files located on the MySQL client's computer can be loaded or selected via LOAD DATA INFILE or SELECT local_file.

This, potentially, can help to read files the client has access to - for example, on an application server, one could access any data that the HTTP server has access to. To avoid it, you need to set local-infile=0 in my.cnf.

Execute the following SQL statement and ensure the Value field is set to OFF:

SHOW VARIABLES WHERE Variable_name = 'local_infile';

Monitor for non-encrypted tablespaces

Starting from MySQL 5.7.11, InnoDB supports data encryption for tables stored in file-per-table tablespaces. This feature provides at-rest encryption for physical tablespace data files. To examine if your tables have been encrypted run:

mysql> SELECT TABLE_SCHEMA, TABLE_NAME, CREATE_OPTIONS FROM INFORMATION_SCHEMA.TABLES
       WHERE CREATE_OPTIONS LIKE '%ENCRYPTION="Y"%';

+--------------+------------+----------------+
| TABLE_SCHEMA | TABLE_NAME | CREATE_OPTIONS |
+--------------+------------+----------------+
| test         | t1         | ENCRYPTION="Y" |
+--------------+------------+----------------+

As a part of the encryption, you should also consider encryption of the binary log. The MySQL server writes plenty of information to binary logs.

Encryption connection validation

In some setups, the database should not be accessible through the network if every connection is managed locally, through the Unix socket. In such cases, you can add the ‘skip-networking’ variable in my.cnf. Skip-networking prevents MySQL from using any TCP/IP connection, and only Unix socket would be possible on Linux.

However this is rather rare situation as it is common to access MySQL over the network. You then need to monitor that your connections are encrypted. MySQL supports SSL as a means to encrypting traffic both between MySQL servers (replication) and between MySQL servers and clients. If you use Galera cluster, similar features are available - both intra-cluster communication and connections with clients can be encrypted using SSL. To check if you use SSL encryption run the following queries:

SHOW variables WHERE variable_name = 'have_ssl'; 
select ssl_verify_server_cert from mysql.slave_master_info;

That’s it for now. This is not a complete list, do let us know if there are any other checks that you are doing today on your production databases.

Tags:

ClusterControl’s agentless approach allows sysadmins and DBAs to monitoring their databases without having to install agent software on each monitored system. Monitoring is implemented using a remote data collector that uses the SSH protocol.

But first, let’s clarify the scope and meaning of monitoring within our context here. Monitoring comes after data trending - the metrics collection and storing process - which allows the monitoring system to process the collected data to produce justification for tuning, alerting, as well as displaying trending data for reporting.

Generally, ClusterControl performs its monitoring, alerting and trending duties by using the following three ways:

SSH - Host metrics collection (process, load balancers stats, resource usage and consumption etc) using SSH library.
Database client - Database metrics collection (status, queries, variables, usage etc) using the respective database client library.
Advisor - Mini programs written using ClusterControl DSL and running within ClusterControl itself, for monitoring, tuning and alerting purposes.

Some description of the above - SSH stands for Secure Shell, a secure network protocol which is used by most of the Linux-based servers for remote administration. ClusterControl Controller, or CMON, is the backend service performing automation, management, monitoring and scheduling tasks, built on top of C++.

ClusterControl DSL (Domain Specific Language) allows you to extend the functionality of your ClusterControl platform by creating Advisors, Auto Tuners, or "Mini Programs". The DSL syntax is based on JavaScript, with extensions to provide access to ClusterControl internal data structures and functions. The DSL allows you to execute SQL statements, run shell commands/programs across all your cluster hosts, and retrieve results to be processed for advisors/alerts or any other actions.

Monitoring Tools

All of the prerequisite tools will be met by the installer script or will be automatically installed by ClusterControl during the database deployment stage, or if the required file/binary/package does not exist on the target server before executing a job. Generally speaking, ClusterControl monitoring duty only requires OpenSSH server package on the monitored hosts. ClusterControl uses libssh client library to collect host metrics for the monitored hosts - CPU, memory, disk, network, IO, process, etc. OpenSSH client package is required on the ClusterControl host only for setting up passwordless SSH and debugging purposes. Other SSH implementations like Dropbear and TinySSH are not supported.

When gathering the database stats and metrics, ClusterControl Controller (CMON) connects to the database server directly via database client libraries - libmysqlclient (MySQL/MariaDB and ProxySQL), libpq (PostgreSQL) and libmongocxx (MongoDB). That is why it's crucial to setup proper privileges for ClusterControl server from database servers perspective. For MySQL-based clusters, ClusterControl requires database user "cmon" while for other databases, any username can be used for monitoring, as long as it is granted with super-user privileges. Most of the time, ClusterControl will setup the required privileges (or use the specified database user) automatically during the cluster import or cluster deployment stage.

For load balancers, ClusterControl requires the following tools:

Maxadmin on the MariaDB MaxScale server.
netcat and/or socat on the HAProxy server to connect to HAProxy socket file and retrieve the monitoring data.
ProxySQL requires mysql client on the ProxySQL server.

The following diagram illustrates both host and database monitoring processes executed by ClusterControl using libssh and database client libraries:

Although monitoring threads do not need database client packages to be installed on the monitored host, it's highly recommended to have them for management purposes. For example, MySQL client package comes with mysql, mysqldump, mysqlbinlog and mysqladmin programs which will be used by ClusterControl when performing backups and point-in-time recovery.

Monitoring Methods

For host and load balancer stats collection, ClusterControl executes this task via SSH with super-user privilege. Therefore, passwordless SSH with super-user privilege is vital, to allow ClusterControl to run the necessary commands remotely with proper escalation. With this pull approach, there are a couple of advantages as compared to other mechanisms:

Agentless - There is no need for agent to be installed, configured and maintained.
Unifying the management and monitoring configuration - SSH can be used to pull monitoring metrics or push management jobs on the target nodes.
Simplify the deployment - The only requirement is proper passwordless SSH setup and that's it. SSH is also very secure and encrypted.
Centralized setup - One ClusterControl server can manage multiple servers and clusters, provided it has sufficient resources.

However, there are also drawbacks with the pull mechanism:

The monitoring data is accurate only from ClusterControl perspective. For example, if there is a network glitch and ClusterControl loses communication to the monitored host, the sampling will be skipped until the next available cycle.
For high granularity monitoring, there will be network overhead due to increase sampling rate where ClusterControl needs to establish more connections to every target hosts.
ClusterControl will keep on attempting to re-establish connection to the target node, because it has no agent to do this on its behalf.
Redundant data sampling if you have more than one ClusterControl server monitoring a cluster, since each ClusterControl server has to pull the monitoring data for itself.

For MySQL query monitoring, ClusterControl monitors the queries in two different ways:

Queries are retrieved from PERFORMANCE_SCHEMA, by querying the schema on the database node via SSH.
If PERFORMANCE_SCHEMA is disabled or unavailable, ClusterControl will parse the content of the Slow Query Log via SSH.

If Performance Schema is enabled, ClusterControl will use it to look for the slow queries. Otherwise, ClusterControl will parse the content of the MySQL slow query log (via slow_query_log=ON dynamic variable) based on the following flow:

Start slow log (during MySQL runtime).
Run it for a short period of time (a second or couple of seconds).
Stop log.
Parse log.
Truncate log (new log file).
Go to 1.

The collected queries are hashed, calculated and digested (normalize, average, count, sort) and then stored in ClusterControl CMON database. Take note that for this sampling method, there is a slight chance some queries will not be captured, especially during “stop log, parse log, truncate log” parts. You can enable Performance Schema if this is not an option.

If you are using the Slow Query log, only queries that exceed the Long Query Time will be listed here. If the data is not populated correctly and you believe that there should be something in there, it could be:

ClusterControl did not collect enough queries to summarize and populate data. Try to lower the Long Query Time.
You have configured Slow Query Log configuration options in the my.cnf of MySQL server, and Override Local Query is turned off. If you really want to use the value you defined inside my.cnf, probably you have to lower the long_query_time value so ClusterControl can calculate a more accurate result.
You have another ClusterControl node pulling the Slow Query log as well (in case you have a standby ClusterControl server). Only allow one ClusterControl server to do this job.

For more details (including how to enable the PERFORMANCE_SCHEMA), see this blog post, How to use the ClusterControl Query Monitor for MySQL, MariaDB and Percona Server.

For PostgreSQL query monitoring, ClusterControl requires pg_stat_statements module, to track execution statistics of all SQL statements. It populates the pg_stat_statements views and functions when displaying the queries in the UI (under Query Monitor tab).

Intervals and Timeouts

ClusterControl Controller (cmon) is a multi-threaded process. By default, ClusterControl Controller sampling thread connects to each monitored host once and maintain persistent connection until the host drops or disconnects it when sampling host stats. It may establish more connections depending on the jobs assigned to the host, since most of the management jobs run in their own thread. For example, cluster recovery runs on the recovery thread, Advisor execution runs on a cron-thread, as well as process monitoring which runs on process collector thread.

ClusterControl monitoring thread performs the following sampling operations in the following interval:

MySQL query/status metrics: every second
Process collection (/proc): every 10 seconds
Server detection: every 10 seconds
Host metrics (/proc, /sys): every 30 seconds (configurable via host_stats_collection_interval)
Database metrics (PostgreSQL and MongoDB only): every 30 seconds (configurable via db_stats_collection_interval)
Database schema metrics: every 3 hours (configurable via db_schema_stats_collection_interval)
Load balancer metrics: every 15 seconds (configurable via lb_stats_collection_interval)

The imperative scripts (Advisors) can make use of SSH and database client libraries that come with CMON with the following restrictions:

5 seconds of hard limit for SSH execution,
10 seconds of default limit for database connection, configurable via net_read_timeout, net_write_timeout, connect_timeout in CMON configuration file,
60 seconds of total script execution time limit before CMON ungracefully aborts it.

Advisors can be created, compiled, tested and scheduled directly from ClusterControl UI, under Manage -> Developer Studio. The following screenshot shows an example of an Advisor to extract top 10 queries from PERFORMANCE_SCHEMA:

The execution of advisors is depending if it is activated and the scheduling time in cron format:

The results of the execution are displayed under Performance -> Advisors, as shown in the following screenshot:

For more information on what Advisors being provided by default, check out our Developer Studio product page.

For short-interval monitoring data like MySQL queries and status, data are stored directly into CMON database. While for long-interval monitoring data like weekly/monthly/yearly data points are aggregated every 60 seconds and stored in memory for 10 minutes. These behaviours are not configurable due to the architecture design.

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

Parameters

There are plethora of parameters you can configure for ClusterControl to suit your monitoring and alerting policy. Most of them are configurable through ClusterControl UI -> pick a cluster -> Settings. The "Settings" tab provide many options to configure alerts, thresholds, notifications, graphs layout, database counters, query monitoring and so on. For example, warning and critical thresholds can be configured as follows:

There are also "Runtime Configuration" page, a summarized list of the active ClusterControl Controller (CMON) runtime configuration parameters:

There are more than 170 ClusterControl Controller configuration options in total and some of the advanced settings can be configured to finely tune your monitoring and alerting policy. To list out some of them:

monitor_cpu_temperature
swap_warning
swap_critical
redobuffer_warning
redobuffer_critical
indexmemory_warning
indexmemory_critical
datamemory_warning
datamemory_critical
tablespace_warning
tablespace_critical
redolog_warning
redolog_critical
max_replication_lag
long_query_time
log_queries_not_using_indexes
query_monitor_use_local_settings
enable_query_monitor
enable_query_monitor_auto_purge_ps

The parameters listed in the "Runtime Configuration" page can be changed either by using the UI or CMON configuration file located at /etc/cmon.d/cmon_X.cnf, where X is the cluster ID. You can list out all of the supported configuration options for CMON by using the following command:

$ cmon --help-config

The same output is also available in the documentation page, ClusterControl Controller Configuration Options.

Final Thoughts

We hope this blog has given you a good understanding of how ClusterControl monitors your database servers and clusters agentlessly. We’ll be shortly announcing some significant new features in the next version of ClusterControl so stay tuned!

Tags:

Deploying a database cluster is not rocket science - there are many how-to’s on how to do that. But how do you know what you just deployed is production-ready? Manual deployments can also be tedious and repetitive. Depending on the number of nodes in the cluster, the deployment steps may be time-consuming and error-prone. Configuration management tools like Puppet, Chef and Ansible are popular in deploying infrastructure, but for stateful database clusters, you need to perform significant scripting to handle deployment of the whole database HA stack. Moreover, the chosen template/module/cookbook/role has to be meticulously tested before you can trust it as part of your infrastructure automation. Version changes require the scripts to be updated and tested again.

The good news is that ClusterControl automates deployments of the entire stack - and for free as well! We’ve deployed thousands of production clusters, and take a number of precautions to ensure they are production-ready Different topologies are supported, from master-slave replication to Galera, NDB and InnoDB cluster, with different database proxies on top.

A high availability stack, deployed through ClusterControl, consists of three layers:

Database layer (e.g., Galera Cluster)
Reverse proxy layer (e.g., HAProxy or ProxySQL)
Keepalived layer, which, with use of Virtual IP, ensures high availability of the proxy layer

In this blog, we are going to show you how to deploy a production-grade Galera Cluster complete with load balancers for high availability setup. The complete setup consists of 6 hosts:

1 host - ClusterControl (deployment, monitoring, management server)
3 hosts - MySQL Galera Cluster
2 hosts - Reverse proxies act as load balancers in front of the cluster.

The following diagram illustrates our end result once deployment is complete:

Prerequisites

ClusterControl must reside on an independant node which is not part of the cluster. Download ClusterControl, and the page will generate a license unique for you and show the steps to install ClusterControl:

$ wget -O install-cc https://severalnines.com/scripts/install-cc
$ chmod +x install-cc
$ ./install-cc # as root or sudo user

Follow the instructions where you will be guided with setting up MySQL server, MySQL root password on the ClusterControl node, cmon password for ClusterControl usage and so on. You should get the following line once the installation has completed:

Determining network interfaces. This may take a couple of minutes. Do NOT press any key.
Public/external IP => http://{public_IP}/clustercontrol
Installation successful. If you want to uninstall ClusterControl then run install-cc --uninstall.

Then, on the ClusterControl server, generate an SSH key which we will use to setup the passwordless SSH later on. You can use any user in the system but it must have the ability to perform super-user operations (sudoer). In this example, we picked the root user:

$ whoami
root
$ ssh-keygen -t rsa

Set up passwordless SSH to all nodes that you would like to monitor/manage via ClusterControl. In this case, we will set this up on all nodes in the stack (including ClusterControl node itself). On ClusterControl node, run the following commands and specify the root password when prompted:

$ ssh-copy-id root@192.168.55.160 # clustercontrol
$ ssh-copy-id root@192.168.55.161 # galera1
$ ssh-copy-id root@192.168.55.162 # galera2
$ ssh-copy-id root@192.168.55.163 # galera3
$ ssh-copy-id root@192.168.55.181 # proxy1
$ ssh-copy-id root@192.168.55.182 # proxy2

You can then verify if it's working by running the following command on ClusterControl node:

$ ssh root@192.168.55.161 "ls /root"

Make sure you are able to see the result of the command above without the need to enter password.

Deploying the Cluster

ClusterControl supports all vendors for Galera Cluster (Codership, Percona and MariaDB). There are some minor differences which may influence your decision for choosing the vendor. If you would like to learn about the differences between them, check out our previous blog post - Galera Cluster Comparison - Codership vs Percona vs MariaDB.

For production deployment, a three-node Galera Cluster is the minimum you should have. You can always scale it out later once the cluster is deployed, manually or via ClusterControl. We’ll open our ClusterControl UI at https://192.168.55.160/clustercontrol and create the first admin user. Then, go to the top menu and click Deploy -> MySQL Galera and you will be presented with the following dialog:

There are two steps, the first one is the "General & SSH Settings". Here we need to configure the SSH user that ClusterControl should use to connect to the database nodes, together with the path to the SSH key (as generated under Prerequisite section) as well as the SSH port of the database nodes. ClusterControl presumes all database nodes are configured with the same SSH user, key and port. Next, give the cluster a name, in this case we will use "MySQL Galera Cluster 5.7". This value can be changed later on. Then select the options to instruct ClusterControl to install the required software, disable the firewall and also disable the security enhancement module on the particular Linux distribution. All of these are recommended to be toggled on to maximize the potential of successful deployment.

Click Continue and you will be presented with the following dialog:

In the next step, we need to configure the database servers - vendor, version, datadir, port, etc - which are pretty self-explanatory. "Configuration Template" is the template filename under /usr/share/cmon/templates of the ClusterControl node. "Repository" is how ClusterControl should configure the repository on the database node. By default, it will use the vendor repository and install the latest version provided by the repository. However, in some cases, the user might have a pre-existing repository mirrored from the original repository due to security policy restriction. Nevertheless, ClusterControl supports most of them, as described in the user guide, under Repository.

Lastly, add the IP address or hostname (must be a valid FQDN) of the database nodes. You will see a green tick icon on the left of the node, indicating ClusterControl was able to connect to the node via passwordless SSH. You are now good to go. Click Deploy to start the deployment. This may take 15 to 20 minutes to complete. You can monitor the deployment progress under Activity (top menu) -> Jobs -> Create Cluster:

Once the deployment completed, at this point, our architecture can be illustrated as below:

Deploying the Load Balancers

In Galera Cluster, all nodes are equal - each node holds the same role and same dataset. Therefore, there is no failover within the cluster if a node fails. Only the application side requires failover, to skip the inoperational nodes while the cluster is partitioned. Therefore, it's highly recommended to place load balancers on top of a Galera Cluster to:

Unify the multiple database endpoints to a single endpoint (load balancer host or virtual IP address as the endpoint).
Balance the database connections between the backend database servers.
Perform health checks and only forward the database connections to healthy nodes.
Redirect/rewrite/block offending (badly written) queries before they hit the database servers.

There are three main choices of reverse proxies for Galera Cluster - HAProxy, MariaDB MaxScale or ProxySQL - all can be installed and configured automatically by ClusterControl. In this deployment, we picked ProxySQL because it checks all the above plus it understands the MySQL protocol of the backend servers.

In this architecture, we want to use two ProxySQL servers to eliminate any single-point-of-failure (SPOF) to the database tier, which will be tied together using a floating virtual IP address. We’ll explain this in the next section. One node will act as the active proxy and the other one as hot-standby. Whichever node that holds the virtual IP address at a given time is the active node.

To deploy the first ProxySQL server, simply go to the cluster action menu (right-side of the summary bar) and click on Add Load Balancer -> ProxySQL -> Deploy ProxySQL and you will see the following:

Again, most of the fields are self-explanatory. In the "Database User" section, ProxySQL acts as a gateway through which your application connects to the database. The application authenticates against ProxySQL, therefore you have to add all of the users from all the backend MySQL nodes, along with their passwords, into ProxySQL. From ClusterControl, you can either create a new user to be used by the application - you can decide on its name, password, access to which databases are granted and what MySQL privileges that user will have. Such user will be created on both MySQL and ProxySQL side. Second option, more suitable for existing infrastructures, is to use the existing database users. You need to pass username and password, and such user will be created only on ProxySQL.

The last section, "Implicit Transaction", ClusterControl will configure ProxySQL to send all of the traffic to the master if we started transaction with SET autocommit=0. Otherwise, if you use BEGIN or START TRANSACTION to create a transaction, ClusterControl will configure read/write split in the query rules. This is to ensure ProxySQL will handle transactions correctly. If you have no idea how your application does this, you can pick the latter.

Repeat the same configuration for the second ProxySQL node, except the "Server Address" value which is 192.168.55.182. Once done, both nodes will be listed under "Nodes" tab -> ProxySQL where you can monitor and manage them directly from the UI:

At this point, our architecture is now looking like this:

If you would like to learn more about ProxySQL, do check out this tutorial - Database Load Balancing for MySQL and MariaDB with ProxySQL - Tutorial.

Deploying the Virtual IP Address

The final part is the virtual IP address. Without it, our load balancers (reverse proxies) would be the weak link as they would be a single-point of failure - unless the application has the ability to automatically redirect failed database connections to another load balancer. Nevertheless, it's good practice to unify them both using virtual IP address and simplify the connection endpoint to the database layer.

From ClusterControl UI -> Add Load Balancer -> Keepalived -> Deploy Keepalived and select the two ProxySQL hosts that we have deployed:

Also, specify the virtual IP address and the network interface to bind the IP address. The network interface must exist on both ProxySQL nodes. Once deployed, you should see the following green checks in the summary bar of the cluster:

At this point, our architecture can be illustrated as below:

Our database cluster is now ready for production usage. You can import your existing database into it or create a fresh new database. You can use the Schemas and Users Management feature if the trial license hasn't expired.

To understand how ClusterControl configures Keepalived, check out this blog post, How ClusterControl Configures Virtual IP and What to Expect During Failover.

Connecting to the Database Cluster

From the application and client standpoint, they need to connect to 192.168.55.180 on port 6033 which is the virtual IP address floating on top of the load balancers. For example, the Wordpress database configuration will be something like this:

/** The name of the database for WordPress */
define( 'DB_NAME', 'wp_myblog' );

/** MySQL database username */
define( 'DB_USER', 'wp_myblog' );

/** MySQL database password */
define( 'DB_PASSWORD', 'mysecr3t' );

/** MySQL hostname - virtual IP address with ProxySQL load-balanced port*/
define( 'DB_HOST', '192.168.55.180:6033' );

If you would like to access the database cluster directly, bypassing the load balancer, you can just connect to port 3306 of the database hosts. This is usually required by the DBA staff for administration, management, and troubleshooting. With ClusterControl, most of these operations can be performed directly from the user interface.

Final Thoughts

As shown above, deploying a database cluster is no longer a difficult task. Once deployed, there a full suite of free monitoring features as well as commercial features for backup management, failover/recovery and others. Fast deployment of different types of cluster/replication topologies can be useful when evaluating high availability database solutions, and how they fit to your particular environment.

Tags:

In the previous two blog posts we covered both deploying the four types of clustering/replication (MySQL/Galera, MySQL Replication, MongoDB & PostgreSQL) and managing/monitoring your existing databases and clusters. So, after reading these two first blog posts you were able to add your 20 existing replication setups to ClusterControl, expand them and additionally deployed two new Galera clusters while doing a ton of other things. Or maybe you deployed MongoDB and/or PostgreSQL systems. So now, how do you keep them healthy?

That’s exactly what this blog post is about: how to leverage ClusterControl performance monitoring and advisors functionality to keep your MySQL, MongoDB and/or PostgreSQL databases and clusters healthy. So how is this done in ClusterControl?

Database Cluster List

The most important information can already be found in the cluster list: as long as there are no alarms and no hosts are shown to be down, everything is functioning fine. An alarm is raised if a certain condition is met, e.g. host is swapping, and brings to your attention the issue you should investigate. That means that alarms not only are raised during an outage but also to allow you to proactively manage your databases.

Suppose you would log into ClusterControl and see a cluster listing like this, you will definitely have something to investigate: one node is down in the Galera cluster for example and every cluster has various alarms:

Once you click on one of the alarms, you will go to a detailed page on all alarms of the cluster. The alarm details will explain the issue and in most cases also advise the action to resolve the issue.

You can set up your own alarms by creating custom expressions, but that has been deprecated in favor of our new Developer Studio that allows you to write custom Javascripts and execute these as Advisors. We will get back to this topic later in this post.

Cluster Overview - Dashboards

When opening up the cluster overview, we can immediately see the most important performance metrics for the cluster in the tabs. This overview may differ per cluster type as, for instance, Galera has different performance metrics to watch than traditional MySQL, PostgreSQL or MongoDB.

Both the default overview and the pre-selected tabs are customizable. By clicking on Overview -> Dash Settings you are given a dialogue that allows you to define the dashboard:

By pressing the plus sign you can add and define your own metrics to graph the dashboard. In our case we will define a new dashboard featuring the Galera specific send and receive queue average:

This new dashboard should give us good insight in the average queue length of our Galera cluster.

Once you have pressed save, the new dashboard will become available for this cluster:

Similarly you can do this for PostgreSQL as well, for example we can monitor the shared blocks hit versus blocks read:

So as you can see, it is relatively easy to customize your own (default) dashboard.

Cluster Overview - Query Monitor

The Query Monitor tab is available for both MySQL and PostgreSQL based setups and consists out of three dashboards: Top Queries, Running Queries and Query Outliers.

In the Running Queries dashboard, you will find all current queries that are running. This is basically the equivalent of SHOW FULL PROCESSLIST statement in MySQL database.

Top Queries and Query Outliers both rely on the input of the slow query log or Performance Schema. Using Performance Schema is always recommended and will be used automatically if enabled. Otherwise, ClusterControl will use the MySQL slow query log to capture the running queries. To prevent ClusterControl from being too intrusive and the slow query log to grow too large, ClusterControl will sample the slow query log by turning it on and off. This loop is by default set to 1 second capturing and the long_query_time is set to 0.5 seconds. If you wish to change these settings for your cluster, you can change this via Settings -> Query Monitor.

Top Queries will, like the name says, show the top queries that were sampled. You can sort them on various columns: for instance the frequency, average execution time, total execution time or standard deviation time:

You can get more details about the query by selecting it and this will present the query execution plan (if available) and optimization hints/advisories. The Query Outliers is similar to the Top Queries but then allows you to filter the queries per host and compare them in time.

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

Cluster Overview - Operations

Similar to the PostgreSQL and MySQL systems the MongoDB clusters have the Operations overview and is similar to the MySQL's Running Queries. This overview is similar to issuing the db.currentOp() command within MongoDB.

Cluster Overview - Performance

MySQL/Galera

The performance tab is probably the best place to find the overall performance and health of your clusters. For MySQL and Galera it consists of an Overview page, the Advisors, status/variables overviews, the Schema Analyzer and the Transaction log.

The Overview page will give you a graph overview of the most important metrics in your cluster. This is, obviously, different per cluster type. Eight metrics have been set by default, but you can easily set your own - up to 20 graphs if needed:

The Advisors is one of the key features of ClusterControl: the Advisors are scripted checks that can be run on demand. The advisors can evaluate almost any fact known about the host and/or cluster and give its opinion on the health of the host and/or cluster and even can give advice on how to resolve issues or improve your hosts!

The best part is yet to come: you can create your own checks in the Developer Studio (ClusterControl -> Manage -> Developer Studio), run them on a regular interval and use them again in the Advisors section. We blogged about this new feature earlier this year.

We will skip the status/variables overview of MySQL and Galera as this is useful for reference but not for this blog post: it is good enough that you know it is here.

Now suppose your database is growing but you want to know how fast it grew in the past week. You can actually keep track of the growth of both data and index sizes from right within ClusterControl:

And next to the total growth on disk it can also report back the top 25 largest schemas.

Another important feature is the Schema Analyzer within ClusterControl:

ClusterControl will analyze your schemas and look for redundant indexes, MyISAM tables and tables without a primary key. Of course it is entirely up to you to keep a table without a primary key because some application might have created it this way, but at least it is great to get the advice here for free. The Schema Analyzer even recommends the necessary ALTER statement to fix the problem.

PostgreSQL

For PostgreSQL the Advisors, DB Status and DB Variables can be found here:

MongoDB

For MongoDB the Mongo Stats and performance overview can be found under the Performance tab. The Mongo Stats is an overview of the output of mongostat and the Performance overview gives a good graphical overview of the MongoDB opcounters:

Final Thoughts

We showed you how to keep your eyeballs on the most important monitoring and health checking features of ClusterControl. Obviously this is only the beginning of the journey as we will soon start another blog series about the Developer Studio capabilities and how you can make most of your own checks. Also keep in mind that our support for MongoDB and PostgreSQL is not as extensive as our MySQL toolset, but we are continuously improving on this.

You may ask yourself why we have skipped over the performance monitoring and health checks of HAProxy, ProxySQL and MaxScale. We did that deliberately as the blog series covered only deployments of clusters up till now and not the deployment of HA components. So that’s the subject we'll cover next time.

Tags:

The majority of DBA’s perform health checks on their databases every now and then. Usually, it would happen on a daily or weekly basis. We previously discussed why such checks are important and what they should include.

To make sure your systems are in a good shape, you’d need to go through quite a lot of information - host statistics, MySQL statistics, workload statistics, state of backups, database packages, logs and so forth. Such data should be available in every properly monitored environment, although sometimes it is scattered across multiple locations - you may have one tool to monitor MySQL state, another tool to collect system statistics, maybe a set of scripts, e.g., to check the state of your backups. This makes health checks much more time-consuming than they should be - the DBA has to put together the different pieces to understand the state of the system.

Integrated tools like ClusterControl have an advantage that all of the bits are located in the same place (or in the same application). It still does not mean they are located next to each other - they may be located in different sections of the UI and a DBA may have to spend some time clicking through the UI to reach all the interesting data.

The whole idea behind creating Operational Reports is to put all of the most important data into a single document, which can be quickly reviewed to get an understanding of the state of the databases.

Operational Reports are available from the menu Side Menu -> Operational Reports:

Once you go there, you’ll be presented with a list of reports created manually or automatically, based on a predefined schedule:

If you want to create a new report manually, you’ll use the 'Create' option. Pick the type of report, cluster name (for per-cluster report), email recipients (optional - if you want the report to be delivered to you), and you’re pretty much done:

The reports can also be scheduled to be created on a regular basis:

At this time, 5 types of reports are available:

Availability report - All clusters.
Backup report - All clusters.
Schema change report - MySQL/MariaDB-based cluster only.
Daily system report - Per cluster.
Package upgrade report - Per cluster.

Availability Report

Availability reports focuses on, well, availability. It includes three sections. First, availability summary:

You can see information about availability statistics of your databases, the cluster type, total uptime and downtime, current state of the cluster and when that state last changed.

Another section gives more details on availability for every cluster. The screenshot below only shows one of the database cluster:

We can see when a node switched state and what the transition was. It’s a nice place to check if there were any recent problems with the cluster. Similar data is shown in the third section of this report, where you can go through the history of changes in cluster state.

Backup Report

The second type of the report is one covering backups of all clusters. It contains two sections - backup summary and backup details, where the former basically gives you a short summary of when the last backup was created, if it completed successfully or failed, backup verification status, success rate and retention period:

ClusterControl also provides examples of backup policy if it finds any of the monitored database cluster running without any scheduled backup or delayed slave configured. Next are the backup details:

You can also check the list of backups executed on the cluster with their state, type and size within the specified interval. This is as close you can get to be certain that backups work correctly without running a full recovery test. We definitely recommend that such tests are performed every now and then. Good news is ClusterControl supports MySQL-based restoration and verification on a standalone host under Backup -> Restore Backup.

Daily System Report

This type of report contains detailed information about a particular cluster. It starts with a summary of different alerts which are related to the cluster:

Next section is about the state of the nodes that are part of the cluster:

You have a list of the nodes in the cluster, their type, role (master or slave), status of the node, uptime and the OS.

Another section of the report is the backup summary, same as we discussed above. Next one presents a summary of top queries in the cluster:

Finally, we see a “Node status overview” in which you’ll be presented with graphs related to OS and MySQL metrics for each node.

As you can see, we have here graphs covering all of the aspects of the load on the host - CPU, memory, network, disk, CPU load and disk free. This is enough to get an idea whether anything weird happened recently or not. You can also see some details about MySQL workload - how many queries were executed, which type of query, how the data was accessed (via which handler)? This, on the other hand, should be enough to pick most of the issues on MySQL side. What you want to look at are all spikes and dips that you haven’t seen in the past. Maybe a new query has been added to the mix and, as a result, handler_read_rnd_next skyrocketed? Maybe there was an increase of CPU load and a high number of connections might point to increased load on MySQL, but also to some kind of contention. An unexpected pattern might be good to investigate, so you know what is going on.

Package Upgrade Report

This report gives a summary of packages available for upgrade by the repository manager on the monitored hosts. For an accurate reporting, ensure you always use stable and trusted repositories on every host. In some undesirable occasions, the monitored hosts could be configured with an outdated repository after an upgrade (e.g, every MariaDB major version uses different repository), incomplete internal repository (e.g, partial mirrored from the upstream) or bleeding edge repository (commonly for unstable nightly-build packages).

The first section is the upgrade summary:

It summarizes the total number of packages available for upgrade as well as the related managed service for the cluster like load balancer, virtual IP address and arbitrator. Next, ClusterControl provides a detailed package list, grouped by package type for every host:

This report provides the available version and can greatly help us plan our maintenance window efficiently. For critical upgrades like security and database packages, we could prioritize it over non-critical upgrades, which could be consolidated with other less priority maintenance windows.

Schema Change Report

This report compares the selected MySQL/MariaDB database changes in table structure which happened between two different generated reports. In the MySQL/MariaDB older versions, DDL operation is a non-atomic operation (pre 8.0) and requires full table copy (pre 5.6 for most operations) - blocking other transactions until it completes. Schema changes could become a huge pain once your tables get a significant amount of data and must be carefully planned especially in a clustered setup. In a multi-tiered development environment, we have seen many cases where developers silently modify the table structure, resulting in significant impact to query performance.

In order for ClusterControl to produce an accurate report, special options must be configured inside CMON configuration file for the respective cluster:

schema_change_detection_address - Checks will be executed using SHOW TABLES/SHOW CREATE TABLE to determine if the schema has changed. The checks are executed on the address specified and is of the format HOSTNAME:PORT. The schema_change_detection_databases must also be set. A differential of a changed table is created (using diff).
schema_change_detection_databases - Comma separated list of databases to monitor for schema changes. If empty, no checks are made.

In this example, we would like to monitor schema changes for database "myapp" and "sbtest" on our MariaDB Cluster with cluster ID 27. Pick one of the database nodes as the value of schema_change_detection_address. For MySQL replication, this should be the master host, or any slave host that holds the databases (in case partial replication is active). Then, inside /etc/cmon.d/cmon_27.cnf, add the two following lines:

schema_change_detection_address=10.0.0.30:3306
schema_change_detection_databases=myapp,sbtest

Restart CMON service to load the change:

$ systemctl restart cmon

For the first and foremost report, ClusterControl only returns the result of metadata collection, similar to below:

With the first report as the baseline, the subsequent reports will return the output that we are expecting for:

Take note only new tables or changed tables are printed in the report. The first report is only for metadata collection for comparison in the subsequent rounds, thus we have to run it for at least twice to see the difference.

With this report, you can now gather the database structure footprints and understand how your database has evolved across time.

Final Thoughts

Operational report is a comprehensive way to understand the state of your database infrastructure. It is built for both operational or managerial staff, and can be very useful in analysing your database operations. The reports can be generated in-place or can be delivered to you via email, which make things conveniently easy if you have a reporting silo.

We’d love to hear your feedback on anything else you’d like to have included in the report, what’s missing and what is not needed.

Tags: