diff --git a/Documentation/Monitors/MariaDB-Monitor.md b/Documentation/Monitors/MariaDB-Monitor.md index 8102dcbca..1847375c4 100644 --- a/Documentation/Monitors/MariaDB-Monitor.md +++ b/Documentation/Monitors/MariaDB-Monitor.md @@ -299,37 +299,34 @@ following: 1. Select the most up-to-date slave of the old master to be the new master. The selection criteria is as follows in descending priority: - 1. gtid_IO_pos (latest event in relay log) - 2. gtid_current_pos (most processed events) - 3. log_slave_updates is on - 4. disk space is not low + 1. gtid_IO_pos (latest event in relay log) + 2. gtid_current_pos (most processed events) + 3. log_slave_updates is on + 4. disk space is not low 2. If the new master has unprocessed relay log items, cancel and try again later. 3. Prepare the new master: - 1. Remove the slave connection the new master used to replicate from the old + 1. Remove the slave connection the new master used to replicate from the old master. - 2. Disable the *read\_only*-flag. - 3. Enable scheduled server events (if event handling is on). - 4. Run the commands in `promotion_sql_file`. - 5. Start replication from external master is one existed. + 2. Disable the *read\_only*-flag. + 3. Enable scheduled server events (if event handling is on). + 4. Run the commands in `promotion_sql_file`. + 5. Start replication from external master if one existed. 4. Redirect all other slaves to replicate from the new master: - 1. STOP SLAVE and RESET SLAVE - 2. CHANGE MASTER TO - 3. START SLAVE + 1. STOP SLAVE and RESET SLAVE + 2. CHANGE MASTER TO + 3. START SLAVE 5. Check that all slaves are replicating. -Failover may lose events if no slave managed to replicate the events before the -master went down. - **Switchover** swaps a running master with a running slave. It does the following: 1. Prepare the old master for demotion: - 1. Stop any external replication. - 2. Enable the *read\_only*-flag to stop writes. - 3. Disable scheduled server events (if event handling is on). - 4. Run the commands in `demotion_sql_file`. - 5. Flush the binary log (FLUSH LOGS) so that all events are on disk. + 1. Stop any external replication. + 2. Enable the *read\_only*-flag to stop writes. + 3. Disable scheduled server events (if event handling is on). + 4. Run the commands in `demotion_sql_file`. + 5. Flush the binary log (FLUSH LOGS) so that all events are on disk. 2. Wait for the new master to catch up with the old master. 3. Promote new master and redirect slaves as in failover steps 3 and 4. Also redirect the demoted old master. @@ -353,15 +350,15 @@ cluster are out of sync while the actual data is known to be in sync. The operation proceeds as follows: 1. Reset gtid:s and delete binary logs on all servers: - 1. Stop (STOP SLAVE) and delete (RESET SLAVE ALL) all slave connections. - 2. Enable the *read\_only*-flag. - 3. Disable scheduled server events (if event handling is on). - 3. Delete binary logs (RESET MASTER). - 4. Set the sequence number of *gtid\_slave\_pos* to zero. This also affects + 1. Stop (STOP SLAVE) and delete (RESET SLAVE ALL) all slave connections. + 2. Enable the *read\_only*-flag. + 3. Disable scheduled server events (if event handling is on). + 3. Delete binary logs (RESET MASTER). + 4. Set the sequence number of *gtid\_slave\_pos* to zero. This also affects *gtid\_current\_pos*. 2. Prepare new master: - 1. Disable the *read\_only*-flag. - 2. Enable scheduled server events (if event handling is on). + 1. Disable the *read\_only*-flag. + 2. Enable scheduled server events (if event handling is on). 3. Direct other servers to replicate from the new master as in the other operations. @@ -492,6 +489,11 @@ The backends must all use GTID-based replication, and the domain id should not change during a switchover or failover. Master and slaves must have well-behaving GTIDs with no extra events on slave servers. +Failover cannot be performed if MaxScale was started only after the master +server went down. This is because MaxScale needs reliable information on the +gtid domain of the cluster and the replication topology in general to properly +select the new master. + Failover may lose events. If a master goes down before sending new events to at least one slave, those events are lost when a new master is chosen. If the old master comes back online, the other servers have likely moved on with a @@ -614,18 +616,15 @@ encrypted with the same key to avoid erroneous decryption. #### `failover_timeout` and `switchover_timeout` -Time limit for the cluster failover and switchover in seconds. The default values -are 90 seconds. +Time limit for failover and switchover operations, in seconds. The default +values are 90 seconds for both. `switchover_timeout` is also used as the time +limit for a rejoin operation. Rejoin should rarely time out, since it is a +faster operation than switchover. If no successful failover/switchover takes place within the configured time period, a message is logged and automatic failover is disabled. This prevents further automatic modifications to the misbehaving cluster. -`failover_timeout` also controls how long a MaxScale instance that has -transitioned from passive to active will wait for a failover to take place after -an apparent loss of a master server. If no new master server is detected within -the configured time period, failover will be initiated again. - #### `verify_master_failure` and `master_failure_timeout` Enable additional master failure verification for automatic failover.