Add documentation for switchover, failover and rejoin

This commit is contained in:
Esa Korhonen 2017-12-08 12:26:07 +02:00
parent 45ba9b3730
commit 06e16954c4

View File

@ -184,6 +184,10 @@ the master.
The formula for calculating the actual number of milliseconds before the server
is labelled as the master is `monitor_interval * failcount`.
If automatic failover is enabled (`auto_failover=true`), this setting also
controls how many times the master server must fail to respond before failover
begins.
### `allow_cluster_recovery`
Allow recovery after the cluster has dropped down to one server. This feature
@ -214,60 +218,163 @@ assigned the _Slave_ status which allows them to be used like normal slave
servers. When the option is disabled, the servers will only receive the _Slave
of External Server_ status and they will not be used.
### `failover`
## Failover, switchover and auto-rejoin
Starting with MaxScale 2.2.1, MySQL Monitor supports replication cluster
modification. The operations implemented are: _failover_ (replacing a failed
master), _switchover_ (swapping a slave with a running master) and _rejoin_
(joining a standalone server to the cluster). The features and the parameters
controlling them are presented in this section.
Both failover and switchover can be activated manually through MaxAdmin.
Failover selects the new master server automatically, switchover requires the
user to designate the new master as well as the current master. Example commands
are below:
```
call command mysqlmon failover MySQL-Monitor
call command mysqlmon switchover MySQL-Monitor SlaveServ3 MasterServ
```
Failover can also activate automatically, if `auto_failover` is on. The
activation begins when the master has been down for a number of monitor
iterations defined in `failcount`.
When `auto-rejoin` is active, the monitor will try to rejoin standalone servers
and slaves replicating from the wrong master (any server not the cluster
master). These servers are redirected to replicate from the correct master
server, forcing the replication topology to a 1-master-N-slaves configuration.
All of the three features require that the monitor user (`user`) has the SUPER
privilege. In addition, the monitor needs to know which username and password a
slave should use when starting replication. These are given in
`replication_user` and `replication_password`.
### Limitations
Switchover and failover only understand simple topologies. They will not work if
the cluster has multiple masters, relay masters, or if the topology is circular.
The server cluster is assumed to be well-behaving with no significant
replication lag and all commands that modify the cluster complete in a few
seconds (faster than `backend_read_timeout` and `backend_write_timeout`).
The backends must all use GTID-based replication, and the domain id should not
change during a switchover or failover. Master and slaves must have
well-behaving GTIDs: no extra events on slave servers.
### Configuration parameters
#### `auto_failover`
Enable automated master failover. This parameter expects a boolean value and the
default value is false.
When the failover functionality is enabled, traditional MariaDB Master-Slave
clusters will automatically elect a new master if the old master goes down. The
failover functionality will not take place when MaxScale is configured as a
passive instance. For details on how MaxScale behaves in passive mode, see the
following documentation of `failover_timeout`.
When automatic failover is enabled, traditional MariaDB Master-Slave clusters
will automatically elect a new master if the old master goes down and stays down
a number of iterations given in `failcount`. Failover will not take place when
MaxScale is configured as a passive instance. For details on how MaxScale
behaves in passive mode, see the documentation on `failover_timeout` below.
If an attempt at failover fails or multiple master servers are detected, an
error is logged and the failover functionality is disabled. If this happens, the
cluster must be fixed manually and the failover needs to be re-enabled via the
REST API or MaxAdmin.
error is logged and automatic failover is disabled. If this happens, the cluster
must be fixed manually and the failover needs to be re-enabled via the REST API
or MaxAdmin.
**Note:** The monitor user must have the SUPER privilege if the failover feature
is enabled.
The monitor user must have the SUPER privilege for failover to work.
### `failover_timeout`
#### `auto_rejoin`
The timeout for the cluster failover in seconds. The default value is 90
Enable automatic joining of server to the cluster. This parameter expects a
boolean value and the default value is false.
When enabled, the monitor will attempt to direct standalone servers and servers replicating from a relay master to the main cluster master server, enforcing a 1-master-N-slaves configuration.
For example, consider the following event series.
1. Slave A goes down
2. Master goes down and a failover is performed, promoting Slave B
3. Slave A comes back
Slave A is still trying to replicate from the downed master, since it wasn't online during failover. If `auto_rejoin` is on, Slave A will quickly be redirected to Slave B, the current master.
#### `replication_user` and `replication_password`
The username and password of the replication user. These are given as the values
for `MASTER_USER` and `MASTER_PASSWORD` whenever a `CHANGE MASTER TO` command is
executed.
Both `replication_user` and `replication_password` parameters must be defined if
a custom replication user is used. If neither of the parameters is defined, the
`CHANGE MASTER TO` command will use the monitor credentials for the replication
user.
The credentials used for replication must have the `REPLICATION SLAVE`
privilege.
#### `failover_timeout`
Time limit for the cluster failover in seconds. The default value is 90
seconds.
If no successful failover takes place within the configured time period, a
message is logged and the failover functionality is disabled.
message is logged and automatic failover is disabled.
This parameter also controls how long a MaxScale instance that has transitioned
from passive to active will wait for a failover to take place after an apparent
loss of a master server. If no new master server is detected within the
configured time period, the failover will be initiated again.
configured time period, failover will be initiated again.
### `switchover`
#### `verify_master_failure`
Enable switchover via MaxScale. This parameter expects a boolean value and
the default value is false.
Enable master failure verification for automatic failover. This parameter
expects a boolean value and the feature is enabled by default.
When the switchover functionality is enabled, a REST API endpoint will be
made available, using which switchover may be performed. The endpoint will
be available irrespective of whether MaxScale is in active or passive mode,
but switchover will only be attempted if MaxScale is in active mode and an
error logged if an attempt is made when MaxScale is in passive mode.
Switchover may also be triggered from MaxAdmin and the same rules regarding
active/passive holds.
The failure of a master can be verified by checking whether the slaves are still
connected to the master. The timeout for master failure verification is
controlled by the `master_failure_timeout` parameter.
It is safe to perform switchover even with the failover functionality
enabled, as MaxScale will disable the failover behaviour for the duration
of the switchover.
#### `master_failure_timeout`
Only if the switchover succeeds, will the failover functionality be re-enabled.
Otherwise it will remain disabled and must be turned on manually via the REST
API or MaxAdmin.
This parameter controls the period of time, in seconds, that the monitor must
wait before it can declare that the master has failed. The default value is 10
seconds. For failover to activate, the `failcount` requirement must also be met.
When switchover is iniated via the REST-API, the URL path looks as follows:
The failure of a master is verified by tracking when the last change to the
relay log was done and when the last replication heartbeat was received. If the
period of time between the last received event and the time of the check exceeds
the configured value, the slave's connection to the master is considered to be
broken.
When all slaves of a failed master are no longer connected to the master, the
master failure is verified and the failover can be safely performed.
If the slaves lose their connections to the master before the configured timeout
is exceeded, the failover is performed immediately. This allows a faster
failover when the master server crashes causing immediate disconnection of the
the network connections.
#### `switchover_timeout`
Time limit for cluster switchover in seconds. The default value is 90
seconds.
If no successful switchover takes place within the configured time period, a
message is logged and automatic failover is disabled, even if it was enabled
before the switchover attempt. This prevents further modifications to the
misbehaving cluster.
### Manual switchover and failover
Both failover and switchover can be activated manually through the REST API or
MaxAdmin. The commands are only performed when MaxScale is in active mode.
It is safe to perform switchover or failover even with `auto_failover` on, since
the automatic operation cannot happen simultaneously with the manual one.
If a switchover or failover fails, automatic failover is disabled. It can be
turned on manually via the REST API or MaxAdmin.
When switchover is iniated via the REST-API, the URL path is:
```
/v1/maxscale/mysqlmon/switchover?<monitor-instance>&<new-master>&<current-master>
```
@ -291,94 +398,12 @@ path for making `server4` the new master would be:
/v1/maxscale/mysqlmon/switchover?Cluster1&server4&server2
```
**Note:** The monitor user must have the SUPER privilege if the switchover
feature is enabled.
### `switchover_script`
*NOTE* By default, MariaDB MaxScale uses the MariaDB provided switchover
script, so `switchover_script` need not be specified.
This command will be executed when MaxScale has been told to perform a
switchover, either via MaxAdmin or the REST-API. The parameter should be an
absolute path to a command or the command should be in the executable path.
The user which is used to run MaxScale should have execution rights to the
file itself and the directory it resides in.
The REST-API path for manual failover is similar, although the `<new-master>`
and `<current-master>` fields are left out.
```
script=/home/user/myswitchover.sh current_master=$CURRENT_MASTER new_master=$NEW_MASTER
/v1/maxscale/mysqlmon/failover?Cluster1
```
In addition to the substitutions documented in
[Common Monitor Parameters](./Monitor-Common.md)
the following substitutions will be made to the parameter value:
* `$CURRENT_MASTER` will be replaced with the IP and port of the current
master. If the is no current master, the value will be `none`.
* `$NEW_MASTER` will be replaced with the IP and port of the server that
should be made into the new master.
The script should return 0 for success and a non-zero value for failure.
### `switchover_timeout`
The timeout for the cluster switchover in seconds. The default value is 90
seconds.
If no successful switchover takes place within the configured time period,
a message is logged and the failover (not switchover) functionality will not
be enabled, even if it was enabled before the switchover attempt.
### `replication_user`
The username of the replication user. This is given as the value for
`MASTER_USER` whenever a `CHANGE_MASTER_TO` command is executed.
Both `replication_user` and `replication_password` parameters must be defined if
a custom replication user is used. If neither of the parameters is defined, the
`CHANGE MASTER TO` command will use the monitor credentials for the replication
user.
The credentials used for replication must have the `REPLICATION SLAVE`
privilege.
### `replication_password`
The password of the replication user. This is given as the value for
`MASTER_USER` whenever a `CHANGE_MASTER_TO` command is executed.
See `replication_user` parameter documentation for details about the use of this
parameter.
### `verify_master_failure`
Enable master failure verification for failover. This parameter expects a
boolean value and the feature is enabled by default.
The failure of a master can be verified by checking whether the slaves are still
connected to the master. The timeout for master failure verification is
controlled by the `master_failure_timeout` parameter.
### `master_failure_timeout`
This parameter controls the period of time, in seconds, that the monitor must
wait before it can declare that the master has failed. The default value is 10
seconds.
The failure of a master is verified by tracking when the last change to the
relay log was done and when the last replication heartbeat was received. If the
period of time between the last received event and the time of the check exceeds
the configured value, the slave's connection to the master is considered to be
broken.
When all slaves of a failed master are no longer connected to the master, the
master failure is verified and the failover can be safely performed.
If the slaves lose their connections to the master before the configured timeout
is exceeded, the failover is performed immediately. This allows a faster
failover when the master server crashes causing immediate disconnection of the
the network connections.
## Using the MySQL Monitor With Binlogrouter
Since MaxScale 2.2 it's possible to detect a replication setup