Add documentation for switchover, failover and rejoin
This commit is contained in:
@ -184,6 +184,10 @@ the master.
|
|||||||
The formula for calculating the actual number of milliseconds before the server
|
The formula for calculating the actual number of milliseconds before the server
|
||||||
is labelled as the master is `monitor_interval * failcount`.
|
is labelled as the master is `monitor_interval * failcount`.
|
||||||
|
|
||||||
|
If automatic failover is enabled (`auto_failover=true`), this setting also
|
||||||
|
controls how many times the master server must fail to respond before failover
|
||||||
|
begins.
|
||||||
|
|
||||||
### `allow_cluster_recovery`
|
### `allow_cluster_recovery`
|
||||||
|
|
||||||
Allow recovery after the cluster has dropped down to one server. This feature
|
Allow recovery after the cluster has dropped down to one server. This feature
|
||||||
@ -214,60 +218,163 @@ assigned the _Slave_ status which allows them to be used like normal slave
|
|||||||
servers. When the option is disabled, the servers will only receive the _Slave
|
servers. When the option is disabled, the servers will only receive the _Slave
|
||||||
of External Server_ status and they will not be used.
|
of External Server_ status and they will not be used.
|
||||||
|
|
||||||
### `failover`
|
## Failover, switchover and auto-rejoin
|
||||||
|
|
||||||
|
Starting with MaxScale 2.2.1, MySQL Monitor supports replication cluster
|
||||||
|
modification. The operations implemented are: _failover_ (replacing a failed
|
||||||
|
master), _switchover_ (swapping a slave with a running master) and _rejoin_
|
||||||
|
(joining a standalone server to the cluster). The features and the parameters
|
||||||
|
controlling them are presented in this section.
|
||||||
|
|
||||||
|
Both failover and switchover can be activated manually through MaxAdmin.
|
||||||
|
Failover selects the new master server automatically, switchover requires the
|
||||||
|
user to designate the new master as well as the current master. Example commands
|
||||||
|
are below:
|
||||||
|
|
||||||
|
```
|
||||||
|
call command mysqlmon failover MySQL-Monitor
|
||||||
|
call command mysqlmon switchover MySQL-Monitor SlaveServ3 MasterServ
|
||||||
|
```
|
||||||
|
|
||||||
|
Failover can also activate automatically, if `auto_failover` is on. The
|
||||||
|
activation begins when the master has been down for a number of monitor
|
||||||
|
iterations defined in `failcount`.
|
||||||
|
|
||||||
|
When `auto-rejoin` is active, the monitor will try to rejoin standalone servers
|
||||||
|
and slaves replicating from the wrong master (any server not the cluster
|
||||||
|
master). These servers are redirected to replicate from the correct master
|
||||||
|
server, forcing the replication topology to a 1-master-N-slaves configuration.
|
||||||
|
|
||||||
|
All of the three features require that the monitor user (`user`) has the SUPER
|
||||||
|
privilege. In addition, the monitor needs to know which username and password a
|
||||||
|
slave should use when starting replication. These are given in
|
||||||
|
`replication_user` and `replication_password`.
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
|
||||||
|
Switchover and failover only understand simple topologies. They will not work if
|
||||||
|
the cluster has multiple masters, relay masters, or if the topology is circular.
|
||||||
|
The server cluster is assumed to be well-behaving with no significant
|
||||||
|
replication lag and all commands that modify the cluster complete in a few
|
||||||
|
seconds (faster than `backend_read_timeout` and `backend_write_timeout`).
|
||||||
|
|
||||||
|
The backends must all use GTID-based replication, and the domain id should not
|
||||||
|
change during a switchover or failover. Master and slaves must have
|
||||||
|
well-behaving GTIDs: no extra events on slave servers.
|
||||||
|
|
||||||
|
### Configuration parameters
|
||||||
|
|
||||||
|
#### `auto_failover`
|
||||||
|
|
||||||
Enable automated master failover. This parameter expects a boolean value and the
|
Enable automated master failover. This parameter expects a boolean value and the
|
||||||
default value is false.
|
default value is false.
|
||||||
|
|
||||||
When the failover functionality is enabled, traditional MariaDB Master-Slave
|
When automatic failover is enabled, traditional MariaDB Master-Slave clusters
|
||||||
clusters will automatically elect a new master if the old master goes down. The
|
will automatically elect a new master if the old master goes down and stays down
|
||||||
failover functionality will not take place when MaxScale is configured as a
|
a number of iterations given in `failcount`. Failover will not take place when
|
||||||
passive instance. For details on how MaxScale behaves in passive mode, see the
|
MaxScale is configured as a passive instance. For details on how MaxScale
|
||||||
following documentation of `failover_timeout`.
|
behaves in passive mode, see the documentation on `failover_timeout` below.
|
||||||
|
|
||||||
If an attempt at failover fails or multiple master servers are detected, an
|
If an attempt at failover fails or multiple master servers are detected, an
|
||||||
error is logged and the failover functionality is disabled. If this happens, the
|
error is logged and automatic failover is disabled. If this happens, the cluster
|
||||||
cluster must be fixed manually and the failover needs to be re-enabled via the
|
must be fixed manually and the failover needs to be re-enabled via the REST API
|
||||||
REST API or MaxAdmin.
|
or MaxAdmin.
|
||||||
|
|
||||||
**Note:** The monitor user must have the SUPER privilege if the failover feature
|
The monitor user must have the SUPER privilege for failover to work.
|
||||||
is enabled.
|
|
||||||
|
|
||||||
### `failover_timeout`
|
#### `auto_rejoin`
|
||||||
|
|
||||||
The timeout for the cluster failover in seconds. The default value is 90
|
Enable automatic joining of server to the cluster. This parameter expects a
|
||||||
|
boolean value and the default value is false.
|
||||||
|
|
||||||
|
When enabled, the monitor will attempt to direct standalone servers and servers replicating from a relay master to the main cluster master server, enforcing a 1-master-N-slaves configuration.
|
||||||
|
|
||||||
|
For example, consider the following event series.
|
||||||
|
|
||||||
|
1. Slave A goes down
|
||||||
|
2. Master goes down and a failover is performed, promoting Slave B
|
||||||
|
3. Slave A comes back
|
||||||
|
|
||||||
|
Slave A is still trying to replicate from the downed master, since it wasn't online during failover. If `auto_rejoin` is on, Slave A will quickly be redirected to Slave B, the current master.
|
||||||
|
|
||||||
|
#### `replication_user` and `replication_password`
|
||||||
|
|
||||||
|
The username and password of the replication user. These are given as the values
|
||||||
|
for `MASTER_USER` and `MASTER_PASSWORD` whenever a `CHANGE MASTER TO` command is
|
||||||
|
executed.
|
||||||
|
|
||||||
|
Both `replication_user` and `replication_password` parameters must be defined if
|
||||||
|
a custom replication user is used. If neither of the parameters is defined, the
|
||||||
|
`CHANGE MASTER TO` command will use the monitor credentials for the replication
|
||||||
|
user.
|
||||||
|
|
||||||
|
The credentials used for replication must have the `REPLICATION SLAVE`
|
||||||
|
privilege.
|
||||||
|
|
||||||
|
#### `failover_timeout`
|
||||||
|
|
||||||
|
Time limit for the cluster failover in seconds. The default value is 90
|
||||||
seconds.
|
seconds.
|
||||||
|
|
||||||
If no successful failover takes place within the configured time period, a
|
If no successful failover takes place within the configured time period, a
|
||||||
message is logged and the failover functionality is disabled.
|
message is logged and automatic failover is disabled.
|
||||||
|
|
||||||
This parameter also controls how long a MaxScale instance that has transitioned
|
This parameter also controls how long a MaxScale instance that has transitioned
|
||||||
from passive to active will wait for a failover to take place after an apparent
|
from passive to active will wait for a failover to take place after an apparent
|
||||||
loss of a master server. If no new master server is detected within the
|
loss of a master server. If no new master server is detected within the
|
||||||
configured time period, the failover will be initiated again.
|
configured time period, failover will be initiated again.
|
||||||
|
|
||||||
### `switchover`
|
#### `verify_master_failure`
|
||||||
|
|
||||||
Enable switchover via MaxScale. This parameter expects a boolean value and
|
Enable master failure verification for automatic failover. This parameter
|
||||||
the default value is false.
|
expects a boolean value and the feature is enabled by default.
|
||||||
|
|
||||||
When the switchover functionality is enabled, a REST API endpoint will be
|
The failure of a master can be verified by checking whether the slaves are still
|
||||||
made available, using which switchover may be performed. The endpoint will
|
connected to the master. The timeout for master failure verification is
|
||||||
be available irrespective of whether MaxScale is in active or passive mode,
|
controlled by the `master_failure_timeout` parameter.
|
||||||
but switchover will only be attempted if MaxScale is in active mode and an
|
|
||||||
error logged if an attempt is made when MaxScale is in passive mode.
|
|
||||||
Switchover may also be triggered from MaxAdmin and the same rules regarding
|
|
||||||
active/passive holds.
|
|
||||||
|
|
||||||
It is safe to perform switchover even with the failover functionality
|
#### `master_failure_timeout`
|
||||||
enabled, as MaxScale will disable the failover behaviour for the duration
|
|
||||||
of the switchover.
|
|
||||||
|
|
||||||
Only if the switchover succeeds, will the failover functionality be re-enabled.
|
This parameter controls the period of time, in seconds, that the monitor must
|
||||||
Otherwise it will remain disabled and must be turned on manually via the REST
|
wait before it can declare that the master has failed. The default value is 10
|
||||||
API or MaxAdmin.
|
seconds. For failover to activate, the `failcount` requirement must also be met.
|
||||||
|
|
||||||
When switchover is iniated via the REST-API, the URL path looks as follows:
|
The failure of a master is verified by tracking when the last change to the
|
||||||
|
relay log was done and when the last replication heartbeat was received. If the
|
||||||
|
period of time between the last received event and the time of the check exceeds
|
||||||
|
the configured value, the slave's connection to the master is considered to be
|
||||||
|
broken.
|
||||||
|
|
||||||
|
When all slaves of a failed master are no longer connected to the master, the
|
||||||
|
master failure is verified and the failover can be safely performed.
|
||||||
|
|
||||||
|
If the slaves lose their connections to the master before the configured timeout
|
||||||
|
is exceeded, the failover is performed immediately. This allows a faster
|
||||||
|
failover when the master server crashes causing immediate disconnection of the
|
||||||
|
the network connections.
|
||||||
|
|
||||||
|
#### `switchover_timeout`
|
||||||
|
|
||||||
|
Time limit for cluster switchover in seconds. The default value is 90
|
||||||
|
seconds.
|
||||||
|
|
||||||
|
If no successful switchover takes place within the configured time period, a
|
||||||
|
message is logged and automatic failover is disabled, even if it was enabled
|
||||||
|
before the switchover attempt. This prevents further modifications to the
|
||||||
|
misbehaving cluster.
|
||||||
|
|
||||||
|
### Manual switchover and failover
|
||||||
|
|
||||||
|
Both failover and switchover can be activated manually through the REST API or
|
||||||
|
MaxAdmin. The commands are only performed when MaxScale is in active mode.
|
||||||
|
|
||||||
|
It is safe to perform switchover or failover even with `auto_failover` on, since
|
||||||
|
the automatic operation cannot happen simultaneously with the manual one.
|
||||||
|
|
||||||
|
If a switchover or failover fails, automatic failover is disabled. It can be
|
||||||
|
turned on manually via the REST API or MaxAdmin.
|
||||||
|
|
||||||
|
When switchover is iniated via the REST-API, the URL path is:
|
||||||
```
|
```
|
||||||
/v1/maxscale/mysqlmon/switchover?<monitor-instance>&<new-master>&<current-master>
|
/v1/maxscale/mysqlmon/switchover?<monitor-instance>&<new-master>&<current-master>
|
||||||
```
|
```
|
||||||
@ -291,94 +398,12 @@ path for making `server4` the new master would be:
|
|||||||
/v1/maxscale/mysqlmon/switchover?Cluster1&server4&server2
|
/v1/maxscale/mysqlmon/switchover?Cluster1&server4&server2
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** The monitor user must have the SUPER privilege if the switchover
|
The REST-API path for manual failover is similar, although the `<new-master>`
|
||||||
feature is enabled.
|
and `<current-master>` fields are left out.
|
||||||
|
|
||||||
### `switchover_script`
|
|
||||||
|
|
||||||
*NOTE* By default, MariaDB MaxScale uses the MariaDB provided switchover
|
|
||||||
script, so `switchover_script` need not be specified.
|
|
||||||
|
|
||||||
This command will be executed when MaxScale has been told to perform a
|
|
||||||
switchover, either via MaxAdmin or the REST-API. The parameter should be an
|
|
||||||
absolute path to a command or the command should be in the executable path.
|
|
||||||
The user which is used to run MaxScale should have execution rights to the
|
|
||||||
file itself and the directory it resides in.
|
|
||||||
|
|
||||||
```
|
```
|
||||||
script=/home/user/myswitchover.sh current_master=$CURRENT_MASTER new_master=$NEW_MASTER
|
/v1/maxscale/mysqlmon/failover?Cluster1
|
||||||
```
|
```
|
||||||
|
|
||||||
In addition to the substitutions documented in
|
|
||||||
[Common Monitor Parameters](./Monitor-Common.md)
|
|
||||||
the following substitutions will be made to the parameter value:
|
|
||||||
|
|
||||||
* `$CURRENT_MASTER` will be replaced with the IP and port of the current
|
|
||||||
master. If the is no current master, the value will be `none`.
|
|
||||||
* `$NEW_MASTER` will be replaced with the IP and port of the server that
|
|
||||||
should be made into the new master.
|
|
||||||
|
|
||||||
The script should return 0 for success and a non-zero value for failure.
|
|
||||||
|
|
||||||
### `switchover_timeout`
|
|
||||||
|
|
||||||
The timeout for the cluster switchover in seconds. The default value is 90
|
|
||||||
seconds.
|
|
||||||
|
|
||||||
If no successful switchover takes place within the configured time period,
|
|
||||||
a message is logged and the failover (not switchover) functionality will not
|
|
||||||
be enabled, even if it was enabled before the switchover attempt.
|
|
||||||
|
|
||||||
### `replication_user`
|
|
||||||
|
|
||||||
The username of the replication user. This is given as the value for
|
|
||||||
`MASTER_USER` whenever a `CHANGE_MASTER_TO` command is executed.
|
|
||||||
|
|
||||||
Both `replication_user` and `replication_password` parameters must be defined if
|
|
||||||
a custom replication user is used. If neither of the parameters is defined, the
|
|
||||||
`CHANGE MASTER TO` command will use the monitor credentials for the replication
|
|
||||||
user.
|
|
||||||
|
|
||||||
The credentials used for replication must have the `REPLICATION SLAVE`
|
|
||||||
privilege.
|
|
||||||
|
|
||||||
### `replication_password`
|
|
||||||
|
|
||||||
The password of the replication user. This is given as the value for
|
|
||||||
`MASTER_USER` whenever a `CHANGE_MASTER_TO` command is executed.
|
|
||||||
|
|
||||||
See `replication_user` parameter documentation for details about the use of this
|
|
||||||
parameter.
|
|
||||||
|
|
||||||
### `verify_master_failure`
|
|
||||||
|
|
||||||
Enable master failure verification for failover. This parameter expects a
|
|
||||||
boolean value and the feature is enabled by default.
|
|
||||||
|
|
||||||
The failure of a master can be verified by checking whether the slaves are still
|
|
||||||
connected to the master. The timeout for master failure verification is
|
|
||||||
controlled by the `master_failure_timeout` parameter.
|
|
||||||
|
|
||||||
### `master_failure_timeout`
|
|
||||||
|
|
||||||
This parameter controls the period of time, in seconds, that the monitor must
|
|
||||||
wait before it can declare that the master has failed. The default value is 10
|
|
||||||
seconds.
|
|
||||||
|
|
||||||
The failure of a master is verified by tracking when the last change to the
|
|
||||||
relay log was done and when the last replication heartbeat was received. If the
|
|
||||||
period of time between the last received event and the time of the check exceeds
|
|
||||||
the configured value, the slave's connection to the master is considered to be
|
|
||||||
broken.
|
|
||||||
|
|
||||||
When all slaves of a failed master are no longer connected to the master, the
|
|
||||||
master failure is verified and the failover can be safely performed.
|
|
||||||
|
|
||||||
If the slaves lose their connections to the master before the configured timeout
|
|
||||||
is exceeded, the failover is performed immediately. This allows a faster
|
|
||||||
failover when the master server crashes causing immediate disconnection of the
|
|
||||||
the network connections.
|
|
||||||
|
|
||||||
## Using the MySQL Monitor With Binlogrouter
|
## Using the MySQL Monitor With Binlogrouter
|
||||||
|
|
||||||
Since MaxScale 2.2 it's possible to detect a replication setup
|
Since MaxScale 2.2 it's possible to detect a replication setup
|
||||||
|
Reference in New Issue
Block a user