If a server goes down and it has the stale master bit enabled, all other
bits for the server are cleared. This allows failed masters that have been
replaced to be first detected and then reintroduced into the replication
topology.
The slave and stale slave status bits should be cleared from a master if
it still has them.
Also used the correct functions to manipulate the bits instead of directly
setting them in the monitor.
The value of the global gtid_slave_pos is only needed during
failover, so querying it every monitor loop is unnecessary. The
value is now only requested when deciding on a new master server
or when waiting for the selected promotion target to clear its
relay logs.
Also, when waiting for the logs to clear, gtid_io_pos must stay
constant or failover is cancelled. Io_pos advancing indicates that
the server is still receiving events from the old master.
The Gtid_Slave_Pos returned by SHOW ALL SLAVES STATUS is not quite
reliable (MDEV-14182) so the variable version is used instead. Added
a convenience function for querying a single row of values.
Also, gtid_strict_mode, log_bin and log_slave_updates are now
queried during failover. The first only causes a warning message
if disabled, the last two affect new master selection.
Gtid_Slave_Pos may contain multiple triplets even with single-source
replication if the domain has changed at some point. For failover, we
only need to know the current domain values, so the gtid-parsing now
accepts an optional domain parameter. The Gtid-class still only stores
one triplet of values.
When parsing the Show Slave Status result, Gtid_IO_Pos is parsed first.
The resulting domain is then read from Gtid_Slave_Pos.
When selecting the new master server, Gtid_IO_Pos is checked to
select the slave with the latest event in relay log. If there is a
tie, the slave that has processed most events wins.
It's possible that the winning slave has unprocessed events. In
this case, failover waits for the slave to complete processing the
log. The maximum wait is defined in monitor parameter
"failover_timeout", defaulting to 90 seconds. If time runs out
failover ends in failure.
The Gtid struct was separated to its own definition to handle gtid:s
easier.
The SlaveStatus info is now in a separate class, although it's
still embedded in the MYSQL_SERVER_INFO-class. Both classes now
use strings intead of char*:s.
The helper function provides map-like access to row values. This is used
to retrieve the values for all MariaDB 10.0+ versions as there are
differences in the returned results between 10.1 and 10.2.
Using timestamps to detect whether MaxScale was active or passive can
cause problems if multiple events happen at the same time. This can be
avoided by separating events into actively observed and passively observed
events. This clarifies the logic by removing the ambiguity of timestamps.
As the monitoring threads are separate from the worker threads, it is
prudent to use atomic operations to modify and read the state of the
MaxScale. This will impose an happens-before relation between MaxScale
being set into passive mode and events being classified as being passively
observed.
The master failure can now be verified by checking when the slaves are
connected to the master. If the slaves do not receive any events from the
master, the connections are considered as down after a configurable limit.
Added two parameters for controlling whether the check is done and for how
long the monitor waits before doing the failover.
The slave heartbeat count and period are collected from the SHOW ALL
SLAVES STATUS output. This, in addition to the relay log position, is used
to calculate the point in time when a slave has last interacted with the
master.
By using this timestamp, the monitor can enforce a minimum "timeout" for
the master before a failover is performed.
Moved mon_process_failover() from monitor.cc to mysql_mon.cc. Renamed
some functions and variables related to previous failover functionality
to avoid confusion.
The get_server_info function takes the monitor handle and a database and
returns the corresponding MYSQL_SERVER_INFO struct. This hides a part of
the actual implementation of the info struct from the monitor code,
allowing future refactoring to be done. It also makes the code a bit more
readable.
The values in the MYSQL_SERVER_INFO struct can now be updated with the
update_slave_status function.
Also moved the number of configured and running slave configurations into
the info struct. This removes the need to pass output parameters.
The MYSQL_SERVER_INFO struct is updated first and then the server status
is updated. This allows the function to be called without it affecting the
server state.
The parameter handling for monitors can now be done in a consistent manner
by establishing a rule that the monitor owns the parameter object as long
as it is running. This will allow parameters to be added and removed
safely both from outside and inside monitors.
Currently this functionality is only used by mysqlmon to disable failover
after an attempt to perform a failover has failed.
If a switchover_script parameter is given, its value will be used as
the switchover script. Otherwise the default will be used. Currently
just echo.
The MySQL Monitor now introduces two script variables, CURRENT_MASTER
and NEW_MASTER, that contain information about the current and new
master respectively.
Switchover is performed only if switchover has been enabled and MaxScale
is *not* in passive mode.
To be able to do that, we need to get hold of the MXS_MONITORED_SERVER
corresponding to the SERVER specified as the new master.
So, instead of just return a boolean indicating whether the server was
found or not we return the MXS_MONITORED_SERVER pointer.
Switchover expects one or two servers as argument, one (the new
master) if there is no master and two (the new master, and the
current master) if there currently is a master.
The procedure is as follows:
- Stop monitor
- Check that provided arguments are reasonable.
- If there is no master currently, then only one argument is
accepted.
- If there is a master, then it must also be specified.
This is to prevent pathological cases where the situation has
changed after the admin has issued the switchover command.
- Check the failover mode and disable it.
- Perform the failover.
- If succeeded, enable failover if it was.
- If it failed, if failover was enabled, do not enable it and log
an alert. If failover was not enabled, just log an error.