This reduces the ambiguity of server id:s in the slave status contents.
If a slave connection has been seen properly connected at an earlier time,
it can be trusted to report the correct master server id. This also
fixes some wrong status assignment edge cases with the SERVER_WAS_SLAVE-bit.
The bit will be removed in a later commit.
Even this does not solve the situation when MaxScale is started with
some servers down.
The function has use outside of the monitors as it makes execution of
worker tasks much more convenient. Currently, this change only moves the
code and takes it into use: there should be no functional changes.
Uses mostly the status functions for reading the flags. Strickly
speaking this breaks the REST API since in some cases (status combinations)
the printed string is different from what was printed before.
The monitor can now differentiate between slaves with a running
series of slave connections to the master from slaves with broken
links. Both still get the SERVER_SLAVE-flag if 'detect_stale_slave'
is on.
Also, relay servers must be running.
If auto_failover is disabled and an alternative master exists, the
monitor will swap the master. This may break replication, but the
situation requires that the dba has set up a cluster with multiple
masters.
MariaDBMonitor diagnostics printing is unsafe as some of the read
fields are arrays. To be on the safe side, the fields are now read
in the monitor worker thread.
Since diagnostics must work even for stopped monitors, a worker task
is used. In practice, it usually runs when the monitor is sleeping.
Because of monitor changes, the test had wrong assumptions.
Renamed the test and updated it to use MaxCtrl for some queries.
Also, changed the type of the cycle container in the monitor to an
ordered map so that results are predictable.
The relay master status was assigned to a server based on the last known
replication status of the slaves that have at some point replicated from
it. This can cause false positives and the relay master status is assigned
to servers that have never been observed to act as relay masters.
The master failure verification would not work if the slaves did not have
a state change since MaxScale had started. This can be fixed by treating
the startup of MaxScale as an event of sorts.
The master validity check now checks if the master is down. This requires
that the slave status is assigned even if no master is available.
The failover precondition is also fulfilled as long as one valid promotion
candidate is found. Previously a slave that didn't use GTID replication
appeared to prevent failover.
The auto_failover is a more reliable solution and should be used instead. Several
unused parameters were removed, although they can still be defined in the config
file. Updated documentation on the relevant parts.
In previously the status bits were assigned only for running servers. Due
to the changes done in the monitoring algorithm, the slave and master
status bits are assigned to servers that are down. This change broke a
number of tests and deviates from previous behavior.
To keep the old behavior and to fix the test, the status bits are not
assigned to servers that are down.
The command is saved in a function object which is read by the monitor
thread. This way, manual and automatic cluster modification commands are
ran in the same step of a monitor cycle.
This update required several modifications in related code.
The monitor now detects when a server has changed such that a replication
graph rebuild is needed and only then rebuilds the graph and detects
cycles and master.
Also, some old code is no longer called in the monitor cycle. It will be
removed in later commits. Refactored some of the related functions.
This also applies to autoselect switchover. The disk space warning has the least
priority, as the other criteria could lead to replication failures. Also, print
the reason the new master was selected over the second best candidate.
Auto-rejoin now explains more accurately if a server cannot be joined due
to conflicting gtid.
Also, auto-rejoin is no longer disabled if a join fails. Usually the fail
is due to the server not replying fast enough with query completion. The
query is often completed anyways. This can lead to some log spam.
Auto-failover is no longer considered to have failed if the preconditions
are not met. An error message with the failed checks is printed once, but
the checks are repeated every loop as long as the master is down.
Not yet used, as more is needed to replace the old code. The
algorithm is based on counting the total number of slave nodes
a server has, possibly in multiple layers and/or cycles.
MonitorInstanceSimple is intended for simple monitors that
probe servers in a straightforward fashion. More complex monitors
can be derived directly from MonitorInstance.