The fix causes a regression in the failover functionality as there is a
dependency between the slave's master ID and how the failover
performs. This dependency should not exist but fixing it causes a problem
with the mysqlmon_rejoin_bad2 test.
Startup now done in a static method. Constructor initializes some values.
Config parameters loaded in a separate method. Some things still need
looking.
When the IO thread of a relay master is stopped, the knowledge that it is
not a real master but a relay master is lost. To prevent this loss of
information, the master server's server_id value should always be stored
if it is available.
Previously, if the list contained servers that were not monitored by
the monitor yet were valid servers, an error value would be returned
and the monitor failed to start.
With this update, the non-monitored servers are simply ignored when
forming the final list.
Also, added printing of the list to diagnostics.
If the master is replicating from an external master, the monitor will save the
host:port of the external server. During demotion, the old master stops the external
replication while the new master begins it. Also, any commands that would add
to gtid have to be omitted when an external master is in play.
When four servers (A, B, C and E where E and A replicate from each other
and A is the master for B and C) form a cluster and only three of them (A,
B and C) are configured into MaxScale, a failover operation from A to B
(making B the current master) and a restart of A causes B to lose its
master status.
The following diagram illustrates the state of the cluster at the end of
the process described above.
+----------------------+
| +---+ |
+------------+ B <-+ |
+-v-+ | +---+ | |
| E | | | |
+-^-+ | +---+ +-+-+ |
+------+ A | | C | |
| +---+ +---+ |
| |
+----------------------+
The external server E was not correctly ignored in the replication
topology generation causing both A and B to be seen as the lowest slave
nodes in the tree. From a theoretical point of view this is the correct
interpretation as there are two distinct trees and neither of them
contains any true masters.
In practice, MaxScale should treat any servers that replicate from an
external master as root level master nodes. Doing this guarantees that they
are labeled as masters if they have slaves replicating from them.
When detect_standalone_master is enabled, the root_master variable was not
updated after the master was changed by the standalone server detection
mechanism. This caused debug assertions to fire in addition to possibly
causing some of the ignore_external_masters logic to break.
"servers_no_promotion" is a comma-separated list of servers
which cannot be chosen when selecting a new master during failover
(auto or manual), or when automatically selecting a new master
for switchover (currently disabled).
The servers in the list are redirected normally and can be promoted
by switchover when manually selecting a new master.
The Master status now prevents Slave status from being assigned to a
server. In practice this simply means that the master will not have both
the Master and Slave status bits.
In debug mode, when scanning the server id from a string, check that resulting
number is 32bit. Also, when querying the server id, query the global version.
Now, if a super user modifies the server id the monitor will notice it.
Server id:s in gtid:s are handled similarly.
Now detects some erroneous situations before starting switchover.
Switchover can be activated without specifying current master.
In this case, the cluster master server is selected.
The monitor will now also create the database if it is missing. Since it
already creates the table, also creating the database is not a large
addition.
Cleaned up some of the related checking code and combined them into a
simple utility function.
Time elapsed is now properly tracked during a switchover. After slave
redirection, an event is added to the master. Then, the slaves are queried
repeatedly until they advance to the newest event. I/O and SQL errors are
also detected.
During switchover, MASTER_GTID_WAIT is now called on all slaves. This causes
switchover to complete slower than before but is safer if log_slave_updates
is not on on the new master server. Also, read_only is disabled on the
demoted server if waiting on slaves or promotion fails. This should
effectively cancel the failover for the old master.