This fixes some situations where MaxAdmin/MaxCtrl would block and wait
until a monitor operation or tick is complete. This also fixes a deadlock
caused by calling monitor diagnostics inside a monitor script.
Concurrency is enabled by adding one mutex per server object to protect
array-like fields from concurrent reading/writing.
The monitor now continuously updates a list of enabled server events. When
promoting a new master in failover/switchover, only events that were enabled
on the previous master are enabled on the new. This avoids enabling events
that may have been disabled on the master yet stayed in the SLAVESIDE_DISABLED-
state on the slave.
In the case of reset-replication command, events on the new master are only
enabled if the monitor had a master when the command was launched. Otherwise
all events remain disabled.
Previously, if the server had no gtid:s, the method would fail leading to
a confusing error message. This could even totally stop the monitor from working
if a recent server version (10.X) did not have any gtid events.
The main class was getting unwieldly and too general. Dividing the fields
helps adding support for other operation types.
This commit leaves most data duplicated, later commits clean up the affected code.
The removing and slave status updating is now separated to a function.
As the MariaDBServer object now contains the updated slave connections,
keeping track of removed connections is no longer required.
The two cases are now separated. In switchover, the promotion and
demotion targets can swap connections between each other without worry.
In failover, the two connection lists must be merged semi-intelligently.
The slave connections of the two servers are now saved to the operation
descriptor object at the start of the operation. This allows slave status
updating during the operation.
Several small changes:
Binlog is flushed at the end of old master demotion.
Only new master is required to catch up to old master.
Use the same replication check method as failover.
No longer writes events to the master, as this creates problems if the
promoted server was not the overall master. Instead, the slave status
output is inspected.
In progress, does not yet overwrite existing code.
The new promotion mechanism automatically retries queries which timed out. It also
handles multimaster situations correctly.
The 'reset_replication' module command deletes all slave connections and binlogs,
sets gtid to sequence 0 and restarts replication from the given master. Should be
only used if gtid:s are incompatible but the actual data is known to be in sync.
Event handling is now enabled by default. If the monitor cannot query the EVENTS-
table (most likely because of missing credentials), print an error suggesting to
turn the feature off.
When disabling events on a rejoining standalone server (likely a former master),
disable binlog event recording for the session. This prevents the ALTER EVENT
queries from generating binlog events.
Also added documentation and combined similar parts in the code.
See script directory for method. The script to run in the top level
MaxScale directory is called maxscale-uncrustify.sh, which uses
another script, list-src, from the same directory (so you need to set
your PATH). The uncrustify version was 0.66.
Since the servers are not modified before or during the wait, the waiting
can be done in the preparation method. This simplifies the actual failover
somewhat, and allows the monitor to keep running normally while waiting for
the log to clear.
When the replication status from the external master is checked, the
pending status must be used. This makes sure that the SlaveStatusArray and
the server state are sync.
Also extended the message that was logged when the external master was
lost. By adding the network address there, it makes it easier to see where
the server was replicating from if only the log file is available.
This reduces the ambiguity of server id:s in the slave status contents.
If a slave connection has been seen properly connected at an earlier time,
it can be trusted to report the correct master server id. This also
fixes some wrong status assignment edge cases with the SERVER_WAS_SLAVE-bit.
The bit will be removed in a later commit.
Even this does not solve the situation when MaxScale is started with
some servers down.
The auto_failover is a more reliable solution and should be used instead. Several
unused parameters were removed, although they can still be defined in the config
file. Updated documentation on the relevant parts.