Added a test case that does a set of sanity checks on the monitor. As the
monitor is very simple, there are not a lot of things to test without
access to the actual instances (e.g. ExeMgr failures need to be tested).
Currently the test always passes as ColumnStore clusters aren't
implemented for the test framework.
Removed the tests obsoleted by the sanity_check test case. This shortens
the test time by about a minute and a half and removes about 2500 lines of code.
This should help prevent network disconnections and make the test more
stable. If the connection is lost, the automatic failover is disabled and
the test will fail.
The test doesn't work when ASAN is used as it increases the memory use of
the process. With the addition of more caches in 2.3, the test is also
more likely to fail. Due to the test being quite useless with ASAN, it is
better to remove it.
Now the test program will
1) Write to each node in a Galera cluster and verify that the data
ends up in the slave.
2) At the end of 1) execute STOP SLAVE and START SLAVE to check that
replication can be stopped and started again (won't work unless
each node has the same server_id and value for @@log_bin_basename).
3) Block the node BLR is replicating from and expect it to connect
to the next configured master and that replication continues to
work. Do that for all nodes.
4) Stop MaxScale and restart it and expect 3) to work. That checks
that BLR saves all necessary information in master.ini and is
capable of reading it.
It should be possible to START SLAVE and STOP SLAVE irrespective
of which Galera node updates are mode to.
That will be the case if @@log_slave_updates is on and each node
in the Galera cluster have the same server id. Otherwise it will
fail with the current incarnation of BLR.
If
* BLR replicates from a node in a Galera cluster and
* writes are made to all nodes in that cluster,
then
* if a slave to BLR is stopped when it has received an event
originating in a node different than the one BLR is replicating
from
the subsequent (re)starting of the slave will fail because BLR looks
for the last event from a file whose path contains the server id of
the node where the event originates, although it should look for it
in the file whose path contains the server id of the node from which
BLR replicates.
The behavior of mariadbmon was changed so that it better understands
slaves attempting to replicate. Rewrote the test to accommodate the change
in behavior and take the opportunity to use newer code.
The switchover sometimes fails due to a broken connection when the STOP
SLAVE on the new master is executed. Nothing is logged on the server in
question and the error message simply states that the connection was lost
in the middle of a query.
Increasing the query_retries to 1 reduced the likelihood of failure from
about 1/3 of tests failing to roughly 1/6 of tests failing. Increasing it
to 5 seems to remove it completely. As to what is the real reason this
happens, we do not yet know.
Because of monitor changes, the test had wrong assumptions.
Renamed the test and updated it to use MaxCtrl for some queries.
Also, changed the type of the cycle container in the monitor to an
ordered map so that results are predictable.
The test environment isn't always pristine after a test run so for the
sake of being able to actually test what we're attempting to test, we
should ignore duplicate databases for the time being.
The long-term fix is detect when a test doesn't clean up after itself.