If the replication is broken between the nodes, it is now fixed in
parallel on all nodes instead of doing it one server at a time.
This reduces the time from about 120 seconds to 13 seconds. The time was
measured by running the check_backend test first with all backends broken
and then with the fixed backends subtracting time of the latter from the
former.
As the wait_for_monitor function guarantees that the monitor notices the
state change, we can skip the replication fixing which was somewhat
superficial in the first place.
The tests were consistently unstable and as a result of this did not
provide any actionable output. In addition to this these two test were the
longest running tests in the whole MaxScale test suite so a re-design was
warranted.
Instead of emulating a client and a server failure, testing functionality
provides for a test that is faster, more precise and provides more
actionable output. Due to the single-threadedness of the new test, no
cross-thread depencies are present. In addition to this, the superfluous
log flushing was not done as it almost always happened after all
transactions were already complete.
The estimated savings in test time alone is around 1100 seconds (roughly
18 minutes).
The test description talks about putting the master into maintenance mode
but it spends most of the time putting slaves into maintenance mode. To
make the test more precise (and fast) the test can be reduced to blocking
the most often used slave and the master. The iteration count can also be
lowered from five to two to get at least two cycles of maintenance mode.
Removed the tests obsoleted by the sanity_check test case. This shortens
the test time by about a minute and a half and removes about 2500 lines of code.
The sanity check replaces several old regression tests and provides a
quick test for checking mainly the readwritesplit routing behavior. It
also checks some of the connection counts and runs queries that once
caused a crash.
The set of tests that the sanity check obsoletes is:
bug422
bug469
bug448
bug507
bug509
bug634
bug694
bug669
bug711
mxs127
mxs47
mxs682_cyrillic
mxs957
mxs1786_statistics
rwsplit_read_only_trx
This should help prevent network disconnections and make the test more
stable. If the connection is lost, the automatic failover is disabled and
the test will fail.
The test doesn't work when ASAN is used as it increases the memory use of
the process. With the addition of more caches in 2.3, the test is also
more likely to fail. Due to the test being quite useless with ASAN, it is
better to remove it.
If the password field in mysql.user is empty, it is possible that the
actual password is stored in the authentication_string field. Most of the
time this happens due to MDEV-16774 which causes the password to be stored
in the authentication_string field.
Also added a test case that verifies the problem and that it is fixed by
this commit.
The intention was to send the lowest backend version string automatically
to the client instead of the default handshake version. This did not work
as the service version string was used instead of the server version.
The table creation was not detected as the function used to extract the
table name did not return the fully qualified names. Even if it did return
a fully qualified name, it wouldn't have been correctly processed.
The test did not use the wait_for_monitor function to sync with the
monitor. This function speeds up the testing greatly by removing
unnecessary sleeps from it.
Also reduced the amount of data inserted into the cluster. There's no real
need to test with large amounts of data as it is only a functional test.
If the test fails, there's no point in continuing with the load generation
as it only serves to slow things down. In few cases the test caused
std::bad_alloc to be thrown which prematurely stopped the ctest run.
Now the test program will
1) Write to each node in a Galera cluster and verify that the data
ends up in the slave.
2) At the end of 1) execute STOP SLAVE and START SLAVE to check that
replication can be stopped and started again (won't work unless
each node has the same server_id and value for @@log_bin_basename).
3) Block the node BLR is replicating from and expect it to connect
to the next configured master and that replication continues to
work. Do that for all nodes.
4) Stop MaxScale and restart it and expect 3) to work. That checks
that BLR saves all necessary information in master.ini and is
capable of reading it.
It should be possible to START SLAVE and STOP SLAVE irrespective
of which Galera node updates are mode to.
That will be the case if @@log_slave_updates is on and each node
in the Galera cluster have the same server id. Otherwise it will
fail with the current incarnation of BLR.
When running BLR locally, you need to be able to specify what
IP the BLR is visible at (127.0.0.1 does not work for VM nodes)
and also to perform cleanup etc. action when needed.