The blocking of the nodes that happens before it could cause the
connections to break. This also removes the need for the fixing of the
replication which takes time.
If the test uses two MaxScales, they are automatically stopped after the
test. This prevents the second MaxScale from interfering with subsequent
tests.
By exposing a (currently undocumented) debug endpoint that lets one
monitor interval pass, we make the reuse of the monitor waiting
functionality a lot easier. With it, when MaxScale is started by the test
framework it knows that at least one monitor interval will have passed for
all monitors and that the system is ready to accept queries.
As the ssh_node_f function supports full shell syntax, all of the work can
be done with a single ssh connection. This removes the overhead that each
extra ssh connection adds.
The collection of the various artifacts generated by a test case and the
core dump detection is now done in the same SSH command. This removes the
extra overhead that it added.
There were a total of five SSH connections opened at the start of each
test. Only two of these are currently required: the SSL certificate
directory check and the actual command that restarts MaxScale. Two of the
three remaining commands, stopping of MaxScale and copying of the
configuration, can be made conditional or combined into other
commands.
The stopping of MaxScale is done to prevent it from interfering with the
cluster setup process. As MaxScale does nothing if nothing is wrong, it is
safe to make the restart conditional so that it is done only when a
problem in the cluster setup is detected.
The final SSH command, the MaxScale health check via maxadmin, can be
removed as it is redundant: the daemonization already covers this by
exiting only after MaxScale is ready.
A certain templated parameter was only substituted when the VMs were
provisioned. This needs to be handled by the test framework to allow
changes into Galera clusters configuration.
Also made the startup of the "lesser" nodes parallel so minimize the
startup time.
The galera configurations need pre-processing before they can be
used. Switched to std::endl to automatically flush the output at the end
of each line. This makes it easier to see what is happening when the tests
are ran by buildbot. Also removed the extra startup of the servers that
was done right after installing the database.
Grouped all binlogrouter and avrorouter tests so that they are executed as
the last tests. This helps prevent some side effects that result from the
"aggressive" replication modifications the tests do. Also removed some
commented out test cases.
If the replication is broken between the nodes, it is now fixed in
parallel on all nodes instead of doing it one server at a time.
This reduces the time from about 120 seconds to 13 seconds. The time was
measured by running the check_backend test first with all backends broken
and then with the fixed backends subtracting time of the latter from the
former.
As the wait_for_monitor function guarantees that the monitor notices the
state change, we can skip the replication fixing which was somewhat
superficial in the first place.
The tests were consistently unstable and as a result of this did not
provide any actionable output. In addition to this these two test were the
longest running tests in the whole MaxScale test suite so a re-design was
warranted.
Instead of emulating a client and a server failure, testing functionality
provides for a test that is faster, more precise and provides more
actionable output. Due to the single-threadedness of the new test, no
cross-thread depencies are present. In addition to this, the superfluous
log flushing was not done as it almost always happened after all
transactions were already complete.
The estimated savings in test time alone is around 1100 seconds (roughly
18 minutes).
The test description talks about putting the master into maintenance mode
but it spends most of the time putting slaves into maintenance mode. To
make the test more precise (and fast) the test can be reduced to blocking
the most often used slave and the master. The iteration count can also be
lowered from five to two to get at least two cycles of maintenance mode.
Removed the tests obsoleted by the sanity_check test case. This shortens
the test time by about a minute and a half and removes about 2500 lines of code.
The sanity check replaces several old regression tests and provides a
quick test for checking mainly the readwritesplit routing behavior. It
also checks some of the connection counts and runs queries that once
caused a crash.
The set of tests that the sanity check obsoletes is:
bug422
bug469
bug448
bug507
bug509
bug634
bug694
bug669
bug711
mxs127
mxs47
mxs682_cyrillic
mxs957
mxs1786_statistics
rwsplit_read_only_trx
This should help prevent network disconnections and make the test more
stable. If the connection is lost, the automatic failover is disabled and
the test will fail.
The test doesn't work when ASAN is used as it increases the memory use of
the process. With the addition of more caches in 2.3, the test is also
more likely to fail. Due to the test being quite useless with ASAN, it is
better to remove it.
If the password field in mysql.user is empty, it is possible that the
actual password is stored in the authentication_string field. Most of the
time this happens due to MDEV-16774 which causes the password to be stored
in the authentication_string field.
Also added a test case that verifies the problem and that it is fixed by
this commit.