The blocking of the nodes that happens before it could cause the
connections to break. This also removes the need for the fixing of the
replication which takes time.
If the test uses two MaxScales, they are automatically stopped after the
test. This prevents the second MaxScale from interfering with subsequent
tests.
By exposing a (currently undocumented) debug endpoint that lets one
monitor interval pass, we make the reuse of the monitor waiting
functionality a lot easier. With it, when MaxScale is started by the test
framework it knows that at least one monitor interval will have passed for
all monitors and that the system is ready to accept queries.
As the ssh_node_f function supports full shell syntax, all of the work can
be done with a single ssh connection. This removes the overhead that each
extra ssh connection adds.
The collection of the various artifacts generated by a test case and the
core dump detection is now done in the same SSH command. This removes the
extra overhead that it added.
There were a total of five SSH connections opened at the start of each
test. Only two of these are currently required: the SSL certificate
directory check and the actual command that restarts MaxScale. Two of the
three remaining commands, stopping of MaxScale and copying of the
configuration, can be made conditional or combined into other
commands.
The stopping of MaxScale is done to prevent it from interfering with the
cluster setup process. As MaxScale does nothing if nothing is wrong, it is
safe to make the restart conditional so that it is done only when a
problem in the cluster setup is detected.
The final SSH command, the MaxScale health check via maxadmin, can be
removed as it is redundant: the daemonization already covers this by
exiting only after MaxScale is ready.
A certain templated parameter was only substituted when the VMs were
provisioned. This needs to be handled by the test framework to allow
changes into Galera clusters configuration.
Also made the startup of the "lesser" nodes parallel so minimize the
startup time.
The galera configurations need pre-processing before they can be
used. Switched to std::endl to automatically flush the output at the end
of each line. This makes it easier to see what is happening when the tests
are ran by buildbot. Also removed the extra startup of the servers that
was done right after installing the database.
Grouped all binlogrouter and avrorouter tests so that they are executed as
the last tests. This helps prevent some side effects that result from the
"aggressive" replication modifications the tests do. Also removed some
commented out test cases.
If the replication is broken between the nodes, it is now fixed in
parallel on all nodes instead of doing it one server at a time.
This reduces the time from about 120 seconds to 13 seconds. The time was
measured by running the check_backend test first with all backends broken
and then with the fixed backends subtracting time of the latter from the
former.
As the wait_for_monitor function guarantees that the monitor notices the
state change, we can skip the replication fixing which was somewhat
superficial in the first place.
The tests were consistently unstable and as a result of this did not
provide any actionable output. In addition to this these two test were the
longest running tests in the whole MaxScale test suite so a re-design was
warranted.
Instead of emulating a client and a server failure, testing functionality
provides for a test that is faster, more precise and provides more
actionable output. Due to the single-threadedness of the new test, no
cross-thread depencies are present. In addition to this, the superfluous
log flushing was not done as it almost always happened after all
transactions were already complete.
The estimated savings in test time alone is around 1100 seconds (roughly
18 minutes).
The test description talks about putting the master into maintenance mode
but it spends most of the time putting slaves into maintenance mode. To
make the test more precise (and fast) the test can be reduced to blocking
the most often used slave and the master. The iteration count can also be
lowered from five to two to get at least two cycles of maintenance mode.