Don't test failover functionality when it is not needed. The bug is only
about the extra events that appear when a master is demoted and a slave is
promoted.
The tests can now wait for a number of monitor intervals. This removes the
need to have hard-coded sleeps in the code and makes monitor tests more
robust under heavier load.
Port 9003 is not open by default in the test environment. Changing it to
port 4006, which is open, will work around this restriction.
Also added the mysql_error output to the error message when the querying
fails.
The `MYSQL_ROW row` variable was being overwritten by the extra query done
by the SST method detection code. Moving it into its own function prevents
this and makes the code significantly easier to comprehend.
Added a test case that reproduced the problem (MaxScale crashed) and
verifies that the patch fixes the problem.
The test checks that failover works even when the master of the monitored
cluster is a slave to an external masters. The test also verifies that the
servers do not get unexpected status labels.
Tests that local_address is taken into account. However, at the time
of writing the maxscale VM does not have two usable IP addresses, so
we only test that explicitly specifying an IP-address does not break
things.
Locally it has been confirmed that this indeed works the way it is
supposed to.
- Start 4 threads where each thread sits in a loop and performs
20% updates and 80% selects. Each thread has a table of its own.
- The main thread executes the following in a loop.
- Perform a switchover from the current master to the next (which is
simply the next node % all nodes).
- Keep on doing that for 1.5 minutes.
The expectation is that the switchover will succeed, that is, after the
operation there will be a new master.
- Start 4 threads where each thread sits in a loop and performs
20% updates and 80% selects. Each thread has a table of its own.
- The main thread executes the following in a loop.
- Take down the current master and wait a while (failover assumed
to happen).
- Put up the old master node and wait a while.
Keep on doing that for 1.5 minutes.
At the end check that:
- There is one 'Master'.
- The other nodes are either
- 'Slave' or
- 'Running' in which case it is checked it is because the node could
not be rejoined.
The tests now reset the replication state using queries and switchover instead of
calling fix_replication(). The results are checked so these tests now test
switchover as well.
Also, reduce printing when verbose is on for any test using the get_output()-function
in fail_switch_rejoin_common.cpp.
auto_failover=true
auto_rejoin=false
This test tests the following:
- Regular master-slave setup
- Create a table, insert some data
- Sync all slaves
- Stop a slave
- Insert some more data
- Sync remaining slaves
- Stop the master
- Expect the failover mechanism to pick a new master (server2)
- Bring up the slave
- Perform a switchover from server2 to server4
- Should fail
Currently it does fail, but only due to a timeout.
[mysqlmon] MASTER_GTID_WAIT() timed out on slave 'server4'.
There should be some check that would ensure that the failure happens
faster than that.
This test tests the following:
- Regular master-slave setup
- Create a table, insert some data
- Sync all slaves
- Stop a slave
- Insert some more data
- Sync remaining slaves
- Stop the master
- Expect the failover mechanism to pick a new master
- Bring up the slave
- Expect the slave to be rejoined
- The test starts with the usual setup of 1 master and 3 slaves.
- Then the master is taken down and it is checked that the failover
mechanism promotes some slave to master.
- This is continued until there is a single master left (no
slaves).
The same test now has two versions. In the automatic version failover
begins automatically. In the manual version failover is started with
maxadmin. The tests are otherwise identical.