Commit Graph

491 Commits

Author SHA1 Message Date
4167e88719 MXS-1751: Fix crash with available_when_donor=true
The `MYSQL_ROW row` variable was being overwritten by the extra query done
by the SST method detection code. Moving it into its own function prevents
this and makes the code significantly easier to comprehend.

Added a test case that reproduced the problem (MaxScale crashed) and
verifies that the patch fixes the problem.
2018-03-31 20:21:07 +03:00
7209080236 MXS-1747 Improve error messages of rejoin operations
Now states which query caused the error.
2018-03-28 12:39:10 +03:00
6c32c7421b MXS-1746 Query global gtid_domain_id instead of session-specific value
The monitor queried the session-specific domain id, which does not follow the global
value while the session is alive. This caused the monitor to follow the wrong gtid
domain if the domain was changed after MaxScale was started. This patch modifies the
query to read the global value instead. Even this is not fool-proof, as existing
sessions can issue writes with the old domain, confusing the gtid-parsing.
2018-03-28 12:23:57 +03:00
bd8b6dbc6f MXS-1722 Add better error messages to switchover_demote_master()
The error messages should now be a bit more reliable.
2018-03-21 15:04:39 +02:00
2178667245 MXS-1679 Check for existence of master before continuing failover checks
Seems to fix the issue with MaxScale detecting an old master down event.
2018-03-16 11:26:58 +02:00
b982458497 MXS-1679 Add more accurate error printing
The reason for rejoin failing should now be clearer.
2018-03-12 17:16:54 +02:00
5a62adc63e MXS-1678: Detect broken replication with Last_IO_Errno
This commit introduces changes that fix the relay master detection that
was broken by the merge from 2.1 into 2.2 by commit
1ecd791887994209eb29e56e1271f8c407cd0cdf.

In 2.2, the master server ID is used to detect whether a slave is actually
replicating from a master. The value is still displayed even if the slave
is not actively replicating from a master. The commit in 2.1 causes this
value to be stored unconditionally if it is available. By checking the
value of Last_IO_Errno and comparing it to a list of known error codes, we
know whether the slave is replicating properly.

The slave detection in 2.2 correctly identifies a broken slave with a
stopped IO thread. Due to this, the test case must be modified to check
that the relay master is not a slave if the IO thread is stopped.
2018-03-12 14:55:54 +02:00
f7b284bbb7 Check IO thread status when verifying master failure
When MaxScale thinks that the master has failed, it tries to verify it by
seeing if the slave server is receiving events. There was a missing IO
thread status check in the slave_receiving_events function which caused
the failover to wait until the verification timed out.

The relay master detection logic also lacked a check for the slave SQL
thread status. The code should check the state of the SQL thread to
determine whether the server is actually a functional slave to a master.
2018-03-09 20:53:56 +02:00
d443e22d1b Merge branch '2.2.3' into 2.2 2018-03-09 20:50:01 +02:00
f4c7a4700a Disable fix to MXS-1678 in 2.2.3
The fix causes a regression in the failover functionality as there is a
dependency between the slave's master ID and how the failover
performs. This dependency should not exist but fixing it causes a problem
with the mysqlmon_rejoin_bad2 test.
2018-03-08 21:03:52 +02:00
ff9024bdfb MXS-1698: Remove false debug assertion
It is not an error if the correct GTID is not found and thus it should not
be asserted that one is found.
2018-03-07 11:55:46 +02:00
39d3c42c94 Merge branch '2.1' into 2.2 2018-03-01 17:52:42 +02:00
b67ab83486 Revert "Use dedicated header in NDBClusterMon"
This reverts commit b9d80f6061d6b536d7a15febf0367e5f6dba0e84.
2018-02-24 15:43:15 +02:00
0bbf0246f9 Revert "Compile mariadbmon.h as C++"
This reverts commit 60d57aee61d96832aeec1b8a61d36803c38ca77c.
2018-02-24 15:40:21 +02:00
236e906d88 Revert "Turn MariaDB Monitor struct to class with public fields"
This reverts commit cb6f70119d9857b277306e9af5881fe29c574a32.
2018-02-24 15:37:50 +02:00
13661ab4a6 Revert "MariaDB Monitor: Move additional classes to separate file"
This reverts commit ff55106610881d55db88eca9e2ef6a056cbc8d51.
2018-02-24 15:35:36 +02:00
e721733434 Revert "MariaDBMon: Move replication manipulation functions to a separate file"
This reverts commit 8cdd23dda2add6486abb685834def94c72a09b6c.
2018-02-24 15:35:02 +02:00
8cdd23dda2 MariaDBMon: Move replication manipulation functions to a separate file
Refactoring continues. This update moves some of the replication manipulation
functions to a separate file and turns them into class methods.
2018-02-22 10:51:52 +02:00
ff55106610 MariaDB Monitor: Move additional classes to separate file
Also use stl containers in monitor definition.
2018-02-21 12:24:24 +02:00
cb6f70119d Turn MariaDB Monitor struct to class with public fields
Allows using std::string for strings. Also, cleanup.
2018-02-21 11:00:42 +02:00
1ecd791887 MXS-1678: Store master_id even when IO thread is stopped
When the IO thread of a relay master is stopped, the knowledge that it is
not a real master but a relay master is lost. To prevent this loss of
information, the master server's server_id value should always be stored
if it is available.
2018-02-21 09:35:42 +02:00
60d57aee61 Compile mariadbmon.h as C++ 2018-02-20 11:14:21 +02:00
b9d80f6061 Use dedicated header in NDBClusterMon
NDBClusterMonitor used the MariaDBMonitor header instead of its own.
2018-02-20 11:09:04 +02:00
754d80da75 Do not auto_rejoin if maxscale is passive 2018-02-14 17:30:02 +02:00
3b2ec4ab5a Change references from MySQL to MariaDB
A few were missed when the renaming was done. Also renamed the file to
mariadbmon.cc.
2018-02-14 14:47:03 +02:00
b94f3b8792 Failover: do not check cluster stabilization if no slaves
Caused debug assert.
2018-02-12 12:37:41 +02:00
b8d3da4968 Add error tolerance to "servers_no_promotion"
Previously, if the list contained servers that were not monitored by
the monitor yet were valid servers, an error value would be returned
and the monitor failed to start.

With this update, the non-monitored servers are simply ignored when
forming the final list.

Also, added printing of the list to diagnostics.
2018-02-12 10:49:28 +02:00
faaf43ff39 Add gtid to monitor diagnostics, clean up formatting
Gtid:s are now queried every monitor loop.

dignostics() no longer prints slave related info if the server has
no slave connection.
2018-02-10 12:32:56 +02:00
a0d9c7da74 External master server support for failover/switchover
If the master is replicating from an external master, the monitor will save the
host:port of the external server. During demotion, the old master stops the external
replication while the new master begins it. Also, any commands that would add
to gtid have to be omitted when an external master is in play.
2018-02-08 18:44:08 +02:00
ed28d986e9 Fix debug assertion when no master is present
If no master is present, the debug assertion would dereference a NULL
pointer.
2018-02-08 13:33:31 +02:00
b059d78a30 Fix master loss on split cluster
When four servers (A, B, C and E where E and A replicate from each other
and A is the master for B and C) form a cluster and only three of them (A,
B and C) are configured into MaxScale, a failover operation from A to B
(making B the current master) and a restart of A causes B to lose its
master status.

The following diagram illustrates the state of the cluster at the end of
the process described above.

      +----------------------+
      |        +---+         |
  +------------+ B <-+       |
+-v-+ |        +---+ |       |
| E | |              |       |
+-^-+ |  +---+     +-+-+     |
  +------+ A |     | C |     |
      |  +---+     +---+     |
      |                      |
      +----------------------+

The external server E was not correctly ignored in the replication
topology generation causing both A and B to be seen as the lowest slave
nodes in the tree. From a theoretical point of view this is the correct
interpretation as there are two distinct trees and neither of them
contains any true masters.

In practice, MaxScale should treat any servers that replicate from an
external master as root level master nodes. Doing this guarantees that they
are labeled as masters if they have slaves replicating from them.
2018-02-08 12:48:56 +02:00
6c42221c9c MXS-1654 Prevent crash in debug mode if no master is found 2018-02-07 16:43:13 +02:00
8558ace801 Clean up ignore_external_masters code
Removed redundant checks and cleared only bits that aren't already
cleared.
2018-02-07 16:07:16 +02:00
8bf756ca56 Update root_master when using a standalone master
When detect_standalone_master is enabled, the root_master variable was not
updated after the master was changed by the standalone server detection
mechanism. This caused debug assertions to fire in addition to possibly
causing some of the ignore_external_masters logic to break.
2018-02-07 16:07:16 +02:00
1cf3de4a74 Add config parameter for excluding servers from failover
"servers_no_promotion" is a comma-separated list of servers
which cannot be chosen when selecting a new master during failover
(auto or manual), or when automatically selecting a new master
for switchover (currently disabled).

The servers in the list are redirected normally and can be promoted
by switchover when manually selecting a new master.
2018-02-07 14:07:10 +02:00
4a478d31f3 Print Gtid IO position during monitor diagnostics 2018-02-05 17:21:20 +02:00
f6afb0c6d1 MXS-1643: Make Master and Slave status mutually exclusive
The Master status now prevents Slave status from being assigned to a
server. In practice this simply means that the master will not have both
the Master and Slave status bits.
2018-02-05 16:53:43 +02:00
a83b36ca45 Use 64 bits for storing server id
In debug mode, when scanning the server id from a string, check that resulting
number is 32bit. Also, when querying the server id, query the global version.
Now, if a super user modifies the server id the monitor will notice it.

Server id:s in gtid:s are handled similarly.
2018-02-02 11:34:32 +02:00
255250652d Refactor pre-switchover, add similar checks as in failover
Now detects some erroneous situations before starting switchover.
Switchover can be activated without specifying current master.
In this case, the cluster master server is selected.
2018-01-31 10:40:09 +02:00
e455e8d43c Simplify monitor start and stop during switchover/failover
If these operations failed, the monitor could be left stopped.
2018-01-29 15:44:32 +02:00
9ded584836 Check that all slaves use gtid replication before performing failover 2018-01-29 13:33:16 +02:00
257034bf3e Clarify master failure verification
The two previous functions were somewhat overlapping.
2018-01-23 16:14:50 +02:00
753b97303a MXS-1606: Create maxscale_schema database if missing
The monitor will now also create the database if it is missing. Since it
already creates the table, also creating the database is not a large
addition.

Cleaned up some of the related checking code and combined them into a
simple utility function.
2018-01-19 11:34:25 +02:00
a4f6176ced Fix bug in printing switchover/failover module command info
The string constant passed to the register-function went invalid
once CREATE_MODULE() completed, causing random characters to be
printed.
2018-01-17 12:01:00 +02:00
23f2c3b980 Better failover timing and redirection success is tested
Works similar to switchover.
2018-01-16 18:05:12 +02:00
c2c898ee93 Fix formatting in MariaDB Monitor 2018-01-16 13:27:20 +02:00
ff2ad05d0a Add manual rejoin command to MariaDB Monitor
The rejoin command takes as parameters the monitor name and
name of server to rejoin.

This change required refactoring the rejoin code.
2018-01-16 13:20:35 +02:00
5f4db64ac7 Better timing for switchover, check slaves for IO/SQL errors
Time elapsed is now properly tracked during a switchover. After slave
redirection, an event is added to the master. Then, the slaves are queried
repeatedly until they advance to the newest event. I/O and SQL errors are
also detected.
2018-01-08 15:23:25 +02:00
047c08f577 MXS-1588: Wait on all slaves during switchover
During switchover, MASTER_GTID_WAIT is now called on all slaves. This causes
switchover to complete slower than before but is safer if log_slave_updates
is not on on the new master server. Also, read_only is disabled on the
demoted server if waiting on slaves or promotion fails. This should
effectively cancel the failover for the old master.
2018-01-03 12:52:33 +02:00
9558addbfe Update module name for mariadbmon 2018-01-02 11:01:28 +02:00