Commit Graph

466 Commits

Author SHA1 Message Date
b94f3b8792 Failover: do not check cluster stabilization if no slaves
Caused debug assert.
2018-02-12 12:37:41 +02:00
b8d3da4968 Add error tolerance to "servers_no_promotion"
Previously, if the list contained servers that were not monitored by
the monitor yet were valid servers, an error value would be returned
and the monitor failed to start.

With this update, the non-monitored servers are simply ignored when
forming the final list.

Also, added printing of the list to diagnostics.
2018-02-12 10:49:28 +02:00
faaf43ff39 Add gtid to monitor diagnostics, clean up formatting
Gtid:s are now queried every monitor loop.

dignostics() no longer prints slave related info if the server has
no slave connection.
2018-02-10 12:32:56 +02:00
a0d9c7da74 External master server support for failover/switchover
If the master is replicating from an external master, the monitor will save the
host:port of the external server. During demotion, the old master stops the external
replication while the new master begins it. Also, any commands that would add
to gtid have to be omitted when an external master is in play.
2018-02-08 18:44:08 +02:00
ed28d986e9 Fix debug assertion when no master is present
If no master is present, the debug assertion would dereference a NULL
pointer.
2018-02-08 13:33:31 +02:00
b059d78a30 Fix master loss on split cluster
When four servers (A, B, C and E where E and A replicate from each other
and A is the master for B and C) form a cluster and only three of them (A,
B and C) are configured into MaxScale, a failover operation from A to B
(making B the current master) and a restart of A causes B to lose its
master status.

The following diagram illustrates the state of the cluster at the end of
the process described above.

      +----------------------+
      |        +---+         |
  +------------+ B <-+       |
+-v-+ |        +---+ |       |
| E | |              |       |
+-^-+ |  +---+     +-+-+     |
  +------+ A |     | C |     |
      |  +---+     +---+     |
      |                      |
      +----------------------+

The external server E was not correctly ignored in the replication
topology generation causing both A and B to be seen as the lowest slave
nodes in the tree. From a theoretical point of view this is the correct
interpretation as there are two distinct trees and neither of them
contains any true masters.

In practice, MaxScale should treat any servers that replicate from an
external master as root level master nodes. Doing this guarantees that they
are labeled as masters if they have slaves replicating from them.
2018-02-08 12:48:56 +02:00
6c42221c9c MXS-1654 Prevent crash in debug mode if no master is found 2018-02-07 16:43:13 +02:00
8558ace801 Clean up ignore_external_masters code
Removed redundant checks and cleared only bits that aren't already
cleared.
2018-02-07 16:07:16 +02:00
8bf756ca56 Update root_master when using a standalone master
When detect_standalone_master is enabled, the root_master variable was not
updated after the master was changed by the standalone server detection
mechanism. This caused debug assertions to fire in addition to possibly
causing some of the ignore_external_masters logic to break.
2018-02-07 16:07:16 +02:00
1cf3de4a74 Add config parameter for excluding servers from failover
"servers_no_promotion" is a comma-separated list of servers
which cannot be chosen when selecting a new master during failover
(auto or manual), or when automatically selecting a new master
for switchover (currently disabled).

The servers in the list are redirected normally and can be promoted
by switchover when manually selecting a new master.
2018-02-07 14:07:10 +02:00
4a478d31f3 Print Gtid IO position during monitor diagnostics 2018-02-05 17:21:20 +02:00
f6afb0c6d1 MXS-1643: Make Master and Slave status mutually exclusive
The Master status now prevents Slave status from being assigned to a
server. In practice this simply means that the master will not have both
the Master and Slave status bits.
2018-02-05 16:53:43 +02:00
a83b36ca45 Use 64 bits for storing server id
In debug mode, when scanning the server id from a string, check that resulting
number is 32bit. Also, when querying the server id, query the global version.
Now, if a super user modifies the server id the monitor will notice it.

Server id:s in gtid:s are handled similarly.
2018-02-02 11:34:32 +02:00
255250652d Refactor pre-switchover, add similar checks as in failover
Now detects some erroneous situations before starting switchover.
Switchover can be activated without specifying current master.
In this case, the cluster master server is selected.
2018-01-31 10:40:09 +02:00
e455e8d43c Simplify monitor start and stop during switchover/failover
If these operations failed, the monitor could be left stopped.
2018-01-29 15:44:32 +02:00
9ded584836 Check that all slaves use gtid replication before performing failover 2018-01-29 13:33:16 +02:00
257034bf3e Clarify master failure verification
The two previous functions were somewhat overlapping.
2018-01-23 16:14:50 +02:00
753b97303a MXS-1606: Create maxscale_schema database if missing
The monitor will now also create the database if it is missing. Since it
already creates the table, also creating the database is not a large
addition.

Cleaned up some of the related checking code and combined them into a
simple utility function.
2018-01-19 11:34:25 +02:00
a4f6176ced Fix bug in printing switchover/failover module command info
The string constant passed to the register-function went invalid
once CREATE_MODULE() completed, causing random characters to be
printed.
2018-01-17 12:01:00 +02:00
23f2c3b980 Better failover timing and redirection success is tested
Works similar to switchover.
2018-01-16 18:05:12 +02:00
c2c898ee93 Fix formatting in MariaDB Monitor 2018-01-16 13:27:20 +02:00
ff2ad05d0a Add manual rejoin command to MariaDB Monitor
The rejoin command takes as parameters the monitor name and
name of server to rejoin.

This change required refactoring the rejoin code.
2018-01-16 13:20:35 +02:00
5f4db64ac7 Better timing for switchover, check slaves for IO/SQL errors
Time elapsed is now properly tracked during a switchover. After slave
redirection, an event is added to the master. Then, the slaves are queried
repeatedly until they advance to the newest event. I/O and SQL errors are
also detected.
2018-01-08 15:23:25 +02:00
047c08f577 MXS-1588: Wait on all slaves during switchover
During switchover, MASTER_GTID_WAIT is now called on all slaves. This causes
switchover to complete slower than before but is safer if log_slave_updates
is not on on the new master server. Also, read_only is disabled on the
demoted server if waiting on slaves or promotion fails. This should
effectively cancel the failover for the old master.
2018-01-03 12:52:33 +02:00
9558addbfe Update module name for mariadbmon 2018-01-02 11:01:28 +02:00
d4f9cb661f MXS-1587 Rename mysqlmon to mariadbmon
'mysqlmon' is still accepted but 'mariadbmon' is loaded instead.
This is done at runtime instead of e.g. by using a symbolic link,
so that a warning can be logged.

The warning is logged and the translation of the module name is
made by the code that loads the modules so that it's easy to do
the same thing for other modules as well.

In a subsequent commit the documentation is updated.
2017-12-27 11:22:27 +02:00
3ccd6eed28 MXS-1588 Fix switchover
Change the ordering of the two flushes such that FLUSH LOGS comes last.
This seems to make sure gtid:s are updated to newest values before
the MASTER_GTID_WAIT-call. Without this fix, switchover does complete
succesfully, but some of the slaves may not be able to replicate due to
not having same events as new master. Exact reason for this still unclear.
2017-12-22 13:35:36 +02:00
79652301d8 Fix split of mysqlmon sources
For some reason, the source code of mysqlmon was split into C and C++
sources. This caused problems by effectively discarding all changes from
2.1 that are merged into 2.2.

This commit merges the changes into the correct file that were added to
the wrong file.
2017-12-21 16:24:03 +02:00
7ed9172496 MXS-1533: Rejoin also when the old master cannot be connected
Previously, the rejoin would only be ran on servers with a connected slave io
thread. This patch runs the rejoin also on slaves which cannot connect to a
downed old master while the master hostname or port differs from the current
cluster master server.
2017-12-18 13:04:47 +02:00
b6e983c0b5 Change default value of detect_standalone_master
The default value was changed from false to true.
2017-12-13 13:18:44 +02:00
79afaa447e Merge branch '2.1' into 2.2 2017-12-12 13:23:02 +02:00
b29a8eb4f8 MXS-1513: Flush logs before tables during switchover_demote_master
In this order, the new binary log file will have 1 event instead of 0.
In total, only 1 event is added.
2017-12-08 12:28:50 +02:00
c6daf8c26b MXS-1513: More accurate error messages
The failed query is now printed.
2017-12-07 10:54:30 +02:00
b2bc087508 MXS-1491: Print failcount
Failcount is now printed in "show monitors". Also, when master goes down
but failover does not yet happen because failcount > 1, a message is
logged.
2017-12-07 10:05:45 +02:00
046ed5c93d MXS-1513: mysql_mon.cc formatting changes
Ran astyle, cut some long lines.
2017-12-04 13:53:16 +02:00
c0ab80e459 MXS-1513: Cleanup some messages, change function names
No real functionality changes.
2017-12-04 12:53:09 +02:00
45834a89b5 MXS-1533: Rename "auto_join" to "auto_rejoin" 2017-12-04 09:59:29 +02:00
508ce3a703 MXS-1491: Failover can be executed manually
Also, renamed config setting "failover" to "auto_failover". Removed
setting "switchover" as it is now always enabled.
2017-12-04 09:41:00 +02:00
90f6d78a58 MXS-1533: Add automatic join feature
When enabled, the monitor will redirect servers to replicate from the
current master. Standalone servers and servers replicating from a slave
are redirected.
2017-12-04 09:37:16 +02:00
4cb50f48ad MXS-1533: Fix relay master identification and root master detection
Relay master servers must now have a running slave. Also, fix cluster master
detection in get_replication_tree().
2017-11-30 16:16:19 +02:00
d5d41349ae MXS-1509: Add ignore_external_masters parameter
The new parameter allows ignoring of master servers that are external to
the monitor configuration. This allows sub-trees of the actual replication
tree to be used as fully fledged replication trees.
2017-11-30 12:39:00 +02:00
23cd294dad MXS-1533: Better handling of multi-domain gtids
If the gtid_domain_pos of the master is ever modified,
gtid-variables will have multiple domains. Generally, we are
only interested in the most recent domain. This is tracked in
gtid_domain_id:s and the value of the master is used for
filtering the correct domain from all gtid-values.

Also, use gtid_current_pos instead of gtid_slave_pos. The
advantage of current_pos is that the same variable works also
for master servers. The gtid-handling is now more thorough and
detects some weird situations.
2017-11-27 16:31:13 +02:00
15e330e127 MXS-1533: Save gtid_domain_id and server version to MySqlServerInfo
Cleaned the surrounding code, as it was querying server
version twice.
2017-11-27 16:30:43 +02:00
dc2286c774 MXS-1513: Obay user-given new master server
If given a readily selected master server, Switchover will use it
as the new master. If the given server is invalid, nothing will
happen and an error is returned.
2017-11-23 13:37:42 +02:00
396b81f336 Fix in-source builds
The internal header directory conflicted with in-source builds causing a
build failure. This is fixed by renaming the internal header directory to
something other than maxscale.

The renaming pointed out a few problems in a couple of source files that
appeared to include internal headers when the headers were in fact public
headers.

Fixed maxctrl in-source builds by making the copying of the sources
optional.
2017-11-22 18:40:18 +02:00
8077d97e25 MXS-1513: Work around the backend_read_timeout-setting
The setting limits the maximum time a MASTER_GTID_WAIT-function
can wait. To work around this limitation, the function is now called
in a loop such that the total timeout is approximately equal to
the requested timeout.
2017-11-20 12:31:11 +02:00
59616b5f3e MXS-1490: Cleanup error printing, add json error
Slave redirection is a special case, as there the total failure
is only known after all redirects have been attempted. In the
failure case, all errors from connections are gathered to one
message.
2017-11-20 11:33:47 +02:00
84d1ea0bff MXS-1490: Perform failover only after failcount monitor loops
The same failcount variable is used for the detect_standalone_master-
feature.
2017-11-17 10:12:33 +02:00
b63c6504a3 MXS-1513: Switchover script
First version of switchover script. Unsafe to run as it has no
timeouts for most queries. Also, removed code launching the
previous switchover_script.
2017-11-16 10:51:12 +02:00
f41111b4bd MXS-1517: Retain stale master bit even on master failure
If a server goes down and it has the stale master bit enabled, all other
bits for the server are cleared. This allows failed masters that have been
replaced to be first detected and then reintroduced into the replication
topology.
2017-11-14 16:53:09 +02:00