Commit Graph

511 Commits

Author SHA1 Message Date
f41caae5c7 MXS-2175: Fix available_when_donor
If a Galera cluster drops down to a single node, the last node would not
be considered valid. During the failure of the second to last node, the
master would also temporarily lose the master status.

The behavior was changed to always keep the cluster UUID until the cluster
size drops down to zero. This guarantees that the same cluster is used as
long as possible.
2018-11-27 09:22:39 +02:00
f34ca0d473 Fix peculiar wrapping 2018-11-01 12:39:18 +02:00
e1dedfb678 Update galeramon.c (#183)
* Update galeramon.c

support wsrep_sst_method "xtrabackup-v2" for available_when_donor maxscale option

* reformat line to fit <=110 chars / support xtrabackup-v2 sst method
2018-10-31 16:00:26 +02:00
d6ce6e4289 MXS-2035: Fix available_when_donor
The parameter got broken by the previous change.
2018-09-15 01:22:39 +03:00
fa96923983 MXS-2035: Add mariabackup support to Galeramon
The mariabackup is now treated the same way as xtrabackup.
2018-09-13 13:02:32 +03:00
ad71655a36 MXS-2036 Redirect slaves with stopped SQL threads
This is somewhat questionable, as the slaves won't be able to really
replicate from the new master. However, not doing this causes the wrong
master to be selected after failover unless the new master has a majority
of slaves under it.
2018-09-11 10:27:31 +03:00
f499b22a9e MXS-2007: Check for no rows
If the query returns no rows, a NULL row is returned.
2018-08-11 23:33:48 +03:00
3243f741a0 MXS-1961 Standalone master loses master status when an alternative master emerges
Fixes the bug by requiring that only running slaves are considered when choosing a master.
2018-07-26 10:37:30 +03:00
09df017528 MXS-1886 Better auto-rejoin error description and tolerance
Auto-rejoin now explains more accurately if a server cannot be joined due
to conflicting gtid.

Also, auto-rejoin is no longer disabled if a join fails. Usually the fail
is due to the server not replying fast enough with query completion. The
query is often completed anyways. This can lead to some log spam.
2018-06-15 13:11:10 +03:00
9e68d8ec3d MXS-1886 Auto-failover error tolerance
Auto-failover is no longer considered to have failed if the preconditions
are not met. An error message with the failed checks is printed once, but
the checks are repeated every loop as long as the master is down.
2018-06-15 12:52:03 +03:00
f2b2951c99 Track the number of performed monitoring intervals
Tracking how many times the monitor has performed its monitoring allows
the test framework to consistently wait for an event instead of waiting
for a hard-coded time period. The MaxCtrl `api get` command can be used to
easily extract the numeric value.
2018-06-06 08:46:46 +03:00
5f167abafd Add missing priority usage information to galeramon
The monitor did not print the current value of this parameter and knowing
it is helpful.
2018-05-28 10:34:06 +03:00
fb56de641a MXS-1859 Add options for enforcing read_only on slaves
If the feature is enabled (default off), at the end of a monitor loop
(once server states are known), read_only is enabled on slaves servers
without it.
2018-05-18 15:29:56 +03:00
ee2c3e21c7 Fix server priority regression
Servers without priorities were chosen instead of servers with
priorities. This caused at least the server_weight test to fail.
2018-05-15 10:15:32 +03:00
97eb7d2f9e Fix deadlock in galeramon
The parameter extraction caused a recursive lock of the server
spinlock. To work around this, an unlocked version of server_get_parameter
is needed.

Ideally, a lock-free setup would be used but due to this being a bug fix,
it will have to be done later on.
2018-05-15 10:15:26 +03:00
39789c19d3 MXS-1856 Do not set read_only OFF if join_cluster() fails
This could in some cases leave read_only OFF even if the target slave
begins replication.
2018-05-08 13:56:57 +03:00
612b4e1a32 MXS-1847: Fix server_get_parameter
The function now takes an output buffer as a parameter. This prevents race
conditions by copying the parameter value into a local buffer.
2018-05-03 09:50:52 +03:00
2a38902aa6 MXS-1639 Discard results when executing sql text files
This removes the limitation of not returning resultsets.
2018-04-24 13:21:44 +03:00
fa7cd9450a MXS-1639 Do not run demote_sql_file if the server already has a slave connection
In this case, the server was already a slave and is not being demoted. Also, the file may
contain queries which cannot be ran while a slave connection is running.
2018-04-24 13:21:44 +03:00
739edcbe22 MXS-1639 Run user-given sql commands during promotion, demotion and rejoin
The sql queries are given in two text files, defined by options promotion_sql_file
and demotion_sql_file. The files must exist when monitor starts. The files are read
line by line, ignoring empty lines and lines starting with '#'. All other lines
are sent to the server being promoted, demoted or rejoined. Any error in opening
a file, reading it or executing the contents will cause the entire operation to
fail.

The filed defined in demotion_sql_file is also ran when rejoining a server. This
is to ensure a previously failed master is "demoted" properly when it joins the
cluster.
2018-04-19 17:01:36 +03:00
4167e88719 MXS-1751: Fix crash with available_when_donor=true
The `MYSQL_ROW row` variable was being overwritten by the extra query done
by the SST method detection code. Moving it into its own function prevents
this and makes the code significantly easier to comprehend.

Added a test case that reproduced the problem (MaxScale crashed) and
verifies that the patch fixes the problem.
2018-03-31 20:21:07 +03:00
7209080236 MXS-1747 Improve error messages of rejoin operations
Now states which query caused the error.
2018-03-28 12:39:10 +03:00
6c32c7421b MXS-1746 Query global gtid_domain_id instead of session-specific value
The monitor queried the session-specific domain id, which does not follow the global
value while the session is alive. This caused the monitor to follow the wrong gtid
domain if the domain was changed after MaxScale was started. This patch modifies the
query to read the global value instead. Even this is not fool-proof, as existing
sessions can issue writes with the old domain, confusing the gtid-parsing.
2018-03-28 12:23:57 +03:00
bd8b6dbc6f MXS-1722 Add better error messages to switchover_demote_master()
The error messages should now be a bit more reliable.
2018-03-21 15:04:39 +02:00
2178667245 MXS-1679 Check for existence of master before continuing failover checks
Seems to fix the issue with MaxScale detecting an old master down event.
2018-03-16 11:26:58 +02:00
b982458497 MXS-1679 Add more accurate error printing
The reason for rejoin failing should now be clearer.
2018-03-12 17:16:54 +02:00
5a62adc63e MXS-1678: Detect broken replication with Last_IO_Errno
This commit introduces changes that fix the relay master detection that
was broken by the merge from 2.1 into 2.2 by commit
1ecd791887994209eb29e56e1271f8c407cd0cdf.

In 2.2, the master server ID is used to detect whether a slave is actually
replicating from a master. The value is still displayed even if the slave
is not actively replicating from a master. The commit in 2.1 causes this
value to be stored unconditionally if it is available. By checking the
value of Last_IO_Errno and comparing it to a list of known error codes, we
know whether the slave is replicating properly.

The slave detection in 2.2 correctly identifies a broken slave with a
stopped IO thread. Due to this, the test case must be modified to check
that the relay master is not a slave if the IO thread is stopped.
2018-03-12 14:55:54 +02:00
f7b284bbb7 Check IO thread status when verifying master failure
When MaxScale thinks that the master has failed, it tries to verify it by
seeing if the slave server is receiving events. There was a missing IO
thread status check in the slave_receiving_events function which caused
the failover to wait until the verification timed out.

The relay master detection logic also lacked a check for the slave SQL
thread status. The code should check the state of the SQL thread to
determine whether the server is actually a functional slave to a master.
2018-03-09 20:53:56 +02:00
d443e22d1b Merge branch '2.2.3' into 2.2 2018-03-09 20:50:01 +02:00
f4c7a4700a Disable fix to MXS-1678 in 2.2.3
The fix causes a regression in the failover functionality as there is a
dependency between the slave's master ID and how the failover
performs. This dependency should not exist but fixing it causes a problem
with the mysqlmon_rejoin_bad2 test.
2018-03-08 21:03:52 +02:00
ff9024bdfb MXS-1698: Remove false debug assertion
It is not an error if the correct GTID is not found and thus it should not
be asserted that one is found.
2018-03-07 11:55:46 +02:00
39d3c42c94 Merge branch '2.1' into 2.2 2018-03-01 17:52:42 +02:00
b67ab83486 Revert "Use dedicated header in NDBClusterMon"
This reverts commit b9d80f6061d6b536d7a15febf0367e5f6dba0e84.
2018-02-24 15:43:15 +02:00
0bbf0246f9 Revert "Compile mariadbmon.h as C++"
This reverts commit 60d57aee61d96832aeec1b8a61d36803c38ca77c.
2018-02-24 15:40:21 +02:00
236e906d88 Revert "Turn MariaDB Monitor struct to class with public fields"
This reverts commit cb6f70119d9857b277306e9af5881fe29c574a32.
2018-02-24 15:37:50 +02:00
13661ab4a6 Revert "MariaDB Monitor: Move additional classes to separate file"
This reverts commit ff55106610881d55db88eca9e2ef6a056cbc8d51.
2018-02-24 15:35:36 +02:00
e721733434 Revert "MariaDBMon: Move replication manipulation functions to a separate file"
This reverts commit 8cdd23dda2add6486abb685834def94c72a09b6c.
2018-02-24 15:35:02 +02:00
8cdd23dda2 MariaDBMon: Move replication manipulation functions to a separate file
Refactoring continues. This update moves some of the replication manipulation
functions to a separate file and turns them into class methods.
2018-02-22 10:51:52 +02:00
ff55106610 MariaDB Monitor: Move additional classes to separate file
Also use stl containers in monitor definition.
2018-02-21 12:24:24 +02:00
cb6f70119d Turn MariaDB Monitor struct to class with public fields
Allows using std::string for strings. Also, cleanup.
2018-02-21 11:00:42 +02:00
1ecd791887 MXS-1678: Store master_id even when IO thread is stopped
When the IO thread of a relay master is stopped, the knowledge that it is
not a real master but a relay master is lost. To prevent this loss of
information, the master server's server_id value should always be stored
if it is available.
2018-02-21 09:35:42 +02:00
60d57aee61 Compile mariadbmon.h as C++ 2018-02-20 11:14:21 +02:00
b9d80f6061 Use dedicated header in NDBClusterMon
NDBClusterMonitor used the MariaDBMonitor header instead of its own.
2018-02-20 11:09:04 +02:00
754d80da75 Do not auto_rejoin if maxscale is passive 2018-02-14 17:30:02 +02:00
3b2ec4ab5a Change references from MySQL to MariaDB
A few were missed when the renaming was done. Also renamed the file to
mariadbmon.cc.
2018-02-14 14:47:03 +02:00
b94f3b8792 Failover: do not check cluster stabilization if no slaves
Caused debug assert.
2018-02-12 12:37:41 +02:00
b8d3da4968 Add error tolerance to "servers_no_promotion"
Previously, if the list contained servers that were not monitored by
the monitor yet were valid servers, an error value would be returned
and the monitor failed to start.

With this update, the non-monitored servers are simply ignored when
forming the final list.

Also, added printing of the list to diagnostics.
2018-02-12 10:49:28 +02:00
faaf43ff39 Add gtid to monitor diagnostics, clean up formatting
Gtid:s are now queried every monitor loop.

dignostics() no longer prints slave related info if the server has
no slave connection.
2018-02-10 12:32:56 +02:00
a0d9c7da74 External master server support for failover/switchover
If the master is replicating from an external master, the monitor will save the
host:port of the external server. During demotion, the old master stops the external
replication while the new master begins it. Also, any commands that would add
to gtid have to be omitted when an external master is in play.
2018-02-08 18:44:08 +02:00
ed28d986e9 Fix debug assertion when no master is present
If no master is present, the debug assertion would dereference a NULL
pointer.
2018-02-08 13:33:31 +02:00