86 Commits

Author SHA1 Message Date
Esa Korhonen
040562f718 MXS-2342 Run MariaDBMonitor diagnostics concurrent with the monitor loop
This fixes some situations where MaxAdmin/MaxCtrl would block and wait
until a monitor operation or tick is complete. This also fixes a deadlock
caused by calling monitor diagnostics inside a monitor script.

Concurrency is enabled by adding one mutex per server object to protect
array-like fields from concurrent reading/writing.
2019-03-12 10:50:16 +02:00
Esa Korhonen
4fd4b726a1 MXS-2325 Only enable events that were enabled on the master
The monitor now continuously updates a list of enabled server events. When
promoting a new master in failover/switchover, only events that were enabled
on the previous master are enabled on the new. This avoids enabling events
that may have been disabled on the master yet stayed in the SLAVESIDE_DISABLED-
state on the slave.

In the case of reset-replication command, events on the new master are only
enabled if the monitor had a master when the command was launched. Otherwise
all events remain disabled.
2019-03-04 16:00:07 +02:00
Esa Korhonen
fb52e565fe Store capabilities of monitored server
Checking the version number in various places in the code gets confusing.
It's better to check the version in one place and record the relevant data.
2018-11-21 17:36:52 +02:00
Esa Korhonen
1a046bd453 MXS-2168 Add 'assume_unique_hostnames'-setting to MariaDBMonitor
Adds the setting and takes it into use during replication graph creation
and the most important checks.
2018-11-21 10:30:11 +02:00
Esa Korhonen
6a1cfddb43 MXS-2158 Clean up gtid updating during rejoin
Error messages from update_gtids() are now printed. can_replicate_from()
no longer updates gtid:s.
2018-11-16 12:56:24 +02:00
Esa Korhonen
14e38e4e08 MXS-2158 Return true if update_gtids() succeeds, even if no data is returned
Previously, if the server had no gtid:s, the method would fail leading to
a confusing error message. This could even totally stop the monitor from working
if a recent server version (10.X) did not have any gtid events.
2018-11-14 10:56:42 +02:00
Esa Korhonen
fb3ccc94d6 MXS-1944 Cleanup function parameters
The naming and ordering is now a bit more consistent between promote() and demote().
2018-11-07 12:55:59 +02:00
Esa Korhonen
e4e2235297 MXS-1944 Use time limited methods in rejoin
Uses switchover time limit, since the typical rejoin of a standalone server
is somewhat similar to a switchover.
2018-11-07 12:55:59 +02:00
Esa Korhonen
a4ce4e4613 MariaDBServer no longer uses ClusterOperation
The functions in the server class now only use the general parameters object.
2018-11-07 12:55:59 +02:00
Esa Korhonen
90e6ff078a Divide ClusterOperation to two types
The main class was getting unwieldly and too general. Dividing the fields
helps adding support for other operation types.

This commit leaves most data duplicated, later commits clean up the affected code.
2018-11-07 12:55:59 +02:00
Esa Korhonen
0cf8ea43f7 Redirect slaves of promotion target
This affects situations where the promoted server is a relay or multimaster
group member.
2018-10-16 16:09:38 +03:00
Esa Korhonen
2f1512a22d Cleanup slave connection removal during promotion/demotion
The removing and slave status updating is now separated to a function.
As the MariaDBServer object now contains the updated slave connections,
keeping track of removed connections is no longer required.
2018-10-09 14:29:49 +03:00
Esa Korhonen
c10fab977d Cleanup slave connection copy & merge
The two cases are now separated. In switchover, the promotion and
demotion targets can swap connections between each other without worry.
In failover, the two connection lists must be merged semi-intelligently.

The slave connections of the two servers are now saved to the operation
descriptor object at the start of the operation. This allows slave status
updating during the operation.
2018-10-09 14:29:49 +03:00
Esa Korhonen
4d6f961695 Cleanup mariadbserver.hh 2018-10-09 14:29:49 +03:00
Esa Korhonen
68d65682b5 Reorganize MariaDBServer code
The server-class keeps growing, so the additional classes are moved out of
the main class file.
2018-10-09 14:29:49 +03:00
Esa Korhonen
30eb21914f MXS-1845 Switchover cleanup
Several small changes:
Binlog is flushed at the end of old master demotion.
Only new master is required to catch up to old master.
Use the same replication check method as failover.
2018-10-04 11:45:33 +03:00
Esa Korhonen
49e85d9a28 MXS-1845 Add demotion code
The master demotion in switchover now uses query retrying with
the switchover time limit.
2018-10-04 11:45:33 +03:00
Esa Korhonen
d14b9bfe43 MXS-1845 Cluster stabilization rewrite
No longer writes events to the master, as this creates problems if the
promoted server was not the overall master. Instead, the slave status
output is inspected.
2018-10-02 11:09:16 +03:00
Esa Korhonen
1ca5d02abb MXS-1845 Add redirection code
Should work with multimaster replication.
2018-10-02 11:09:16 +03:00
Esa Korhonen
6b8443aba6 MXS-1845 Complete server promotion code
Now copies slave connections from the previous master. Promotion
code taken into use.
2018-10-01 18:06:39 +03:00
Esa Korhonen
fe81b399b2 Use maxbase time and clock classes instead of std::chrono 2018-09-27 17:04:59 +03:00
Esa Korhonen
dd9ff27743 MXS-1845 Rewrite server promotion code
In progress, does not yet overwrite existing code.

The new promotion mechanism automatically retries queries which timed out. It also
handles multimaster situations correctly.
2018-09-26 13:20:29 +03:00
Esa Korhonen
c20a17238b MXS-1944 Store failover parameters in an object
Several of the parameters are passed on from function to function. Having them all
in an object cleans things up and makes adding more data easier.
2018-09-26 12:26:35 +03:00
Esa Korhonen
02ac394e38 Cleanup slave status handling
Further reduce direct indexing slave status array to improve compatibility with
multimaster replication.
2018-09-19 13:37:24 +03:00
Esa Korhonen
56c84541df MXS-1712 Add reset replication to MariaDB Monitor
The 'reset_replication' module command deletes all slave connections and binlogs,
sets gtid to sequence 0 and restarts replication from the given master. Should be
only used if gtid:s are incompatible but the actual data is known to be in sync.
2018-09-14 17:15:05 +03:00
Esa Korhonen
cb54880b99 MXS-1937 Cleanup event handling
Event handling is now enabled by default. If the monitor cannot query the EVENTS-
table (most likely because of missing credentials), print an error suggesting to
turn the feature off.

When disabling events on a rejoining standalone server (likely a former master),
disable binlog event recording for the session. This prevents the ALTER EVENT
queries from generating binlog events.

Also added documentation and combined similar parts in the code.
2018-09-14 16:54:24 +03:00
Markus Mäkelä
d11c78ad80
Format all sources with Uncrustify
Formatted all sources and manually tuned some files to make the code look
neater.
2018-09-10 13:22:49 +03:00
Niclas Antti
c447e5cf15 Uncrustify maxscale
See script directory for method. The script to run in the top level
MaxScale directory is called maxscale-uncrustify.sh, which uses
another script, list-src, from the same directory (so you need to set
your PATH). The uncrustify version was 0.66.
2018-09-09 22:26:19 +03:00
Esa Korhonen
f3fbb297a4 MXS-1937 Enable server events on new master during switchover and failover
Combined some of the shared code in enable/disable_events(). Also disables
events on a joining standalone server.
2018-09-05 15:54:08 +03:00
Esa Korhonen
7e6ce2d13f MXS-1937 Disable server events on master during switchover
The feature is behind a config setting.
2018-09-05 15:54:08 +03:00
Esa Korhonen
7cd1cfdb80 Relay log waiting is part of failover_prepare
Since the servers are not modified before or during the wait, the waiting
can be done in the preparation method. This simplifies the actual failover
somewhat, and allows the monitor to keep running normally while waiting for
the log to clear.
2018-08-30 17:07:34 +03:00
Esa Korhonen
c39177bc8d Relay log clear supports multiple slave connections
Now waits for the relay log of the correct slave connection.
2018-08-29 17:07:52 +03:00
Esa Korhonen
85d8a85cde Update master failure detection from slaves
The detection now works with multiple slave connections.
2018-08-29 17:07:52 +03:00
Esa Korhonen
9e566bc619 A master that is down with no running slaves can be replaced
This should be a more general way to detect situations where a DBA
or another MaxScale performs a failover.
2018-08-29 16:41:20 +03:00
Markus Mäkelä
91ab59530f
Use pending status in external master checks
When the replication status from the external master is checked, the
pending status must be used. This makes sure that the SlaveStatusArray and
the server state are sync.

Also extended the message that was logged when the external master was
lost. By adding the network address there, it makes it easier to see where
the server was replicating from if only the log file is available.
2018-08-23 15:46:45 +03:00
Esa Korhonen
61bb172033 Cleanup failover/switchover
Replication settings warnings are printed once more. Changed some
parameter names to be more consistent within the monitor.
2018-08-23 10:28:47 +03:00
Esa Korhonen
f2dfd39f79 Clean up JSON diagnostics
Now prints all slave connections.
2018-08-22 12:23:28 +03:00
Esa Korhonen
3777da96bd Miscellaneous cleanup
Removes needless status assignments and unused code. Moves and modifies some comments.
2018-08-22 12:11:33 +03:00
Niclas Antti
24ab3c099c Move top of the file "#pragma once" to after the following comment (swap them). If the comment is a BPL update it to the latest one 2018-08-21 13:13:15 +03:00
Esa Korhonen
03cefcc4ac MXS-2012 Write replication lag to SERVER
Allows routers to read the value.
2018-08-21 11:51:10 +03:00
Esa Korhonen
1c508cd413 MXS-2012 Read and print Seconds_Behind_Master
Replaces the old replication lag detection.
2018-08-21 11:51:10 +03:00
Esa Korhonen
681c456bd7 Separate unknown server version from old versions
This allows better failover support detection.
2018-08-13 11:30:21 +03:00
Esa Korhonen
b7c94abb34 Keep track of previously observed slave connections
This reduces the ambiguity of server id:s in the slave status contents.
If a slave connection has been seen properly connected at an earlier time,
it can be trusted to report the correct master server id. This also
fixes some wrong status assignment edge cases with the SERVER_WAS_SLAVE-bit.
The bit will be removed in a later commit.

Even this does not solve the situation when MaxScale is started with
some servers down.
2018-08-09 20:39:19 +03:00
Esa Korhonen
17c84a22c7 Refactor preparations to failover
The two operations are quite similar so the code should look
similar as well and use shared functions.
2018-08-07 16:33:56 +03:00
Esa Korhonen
0a81f78442 Use unique pointer instead of auto-pointer 2018-08-06 13:24:05 +03:00
Esa Korhonen
c0bd5ca3a1 MXS-1905 Switchover if master is low on disk space
Required quite a bit of refactoring.
2018-08-06 13:24:05 +03:00
Esa Korhonen
1e33ab69f2 Rename server_is_running() to server_is_usable()
The previous name was misleading. The new server_is_running() only
checks for the running bit so that a server is always either running
or down.
2018-07-31 14:53:56 +03:00
Esa Korhonen
c9570ff616 Check failover applicability to the cluster every turn
This should give an advance warning if a user tries to activate auto_failover
on a cluster which does not support it.
2018-07-20 15:33:47 +03:00
Esa Korhonen
936bcde135 Remove old "detect_standalone_master"-feature, update documentation
The auto_failover is a more reliable solution and should be used instead. Several
unused parameters were removed, although they can still be defined in the config
file. Updated documentation on the relevant parts.
2018-07-16 15:58:16 +03:00
Markus Mäkelä
77a1417479
Replace TR1 headers with standard headers
Now that the C++11 standard is the default one, we can remove the TR1
headers and classes.
2018-07-11 14:08:46 +03:00