If code that may remove items from m_nodes_by_id (Clustrix nodes
keyed by id) succeeds, we must update the vector of health check
URLs also in the case that code that _may_ add items to m_nodes_by_id
fails.
The name of the object (i.e. the section name from the configuration
file), is now stored in the configuration object for that object.
That way, more contextual and hence morfe user friendly errors and
warnings can be generated.
If there have been any changes in the bootstrap servers specified
for the Clustrix monitor, then the persistent connection information
is not used.
Otherwise, if the bootstrap server is changed and inaccessible, we
may connect to another cluster than the intended one.
Persisted information about dynamic nodes must be used only if
the bootrap information has not been changed, as otherwise we risk
using information that is not valid.
"Once you eliminate the impossible, whatever remains, no matter
how improbable, must be the truth." Arthur Conan Doyle
Since server objects are never destroyed, currently the only
explanation for the crash described in MXS-2446 is that a server
created at runtime could not, immediately after the creation, be
found using its name.
If the nodes change while a multi HTTP GET is in process, the
corresponding delayed called must be cancelled. Otherwise we
eventually would end up attempting to update the state of the
nodes using the wrong result.
If 'dynamic_node_detection' has been set to false, then the
Clustrix monitor will not dynamically figure out what nodes are
available, but instead use the bootstrap nodes as such.
With 'dynamic_node_detection' being false, the Clustrix monitor
will do no cluster checks, but simply ping the health port of
each server.
'dynamic_node_detection' specifies whether the Clustrix monitor
should dynamically figure out what nodes there are, or just rely
upon static information.
'health_check_port' specifies the port to be used when perforing
the health check ping.
At runtime the Clustrix monitor will save to an sqlite3
database information about detected nodes and delete that
information if a node disappears.
At startup, if the monitor fails to connect to a bootstrap
node, it will try to connect any of the persisted nodes and
start from there.
This means that in general it is sufficient if the Clustrix
monitor at the very first startup can connect to a bootstrap
node; thereafter it will get by even if the bootstrap node
would disappear for good.
Information about the detected Clustrix nodes is now stored to
a Clustrix monitor specific sqlite-database. This will be used
for bootstrapping the Clustrix monitor, in case a statically
defined bootstrap server is unavailable.
In this context master should be interpreted as "can be read
and written to".
Marking them as master requires less changes in RWS to make it
usable with a Clustrix cluster.
The cluster check can only be made after the monitor has been
started. If done when monitor is configured it will at startup
be done when services are not yet available and hence they will
not be populated with the dynamically discovered servers.
Storing all the runtime errors makes it possible to return all of them
them via the REST API. MaxAdmin will still only show the latest error but
MaxCtrl will now show all errors if more than one error occurs.
Previously, runtime monitor modifications could directly alter monitor fields,
which could leave the text-form parameters and reality out-of-sync. Also,
the configure-function was not called for the entire monitor-object, only the
module-implementation.
Now, all modifications go through the overridden configure-function, which calls the
base-class function. As most configuration changes are given in text-form, this
removes the need for specific setters. The only exceptions are the server add/remove
operations, which must modify the text-form serverlist.
Since the current node id can be obtained using the function gtmnid()
the queries for finding out whether a node is in the quorum and whether
it is softfailed can be made simpler.
When a softfailed node is finally revoked, it will appear as the
single node in a functioning Clustrix cluster. To ensure that the
Clustrix monitor will not stick to that node, if the node that is
used as hub is softfailed, it is immediately replaced with another
node.
Worker::STOPPED -> MONITOR_STATE_STOPPED
Worker::POLLING -> MONITOR_STATE_RUNNING
Worker::PROCESSING -> MONITOR_STATE_RUNNING
By defining the monitor state from the worker state there is
no risk they will ever get out of sync. And there is one thing
less to maintain.
When the servers of a service are defined by a monitor, then
at startup all servers of the monitor should be added to relevant
services. Likewise, when a server is added to or removed from a
monitor at runtime, those changes should affect services as well.
However, whether that should happen or not depends upon the monitor.
In the case of the Clustrix monitor this should not happen as it
adds and removes servers depending on the runtime state of the
Clustrix cluster.
The services whose servers are defined using a monitor, will
now be populated from the monitor.
Note, no consideration has yet been given to runtime changes.
The likely reason for a node being down is that some cluster level
modifications have been performed. Consequently a cluster check should
be triggered in that case.
When checking the node info, also include information about wheter
a node is being SOFTFAILed. If it is, turn on the `Being Drained`
bit.
A node is SOFTFAILed with the intention of removing it, so better
not to create new connections to it as they later would be broken
when the node is actually taken down.