MXS-2275 Implement [un]softfailing

It is now possible to [un]softfail a Clustrix node via MaxScale using a Clustrix monitor module command. In case a node is successfully softfailed, the `Being Drained` bit will automatically turned on. Similarly, if a node is successfully unsoftfailed, the `Being Drained` bit will be cleared.
2019-01-30 16:12:34 +02:00 · 2019-01-30 16:12:34 +02:00 · cb07687672
commit cb07687672
parent 2e395c4477
3 changed files with 218 additions and 5 deletions
--- a/Documentation/Monitors/Clustrix-Monitor.md
+++ b/Documentation/Monitors/Clustrix-Monitor.md
@ -28,6 +28,26 @@ password=mypwd

 ```

+## Dynamic Servers
+
+The server objects the Clustrix monitor creates for each detected
+Clustrix node will be named like
+```
+@@<name-of-clustrix-monitor>:node-<id>
+```
+where `<name-of-clustrix-monitor>` is the name of the Clustrix monitor
+instance, as defined in the MaxScale configuration file, and `<id>` is the
+id of the Clustrix node.
+
+For instance, with the Clustrix monitor defined as above and a Clustrix
+cluster consisting of 3 nodes whose ids are `1`, `2` and `3` respectively,
+the names of the created server objects will be:
+```
+@@TheClustrixMonitor:node-1
+@@TheClustrixMonitor:node-2
+@@TheClustrixMonitor:node-3
+```
+
 ### Grants

 Note that the monitor user _must_ have `SELECT` grant on the following tables:
@ -37,8 +57,13 @@ Note that the monitor user _must_ have `SELECT` grant on the following tables:

 You can give the necessary grants using the following commands:
 ```
-    grant select on system.membership to 'myuser'@'%';
-    grant select on system.nodeinfo to 'myuser'@'%';
+    GRANT SELECT ON system.membership TO 'myuser'@'%';
+    GRANT SELECT ON system.nodeinfo TO 'myuser'@'%';
+```
+Further, if you want be able to _softfail_ and _unsoftfail_a node via MaxScale,
+then the monitor user must have `SUPER` privileges, which can be granted like:
+```
+    GRANT SUPER ON *.* TO 'myuser'@'%';
 ```
 The user name must be changed to the one actually being used.

@ -69,3 +94,45 @@ considers a particular node to be down. The default value is 2.
 ```
 health_check_threshold=3
 ```
+
+## Commands
+
+The Clustrix monitor supports the following module commands.
+
+### `softfail`
+
+With the `softfail` module command, a node can be _softfailed_ via
+MaxScale. The command requires as argument the name of the Clustrix
+monitor instance (as defined in the configuration file) and the name
+of the node to be softfailed.
+
+For instance, with a configuration file like
+```
+[TheClustrixMonitor]
+type=monitor
+module=clustrixmon
+...
+```
+then the node whose server name is `@@TheClustrixMonitor:node-1` can
+be softfailed like
+```
+$ maxctrl call command clustrixmon softfail TheClustrixMonitor @@TheClustrixMonitor:node-1
+```
+If a node is successfully softfailed, then the status of the corresponding
+MaxScale server object will be set to `Being Drained`, which will prevent
+new connections from being created to the node.
+
+### `unsoftfail`
+
+With the `unsoftfail` module command, a node can be _unsoftfailed_ via
+MaxScale. The command requires as argument the name of the Clustrix
+monitor instance (as defined in the configuration file) and the name
+of the node to be unsoftfailed.
+
+With a setup similar to the `softfail` case, a node can be unsoftfailed
+like:
+```
+$ maxctrl call command clustrixmon unsoftfail TheClustrixMonitor @@TheClustrixMonitor:node-1
+```
+If a node is successfully softfailed, then a `Being Drained` status of
+the corresponding MaxScale server object will be cleared.
--- a/server/modules/monitor/clustrixmon/clustrixmonitor.cc
+++ b/server/modules/monitor/clustrixmon/clustrixmonitor.cc
@ -14,11 +14,21 @@
 #include "clustrixmonitor.hh"
 #include <algorithm>
 #include <set>
+#include <maxscale/json_api.h>
 #include "../../../core/internal/config_runtime.hh"

 namespace http = mxb::http;
 using namespace std;

+#define LOG_JSON_ERROR(ppJson, format, ...) \
+    do { \
+        MXS_ERROR(format, ##__VA_ARGS__); \
+        if (ppJson) \
+        { \
+            *ppJson = mxs_json_error_append(*ppJson, format, ##__VA_ARGS__); \
+        } \
+    } while (false)
+
 namespace
 {

@ -57,13 +67,45 @@ bool ClustrixMonitor::configure(const MXS_CONFIG_PARAMETER* pParams)

 bool ClustrixMonitor::softfail(SERVER* pServer, json_t** ppError)
 {
-    MXS_NOTICE("Should softfail %s.", pServer->address);
+    bool rv = false;
+
+    if (is_running())
+    {
+        call([this, pServer, ppError, &rv]() {
+                rv = perform_softfail(pServer, ppError);
+            },
+            EXECUTE_QUEUED);
+    }
+    else
+    {
+        LOG_JSON_ERROR(ppError,
+                       "%s: The monitor is not running and hence "
+                       "SOFTFAIL cannot be performed for %s.",
+                       m_name, pServer->address);
+    }
+
    return true;
 }

 bool ClustrixMonitor::unsoftfail(SERVER* pServer, json_t** ppError)
 {
-    MXS_NOTICE("Should unsoftfail %s.", pServer->address);
+    bool rv = false;
+
+    if (is_running())
+    {
+        call([this, pServer, ppError, &rv]() {
+                rv = perform_unsoftfail(pServer, ppError);
+            },
+            EXECUTE_QUEUED);
+    }
+    else
+    {
+        LOG_JSON_ERROR(ppError,
+                       "%s: The monitor is not running and hence "
+                       "UNSOFTFAIL cannot be performed for %s.",
+                       m_name, pServer->address);
+    }
+
    return true;
 }

@ -217,7 +259,7 @@ void ClustrixMonitor::refresh_nodes()

                        // '@@' ensures no clash with user created servers.
                        // Monitor name ensures no clash with other Clustrix monitor instances.
-                        string name = string("@@") + m_name + ":server-" + std::to_string(id);
+                        string name = string("@@") + m_name + ":node-" + std::to_string(id);

                        auto nit = m_nodes.find(id);
                        auto mit = memberships.find(id);
@ -566,3 +608,93 @@ bool ClustrixMonitor::check_http(Call::action_t action)

    return false;
 }
+
+bool ClustrixMonitor::perform_softfail(SERVER* pServer, json_t** ppError)
+{
+    return perform_operation(Operation::SOFTFAIL, pServer, ppError);
+}
+
+bool ClustrixMonitor::perform_unsoftfail(SERVER* pServer, json_t** ppError)
+{
+    return perform_operation(Operation::UNSOFTFAIL, pServer, ppError);
+}
+
+bool ClustrixMonitor::perform_operation(Operation operation,
+                                        SERVER* pServer,
+                                        json_t** ppError)
+{
+    bool performed = false;
+
+    const char ZSOFTFAIL[] = "SOFTFAIL";
+    const char ZUNSOFTFAIL[] = "UNSOFTFAIL";
+
+    const char* zOperation = (operation == Operation::SOFTFAIL) ? ZSOFTFAIL : ZUNSOFTFAIL;
+
+    if (!m_pHub_con)
+    {
+        check_hub_and_refresh_nodes();
+    }
+
+    if (m_pHub_con)
+    {
+        auto it = find_if(m_nodes.begin(), m_nodes.end(),
+                          [pServer] (const std::pair<int, ClustrixNode>& element) {
+                              return element.second.server() == pServer;
+                          });
+
+        if (it != m_nodes.end())
+        {
+            ClustrixNode& node = it->second;
+
+            const char ZQUERY_FORMAT[] = "ALTER CLUSTER %s %d";
+
+            int id = node.id();
+            char zQuery[sizeof(ZQUERY_FORMAT) + sizeof(ZUNSOFTFAIL) + UINTLEN(id)]; // ZUNSOFTFAIL is longer
+
+            sprintf(zQuery, ZQUERY_FORMAT, zOperation, id);
+
+            if (mysql_query(m_pHub_con, zQuery) == 0)
+            {
+                MXS_NOTICE("Clustrix monitor %s performed %s on node %d (%s).",
+                           m_name, zOperation, id, pServer->address);
+
+                if (operation == Operation::SOFTFAIL)
+                {
+                    MXS_NOTICE("%s: Turning on 'Being Drained' on server %s.",
+                               m_name, pServer->address);
+                    pServer->set_status(SERVER_BEING_DRAINED);
+                }
+                else
+                {
+                    mxb_assert(operation == Operation::UNSOFTFAIL);
+
+                    MXS_NOTICE("%s: Turning off 'Being Drained' on server %s.",
+                               m_name, pServer->address);
+                    pServer->clear_status(SERVER_BEING_DRAINED);
+                }
+            }
+            else
+            {
+                LOG_JSON_ERROR(ppError,
+                               "%s: The execution of '%s' failed: %s",
+                               m_name, zQuery, mysql_error(m_pHub_con));
+            }
+        }
+        else
+        {
+            LOG_JSON_ERROR(ppError,
+                           "%s: The server %s is not being monitored, "
+                           "cannot perform %s.",
+                           m_name, pServer->address, zOperation);
+        }
+    }
+    else
+    {
+        LOG_JSON_ERROR(ppError,
+                       "%s: Could not could not connect to any Clustrix node, "
+                       "cannot perform %s of %s.",
+                       m_name, zOperation, pServer->address);
+    }
+
+    return performed;
+}
--- a/server/modules/monitor/clustrixmon/clustrixmonitor.hh
+++ b/server/modules/monitor/clustrixmon/clustrixmonitor.hh
@ -87,6 +87,20 @@ private:
    void initiate_delayed_http_check();
    bool check_http(Call::action_t action);

+    bool perform_softfail(SERVER* pServer, json_t** ppError);
+    bool perform_unsoftfail(SERVER* pServer, json_t** ppError);
+
+    enum class Operation
+    {
+        SOFTFAIL,
+        UNSOFTFAIL,
+    };
+
+    bool perform_operation(Operation operation,
+                           SERVER* pServer,
+                           json_t** ppError);
+
+
    static long now()
    {
        return mxb::WorkerLoad::get_time_ms();