MaxScale/Documentation/Tutorials/Failover-with-Keepalived.md

# Failover with Keepalived

## Introduction

[Keepalived](http://www.keepalived.org/index.html) is a routing software for
load balancing and high-availability. It has several applications, but for this
tutorial the goal is to set up a simple IP failover between two servers running
MaxScale. If the main server fails the backup machine takes over, receiving any
new connections. The Keepalived settings used in this tutorial follow the
example given in [Simple keepalived failover setup on Ubuntu 14.04](
https://raymii.org/s/tutorials/Keepalived-Simple-IP-failover-on-Ubuntu.html).

Two hosts and one client machine are used, all in the same LAN. Hosts run
MaxScale and Keepalived. The backend servers may be running on one of the hosts,
e.g. in docker containers, or on separate machines for a more realistic setup.
Clients connect to the virtual IP (VIP), which is claimed by the current master
host.

![](images/Keepalived.png)

Once configured and running, the different Keepalived nodes continuously
broadcast their status to the network and listen for each other. If a node does
not receive a status message from another node with a higher priority than
itself, it will claim the VIP, effectively becoming the master. Thus, a node can
be put online or removed by starting and stopping the Keepalived service.

If the current master node is removed (e.g. by stopping the service or pulling
the network cable) the remaining nodes will quickly elect a new master and
future traffic to the VIP will be directed to that node. Any connections to the
old master node will naturally break. If the old master comes back online, it
will again claim the VIP, breaking any connections to the backup machine.

MaxScale has no knowledge of this even happening. Both MaxScales are running
normally, monitoring the backend servers and listening for client connections.
Since clients are connecting through the VIP, only the machine claiming the VIP
will receive incoming connections. The connections between MaxScale and the
backends are using real IPs and are unaffected by the VIP.

## Configuration

MaxScale does not require any specific configuration to work with Keepalived in
this simple setup, it just needs to be running on both hosts. The MaxScale
configurations should be similar to the extent that both look identical to
connecting clients. In practice the listening ports and related services should
be the same. Setting the service-level setting “version_string” to different
values on the MaxScale nodes is recommended, as it will be printed to any
connecting clients indicating which node was connected to.

Keepalived requires specific setups on both machines. On the **primary host**,
the */etc/keepalived/keepalived.conf*-file should be as follows.

```
vrrp_instance VI_1 {
  state MASTER
  interface eth0
  virtual_router_id 51
  priority 150
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass mypass
  }
  virtual_ipaddress {
    192.168.1.123
  }
}
```

The *state* must be MASTER on both hosts. *virtual_router_id* and *auth_pass*
must be identical on all hosts. The *interface* defines the network interface
used. This depends on the system, but often the correct value is *eth0*,
*enp0s12f3* or similar. *priority* defines the voting strength between different
Keepalived instances when negotiating on which should be the master. The
instances should have different values of priority. In this example, the backup
host(s) could have priority 149, 148 and so on. *advert_int* is the interval
between a host “advertising” its existence to other Keepalived host. One second
is a reasonable value.

*virtual_ipaddress* (VIP) is the IP the different Keepalived hosts try to claim
and must be identical between the hosts. For IP negotiation to work, the VIP
must be in the local network address space and unclaimed by any other machine
in the LAN. An example *keepalived.conf*-file for a **backup host** is listed
below.

```
vrrp_instance VI_1 {
  state MASTER
  interface eth0
  virtual_router_id 51
  priority 100
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass mypass
  }
  virtual_ipaddress {
    192.168.1.123
  }
}
```

Once the Keepalived service is running, recent log entries can be printed with
the command `service keepalived status`.

```
Aug 11 10:27:59 maxscale2 Keepalived_vrrp[27369]: VRRP_Instance(VI_1) Received higher prio advert
Aug 11 10:27:59 maxscale2 Keepalived_vrrp[27369]: VRRP_Instance(VI_1) Entering BACKUP STATE
Aug 11 10:27:59 maxscale2 Keepalived_vrrp[27369]: VRRP_Instance(VI_1) removing protocol VIPs.
```

## MaxScale health check

So far, none of this tutorial has been MaxScale-specific and the health of the
MaxScale process has been ignored. To ensure that MaxScale is running on the
current master host, a *check script* should be set. Keepalived runs the script
regularly and if the script returns an error value, the Keepalived node will
assume that it has failed, stops broadcasting its state and relinquishes the
VIP. This allows another node to take the master status and claim the VIP. To
define a check script, modify the configuration as follows. The example is for
the primary node. See [Keepalived Check and Notify Scripts](
https://tobrunet.ch/2013/07/keepalived-check-and-notify-scripts/) for more
information.

```
vrrp_script chk_myscript {
  script "/home/scripts/is_maxscale_running.sh"
  interval 2 # check every 2 seconds
  fall 2 # require 2 failures for KO
  rise 2 # require 2 successes for OK
}

vrrp_instance VI_1 {
  state MASTER
  interface wlp2s0
  virtual_router_id 51
  priority 150
  advert_int 1
  authentication {
    auth_type PASS
    auth_pass mypass
  }
  virtual_ipaddress {
    192.168.1.13
  }
  track_script {
    chk_myscript
  }
}
```

An example script, *is_maxscale_running.sh*, is listed below. The script uses
MaxAdmin to try to contact the locally running MaxScale and request a server
list, then check that the list has at least some expected elements. The timeout
command ensures the MaxAdmin call exits in reasonable time. The script detects
if MaxScale has crashed, is stuck or is totally overburdened and no longer
responds to connections.

```
#!/bin/bash
fileName="maxadmin_output.txt"
rm $fileName
timeout 2s maxadmin list servers > $fileName
to_result=$?
if [ $to_result -ge 1 ]
then
  echo Timed out or error, timeout returned $to_result
  exit 3
else
  echo MaxAdmin success, rval is $to_result
  echo Checking maxadmin output sanity
  grep1=$(grep server1 $fileName)
  grep2=$(grep server2 $fileName)

  if [ "$grep1" ] && [ "$grep2" ]
  then
    echo All is fine
    exit 0
  else
    echo Something is wrong
    exit 3
  fi
fi
```

```
Aug 11 10:51:56 maxscale2 Keepalived_vrrp[20257]: VRRP_Script(chk_myscript) failed
Aug 11 10:51:57 maxscale2 Keepalived_vrrp[20257]: VRRP_Instance(VI_1) Entering FAULT STATE
Aug 11 10:51:57 maxscale2 Keepalived_vrrp[20257]: VRRP_Instance(VI_1) removing protocol VIPs.
Aug 11 10:51:57 maxscale2 Keepalived_vrrp[20257]: VRRP_Instance(VI_1) Now in FAULT state
```