Manager Node Failover

The Yellowbrick appliance was designed with redundancy in mind and has no single point of failure. Furthermore, the system can sustain multiple component failures and automatically fail over to redundant hardware that is detected to be operational. Depending on which component fails, the failover process may be fully transparent with no downtime, or it may require a small disruption to services (<60 seconds). Most failures are handled transparently with no disruption.

Failover Scenarios

You can inspect the system at any time to check the health of its components. You can use the System Management Console (SMC) or the Yellowbrick command line interface (ybcli) for this purpose.

Should a catastrophic event occur that causes a failover between the manager nodes, the victim of that process is taken out of the cluster and put into standby mode if it is still responding. Once the failure condition has been cleared, you can bring the manager node back into the cluster manually. This process is described in this section.

During system maintenance, you should put the cluster into maintenance mode. You can do this either via the SMC or by using the following ybcli command:
system maintenance on
This command stops the cluster from responding to changes in the environment that would otherwise cause node failover to happen. It will also shut down the Yellowbrick database stack. When maintenance is complete, you can bring the system back to normal operating mode by running the following ybcli command:
system maintenance off
HA Cluster Status
Run the following ybcli command at any time to check the status of the HA cluster:
system status
This command returns the status of various resources in the system. All of these resources always run on the same manager node: the primary manager node. For example:
YBCLI (PRIMARY)> system status

Cluster nodes configured: 2
----------------------------
Node 1 (PRIMARY   - LOCAL NODE  ) : yb100-mgr0.yellowbrickroad.io -> ONLINE
Node 2 (SECONDARY - REMOTE NODE ) : yb100-mgr1.yellowbrickroad.io -> ONLINE

Yellowbrick database running: YES
Yellowbrick database ready  : YES
Blade parity                : Enabled
Blade parity rebuilding     : NO
Maintenance mode            : NO
Software update in-progress : NO
Floating system IP          : 10.22.110.10/24

In this example, the yb100-mgr0 node is currently the primary node running all database resources. Besides yb100-mgr0, the other manager node, with a hostname of yb100-mgr1, is also online and ready to take over.

In other words, yb100-mgr0 is the primary manager node and is serving all user requests; yb100-mgr1 is the secondary manager node and will automatically take over resources if yb100-mgr1 becomes unavailable.

Manual Failover

If it becomes necessary to manually fail over from one manager node to the other, you can use the ybcli to complete the task.

  1. Start the ybcli as the ybdadmin user on either the primary or secondary manager node.
  2. Run the system status command to verify that the secondary manager node is present.
  3. Run the system failover command to perform a manual failover.
    Note: For a manual failover, it can take 2 minutes before the system responds again at the floating IP address.

It is important to log into the primary node directly when running the system failover command because the floating IP will briefly be taken down as it moves between the nodes.

Using the yb100-mgr0 and yb100-mgr1 systems, failing over to yb100-mgr1 would look like this:
Initiating system failover
Monitoring completion. This can take 2 minutes. Notifications may appear. Standby...

System failover was successful. Yellowbrick database started.
Primary manager node is now: Remote node (yb100-mgr1.yellowbrick.io)

WARNING: A SYSTEM NODE ROLE CHANGE WAS DETECTED
Current roles
–------------
LOCAL NODE  : SECONDARY (ACTIVE)
REMOTE NODE : PRIMARY   (ACTIVE)

Now the yb100-mgr1 node is running all resources. Once the failover is complete, you can fail back to the original node, which in this case was yb100-mgr0. ybcli will automatically determine when it is safe to do so. If the system is not capable of failing over, ybcli reports an error to explain the problem. This may happen if large amounts of data are being replicated between the nodes.

Use the system status command to determine when the process is complete.