System Maintenance

The Yellowbrick appliance requires very little maintenance; however, you may need to verify that various components are working as intended.

The best way to get an overview of the status of the system is to start the ybcli. On a healthy system, the ybcli returns information like this:
Yellowbrick Data CLI vx.x.x
Copyright (c) 2016 Yellowbrick Data, Inc.
All rights reserved
-----------------------------------------
Redundant manager node detected
YBCLI is currently running on the PRIMARY manager node.
Local manager node : yb100-mgr0.yellowbrick.io (PRIMARY ACTIVE)
Remote manager node: yb100-mgr1.yellowbrick.io (SECONDARY ACTIVE)

Type 'help' for a list of supported commands

YBCLI (PRIMARY)> 

The output includes detailed information about which nodes are present in the system, their role, and current status. Any errors, such as when no primary node is found or if the system is in maintenance mode, are reported during ybcli startup.

The best way to get a complete overview of system health is to run the health all and status all commands. These two commands return the status of all components, including blades, power supplies, fans, external processors, and so on.

Each subsystem may also have one or more alerts. Alerts may describe increased thermal conditions, unstable power, or blades not operating nominally. For example, the following command retrieves any outstanding hardware alerts for the blade in bay 1:
YBCLI (PRIMARY) health blade 1
Bay:  1 Status: ok Power: on LED: [ OFF OFF ON  ] FRU: C4-0005-01 R01 Serial: TAA160802003DE CPU: Booted (YBOS)

Retrieving blade alerts...

Blade alerts reported: 1
Bay:  1: volt_00_pvccin_cpu_mV → 2234 (max error)
It is also useful to verify that the manager nodes are actively replicating their block devices to each other over the high-speed network. Note that the system software will make sure that only a node that has an up-to-date block device can ever be promoted to the primary manager node. To view the status of block replication, run the health storage command on any node that is not in standby state:
...
Replicated block device mounted: OK
Data replication active: OK
Data replication in progress: NO
...

In this example, the command was run on the primary node. It shows that block replication between the manager nodes is running and healthy, and that replication is not currently in progress.

You can take a manager node out of the cluster by putting it into standby mode. This means that the node will stop participating in the cluster and will not be considered for automatic system failover. However, you can still initiate a manual system failover to the node.

Note: After a node is put into standby mode, it will not actively replicate data blocks from the primary node. When it is brought out of standby mode, the node will resync its data store. Depending on how long the node was in standby mode and how much was written to the primary node in that time frame, it could take up to 2 minutes to fully synchronize the data store. The health storage command provides accurate information about the progress of resynchronizing the block device.

Disk Health

The manager nodes each have 4 block devices: 2 are mirrored for the operating system, and 2 are mirrored for the data in the Yellowbrick database. The latter is also actively synchronized to the secondary manager node.

You can check the status and health of all manager node disk drives by looking at the output of the health storage command.

A drive can be removed at any time and replaced with a drive provided by Yellowbrick. If possible, putting the system into maintenance mode is recommended before any planned hardware replacement.

Restarting the Database After a Power Failure

In the event of a power failure that brings down both manager nodes, when the system is powered back on, you can use the system status command to check the overall state of the system. The database may not come back up, in which case you can start it with the database start command.