Cluster Node Maintenance

Overview

This topic is relevant for high availability clusters as well as single data node clusters.

The procedures in this topic allow you to safely shut down nodes in your cluster and restart them without causing data corruption. Once the node is shut down, you can perform whatever maintenance is needed on the node or nodes. These procedures are helpful for scheduled maintenance, such as upgrading your operating system (both TufinOS and non-TufinOS).

Single Data Node Cluster Environment

When performing maintenance work on a data node in a single data node environment, you will not be able to keep TOS running and you will experience downtime. Once you have completed the maintenance work and started TOS again, your environment should return to full functionality immediately.

Select which of the following nodes you want to perform maintenance on. Then follow the procedure.

  1. The data node

  2. A worker node

High Availability Environment

When performing maintenance on a single data node or worker node in an HA environment, you will be able to keep TOS running. There is a small chance that you will experience a few minutes of downtime, right after you power down the node. In addition, if another data node in your cluster fails (when one is already down), then the cluster will fail. See High Availability.

If you simply shut down the node without this procedure, you will experience a state like node failure which triggers high availability failover. This state takes longer to recover from and has a higher risk of data corruption.

If you want to avoid downtime, only do maintenance work on one data node at a time. After you complete the maintenance work, check the cluster health before starting maintenance on another node. Only perform maintenance on nodes that are part of a healthy cluster.

If you don't need to avoid downtime, you can run maintenance on all the data nodes in your cluster by powering down your entire cluster at once. See Power Down All Nodes.

Select which of the following nodes you want to perform maintenance on. Then follow the procedure.

  1. A single data node within an HA cluster

  2. A worker node

Perform Maintenance Work on a Single Data Node within an HA Deployment

  1. On the node on which you will work, run the command:

    [<ADMIN> ~]$ kubectl get node 
    kubectl get node

    A list of nodes which currently exist in the cluster appears.

  2. In the output, note the name of the data node on which you want to run maintenance.

  3. Run the command:

    [<ADMIN> ~]$ kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsets   
    kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsets

    where <NODENAME> is the name of the data node on which you want to run the maintenance.

  4. Complete the desired maintenance.

  5. When the maintenance is completed, power up (if required) and log in again to the data node.

  6. Run the command:

    [<ADMIN> ~]$ kubectl uncordon <NODENAME>
    kubectl uncordon <NODENAME>
  7. Check the cluster health by repeating Step 1. Wait at least one day before performing maintenance on another data node in the cluster.

Perform Maintenance Work on a Single Worker Node

  1. If necessary, power the worker node down.

  2. Perform the required maintenance.

  3. If necessary, power the worker node back up.

  4. Confirm that the cluster is restored to full operation by repeating Step 1. If the cluster is healthy, you can proceed to run maintenance on additional worker nodes as needed.

Perform Maintenance on All Nodes in the Cluster

  1. Complete the desired maintenance.
  2. Log in to the primary data node.

  3. Restart TOS on the primary data node:

    [<ADMIN> ~]$ sudo tos start 
    sudo tos start
  4. All TOS processes will be restarted on all the data nodes in the cluster, and you will be able to resume using TOS.

  5. Confirm that the cluster is restored to full operation by repeating Step 1.

/Opt Partition Disk Usage

To ensure TOS is functioning properly, make sure the amount of data stored in the /opt partition is below 70%.When the /opt partition becomes 90% full, TOS will stop making automatic backups and core services will be stopped. We recommend configuring TOS monitoring to send notifications if too much storage is being consumed.