Cluster Node Maintenance

Overview

This topic is relevant for high availability clusters as well as single data node clusters.

The procedures in this topic allow you to safely shut down nodes in your cluster and restart them without causing data corruption. Once the node is shut down, you can perform whatever maintenance is needed on the node or nodes. These procedures are helpful for scheduled maintenance, such as upgrading your operating system (both TufinOS and non-TufinOS).

Single Data Node Cluster Environment

When performing maintenance work on a data node in a single data node environment, you will not be able to keep TOS running and you will experience downtime. Once you have completed the maintenance work and started TOS again, your environment should return to full functionality immediately.

Select which of the following nodes you want to perform maintenance on. Then follow the procedure.

The data node
A worker node

High Availability Environment

When performing maintenance on a single data node or worker node in an HA environment, you will be able to keep TOS running. There is a small chance that you will experience a few minutes of downtime, right after you power down the node. In addition, if another data node in your cluster fails (when one is already down), then the cluster will fail. See High Availability.

If you simply shut down the node without this procedure, you will experience a state like node failure which triggers high availability failover. This state takes longer to recover from and has a higher risk of data corruption.

If you want to avoid downtime, only do maintenance work on one data node at a time. After you complete the maintenance work, check the cluster health before starting maintenance on another node. Only perform maintenance on nodes that are part of a healthy cluster.

If you don't need to avoid downtime, you can run maintenance on all the data nodes in your cluster by powering down your entire cluster at once. See Power Down All Nodes.

Select which of the following nodes you want to perform maintenance on. Then follow the procedure.

A single data node within an HA cluster
A worker node

Perform Maintenance Work on a Single Data Node within an HA Deployment

Check the cluster health.

On the primary data node, check the following status.
```
[<ADMIN> ~]$ sudo systemctl status k3s
```
sudo systemctl status k3s

In the output under the line k3s.service - Aurora Kubernetes, two lines should appear - Loaded... and Active... similar to the example below. If they appear, continue with the next step, otherwise contact Tufin Support for assistance.

Example output:

[<ADMIN> ~]$ sudo systemctl status k3s
 [root@TufinOS ~]# systemctl status k3s
 Redirecting to /bin/systemctl status k3s.service
 ● k3s.service - Aurora Kubernetes
    Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
    Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago
      Docs: https://k3s.io
   Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Main PID: 1250 (k3s-server)
     Tasks: 1042
    Memory: 2.3G

On the same node or nodes, check the TOS status.
```
[<ADMIN> ~]$ sudo tos status
```
sudo tos status

In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.

Example output:

[<ADMIN> ~]$ sudo tos status
 Tufin Orchestration Suite 2.0
  
 System Status: Ok
 System Mode:   Multi Node
  
 Nodes:
   1 Master, 1 Worker. Total 2 nodes. Nodes are healthy.
  
 Components:
   Node:            Ok
   Cassandra:       Ok
   Mongodb:         Ok
   Mongodb_sc:      Ok
   Nats:            Ok
   Neo4j:           Ok
   Postgres:        Ok
   Postgres_sc:     Ok

Backup your TOS data.
If you are going to perform this procedure over multiple maintenance periods, create a new backup each time.
1. Create the backup using tos backup create:
2. You can check the backup creation status using tos backup status, which shows the status of backups in progress. Wait until completion before continuing.
3. Run the following command to display the list of backups saved on the node:
  [<ADMIN> ~]$ sudo tos backup list
  
  sudo tos backup list
4. Check that your backup file appears in the list, and that the status is "Completed".
5. Run the following command to export the backup to a file:
  [<ADMIN> ~]$ sudo tos backup export
  
  sudo tos backup export
6. If your backup files are saved locally:
  1. Run sudo tos backup export to save your backup file from a TOS backup directory as a single .gzip file. If there are other backups present, they will be included as well.
  2. Transfer the exported .gzip file to a safe, remote location.
    
    Make sure you have the location of your backups safely documented and accessible, including credentials needed to access them, for recovery when needed.
  After the backup is exported, we recommend verifying that the file contents can be viewed by running the following command:
  [Target location]$ tar tzvf <filename>
  
  tar tzvf <file name>
On the node on which you will work, run the command:
```
[<ADMIN> ~]$ kubectl get node 
```
kubectl get node
A list of nodes which currently exist in the cluster is displayed.
In the output, note the name of the data node on which you want to run maintenance.
Run the command:
```
[<ADMIN> ~]$ kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsets   
```
kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsets
where <NODENAME> is the name of the data node on which you want to run the maintenance.
Complete the desired maintenance.
When the maintenance is completed, power up (if required) and log in again to the data node.
Run the command:
```
[<ADMIN> ~]$ kubectl uncordon <NODENAME>
```
kubectl uncordon <NODENAME>
Check the cluster health by repeating Step 1. Wait at least one day before performing maintenance on another data node in the cluster.

Perform Maintenance Work on a Single Worker Node

Check the cluster health.

On the primary data node, check the following status.
```
[<ADMIN> ~]$ sudo systemctl status k3s
```
sudo systemctl status k3s

Example output:

[<ADMIN> ~]$ sudo systemctl status k3s
 [root@TufinOS ~]# systemctl status k3s
 Redirecting to /bin/systemctl status k3s.service
 ● k3s.service - Aurora Kubernetes
    Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
    Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago
      Docs: https://k3s.io
   Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Main PID: 1250 (k3s-server)
     Tasks: 1042
    Memory: 2.3G

On the same node or nodes, check the TOS status.
```
[<ADMIN> ~]$ sudo tos status
```
sudo tos status

In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.

Example output:

[<ADMIN> ~]$ sudo tos status
 Tufin Orchestration Suite 2.0
  
 System Status: Ok
 System Mode:   Multi Node
  
 Nodes:
   1 Master, 1 Worker. Total 2 nodes. Nodes are healthy.
  
 Components:
   Node:            Ok
   Cassandra:       Ok
   Mongodb:         Ok
   Mongodb_sc:      Ok
   Nats:            Ok
   Neo4j:           Ok
   Postgres:        Ok
   Postgres_sc:     Ok

If necessary, power the worker node down.
Perform the required maintenance.
If necessary, power the worker node back up.
Confirm that the cluster is restored to full operation by repeating Step 1. If the cluster is healthy, you can proceed to run maintenance on additional worker nodes as needed.

Perform Maintenance on All Nodes in the Cluster

Check cluster health.

On the primary data node, check the following status.
```
[<ADMIN> ~]$ sudo systemctl status k3s
```
sudo systemctl status k3s

Example output:

[<ADMIN> ~]$ sudo systemctl status k3s
 [root@TufinOS ~]# systemctl status k3s
 Redirecting to /bin/systemctl status k3s.service
 ● k3s.service - Aurora Kubernetes
    Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
    Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago
      Docs: https://k3s.io
   Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Main PID: 1250 (k3s-server)
     Tasks: 1042
    Memory: 2.3G

On the same node or nodes, check the TOS status.
```
[<ADMIN> ~]$ sudo tos status
```
sudo tos status

In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.

Example output:

[<ADMIN> ~]$ sudo tos status
 Tufin Orchestration Suite 2.0
  
 System Status: Ok
 System Mode:   Multi Node
  
 Nodes:
   1 Master, 1 Worker. Total 2 nodes. Nodes are healthy.
  
 Components:
   Node:            Ok
   Cassandra:       Ok
   Mongodb:         Ok
   Mongodb_sc:      Ok
   Nats:            Ok
   Neo4j:           Ok
   Postgres:        Ok
   Postgres_sc:     Ok

Backup your TOS data.
If you are going to perform this procedure over multiple maintenance periods, create a new backup each time.
1. Create the backup using tos backup create:
2. You can check the backup creation status using tos backup status, which shows the status of backups in progress. Wait until completion before continuing.
3. Run the following command to display the list of backups saved on the node:
  [<ADMIN> ~]$ sudo tos backup list
  
  sudo tos backup list
4. Check that your backup file appears in the list, and that the status is "Completed".
5. Run the following command to export the backup to a file:
  [<ADMIN> ~]$ sudo tos backup export
  
  sudo tos backup export
6. If your backup files are saved locally:
  1. Run sudo tos backup export to save your backup file from a TOS backup directory as a single .gzip file. If there are other backups present, they will be included as well.
  2. Transfer the exported .gzip file to a safe, remote location.
    
    Make sure you have the location of your backups safely documented and accessible, including credentials needed to access them, for recovery when needed.
  After the backup is exported, we recommend verifying that the file contents can be viewed by running the following command:
  [Target location]$ tar tzvf <filename>
  
  tar tzvf <file name>
Shut down TOS.
1. Run the command:
  [<ADMIN> ~]$ sudo tos stop
  
  sudo tos stop
  This process may take time.
2. Check that all processes have been stopped successfully. Run the command:
3. Wait until all the pods, with the exception of the service controller, ps-proxy, and reportpack pods, have disappeared from the list or reached a status of Completed. The service controller, ps-proxy, and reportpack pods which can continue running.
  
  All TOS processes are now stopped on all the data nodes in the cluster.
Complete the desired maintenance.
Log in to the primary data node.
Restart TOS on the primary data node:
```
[<ADMIN> ~]$ sudo tos start 
```
sudo tos start

All TOS processes will be restarted on all the data nodes in the cluster, and you will be able to resume using TOS.

Confirm that the cluster is restored to full operation by repeating Step 1.

/Opt Partition Disk Usage

To ensure TOS is functioning properly, make sure the amount of data stored in the /opt partition is below 70%. From PHF2.0.0, when the /opt partition becomes 90% full, TOS will stop making automatic backups and core services will be stopped. We recommend configuring TOS monitoring to send notifications if too much storage is being consumed.