On This Page
Cluster Node Maintenance
Overview
This topic is relevant for high availability clusters as well as single data node clusters.
The procedures in this topic allow you to safely shut down nodes in your cluster and restart them without causing data corruption. Once the node is shut down, you can perform whatever maintenance is needed on the node or nodes. These procedures are helpful for scheduled maintenance, such as upgrading your operating system (both TufinOS and non-TufinOS).
Single Data Node Cluster Environment
When performing maintenance work on a data node in a single data node environment, you will not be able to keep TOS running and you will experience downtime. Once you have completed the maintenance work and started TOS again, your environment should return to full functionality immediately.
Select which of the following nodes you want to perform maintenance on. Then follow the procedure.
High Availability Environment
When performing maintenance on a single data node or worker node in an HA environment, you will be able to keep TOS running. There is a small chance that you will experience a few minutes of downtime, right after you power down the node. In addition, if another data node in your cluster fails (when one is already down), then the cluster will fail. See High Availability.
If you simply shut down the node without this procedure, you will experience a state like node failure which triggers high availability failover. This state takes longer to recover from and has a higher risk of data corruption.
If you want to avoid downtime, only do maintenance work on one data node at a time. After you complete the maintenance work, check the cluster health before starting maintenance on another node. Only perform maintenance on nodes that are part of a healthy cluster.
If you don't need to avoid downtime, you can run maintenance on all the data nodes in your cluster by powering down your entire cluster at once. See Power Down All Nodes.
Select which of the following nodes you want to perform maintenance on. Then follow the procedure.
Perform Maintenance Work on a Single Data Node within an HA Deployment
-
Check the cluster health.
-
On the primary data node, check the following status.
-
On the same node or nodes, check the TOS status.
In the output under the line k3s.service - Aurora Kubernetes, two lines should appear - Loaded... and Active... similar to the example below. If they appear, continue with the next step, otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo systemctl status k3s [root@TufinOS ~]# systemctl status k3s Redirecting to /bin/systemctl status k3s.service ● k3s.service - Aurora Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago Docs: https://k3s.io Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Main PID: 1250 (k3s-server) Tasks: 1042 Memory: 2.3G
In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo tos status Tufin Orchestration Suite 2.0 System Status: Ok System Mode: Multi Node Nodes: 1 Master, 1 Worker. Total 2 nodes. Nodes are healthy. Components: Node: Ok Cassandra: Ok Mongodb: Ok Mongodb_sc: Ok Nats: Ok Neo4j: Ok Postgres: Ok Postgres_sc: Ok
-
-
Backup your TOS data.
If you are going to perform this procedure over multiple maintenance periods, create a new backup each time.
-
Create the backup using tos backup create:
-
You can check the backup creation status using tos backup status, which shows the status of backups in progress. Wait until completion before continuing.
-
Run the following command to display the list of backups saved on the node:
-
Check that your backup file appears in the list, and that the status is "Completed".
-
Run the following command to export the backup to a file:
-
If your backup files are saved locally:
-
Run sudo tos backup export to save your backup file from a TOS backup directory as a single
.gzip
file. If there are other backups present, they will be included as well. -
Transfer the exported
.gzip
file to a safe, remote location.Make sure you have the location of your backups safely documented and accessible, including credentials needed to access them, for recovery when needed.
After the backup is exported, we recommend verifying that the file contents can be viewed by running the following command:
-
Example output:
[%=Local.admin-prompt% sudo tos backup create [Aug 23 16:18:42] INFO Running backup Backup status can be monitored with "tos backup status"
Example output:
[<ADMIN> ~]$ sudo tos backup status Found active backup "23-august-2021-16-18"
Example output:
[<ADMIN> ~]$ sudo tos backup list ["23-august-2021-16-18"] Started: "2021-08-23 13:18:43 +0000 UTC" Completed: "N/A" Modules: "ST, SC" HA mode: "false" TOS release: "21.2 (PGA.0.0) Final" TOS build: "21.2.2100-210722164631509" Expiration Date: "2021-09-22 13:18:43 +0000 UTC" Status: "Completed"
The command creates a single backup file.
[<ADMIN> ~]$ sudo tos backup export [Aug 23 16:33:42] INFO Preparing target dir /opt/tufin/backups [Aug 23 16:33:42] INFO Compressing... [Aug 23 16:33:48] INFO Backup exported file: /opt/tufin/backups/backup-21-2-pga.0.0-final-20210823163342.tar.gzip [Aug 23 16:33:48] INFO Backup export has completed
-
-
On the node on which you will work, run the command:
A list of nodes which currently exist in the cluster is displayed.
-
In the output, note the name of the data node on which you want to run maintenance.
-
Run the command:
[<ADMIN> ~]$ kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsets
kubectl drain <NODENAME> --delete-emptydir-data --ignore-daemonsetswhere
<NODENAME>
is the name of the data node on which you want to run the maintenance. -
Complete the desired maintenance.
-
When the maintenance is completed, power up (if required) and log in again to the data node.
-
Run the command:
-
Check the cluster health by repeating Step 1. Wait at least one day before performing maintenance on another data node in the cluster.
Perform Maintenance Work on a Single Worker Node
-
Check the cluster health.
-
On the primary data node, check the following status.
-
On the same node or nodes, check the TOS status.
In the output under the line k3s.service - Aurora Kubernetes, two lines should appear - Loaded... and Active... similar to the example below. If they appear, continue with the next step, otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo systemctl status k3s [root@TufinOS ~]# systemctl status k3s Redirecting to /bin/systemctl status k3s.service ● k3s.service - Aurora Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago Docs: https://k3s.io Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Main PID: 1250 (k3s-server) Tasks: 1042 Memory: 2.3G
In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo tos status Tufin Orchestration Suite 2.0 System Status: Ok System Mode: Multi Node Nodes: 1 Master, 1 Worker. Total 2 nodes. Nodes are healthy. Components: Node: Ok Cassandra: Ok Mongodb: Ok Mongodb_sc: Ok Nats: Ok Neo4j: Ok Postgres: Ok Postgres_sc: Ok
-
-
If necessary, power the worker node down.
-
Perform the required maintenance.
-
If necessary, power the worker node back up.
-
Confirm that the cluster is restored to full operation by repeating Step 1. If the cluster is healthy, you can proceed to run maintenance on additional worker nodes as needed.
Perform Maintenance on All Nodes in the Cluster
-
Check cluster health.
-
On the primary data node, check the following status.
-
On the same node or nodes, check the TOS status.
In the output under the line k3s.service - Aurora Kubernetes, two lines should appear - Loaded... and Active... similar to the example below. If they appear, continue with the next step, otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo systemctl status k3s [root@TufinOS ~]# systemctl status k3s Redirecting to /bin/systemctl status k3s.service ● k3s.service - Aurora Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-08-24 17:14:38 IDT; 1 day 18h ago Docs: https://k3s.io Process: 1241 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Process: 1226 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Main PID: 1250 (k3s-server) Tasks: 1042 Memory: 2.3G
In the output, if the System Status is Ok and all the items listed under Components appear as ok, continue with the next steps. Otherwise contact Tufin Support for assistance.
Example output:
[<ADMIN> ~]$ sudo tos status Tufin Orchestration Suite 2.0 System Status: Ok System Mode: Multi Node Nodes: 1 Master, 1 Worker. Total 2 nodes. Nodes are healthy. Components: Node: Ok Cassandra: Ok Mongodb: Ok Mongodb_sc: Ok Nats: Ok Neo4j: Ok Postgres: Ok Postgres_sc: Ok
-
-
Backup your TOS data.
If you are going to perform this procedure over multiple maintenance periods, create a new backup each time.
-
Create the backup using tos backup create:
-
You can check the backup creation status using tos backup status, which shows the status of backups in progress. Wait until completion before continuing.
-
Run the following command to display the list of backups saved on the node:
-
Check that your backup file appears in the list, and that the status is "Completed".
-
Run the following command to export the backup to a file:
-
If your backup files are saved locally:
-
Run sudo tos backup export to save your backup file from a TOS backup directory as a single
.gzip
file. If there are other backups present, they will be included as well. -
Transfer the exported
.gzip
file to a safe, remote location.Make sure you have the location of your backups safely documented and accessible, including credentials needed to access them, for recovery when needed.
After the backup is exported, we recommend verifying that the file contents can be viewed by running the following command:
-
Example output:
[%=Local.admin-prompt% sudo tos backup create [Aug 23 16:18:42] INFO Running backup Backup status can be monitored with "tos backup status"
Example output:
[<ADMIN> ~]$ sudo tos backup status Found active backup "23-august-2021-16-18"
Example output:
[<ADMIN> ~]$ sudo tos backup list ["23-august-2021-16-18"] Started: "2021-08-23 13:18:43 +0000 UTC" Completed: "N/A" Modules: "ST, SC" HA mode: "false" TOS release: "21.2 (PGA.0.0) Final" TOS build: "21.2.2100-210722164631509" Expiration Date: "2021-09-22 13:18:43 +0000 UTC" Status: "Completed"
The command creates a single backup file.
[<ADMIN> ~]$ sudo tos backup export [Aug 23 16:33:42] INFO Preparing target dir /opt/tufin/backups [Aug 23 16:33:42] INFO Compressing... [Aug 23 16:33:48] INFO Backup exported file: /opt/tufin/backups/backup-21-2-pga.0.0-final-20210823163342.tar.gzip [Aug 23 16:33:48] INFO Backup export has completed
-
-
Shut down TOS.
-
Run the command:
This process may take time.
- Check that all processes have been stopped successfully. Run the command:
-
Wait until all the pods, with the exception of the service controller, ps-proxy, and reportpack pods, have disappeared from the list or reached a status of Completed. The service controller, ps-proxy, and reportpack pods which can continue running.
All TOS processes are now stopped on all the data nodes in the cluster.
A list of all pods is displayed.
Example
-
- Complete the desired maintenance.
-
Log in to the primary data node.
-
Restart TOS on the primary data node:
-
Confirm that the cluster is restored to full operation by repeating Step 1.
All TOS processes will be restarted on all the data nodes in the cluster, and you will be able to resume using TOS.
/Opt Partition Disk Usage
To ensure TOS is functioning properly, make sure the amount of data stored in the /opt partition is below 70%. From PHF2.0.0, when the /opt partition becomes 90% full, TOS will stop making automatic backups and core services will be stopped. We recommend configuring TOS monitoring to send notifications if too much storage is being consumed.