High Availability Architecture

Overview

For TOS Architecture, see TOS Architecture.

TOS has built-in support to run in high availability mode, using a redundancy mechanism, in which your data is kept up-to-date on three nodes (servers) simultaneously. On failure of any one of them, the up-to-date data on the other data nodes will be available. A short period of downtime may occur, depending on the type of services running on the failed node, after which TOS will continue running. The terms high availability and HA are used interchangeably. High Availability can only be installed on a single site.

Do not use HA failover for either maintenance purposes or for periodic rebooting. Such actions must be done in a controlled manner, as described in Cluster Node Maintenance.

If two of the three data nodes are down, the entire cluster will fail.

After the additional data nodes are added, and HA is enabled, TOS will begin changing the cluster configuration from a standard configuration to a high availability configuration. During this time, TOS will not be operational. The expected initial downtime depends on the hardware performance – up to one hour.

After HA is enabled, the system will begin to replicate data. During this time TOS will be operational, but the high availability will not be fully functional until all data has been replicated. The amount of time it takes to replicate all data will depend on the volume of the data being replicated.

High availability is supported for GCP over three availability zones, giving you a higher level of resilience and availability when deploying on this cloud platform. Note that all availability zones must be in the same region.

Architecture

High availability requires three data nodes in the cluster: The primary data node and two additional data nodes. Worker nodes are not required for high availability.

An odd number of nodes is a widely accepted high availability best practice. The odd number is required due to a quorum-based election mechanism. With this mechanism, a “leader” node is automatically elected from among the three data nodes in the cluster. The leader node manages the cluster and updates the other two data nodes with any changes. The leader needs a majority of the nodes (two) to be elected in order to prevent a split-brain scenario in which two different nodes are elected as leader, resulting in data corruption.

In addition, with two data nodes, if the workload is split between both nodes, failure of one node requires that the second node be able to handle the workload of the entire cluster, requiring the resource usage on any given node to be less than 50%. With three data nodes, resource usage is split more evenly, during the event of a node failure.

Data is replicated across all three data nodes, and they both store data and run data services in the cluster at any given time.

Services are automatically distributed across the different nodes (data and worker) in the cluster according to their configuration and available resources per node.

Primary Data Node

There is no difference between the primary data node and the other two data nodes when it comes to high availability functionalities.

However, the primary data node does have some unique roles:

You can only make changes to the cluster using CLI commands from the primary data node.
Backups are saved on the primary data node.

You can change one of the additional data nodes to the primary data node using the CLI command: tos cluster node set-primary

Multiple Sites

For high availability to work, TOS needs to be deployed on a single site.

Deploying on two sites would result in one site having two data nodes, and the other having one data node - due to the requirement for three data nodes. If the site with two data nodes fails, the entire cluster will fail, as the data node on the other site has no majority. Therefore, there is no redundancy between the sites.

In addition, the cluster is very sensitive to latency, and all the nodes have to be connected to the same L2 network and share the same subnet.

Remote Collectors

Remote collectors cannot be run under high availability, however they can be connected to a central cluster that is running under high availability.

Cloud Deployments

High availability is supported for TOS deployed on GCP, but not on Azure or AWS.

High Availability Failover

Failover occurs when one of the data nodes fails. When failover occurs your system will continue running, as the other two data nodes will continue replicating data until the third one can be brought back online. However, some downtime should be expected. The failover impact and downtime depends on the:

Services running on the failed data node.

For example: Some services in TOS have a single service instance. When the data node fails, these services automatically restart on one of the other nodes. This can take around 15 minutes.
Current role of the failed data node in the cluster
Database size
Current load on the system

During failover:

If you have set up a notification for this event, it will be sent immediately.
TOS will continue to run with two data nodes.
The high availability status of the cluster will be impaired. Any additional data node failure will bring the cluster down. Therefore if one of the data nodes fails, it is crucial that you fix the issue as soon as possible.
If the failed data node is restored to a fully operational state, high availability will automatically return to full functionality.
If the failed node cannot be restored to a fully operational state with the same network properties, it must be replaced with another as soon as possible, to restore the high availability environment.