Restart Strategy

Docker Service Restart Strategy

When we encounter problems that need to be solved by restarting the docker service, please follow the following rules.

Manage Node Docker Restart Strategy

The management node has a special position in the cluster. In the production environment, a high-availability architecture must be installed. Otherwise, once the docker service restarts, the cluster management function will inevitably be interrupted.

If you determine that you need to restart, do the following:

systemctl stop node 
systemctl stop rbd*
systemctl stop kube*
systemctl stop calico 
systemctl stop etcd
systemctl restart docker
systemctl start node

Compute Node Docker Restart Strategy

Before restarting the docker service on a compute node, the instances running on it should be migrated before restarting.

# The management node executes the following command, the instance on the cloud will be automatically migrated, wait for the migration to continue
grctl node down <specified computing node Uid>

# Compute node execution
systemctl restart docker

# After restarting, re-launch this node on the management node
grctl node up <specified computing node Uid>

Gateway Node Docker Service Can Be Restarted Directly

Direct execution:

systemctl restart docker

Cluster Server Restart Strategy

Graceful Restart Strategy

When the Kato cluster adopts the High Availability Installation Method, the entire cluster can be restarted gracefully in the following order.

Preparation Before Restart

If business interruption is allowed, please shut down all running components on the platform first. If the business is not allowed to be interrupted, you need to pay attention to:

  • Multi-instance stateless components, the stateful cluster version of multi-instance components will not interrupt the service during the cluster restart process
  • Single instance components will be automatically migrated after a short business interruption
The Storage Node Restarts Gracefully

At this time, the cluster storage node should be a distributed multi-node deployment. The restart strategy is as follows:

  • Observe the current mounting information of all nodes in the cluster and confirm that the storage node is currently mounted
  • Restart the non storage nodes currently mounted one by one. Ensure that the restarted storage node returns to normal after each restart, and then restart the next node
  • Finally restart the currently mounted storage node
The Management Node Restarts Gracefully

At this time, cluster management nodes should be deployed in odd numbers such as 3.5.7ยทยทยท, and the restart strategy is as follows:

  • Query the current etcd cluster leader node etcdctl member list, in the following example, the leader node is 192.168.195.1:
1cd2685eb4460830: name=etcd1 peerURLs=http://192.168.195.1:2380 clientURLs=http://192.168.195.1:2379,http://192.168.195.1:4001 isLeader=true
1cd2685e123450830: name=etcd2 peerURLs=http://192.168.195.2:2380 clientURLs=http://192.168.195.2:2379,http://192.168.195.2:4001 isLeader=false
13412385eb4460830: name=etcd3 peerURLs=http://192.168.195.3:2380 clientURLs=http://192.168.195.3:2379,http://192.168.195.3:4001 isLeader=false
  • Restart management nodes with non etcd leader attributes one by one. Ensure that the restarted management node returns to normal after each restart, and then restart the next node
  • Finally restart etcd leader attribute management node
Compute Node Restarts Gracefully

At this time, the cluster computing node should be a distributed multi-node deployment, and the restart strategy is as follows:

  • Compute nodes should be restarted one by one
  • The compute node that is about to be restarted when the management node is offline grctl node down <compute node Uid>
  • Observe the remaining container instance dps on the computing node that is about to be restarted until only three containers of rbd-dns calico etcd-proxy remain
  • Restart the compute node
  • Confirm that the compute node service after startup is normal (the above three containers are running)
  • Go online on the management node, compute node grctl node up <compute node Uid>
  • Restart the next computing node and repeat the above operation
Graceful Restart of Gateway Node

At this time, the cluster gateway node should be a distributed multi-node deployment, and the restart strategy is as follows:

  • Gateway nodes should be restarted one by one
  • Confirm that the node is running normally after restarting, and the container rbd-gateway etcd-proxy calico is running
  • Restart the next gateway node

Restart Strategy Under Node Reuse

When the Kato cluster is deployed in the way of node reuse, the restart strategy is based on the storage node graceful restart judgment criteria take precedence.

Computer Room Restart Strategy

In some environments (such as a power failure in the computer room), all servers in the Kato cluster restart collectively. The points to note in this process are as follows:

  • The startup sequence is: The storage node restarts first, and the startup sequence of other nodes is arbitrary after startup
    • If there is no storage node and NFS provided by default is used, start the first management node first, and then start other nodes
  • Time synchronization: After the server is started, it is necessary to confirm the time synchronization between all nodes