Cluster Issues

Cluster Status Confirmation

Excuting an Order:

grctl cluster

Learn more about grctl

              	Used/Total       	Use of
CPU            	30/48            	63%
Memory         	133056/193288    	68%
DistributedDisk	1461Gb/10485760Gb	0.01%
| Service                 | HealthyQuantity/Total | Message                                       |
| ClusterStatus           | available             |                                               |
| rbd-monitor             | 1/1                   |                                               |
| rbd-repo                | 1/1                   |                                               |
| rbd-webcli              | 1/1                   |                                               |
| calico                  | 4/4                   |                                               |
| rbd-app-ui              | 1/1                   |                                               |
| Ready                   | 4/4                   |                                               |
| rbd-dns                 | 4/4                   |                                               |
| storage                 | 4/4                   |                                               |
| kubelet                 | 3/3                   |                                               |
| rbd-eventlog            | 1/1                   |                                               |
| rbd-hub                 | 1/1                   |                                               |
| rbd-java-buildpack      | 1/1                   |                                               |
| etcd                    | 1/1                   |                                               |
| etcd-proxy              | 3/3                   |                                               |
| rbd-mq                  | 1/1                   |                                               |
| r6d-slack               | 1/1                   |                                               |
| rbd-grafana             | 1/1                   |                                               |
| cube-apiserver          | 1/1                   |                                               |
| rbd-api                 | 1/1                   |                                               |
| rbd-db                  | 1/1                   |                                               |
| rbd-gateway             | 1/1                   |                                               |
| NodeUp                  | 4/4                   |                                               |
| docker                  | 4/4                   |                                               |
| r6d-alertmanager        | 1/1                   |                                               |
| rbd-chaos               | 1/1                   |                                               |
| rbd-worker              | 1/1                   |                                               |
| KubeNodeReady           | 3/3                   |                                               |
| local-dns               | 4/4                   |                                               |
| kube-controller-manager | 1/1                   |                                               |
| kube-scheduler          | 1/1                   |                                               |

| Uid                                  | IP          | HostName  | NodeRole      | Status         |
| 959eba4b-6bbe-4ad5-ba0f-ecfad17d378d | | manage01  | manage,gateway| running        |
| b4e3a2dd-ccef-410b-93a8-95d19a18b282 | | compute01 | compute       | running        |
| 4756d361-afbc-4283-b60e-1bfdcd8e4b5e | | compute02 | compute       | running        |
| e96f51b7-5c12-4b48-a126-8a91e9df5165 | | compute03 | compute       | running        |

The normal performance of the cluster:

  • The Message box of all services is empty
  • The Status of all nodes is green running
  • There are at least one role each of manage, gateway, and compute in the cluster

Executing grctl cluster does not normally feedback the above cluster status information, please refer to grctl cluster feedback exception

Node Status Troubleshooting

Abnormal Status and Handling

The node status is normal only when it is shown as the green running. For other statuses, see the following table:

Node StatusReasonTreatment Method
running (unhealthy)The node has a service abnormalgrctl node get
running (unschedulable)The node is normal, but in an unschedulable stategrctl node uncordon
running (unschedulable, unhealthy)The node has an abnormal service and is in an unschedulable stategrctl node get
Handle the problem
grctl node uncordon
offlineThe node is not online, or the node service is abnormalgrctl node up
Check the node service status
unknownThe node status is unknownCheck the time synchronization between the node and the management node
Check the node service status of the node
install_failed(unschedulable,unhealth)Node installation failedSee Installation troubleshooting
not_installedThe node is not installedgrctl node install

Service Exception Handling

Identify the Problem

Kato configures a health check for each cluster-related service running on each node in the cluster. The result will be displayed in the return of the grctl cluster command.

If Kato detects that a service on a node is in an abnormal state, it will display detailed information in the Message list corresponding to the abnormal service in the format of HostName:Detailed Error Information; at the same time, the word unhealth appears in the corresponding node Status.

| rbd-api |  0/1 | manage01:Get dial tcp connect: connection refused/ |
| Uid                                  | IP          | HostName  | NodeRole | Status             |
| 959eba4b-6bbe-4ad5-ba0f-ecfad17d378d | | manage01  | manage   | running(unhealthy) |

The above information indicates that the service named rbd-api is in an abnormal state on the manage01 node. The detailed information shows that the manage01 node failed to connect to the local port 8443, and the port 8443 is the listening port of the rbd-api service.

It should be noted that the port number where the connection fails may not necessarily be the port number of the error-reporting service, but may also be other services that the service depends on.

Query the running status of the service

systemctl status rbd-api


● rbd-api.service - rbd-api
  Loaded: loaded (/etc/systemd/system/rbd-api.service; enabled; vendor preset: enabled)
  Active: inactive (dead) since Tue 2019-08-06 17:17:02 CST; 13s ago
  Process: 24249 ExecStop=/bin/bash -c docker stop rbd-api (code=exited, status=0/SUCCESS)
  Main PID: 8491 (code=killed, signal=TERM)

The service is found to be in the inactive (dead) state. At this point, the problem location is complete.

For all Kato service monitoring ports, please refer to Service component port description, through the error port, you can quickly locate the location of the exception

Troubleshoot Issues Based on Service Logs

Kato’s own node service will automatically maintain all services and restart after failure. If the problem persists, the service has encountered a problem that cannot be solved by restarting.

Kato component logs are all hosted in journal, log query method: journalctl -fu <service name>

Such as:

journalctl -fu rbd-api

After querying the log, the cause of the error will be prompted. Here are some keywords that may appear in the log:

Unable to Find Image

Error response from daemon: No such container: rbd-api
Started rbd-grafana.
Unable to find image 'gridworkz/rbd-api:v5.1.5-release' locally
docker: Error response from daemon: manifest for gridworkz/rbd-api:v5.1.5-release not found.

This error indicates that the specified mirror does not exist locally.

  • Solution:
    • Check the configuration file, whether the mirror address is wrong
    • Confirm whether the image exists on other nodes (in most cases, the first management node), if so, execute docker push gridworkz/rbd-api:v5.1.5-release
    • Get the image package of the specified version of Kato to extract the corresponding image. v5.1.5 version corresponds to the mirror offline package

error: dial tcp xx.xx.xx.xx:3306: connect: connection refused

Started rbd-api.
error: dial tcp connect: connection refused
main process exited, code=exited, status=1/FAILURE

This error indicates that the rbd-db service or the external database connection to which the user has failed has failed.

  • Solution:
    • systemctl status rbd-db check whether the database service is normal/check whether the external database connected with the custom is running normally
    • Check the current service configuration file, connect to mysql address, user name, password
    • Check whether the current database network has restrictions

The services in Kato are interdependent. This causes some services to fail to start. The root cause is that other components do not provide services normally. See Mutual dependencies between components

Container Name “XXXX” is already in use by container “····”

/usr/bin/docker: Error response from daemon: Conflict. The container name "/etcd-proxy" is already use by container "d2cb3ce793ef764ae0525ccc". You have to remove (or rename) that container to be able to reuse that name.
See '/usr/bin/docker run --help'
etcd-proxy.service: main process exited, code=exited, status=125/n/a

This error means that a container with the same name already exists, and this container has not been automatically cleaned up. This generally reflects that the docker service is not working properly.

  • Solution:
    • Try to manually clean up the container with the same name docker rm -f etcd-proxy
    • Manual cleaning fails, consider restarting the docker service. [docker service restart strategy](/docs/Troubleshoot/concrete-operations/how-to-restart/#docker service restart strategy)

Other Troubleshooting

grctl cluster Feedback Exception

After executing grctl cluster, the following information is returned:

The current cluster api server is not working properly.
You can query the service log for troubleshooting.
Exec Command: journalctl -fu rbd-api


The current cluster node manager is not working properly.
You can query the service log for troubleshooting.
Exec Command: journalctl -fu node

The execution of grctl cluster command relies on two services, node and rbd-api. No matter which one has a problem, it cannot be returned normally.

  • Solution:
    • According to the returned prompt, determine which of the above two services caused the problem
    • Query the log according to the prompt, journalctl -fu node/rbd-api
    • According to the log prompt, solve the problem of node/rbd-api first. See [Troubleshooting issues based on service logs](#Troubleshooting issues based on service logs)

Storage Service Error

If there is an error in the storage service, it is generally because there is a problem with the file mounting, you should follow the steps below:

  • The error-reporting node executes mount | grep grdata to observe whether the correct mount path is returned (refer to /etc/fstab for mount path)
  • If you find that it is not mounted, manually mount mount -a
    • The mount still fails, check the system environment according to the error report
    • If the mount is successful, restart all services that depend on storage rbd-app-ui rbd-hub rbd-api rbd-gateway rbd-worker

My Cluster Server Needs to be Restarted

See Cluster server restart strategy

What Should I Do if the Cluster Node Status is Unknown

If the node status is “unknown”, it means that this node is the master node that has not reported the status for more than 1 minute. This problem may be caused by:

  • This node’s node service operation failure has exited or this node has been shut down.
  • The time of this node is inconsistent with the management node, the time difference between cluster nodes should not be more than 30 seconds, it is best to set up the time synchronization service.

My Question is Not Covered

If you are still at a loss as to how to make your cluster work after reading this document, you can:

  • Go to GitHub to check whether there are related issues, and if not, submit issues

  • Go to Community to read the post with the prefix [Operation and Maintenance Question] to find answers to similar questions