Cluster Problem Diagnosis

This article mainly introduces the problems of the Kato cluster and the troubleshooting ideas. When troubleshooting the cluster problems, you must first determine the root cause of the problem is the problem of the Kato cluster itself, not the problem of the application itself.

Node Troubleshooting

  • When the cluster is faulty, the first thing to do is to check whether the nodes in the cluster are in the ready state:
$ kubectl get nodes 
NAME            STATUS   ROLES    AGE    VERSION
192.168.2.146   Ready    master   4d2h   v1.17.2
  • For a problematic node, check the detailed information and events on the node:
kubectl describe node <node name>

Component Troubleshooting

Before performing component troubleshooting, please first confirm that your cluster has a problem with Kato components, not with Kubernetes itself.

  • View the status of all components of Kato, all components of Kato are located under the namespace rbd-system
$ kubectl get pods -n rbd-system

NAME                              READY   STATUS      RESTARTS   AGE
mysql-operator-7c858d698d-g6xvt   1/1     Running     0          3d2h
nfs-commissions-0 1/1 Running 0 4d2h
kato-operator-0               2/2     Running     0          3d23h
rbd-api-7db9df75bc-dbjn4          1/1     Running     1          4d2h
rbd-app-ui-75c5f47d87-p5spp       1/1     Running     0          3d5h
rbd-app-ui-migrations-6crbs       0/1     Completed   0          4d2h
rbd-chaos-nrlpl                   1/1     Running     0          3d22h
rbd-db-0                          2/2     Running     0          4d2h
rbd-etcd-0                        1/1     Running     0          4d2h
rbd-eventlog-8bd8b988-ntt6p       1/1     Running     0          4d2h
rbd-gateway-4z9x8                 1/1     Running     0          4d2h
rbd-hub-5c4b478d5b-j7zrf          1/1     Running     0          4d2h
rbd-monitor-0                     1/1     Running     0          4d2h
rbd-mq-57b4fc595b-ljsbf           1/1     Running     0          4d2h
rbd-node-tpxjj                    1/1     Running     0          4d2h
rbd-repo-0                        1/1     Running     0          4d2h
rbd-webcli-5755745bbb-kmg5t       1/1     Running     0          4d2h
rbd-worker-68c6c97ddb-p68tx       1/1     Running     3          4d2h
  • If there is a Pod in a non-Running state, you need to check the Pod log. The Pod log can basically locate the problem itself.

Example:

View component logs

kubectl logs -f  <pod name>  -n rbd-system

For details on Kato components, please read Platform Component Architecture

Common Pod Abnormal Status

  • Pending

This state usually means that the Pod has not been scheduled to a certain Node. You can use the following command to view the current Pod events to determine why there is no scheduling.

kubectl describe pod <pod name>  -n  rbd-system

Check the contents of the Events field and analyze the reason.

  • Waiting or ContainerCreating

This state usually indicates that the Pod is in the waiting state or creation state. If it is in this state for a long time, use the following command to view the current Pod events.

 kubectl describe pod <pod name>  -n  rbd-system

Check the contents of the Events field and analyze the reason.

  • imagePullBackOff

This status usually indicates that the image pull failed. Use the following command to check which image failed, and then check whether there is the image locally.

kubectl describe pod <pod name>  -n  rbd-system

Check the contents of the Events field and check the image name.

Check whether the mirror exists locally

docker images | grep <image name>
  • CrashLoopBackOff

The CrashLoopBackOff status indicates that the container has been started, but has exited abnormally; in general, it is the problem of the application itself. At this time, you should check the container log first.

kubectl logs --previous <pod name> -n  rbd-system
  • Evicted

Eviction status is most common in Pod being evicted due to insufficient resources. Generally, it is due to insufficient system memory or disk resources. You can use df -Th to view the resource usage of the docker data directory. If the percentage is greater than 85%, you must promptly Clean up resources, especially some large files and docker images.

Use the following command to clear the pod in Evicted status:

kubectl get pods | grep Evicted | awk '{print $1}' | xargs kubectl delete pod

My Question is not Covered

If you are still at a loss as to how to make your cluster work after reading this document, you can:

Go to GitHub to check whether there are related issues, and if not, submit issues

Go to Community to search for your question and find answers to similar questions