Application Operation

Preliminary Investigation

When encountering an instance running exception, you should first perform a preliminary investigation according to the document in this section.

Mirror Type

Note: that these mirrors cannot be run. At the same time, some images that will automatically exit after a one-time run cannot be run on the platform.

Not Enough Storage

Kato will allocate memory for components by default:

  • The default image creation component is 512MB
  • The source code build component defaults to 1GB

This value can be defined when the component is created, or can be modified at any time on the component scaling interface. However, not all components can run with default memory. When the memory allocation is too small, the memory usage of the component instance will be too high, and it will not even continue to start at all.

The specific performance is:

Component Deployment Type Selection Error

Kato supports the deployment of Stateful Service Type and Stateless Service Type.

  • Stateless service types are suitable for web and API components
  • Stateful service types are suitable for DB, cluster, message middleware, and data

There are specific components to be deployed in a specified type. For example, components such as zookeeper and mysql must use stateful service types. Otherwise, abnormal operation may occur.

Components Require Specific Initialization Conditions

For example, the MySQL component must require one of the following three environment variables when it is started for the first time:

  • MYSQL_ROOT_PASSWORD specifies the root user password
  • MYSQL_ALLOW_EMPTY_PASSWORD allows users to have empty passwords
  • MYSQL_RANDOM_ROOT_PASSWORD randomly generates a password for the root user

If this specific initialization condition is not specified, the component will not enter the initialization process.

To determine what happened to the current component, you can view the log:

Health Examination

The Kato component is configured with a health check mechanism by default. When the component starts, it will check whether the component starts normally in a specific way according to the port configured in the health check. If the test fails, there will be abnormal operation status.

The port for health check comes from the port opened by the component, and port 5000 is opened by default. Therefore, make sure that the port opened by the component on the platform is consistent with the port that the component is running and listening.

grctl msg get Command Usage

The platform provides the global component status query command grctl msg get

The command returns:

  • TenantName (tenant alias)/ServiceName (component alias) is used to locate components
  • Message (Exception Information) provides relatively specific exception information
  • Reason (abnormal reason) Corresponding to the abnormal information, provide the abnormal reason that the system can judge
    • Error An error occurred in the container itself, please check the container log
    • OOMKilled component memory allocation is insufficient
  • Count (number of occurrences) information count
  • LastTime (last abnormal time) the time when the corresponding information last occurred
  • FirstTime (first abnormal time) the time when the corresponding information first occurred

In-depth Investigation

After preliminary investigation, if the cause of the abnormality is not found, please follow this section to conduct an in-depth investigation of the component.

Get Component Details

Switch to the scaling interface of the problem component and get the query command:

Execute this command on the management node to return component details:

For component-level error troubleshooting, the key information obtained by this command includes:

  • PodStatus

    • Initialized Whether the current pod is initialized, the possible value is True/False
    • Ready Is the current pod normal? When you need to troubleshoot, the value must be False
    • PodScheduled Whether the current pod can be scheduled, indicating whether there is a compute node in the current cluster that can run the pod, the possible value is True/False
  • PodIP The IP address assigned to the pod by the system. If this item is empty, check whether the KatoSDN (calico/flannel) component is normal.

  • PodStratTime The current pod startup time record. This record is compared with the startup time of Containers.State.Running to determine whether the container has restarted automatically. If there is a restart, refer to [Container Log Query](#Container Log Query)

  • Containers

    • ID - ID of the container running on the host
    • State - corresponds to the state of the container
      • Running (time) Normal state, which means the current container is running
      • Waiting
      • Terminating state

PodStatus Exception Troubleshooting

Initially troubleshoot the problem based on the PodStatus return value.

Ready

When the current component is in operation abnormal, the value will be False. When the component is running normally, the value automatically becomes True.

Initialized

When the pod is in the state where Initialized is False, you need to use POD status query to troubleshoot POD events. This state is mostly caused by the init container cannot find the image. After querying the POD event, just push the specified image.

PodScheduled

When the pod is in the state where PodScheduled is False, it indicates that the remaining resources of all schedulable computing nodes are insufficient to carry the pod. Check whether the computing nodes are in a schedulable state, and confirm whether the cluster resources need to be expanded.

Containers Exception Troubleshooting

Preliminary troubleshooting based on the return value of Containes.State.

Waiting

When a container in the POD appears in this state, it means that the current container has completed the scheduling, but it cannot run. Need to use POD status query to troubleshoot POD events, combined with container log query to solve the problem.

When using the Kato platform, users sometimes install various plug-ins for components. It should be noted that some plug-ins cannot be used at the same time, such as Service Integrated Network Management Plugin Service Network Management Plugin, both of which are used at the same time will cause one of the containers to be in the Waiting state. And eventually cause the component to run abnormally. This is related to their simultaneous work in the POD network outlet.

Terminating

When a container in the POD appears in this state, it means that the POD is in the process of terminating and exiting. Need to use POD status query to troubleshoot POD events, combined with container log query to solve the problem.

POD Status Query

When you don’t know what happened to the pod, you can use the kubectl command to troubleshoot the problem.

kubectl describe pods <PodName> -n <Namespace>

Under normal circumstances, we can directly focus on the Events event record in the returned result. The whole process of the current POD startup is recorded here, and the error is directly displayed. Common mistakes:

  • If the image fails to be pulled, you need to go to the management node to try docker push <name of the failed image in the time information>
  • Restart BackOffContainer, the container failed to start, then please refer to Container Log Inquiry to solve it.

Container Log Query

Kato provides component log output on the component console interface, which integrates the log output of all containers in all PODs under the current component.

In most cases, we can get the current component log in the log interface of the Kato component console, and use this to determine the cause of the abnormal operation. Once the Kato component console log is not pushed, we need to use the following methods to get the log

  • Get the current log directly
kubectl logs <PodName> <Containers.Name> -n <Namespace>
  • The container in the POD has been restarted, get the log before restart
kubectl logs --previous <PodName> <Containers.Name> -n <Namespace>

My Question is not Covered

If you still cannot run the component after reading this document, you can:

  • Go to GitHub to check whether there are related issues, and if not, submit issues

  • Go to Community to read the post with the prefix [Use Question] to find answers to similar questions, or post a summary of your situation.