Monitoring Alarm Items

Overview

The Kato monitoring service is completed by the component rbd-monitor. The Sidecar design pattern is used in the monitor component to integrate the Prometheus service, and dynamically discover the targets that need to be monitored based on ETCD, automatically configure and Manage Prometheus services. Monitor will periodically scrape indicator data from each target and persist the data locally, providing flexible PromQL queries and RESTful API queries.

Architecture Diagram:

Interview Method

The default listening port is 9999. The Service object has been added to the default installation. After the cluster obtains the ServiceIP, add a third-party service on the platform and open the external port to access.

How to get ServiceIP

$ kubectl get service rbd-monitor -n rbd-system
NAME          TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
rbd-monitor   ClusterIP   10.68.140.5   <none>        9999/TCP   7h11m

Please visit rbd-monitor for specific monitoring alarm items. The following is only for reference.

Monitoring Items

Node Resource Monitoring Items

Monitoring itemOwned componentDescription
cadvisor_version_infocadvisornode system information
machine_memory_bytescadvisorCurrent host memory size
machine_cpu_corescadvisorNumber of current node CPUs
node_filesystem_sizenode
node_load1node
node_load5node
node_load5nodeloading 15m
node_memory_MemTotalnodenode memory total
node_memory_MemFreenodenode memory free
node_uname_infonodenode information

Kato Service Component Monitoring Items

Monitoring itemOwned componentDescription
acp_mq_dequeue_numberrbd-sqm
acp_mq_enqueue_numberrbd-mq
acp_mq_exporter_health_statusrbd-mq
acp_mq_exporter_last_scrape_errorrbd-mq
acp_mq_exporter_scrapes_totalrbd-mq
builder_exporter_builder_task_errorrbd-chaosNumber of failed source build tasks
builder_exporter_builder_task_numberrbd-chaosNumber of source build tasks
builder_exporter_health_statusrbd-chaosComponent status 1 is healthy
event_log_exporter_chan_cache_sizerbd-eventlog
event_log_exporter_collector_duration_secondsrbd-eventlog
event_log_exporter_container_log_store_cache_barrel_countrbd-eventlog
event_log_exporter_container_log_store_log_countrbd-eventlog
event_log_exporter_event_store_barrel_countrbd-eventlog
event_log_exporter_event_store_cache_barrel_countrbd-eventlog
event_log_exporter_event_store_log_countrbd-eventlog
event_log_exporter_health_statusrbd-eventlog
event_log_exporter_last_scrape_errorrbd-eventlog
event_log_exporter_monitor_store_barrel_countrbd-eventlog
event_log_exporter_monitor_store_log_countrbd-eventlog
event_log_exporter_scrapes_totalrbd-eventlog
gateway_request_duration_seconds_bucketrbd-gatewayThe number of client requests within the specified request time (bucket)
gateway_request_duration_seconds_countrbd-gatewayTotal number of client requests
gateway_request_duration_seconds_sumrbd-gatewayTotal client request time
gateway_request_size_bucketrbd-gatewayWithin the specified request size (bucket), the number of requests that meet the conditions
gateway_request_size_countrbd-gatewayTotal number of client requests
gateway_request_size_sumrbd-gatewayTotal number of client request sizes
gateway_requestsrbd-gatewayNumber of client visits
gateway_response_duration_seconds_bucketrbd-gatewayThe number of responses within the specified response time (bucket)
gateway_response_duration_seconds_countrbd-gatewayTotal number of responses
gateway_response_duration_seconds_sumrbd-gatewayTotal response time
gateway_response_size_bucketrbd-gatewayWithin the specified response size (bucket), the number of responses that meet the conditions
gateway_response_size_countrbd-gatewayTotal number of responses
gateway_response_size_sumrbd-gatewayTotal size of response
gateway_upstream_latency_secondsrbd-gatewayWithin the specified delay time (bucket), the number of delays that meet the conditions
gateway_upstream_latency_seconds_countrbd-gatewayTotal number of delays
gateway_upstream_latency_seconds_sumrbd-gatewaysum of delay time
worker_exporter_health_statusrbd-worker
worker_exporter_worker_task_numberrbd-worker
worker_exporter_collector_duration_secondsrbd-worker
worker_exporter_last_scrape_errorrbd-worker
worker_exporter_scrapes_totalrbd-worker
worker_exporter_worker_task_errorrbd-worker
worker_exporter_worker_task_numberrbd-worker
worker_uprbd-worker
scrape_samples_scraped
scrape_samples_post_metric_relabeling
scrape_duration_seconds
statsd_exporter_build_info
statsd_exporter_events_total
statsd_exporter_lines_total
statsd_exporter_loaded_mappings
statsd_exporter_samples_total
statsd_exporter_tag_errors_total
statsd_exporter_tags_total
statsd_exporter_tcp_connection_errors_total
statsd_exporter_tcp_connections_total
statsd_exporter_tcp_too_long_lines_total
statsd_exporter_udp_packets_total
upComponent Status

Application-level Monitoring Items

Monitoring itemsDescription
app_resource_appmemoryApp memory, filtered by service_id, tenant_id
app_resource_appfsApps
app_resource_appmemoryApp
app_client_requestApplication
app_client_requesttimeApp
app_requestApplication
app_request_unusualApp
app_requestclientApplication
app_requesttimeApp

Application level obtains typical monitoring indicators based on CAvisor

Monitoring itemTypeDescription
container_cpu_load_average_10sgaugeThe average load of container CPU in the past 10 seconds
container_cpu_usage_seconds_totalcounterThe cumulative occupation time of the container on each CPU core (unit: seconds)
container_cpu_system_seconds_totalcounterSystem CPU cumulative occupancy time (unit: seconds)
container_cpu_user_seconds_totalcounterUser CPU occupancy time (unit: second)
container_fs_usage_bytesgaugeThe usage of the file system in the container (unit: bytes)
container_fs_limit_bytesgaugeThe total amount of file systems that can be used by the container (unit: bytes)
container_fs_reads_bytes_totalcounterThe total amount of data read by the container (unit: bytes)
container_fs_writes_bytes_totalcounterThe total amount of data written to the container (unit: bytes)
container_memory_max_usage_bytesgaugeThe maximum memory usage of the container (unit: bytes)
container_memory_usage_bytesgaugeThe current memory usage of the container (unit: bytes
container_spec_memory_limit_bytesgaugeContainer’s memory usage limit
container_network_receive_bytes_totalcounterThe total amount of data received by the container network (unit: bytes)
container_network_transmit_bytes_totalcounterThe total amount of data transmitted by the container network (unit: bytes)

Other Monitoring Items

Monitoring itemDescription
process_cpu_seconds_total
process_max_fds
process_open_fds
process_virtual_memory_bytes
process_start_time_seconds
process_resident_memory_bytes
process_open_fds
process_max_fds
process_cpu_seconds_total

Alarm Rule Description

Component Monitoring Alarm

Alarm itemAlarm information
api service offlineAPIDown
chaos service offlineBuilderDown
Chaos component status is abnormalBuilderUnhealthy
The number of abnormal tasks in source code construction is greater than 30BuilderTaskError
ETCD service offlineEtcdDown
ETCD Leader node goes offlineEtcdLoseLeader
ETCD cluster member abnormalInsufficientMembers
ETCD cluster Leader changesHighNumberOfLeaderChanges
ETCD GPRC failed requests greater than 0.05HighNumberOfFailedGRPCRequests
ETCD The number of failed HTTP requests within 1 minute is greater than 0.05HighNumberOfFailedHTTPRequests
The number of GPRC slow queries within 1 minute of ETCD exceeds 0.15GRPCRequestsSlow
ETCD disk space occupies more than 80%DatabaseSpaceExceeded
The eventlog component status is abnormalEventLogUnhealthy
eventlog service offlineEventLogDown
Gateway service offlineGatewayDown
Gateway request size exceeds 10MRequestSizeTooMuch
The number of gateway requests per second exceeds 200RequestMany
The number of error requests in the gateway 10s is greater than 5FailureRequestMany
mq service offlineMqDown
The mq component status is abnormalMqUnhealthy
Tasks that exist in the mq message queue for more than 1 minuteMqMessageQueueBlock
webcli service offlineWebcliDown
Webcli component status is abnormalWebcliUnhealthy
The number of errors that occur when webcli executes commands is greater than 5 times per secondWebcliUnhealthy
worker service offlineWorkerDown
Worker component status is abnormalWorkerUnhealthy
The number of worker execution task errors is greater than 50WorkerTaskError

Cluster Monitoring Alarm

Alarm itemAlarm information
Kato cluster node node is unhealthyRbdNodeUnhealth
K8s cluster node node is unhealthyKubeNodeUnhealth
Collecting cluster information takes more than 10sClusterCollectorTimeout
The tenant’s use of resources exceeds the resource limitInsufficientTenantResources
Node node goes offlineNodeDown
The CPU usage of the node is greater than 70% within 5 minutesHighCpuUsageOnNode
Cluster available memory resources are less than 2GBInsufficientClusteMemoryResources
The available cluster CPU is less than 500mInsufficientClusteCPUResources
Node load is greater than 5 in 5 minutesHighLoadOnNode
The remaining available amount of the node Inode is less than 0.3InodeFreerateLow
Node root partition disk usage is greater than 85%HighRootdiskUsageOnNode
Node Docker disk partition usage is greater than 85%HighDockerdiskUsageOnNode
Node memory usage is greater than 80%HighMemoryUsageOnNode

For cluster monitoring alarm configuration, please refer to Monitor Alarm Deployment