Просмотр оповещений

В системе есть предустановленные правила и возможность настройки кастомных правил. На странице “Просмотр оповещений” доступен просмотр результатов обработки правил (алертов). Результаты разбиты по вкладкам в соответствии со статусами срабатывания:

Firing. Алерт срабатывает и указывает на актуальное нарушение заданного условия или события.
Pending. Алерт не срабатывает в данный момент, но есть вероятность срабатывания в будущем. Это может быть вызвано временными изменениями.
Inactive. Алерт не срабатывает в течение определенного периода времени или отключен администратором. На каждой вкладке алерты разбиты по группам в соответствии с заданным выражением. В каждой группе указано выражение, описание условий срабатывания алерта и счетчик количества событий в группе, а также сами алерты.

Состав алерта:

Идентификатор;
Название;
Дата и время срабатывания;
Описание;
Выражения;
Критичность.

Предустановленные правила

В данном разделе представлен перечень всех предустановленных правил оповещений в клиентском кластере платформы.

Группа shturval-backup

VeleroBackupPartialFailures

Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} partialy failed backups.

Выражение

velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

VeleroBackupFailures

Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} failed backups.

Выражение

velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

Группа alertmanager.rules

AlertmanagerFailedReload

Описание: Configuration has failed to load for {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]) == 0

AlertmanagerMembersInconsistent

Описание: Alertmanager {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}} has only found {{{{ $value }}}} members of the {{{{$labels.job}}}} cluster.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
  max_over_time(alertmanager_cluster_members{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
< on (namespace,service,cluster) group_left
  count by (namespace,service,cluster) (max_over_time(alertmanager_cluster_members{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]))

AlertmanagerFailedToSendAlerts

Описание: Alertmanager {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}} failed to send {{{{ $value | humanizePercentage }}}} of notifications to {{{{ $labels.integration }}}}.

Выражение

(
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
)
> 0.01

AlertmanagerClusterFailedToSendAlerts

Описание: The minimum notification failure rate to {{{{ $labels.integration }}}} sent from any instance in the {{{{$labels.job}}}} cluster is {{{{ $value | humanizePercentage }}}}.

Выражение

min by (namespace,service, integration) (
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration=~`.*`}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration=~`.*`}[5m])
)
> 0.01

AlertmanagerClusterFailedToSendAlerts

Описание: The minimum notification failure rate to {{{{ $labels.integration }}}} sent from any instance in the {{{{$labels.job}}}} cluster is {{{{ $value | humanizePercentage }}}}.

Выражение

min by (namespace,service, integration) (
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration!~`.*`}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration!~`.*`}[5m])
)
> 0.01

AlertmanagerConfigInconsistent

Описание: Alertmanager instances within the {{{{$labels.job}}}} cluster have different configurations.

Выражение

count by (namespace,service,cluster) (
  count_values by (namespace,service,cluster) ("config_hash", alertmanager_config_hash{job="shturval-metrics-alertmanager",namespace="monitoring"})
)
!= 1

AlertmanagerClusterDown

Описание: {{{{ $value | humanizePercentage }}}} of Alertmanager instances within the {{{{$labels.job}}}} cluster have been up for less than half of the last 5m.

Выражение

(
  count by (namespace,service,cluster) (
    avg_over_time(up{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]) < 0.5
  )
/
  count by (namespace,service,cluster) (
    up{job="shturval-metrics-alertmanager",namespace="monitoring"}
  )
)
>= 0.5

AlertmanagerClusterCrashlooping

Описание: {{{{ $value | humanizePercentage }}}} of Alertmanager instances within the {{{{$labels.job}}}} cluster have restarted at least 5 times in the last 10m.

Выражение

(
  count by (namespace,service,cluster) (
    changes(process_start_time_seconds{job="shturval-metrics-alertmanager",namespace="monitoring"}[10m]) > 4
  )
/
  count by (namespace,service,cluster) (
    up{job="shturval-metrics-alertmanager",namespace="monitoring"}
  )
)
>= 0.5

Группа config-reloaders

ConfigReloaderSidecarErrors

Описание: Errors encountered while the {{{{$labels.pod}}}} config-reloader sidecar attempts to sync config in {{{{$labels.namespace}}}} namespace. As a result, configuration for service running in {{{{$labels.pod}}}} may be stale and cannot be updated anymore.

Выражение

max_over_time(reloader_last_reload_successful{namespace=~".+"}[5m]) == 0

Группа etcd

etcdMembersDown

Описание: etcd cluster “{{{{ $labels.job }}}}”: members are down ({{{{ $value }}}}).

Выражение

max without (endpoint) (
  sum without (instance) (up{job=~".*etcd.*"} == bool 0)
or
  count without (To) (
    sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01
  )
)
> 0

etcdInsufficientMembers

Описание: etcd cluster “{{{{ $labels.job }}}}”: insufficient members ({{{{ $value }}}}).

Выражение

sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)

etcdNoLeader

Описание: etcd cluster “{{{{ $labels.job }}}}”: member {{{{ $labels.instance }}}} has no leader.

Выражение

etcd_server_has_leader{job=~".*etcd.*"} == 0

etcdHighNumberOfLeaderChanges

Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated. etcd cluster has high number of leader changes.

Выражение

increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4

etcdHighNumberOfFailedGRPCRequests

Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}}% of requests for {{{{ $labels.grpc_method }}}} failed on etcd instance {{{{ $labels.instance }}}}.

Выражение

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
  /
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
  > 1

etcdHighNumberOfFailedGRPCRequests

Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}}% of requests for {{{{ $labels.grpc_method }}}} failed on etcd instance {{{{ $labels.instance }}}}.

Выражение

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
  /
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
  > 5

etcdGRPCRequestsSlow

Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile of gRPC requests is {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}} for {{{{ $labels.grpc_method }}}} method. etcd grpc requests are slow

Выражение

histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_method!="Defragment", grpc_type="unary"}[5m])) without(grpc_type))
> 0.15

etcdMemberCommunicationSlow

Описание: etcd cluster “{{{{ $labels.job }}}}”: member communication with {{{{ $labels.To }}}} is taking {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.15

etcdHighNumberOfFailedProposals

Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}} proposal failures within the last 30 minutes on etcd instance {{{{ $labels.instance }}}}.

Выражение

rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5

etcdHighFsyncDurations

Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile fsync durations are {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.5

etcdHighFsyncDurations

Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile fsync durations are {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1

etcdHighCommitDurations

Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile commit durations {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.25

etcdDatabaseQuotaLowSpace

Описание: etcd cluster “{{{{ $labels.job }}}}”: database size exceeds the defined quota on etcd instance {{{{ $labels.instance }}}}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.

Выражение

(last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_server_quota_backend_bytes{job=~".*etcd.*"}[5m]))*100 > 95

etcdExcessiveDatabaseGrowth

Описание: etcd cluster “{{{{ $labels.job }}}}”: Predicting running out of disk space in the next four hours, based on write observations within the past four hours on etcd instance {{{{ $labels.instance }}}}, please check as it might be disruptive. Выражение

predict_linear(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[4h], 4*60*60) > etcd_server_quota_backend_bytes{job=~".*etcd.*"}

etcdDatabaseHighFragmentationRatio

Описание: etcd cluster “{{{{ $labels.job }}}}”: database size in use on instance {{{{ $labels.instance }}}} is {{{{ $value | humanizePercentage }}}} of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space.

Выражение

(last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) < 0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600

Группа general.rules

TargetDown

Описание: {{{{ printf “%.4g” $value }}}}% of the {{{{ $labels.job }}}}/{{{{ $labels.service }}}} targets in {{{{ $labels.namespace }}}} namespace are down.

Выражение

100 * (count(up == 0) BY (cluster, job, namespace, service) / count(up) BY (cluster, job, namespace, service)) > 10

Watchdog

Описание: This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the “DeadMansSnitch” integration in PagerDuty.

Выражение

vector(1)

InfoInhibitor

Описание: This is an alert that is used to inhibit info alerts. By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with other alerts. This alert fires whenever there’s a severity=“info” alert, and stops firing when another alert with a severity of ‘warning’ or ‘critical’ starts firing on the same namespace. This alert should be routed to a null receiver and configured to inhibit alerts with severity=“info”.

Выражение

ALERTS{severity = "info"} == 1 unless on (namespace) ALERTS{alertname != "InfoInhibitor", severity =~ "warning|critical", alertstate="firing"} == 1

Группа kube-apiserver-slos

KubeAPIErrorBudgetBurn

Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
and
sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)

Long:1h
Short:5m

KubeAPIErrorBudgetBurn

Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)
and
sum(apiserver_request:burnrate30m) > (6.00 * 0.01000)

Long:6h
Short:30m

KubeAPIErrorBudgetBurn

Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)
and
sum(apiserver_request:burnrate2h) > (3.00 * 0.01000)

Long:1d
Short:2h

KubeAPIErrorBudgetBurn

Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)
and
sum(apiserver_request:burnrate6h) > (1.00 * 0.01000)

Группа kube-state-metrics

KubeStateMetricsListErrors

Описание: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.

Выражение

(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
  /
sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01

KubeStateMetricsWatchErrors

Описание: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.

Выражение

(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
  /
sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01

KubeStateMetricsShardingMismatch

Описание: kube-state-metrics pods are running with different –total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all.

Выражение

stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) by (cluster) != 0

KubeStateMetricsShardsMissing

Описание: kube-state-metrics shards are missing, some Kubernetes objects are not being exposed.

Выражение

2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) by (cluster) - 1
  -
sum( 2 ^ max by (cluster, shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) ) by (cluster)
!= 0

Группа kubernetes-apps

KubePodCrashLooping

Описание: Pod {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}} ({{{{ $labels.container }}}}) is in waiting state (reason: “CrashLoopBackOff”).

Выражение

max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics", namespace=~".*"}[5m]) >= 1

KubePodNotReady

Описание: Pod {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}} has been in a non-ready state for longer than 15 minutes.

Выражение

sum by (namespace, pod, cluster) (
  max by (namespace, pod, cluster) (
    kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"}
  ) * on (namespace, pod, cluster) group_left(owner_kind) topk by (namespace, pod, cluster) (
    1, max by (namespace, pod, owner_kind, cluster) (kube_pod_owner{owner_kind!="Job"})
  )
) > 0

KubeDeploymentGenerationMismatch

Описание: Deployment generation for {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} does not match, this indicates that the Deployment has failed but has not been rolled back.

Выражение

kube_deployment_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
  !=
kube_deployment_metadata_generation{job="kube-state-metrics", namespace=~".*"}

KubeDeploymentReplicasMismatch

Описание: Deployment {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} has not matched the expected number of replicas for longer than 15 minutes.

Выражение

(
  kube_deployment_spec_replicas{job="kube-state-metrics", namespace=~".*"}
    >
  kube_deployment_status_replicas_available{job="kube-state-metrics", namespace=~".*"}
) and (
  changes(kube_deployment_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
    ==
  0
)

KubeDeploymentRolloutStuck

Описание: Rollout of deployment {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} is not progressing for longer than 15 minutes.

Выражение

kube_deployment_status_condition{condition="Progressing", status="false",job="kube-state-metrics", namespace=~".*"}
!= 0

KubeStatefulSetReplicasMismatch

Описание: StatefulSet {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} has not matched the expected number of replicas for longer than 15 minutes.

Выражение

(
  kube_statefulset_status_replicas_ready{job="kube-state-metrics", namespace=~".*"}
    !=
  kube_statefulset_status_replicas{job="kube-state-metrics", namespace=~".*"}
) and (
  changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
    ==
  0
)

KubeStatefulSetGenerationMismatch

Описание: StatefulSet generation for {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} does not match, this indicates that the StatefulSet has failed but has not been rolled back.

Выражение

kube_statefulset_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
  !=
kube_statefulset_metadata_generation{job="kube-state-metrics", namespace=~".*"}

KubeStatefulSetUpdateNotRolledOut

Описание: StatefulSet {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} update has not been rolled out.

Выражение

(
  max without (revision) (
    kube_statefulset_status_current_revision{job="kube-state-metrics", namespace=~".*"}
      unless
    kube_statefulset_status_update_revision{job="kube-state-metrics", namespace=~".*"}
  )
    *
  (
    kube_statefulset_replicas{job="kube-state-metrics", namespace=~".*"}
      !=
    kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}
  )
)  and (
  changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[5m])
    ==
  0
)

KubeDaemonSetRolloutStuck

Описание: DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} has not finished or progressed for at least 15 minutes.

Выражение

(
  (
    kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  ) or (
    kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    0
  ) or (
    kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  ) or (
    kube_daemonset_status_number_available{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  )
) and (
  changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}[5m])
    ==
  0
)

KubeContainerWaiting

Описание: pod/{{{{ $labels.pod }}}} in namespace {{{{ $labels.namespace }}}} on container {{{{ $labels.container}}}} has been in waiting state for longer than 1 hour.

Выражение

sum by (namespace, pod, container, cluster) (kube_pod_container_status_waiting_reason{job="kube-state-metrics", namespace=~".*"}) > 0

KubeDaemonSetNotScheduled

Описание: {{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are not scheduled.

Выражение

kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  -
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"} > 0

KubeDaemonSetMisScheduled

Описание: {{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are running where they are not supposed to run.

Выражение

kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"} > 0

KubeJobNotCompleted

Описание: Job {{{{ $labels.namespace }}}}/{{{{ $labels.job_name }}}} is taking more than {{{{ “43200” | humanizeDuration }}}} to complete.

Выражение

time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
  and
kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200

KubeJobFailed

Описание: Job {{{{ $labels.namespace }}}}/{{{{ $labels.job_name }}}} failed to complete. Removing failed job after investigation should clear this alert.

Выражение

kube_job_failed{job="kube-state-metrics", namespace=~".*"}  > 0

KubeHpaReplicasMismatch

Описание: HPA {{{{ $labels.namespace }}}}/{{{{ $labels.horizontalpodautoscaler }}}} has not matched the desired number of replicas for longer than 15 minutes.

Выражение

(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", namespace=~".*"}
  !=
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"})
  and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  >
kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", namespace=~".*"})
  and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  <
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"})
  and
changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}[15m]) == 0

KubeHpaMaxedOut

Описание: HPA {{{{ $labels.namespace }}}}/{{{{ $labels.horizontalpodautoscaler }}}} has been running at max replicas for longer than 15 minutes.

Выражение

kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  ==
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"}

Группа kubernetes-resources

KubeCPUOvercommit

Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted CPU resource requests for Pods by {{{{ $value }}}} CPU shares and cannot tolerate node failure.

Выражение

sum(namespace_cpu:kube_pod_container_resource_requests:sum{job="kube-state-metrics",}) by (cluster) - (sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0

KubeMemoryOvercommit

Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted memory resource requests for Pods by {{{{ $value | humanize }}}} bytes and cannot tolerate node failure.

Выражение

sum(namespace_memory:kube_pod_container_resource_requests:sum{}) by (cluster) - (sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0

KubeCPUQuotaOvercommit

Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted CPU resource requests for Namespaces.

Выражение

sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(cpu|requests.cpu)"})) by (cluster)
  /
sum(kube_node_status_allocatable{resource="cpu", job="kube-state-metrics"}) by (cluster)
  > 1.5

KubeMemoryQuotaOvercommit

Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted memory resource requests for Namespaces.

Выражение

sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(memory|requests.memory)"})) by (cluster)
  /
sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)
  > 1.5

KubeQuotaAlmostFull

Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 0.9 < 1

KubeQuotaFullyUsed

Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  == 1

KubeQuotaExceeded

Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 1

CPUThrottlingHigh

Описание: {{{{ $value | humanizePercentage }}}} throttling of CPU in namespace {{{{ $labels.namespace }}}} for container {{{{ $labels.container }}}} in pod {{{{ $labels.pod }}}}.

Выражение

sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster, container, pod, namespace)
  /
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster, container, pod, namespace)
  > ( 25 / 100 )

Группа kubernetes-storage

KubePersistentVolumeFillingUp

Описание: The PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is only {{{{ $value | humanizePercentage }}}} free.

Выражение

(
  kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeFillingUp

Описание: Based on recent sampling, the PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is expected to fill up within four days. Currently {{{{ $value | humanizePercentage }}}} is available.

Выражение

(
  kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Описание: The PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} only has {{{{ $value | humanizePercentage }}}} free inodes.

Выражение

(
  kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Описание: Based on recent sampling, the PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is expected to run out of inodes within four days. Currently {{{{ $value | humanizePercentage }}}} of its inodes are free.

Выражение

(
  kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeErrors

Описание: The persistent volume {{{{ $labels.persistentvolume }}}} on Cluster {{{{ $labels.cluster }}}} has status {{{{ $labels.phase }}}}.

Выражение

kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0

Группа kubernetes-system-apiserver

KubeClientCertificateExpiration

Описание: A client certificate used to authenticate to kubernetes apiserver is expiring in less than 7.0 days.

Выражение

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on (job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

KubeClientCertificateExpiration

Описание: A client certificate used to authenticate to kubernetes apiserver is expiring in less than 24.0 hours.

Выражение

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on (job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400

KubeAggregatedAPIErrors

Описание: Kubernetes aggregated API {{{{ $labels.name }}}}/{{{{ $labels.namespace }}}} has reported errors. It has appeared unavailable {{{{ $value | humanize }}}} times averaged over the past 10m.

Выражение

sum by (name, namespace, cluster)(increase(aggregator_unavailable_apiservice_total{job="apiserver"}[10m])) > 4

KubeAggregatedAPIDown

Описание: Kubernetes aggregated API {{{{ $labels.name }}}}/{{{{ $labels.namespace }}}} has been only {{{{ $value | humanize }}}}% available over the last 10m.

Выражение

(1 - max by (name, namespace, cluster)(avg_over_time(aggregator_unavailable_apiservice{job="apiserver"}[10m]))) * 100 < 85

KubeAPIDown

Описание: KubeAPI has disappeared from Prometheus target discovery.

Выражение

absent(up{job="apiserver"} == 1)

KubeAPITerminatedRequests

Описание: The kubernetes apiserver has terminated {{{{ $value | humanizePercentage }}}} of its incoming requests.

Выражение

sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))  / (  sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20

Группа kubernetes-system-controller-manager

KubeControllerManagerDown

Описание: KubeControllerManager has disappeared from Prometheus target discovery.

Выражение

absent(up{job="kube-controller-manager"} == 1)

Группа kubernetes-system-kubelet

KubeNodeNotReady

Описание: {{{{ $labels.node }}}} has been unready for more than 15 minutes.

Выражение

kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0

KubeNodeUnreachable

Описание: {{{{ $labels.node }}}} is unreachable and some workloads may be rescheduled.

Выражение

(kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} unless ignoring(key,value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1

KubeletTooManyPods

Описание: Kubelet ‘{{{{ $labels.node }}}}’ is running at {{{{ $value | humanizePercentage }}}} of its Pod capacity.

Выражение

count by (cluster, node) (
  (kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on (instance,pod,namespace,cluster) group_left(node) topk by (instance,pod,namespace,cluster) (1, kube_pod_info{job="kube-state-metrics"})
)
/
max by (cluster, node) (
  kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1
) > 0.95

KubeNodeReadinessFlapping

Описание: The readiness status of node {{{{ $labels.node }}}} has changed {{{{ $value }}}} times in the last 15 minutes.

Выражение

sum(changes(kube_node_status_condition{job="kube-state-metrics",status="true",condition="Ready"}[15m])) by (cluster, node) > 2

KubeletPlegDurationHigh

Описание: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{{{ $value }}}} seconds on node {{{{ $labels.node }}}}.

Выражение

node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10

KubeletPodStartUpLatencyHigh

Описание: Kubelet Pod startup 99th percentile latency is {{{{ $value }}}} seconds on node {{{{ $labels.node }}}}.

Выражение

histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (cluster, instance, le)) * on (cluster, instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60

KubeletClientCertificateExpiration

Описание: Client certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_client_ttl_seconds < 604800

KubeletClientCertificateExpiration

Описание: Client certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}. Выражение

kubelet_certificate_manager_client_ttl_seconds < 86400

KubeletServerCertificateExpiration

Описание: Server certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_server_ttl_seconds < 604800

KubeletServerCertificateExpiration

Описание: Server certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_server_ttl_seconds < 86400

KubeletClientCertificateRenewalErrors

Описание: Kubelet on node {{{{ $labels.node }}}} has failed to renew its client certificate ({{{{ $value | humanize }}}} errors in the last 5 minutes).

Выражение

increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0

KubeletServerCertificateRenewalErrors

Описание: Kubelet on node {{{{ $labels.node }}}} has failed to renew its server certificate ({{{{ $value | humanize }}}} errors in the last 5 minutes).

Выражение

increase(kubelet_server_expiration_renew_errors[5m]) > 0

KubeletDown

Описание: Kubelet has disappeared from Prometheus target discovery. Выражение

absent(up{job="kubelet", metrics_path="/metrics"} == 1)

Группа kubernetes-system-scheduler

KubeSchedulerDown

Описание: KubeScheduler has disappeared from Prometheus target discovery.

Выражение

absent(up{job="kube-scheduler"} == 1)

Группа kubernetes-system

KubeVersionMismatch

Описание: There are {{{{ $value }}}} different semantic versions of Kubernetes components running.

Выражение

count by (cluster) (count by (git_version, cluster) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"git_version","$1","git_version","(v[0-9]*.[0-9]*).*"))) > 1

KubeClientErrors

Описание: Kubernetes API server client ‘{{{{ $labels.job }}}}/{{{{ $labels.instance }}}}’ is experiencing {{{{ $value | humanizePercentage }}}} errors.’

Выражение

(sum(rate(rest_client_requests_total{job="apiserver",code=~"5.."}[5m])) by (cluster, instance, job, namespace)
  /
sum(rate(rest_client_requests_total{job="apiserver"}[5m])) by (cluster, instance, job, namespace))
> 0.01

Группа node-exporter

NodeFilesystemSpaceFillingUp

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left and is filling up.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 15
and
  predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemSpaceFillingUp

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left and is filling up fast.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 10
and
  predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfSpace

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfSpace

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemFilesFillingUp

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left and is filling up.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 40
and
  predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemFilesFillingUp

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left and is filling up fast.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 20
and
  predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfFiles

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfFiles

Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeNetworkReceiveErrs

Описание: {{{{ $labels.instance }}}} interface {{{{ $labels.device }}}} has encountered {{{{ printf “%.0f” $value }}}} receive errors in the last two minutes.

Выражение

rate(node_network_receive_errs_total{job="node-exporter"}[2m]) / rate(node_network_receive_packets_total{job="node-exporter"}[2m]) > 0.01

NodeNetworkTransmitErrs

Описание: {{{{ $labels.instance }}}} interface {{{{ $labels.device }}}} has encountered {{{{ printf “%.0f” $value }}}} transmit errors in the last two minutes.

Выражение

rate(node_network_transmit_errs_total{job="node-exporter"}[2m]) / rate(node_network_transmit_packets_total{job="node-exporter"}[2m]) > 0.01

NodeHighNumberConntrackEntriesUsed

Описание: {{{{ $value | humanizePercentage }}}} of conntrack entries are used.

Выражение

(node_nf_conntrack_entries{job="node-exporter"} / node_nf_conntrack_entries_limit) > 0.75

NodeTextFileCollectorScrapeError

Описание: Node Exporter text file collector on {{{{ $labels.instance }}}} failed to scrape.

Выражение

node_textfile_scrape_error{job="node-exporter"} == 1

NodeClockSkewDetected

Описание: Clock at {{{{ $labels.instance }}}} is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.

Выражение

(
  node_timex_offset_seconds{job="node-exporter"} > 0.05
and
  deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
)
or
(
  node_timex_offset_seconds{job="node-exporter"} < -0.05
and
  deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
)

NodeClockNotSynchronising

Описание: Clock at {{{{ $labels.instance }}}} is not synchronising. Ensure NTP is configured on this host.

Выражение

min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
and
node_timex_maxerror_seconds{job="node-exporter"} >= 16

NodeRAIDDegraded

Описание: RAID array ‘{{{{ $labels.device }}}}’ at {{{{ $labels.instance }}}} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.

Выражение

node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}) > 0

NodeRAIDDiskFailure

Описание: At least one device in RAID array at {{{{ $labels.instance }}}} failed. Array ‘{{{{ $labels.device }}}}’ needs attention and possibly a disk swap.

Выражение

node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"} > 0

NodeFileDescriptorLimit

Описание: File descriptors limit at {{{{ $labels.instance }}}} is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

(
  node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
)

NodeFileDescriptorLimit

Описание: File descriptors limit at {{{{ $labels.instance }}}} is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

(
  node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
)

NodeCPUHighUsage

Описание: CPU usage at {{{{ $labels.instance }}}} has been above 90% for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter", mode!="idle"}[2m]))) * 100 > 90

NodeSystemSaturation

Описание: System load per core at {{{{ $labels.instance }}}} has been above 2 for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. This might indicate this instance resources saturation and can cause it becoming unresponsive.

Выражение

node_load1{job="node-exporter"}
/ count without (cpu, mode) (node_cpu_seconds_total{job="node-exporter", mode="idle"}) > 2

NodeMemoryMajorPagesFaults

Описание: Memory major pages are occurring at very high rate at {{{{ $labels.instance }}}}, 500 major page faults per second for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. Please check that there is enough memory available at this instance.

Выражение

rate(node_vmstat_pgmajfault{job="node-exporter"}[5m]) > 500

NodeMemoryHighUtilization

Описание: Memory is filling up at {{{{ $labels.instance }}}}, has been above 90% for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 90

NodeDiskIOSaturation

Описание: Disk IO queue (aqu-sq) is high on {{{{ $labels.device }}}} at {{{{ $labels.instance }}}}, has been above 10 for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. This symptom might indicate disk saturation.

Выражение

rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m]) > 10

NodeSystemdServiceFailed

Описание: Systemd service {{{{ $labels.name }}}} has entered failed state at {{{{ $labels.instance }}}}

Выражение

node_systemd_unit_state{job="node-exporter", state="failed"} == 1

NodeBondingDegraded

Описание: Bonding interface {{{{ $labels.master }}}} on {{{{ $labels.instance }}}} is in degraded state due to one or more slave failures.

Выражение

(node_bonding_slaves - node_bonding_active) != 0

Группа node-network

NodeNetworkInterfaceFlapping

Описание: Network interface “{{{{ $labels.device }}}}” changing its up status often on node-exporter {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}}

Выражение

changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2

Группа prometheus-operator

PrometheusOperatorListErrors

Описание: Errors while performing List operations in controller {{{{$labels.controller}}}} in {{{{$labels.namespace}}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_list_operations_failed_total{job="shturval-metrics-operator",namespace="monitoring"}[10m])) / sum by (cluster,controller,namespace) (rate(prometheus_operator_list_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[10m]))) > 0.4

PrometheusOperatorWatchErrors

Описание: Errors while performing watch operations in controller {{{{$labels.controller}}}} in {{{{$labels.namespace}}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_watch_operations_failed_total{job="shturval-metrics-operator",namespace="monitoring"}[5m])) / sum by (cluster,controller,namespace) (rate(prometheus_operator_watch_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.4

PrometheusOperatorSyncFailed

Описание: Controller {{{{ $labels.controller }}}} in {{{{ $labels.namespace }}}} namespace fails to reconcile {{{{ $value }}}} objects.

Выражение

min_over_time(prometheus_operator_syncs{status="failed",job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0

PrometheusOperatorReconcileErrors

Описание: {{{{ $value | humanizePercentage }}}} of reconciling operations failed for {{{{ $labels.controller }}}} controller in {{{{ $labels.namespace }}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_reconcile_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) / (sum by (cluster,controller,namespace) (rate(prometheus_operator_reconcile_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.1

PrometheusOperatorStatusUpdateErrors

Описание: {{{{ $value | humanizePercentage }}}} of status update operations failed for {{{{ $labels.controller }}}} controller in {{{{ $labels.namespace }}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_status_update_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) / (sum by (cluster,controller,namespace) (rate(prometheus_operator_status_update_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.1

PrometheusOperatorNodeLookupErrors

Описание: Errors while reconciling Prometheus in {{{{ $labels.namespace }}}} Namespace.

Выражение

rate(prometheus_operator_node_address_lookup_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0.1

PrometheusOperatorNotReady

Описание: Prometheus operator in {{{{ $labels.namespace }}}} namespace isn’t ready to reconcile {{{{ $labels.controller }}}} resources.

Выражение

min by (cluster,controller,namespace) (max_over_time(prometheus_operator_ready{job="shturval-metrics-operator",namespace="monitoring"}[5m]) == 0)

PrometheusOperatorRejectedResources

Описание: Prometheus operator in {{{{ $labels.namespace }}}} namespace rejected {{{{ printf “%0.0f” $value }}}} {{{{ $labels.controller }}}}/{{{{ $labels.resource }}}} resources.

Выражение

min_over_time(prometheus_operator_managed_resources{state="rejected",job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0

Группа prometheus

PrometheusBadConfig

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to reload its configuration.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_config_last_reload_successful{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) == 0

PrometheusSDRefreshFailure

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to refresh SD with mechanism {{{{$labels.mechanism}}}}.

Выражение

increase(prometheus_sd_refresh_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[10m]) > 0

PrometheusNotificationQueueRunningFull

Описание: Alert notification queue of Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is running full.

Выражение

# Without min_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  predict_linear(prometheus_notifications_queue_length{job="shturval-metrics-prometheus",namespace="monitoring"}[5m], 60 * 30)
>
  min_over_time(prometheus_notifications_queue_capacity{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)

PrometheusErrorSendingAlertsToSomeAlertmanagers

Описание: {{{{ printf “%.1f” $value }}}}% errors while sending alerts from Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} to Alertmanager {{{{$labels.alertmanager}}}}.

Выражение

(
  rate(prometheus_notifications_errors_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
/
  rate(prometheus_notifications_sent_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)
* 100
> 1

PrometheusNotConnectedToAlertmanagers

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is not connected to any Alertmanagers.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_notifications_alertmanagers_discovered{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) < 1

PrometheusTSDBReloadsFailing

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has detected {{{{$value | humanize}}}} reload failures over the last 3h.

Выражение

increase(prometheus_tsdb_reloads_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[3h]) > 0

PrometheusTSDBCompactionsFailing

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has detected {{{{$value | humanize}}}} compaction failures over the last 3h.

Выражение

increase(prometheus_tsdb_compactions_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[3h]) > 0

PrometheusNotIngestingSamples

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is not ingesting samples.

Выражение

(
  rate(prometheus_tsdb_head_samples_appended_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) <= 0
and
  (
    sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="shturval-metrics-prometheus",namespace="monitoring"}) > 0
  or
    sum without(rule_group) (prometheus_rule_group_rules{job="shturval-metrics-prometheus",namespace="monitoring"}) > 0
  )
)

PrometheusDuplicateTimestamps

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is dropping {{{{ printf “%.4g” $value }}}} samples/s with different values but duplicated timestamp.

Выражение

rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusOutOfOrderTimestamps

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is dropping {{{{ printf “%.4g” $value }}}} samples/s with timestamps arriving out of order.

Выражение

rate(prometheus_target_scrapes_sample_out_of_order_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusRemoteStorageFailures

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} failed to send {{{{ printf “%.1f” $value }}}}% of the samples to {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}

Выражение

(
  (rate(prometheus_remote_storage_failed_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
/
  (
    (rate(prometheus_remote_storage_failed_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
  +
    (rate(prometheus_remote_storage_succeeded_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
  )
)
* 100
> 1

PrometheusRemoteWriteBehind

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} remote write is {{{{ printf “%.1f” $value }}}}s behind for {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
- ignoring(remote_name, url) group_right
  max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)
> 120

PrometheusRemoteWriteDesiredShards

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} remote write desired shards calculation wants to run {{{{ $value }}}} shards for queue {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}, which is more than the max of {{{{ printf prometheus_remote_storage_shards_max{{instance="%s",job="shturval-metrics-prometheus",namespace="monitoring"}} $labels.instance | query | first | value }}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  max_over_time(prometheus_remote_storage_shards_desired{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
>
  max_over_time(prometheus_remote_storage_shards_max{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)

PrometheusRuleFailures

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to evaluate {{{{ printf “%.0f” $value }}}} rules in the last 5m.

Выражение

increase(prometheus_rule_evaluation_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusMissingRuleEvaluations

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has missed {{{{ printf “%.0f” $value }}}} rule group evaluations in the last 5m.

Выражение

increase(prometheus_rule_group_iterations_missed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusTargetLimitHit

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has dropped {{{{ printf “%.0f” $value }}}} targets because the number of targets exceeded the configured target_limit.

Выражение

increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusLabelLimitHit

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has dropped {{{{ printf “%.0f” $value }}}} targets because some samples exceeded the configured label_limit, label_name_length_limit or label_value_length_limit.

Выражение

increase(prometheus_target_scrape_pool_exceeded_label_limits_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusScrapeBodySizeLimitHit

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed {{{{ printf “%.0f” $value }}}} scrapes in the last 5m because some targets exceeded the configured body_size_limit.

Выражение

increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusScrapeSampleLimitHit

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed {{{{ printf “%.0f” $value }}}} scrapes in the last 5m because some targets exceeded the configured sample_limit.

Выражение

increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusTargetSyncFailure

Описание: {{{{ printf “%.0f” $value }}}} targets in Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} have failed to sync because invalid configuration was supplied.

Выражение

increase(prometheus_target_sync_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[30m]) > 0

PrometheusHighQueryLoad

Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} query API has less than 20% available capacity in its query engine for the last 15 minutes.

Выражение

avg_over_time(prometheus_engine_queries{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) / max_over_time(prometheus_engine_queries_concurrent_max{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0.8

PrometheusErrorSendingAlertsToAnyAlertmanager

Описание: {{{{ printf “%.1f” $value }}}}% minimum errors while sending alerts from Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} to any Alertmanager.

Выражение

min without (alertmanager) (
  rate(prometheus_notifications_errors_total{job="shturval-metrics-prometheus",namespace="monitoring",alertmanager!~``}[5m])
/
  rate(prometheus_notifications_sent_total{job="shturval-metrics-prometheus",namespace="monitoring",alertmanager!~``}[5m])
)
* 100
> 3

Группа x509-certificate-exporter.rules

X509ExporterReadErrors

Описание: Over the last 15 minutes, this x509-certificate-exporter instance has experienced errors reading certificate files or querying the Kubernetes API. This could be caused by a misconfiguration if triggered when the exporter starts.

Выражение

delta(x509_read_errors[15m]) > 0

CertificateError

Описание: Certificate could not be decoded {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

x509_cert_error > 0

CertificateRenewal

Описание: Certificate for “{{{{ $labels.subject_CN }}}}” should be renewed {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

(x509_cert_not_after - time()) < (28 * 86400)

CertificateExpiration

Описание: Certificate for “{{{{ $labels.subject_CN }}}}” is about to expire after {{{{ humanizeDuration $value }}}} {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

(x509_cert_not_after - time()) < (14 * 86400)

Группа shturval-backup

VeleroBackupPartialFailures

Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} partialy failed backups.

Выражение

velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

VeleroBackupFailures

Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} failed backups.

Выражение

velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

Просмотр оповещений

Предустановленные правила #

Группа shturval-backup #

VeleroBackupPartialFailures #

VeleroBackupFailures #

Группа alertmanager.rules #

AlertmanagerFailedReload #

AlertmanagerMembersInconsistent #

AlertmanagerFailedToSendAlerts #

AlertmanagerClusterFailedToSendAlerts #

AlertmanagerClusterFailedToSendAlerts #

AlertmanagerConfigInconsistent #

AlertmanagerClusterDown #

AlertmanagerClusterCrashlooping #

Группа config-reloaders #

ConfigReloaderSidecarErrors #

Группа etcd #

etcdMembersDown #

etcdInsufficientMembers #

etcdNoLeader #

etcdHighNumberOfLeaderChanges #

etcdHighNumberOfFailedGRPCRequests #

etcdHighNumberOfFailedGRPCRequests #

etcdGRPCRequestsSlow #

etcdMemberCommunicationSlow #

etcdHighNumberOfFailedProposals #

etcdHighFsyncDurations #

etcdHighFsyncDurations #

etcdHighCommitDurations #

etcdDatabaseQuotaLowSpace #

etcdExcessiveDatabaseGrowth #

etcdDatabaseHighFragmentationRatio #

Группа general.rules #

TargetDown #

Watchdog #

InfoInhibitor #

Группа kube-apiserver-slos #

KubeAPIErrorBudgetBurn #

KubeAPIErrorBudgetBurn #

KubeAPIErrorBudgetBurn #

KubeAPIErrorBudgetBurn #

Группа kube-state-metrics #

KubeStateMetricsListErrors #

KubeStateMetricsWatchErrors #

KubeStateMetricsShardingMismatch #

KubeStateMetricsShardsMissing #

Группа kubernetes-apps #

KubePodCrashLooping #

KubePodNotReady #

KubeDeploymentGenerationMismatch #

KubeDeploymentReplicasMismatch #

KubeDeploymentRolloutStuck #

KubeStatefulSetReplicasMismatch #

KubeStatefulSetGenerationMismatch #

KubeStatefulSetUpdateNotRolledOut #

KubeDaemonSetRolloutStuck #

KubeContainerWaiting #

KubeDaemonSetNotScheduled #

KubeDaemonSetMisScheduled #

KubeJobNotCompleted #

KubeJobFailed #

KubeHpaReplicasMismatch #

KubeHpaMaxedOut #

Группа kubernetes-resources #

KubeCPUOvercommit #

KubeMemoryOvercommit #

KubeCPUQuotaOvercommit #

KubeMemoryQuotaOvercommit #

KubeQuotaAlmostFull #

KubeQuotaFullyUsed #

KubeQuotaExceeded #

CPUThrottlingHigh #

Группа kubernetes-storage #

KubePersistentVolumeFillingUp #

KubePersistentVolumeFillingUp #

KubePersistentVolumeInodesFillingUp #

KubePersistentVolumeInodesFillingUp #

KubePersistentVolumeErrors #

Группа kubernetes-system-apiserver #

KubeClientCertificateExpiration #

Предустановленные правила

Группа shturval-backup

VeleroBackupPartialFailures

VeleroBackupFailures

Группа alertmanager.rules

AlertmanagerFailedReload

AlertmanagerMembersInconsistent

AlertmanagerFailedToSendAlerts

AlertmanagerClusterFailedToSendAlerts

AlertmanagerClusterFailedToSendAlerts

AlertmanagerConfigInconsistent

AlertmanagerClusterDown

AlertmanagerClusterCrashlooping

Группа config-reloaders

ConfigReloaderSidecarErrors

Группа etcd

etcdMembersDown

etcdInsufficientMembers

etcdNoLeader

etcdHighNumberOfLeaderChanges

etcdHighNumberOfFailedGRPCRequests

etcdHighNumberOfFailedGRPCRequests

etcdGRPCRequestsSlow

etcdMemberCommunicationSlow

etcdHighNumberOfFailedProposals

etcdHighFsyncDurations

etcdHighFsyncDurations

etcdHighCommitDurations

etcdDatabaseQuotaLowSpace

etcdExcessiveDatabaseGrowth

etcdDatabaseHighFragmentationRatio

Группа general.rules

TargetDown

Watchdog

InfoInhibitor

Группа kube-apiserver-slos

KubeAPIErrorBudgetBurn

KubeAPIErrorBudgetBurn

KubeAPIErrorBudgetBurn

KubeAPIErrorBudgetBurn

Группа kube-state-metrics

KubeStateMetricsListErrors

KubeStateMetricsWatchErrors

KubeStateMetricsShardingMismatch

KubeStateMetricsShardsMissing

Группа kubernetes-apps

KubePodCrashLooping

KubePodNotReady

KubeDeploymentGenerationMismatch

KubeDeploymentReplicasMismatch

KubeDeploymentRolloutStuck

KubeStatefulSetReplicasMismatch

KubeStatefulSetGenerationMismatch

KubeStatefulSetUpdateNotRolledOut

KubeDaemonSetRolloutStuck

KubeContainerWaiting

KubeDaemonSetNotScheduled

KubeDaemonSetMisScheduled

KubeJobNotCompleted

KubeJobFailed

KubeHpaReplicasMismatch

KubeHpaMaxedOut

Группа kubernetes-resources

KubeCPUOvercommit

KubeMemoryOvercommit

KubeCPUQuotaOvercommit

KubeMemoryQuotaOvercommit

KubeQuotaAlmostFull

KubeQuotaFullyUsed

KubeQuotaExceeded

CPUThrottlingHigh

Группа kubernetes-storage

KubePersistentVolumeFillingUp

KubePersistentVolumeFillingUp

KubePersistentVolumeInodesFillingUp

KubePersistentVolumeInodesFillingUp

KubePersistentVolumeErrors

Группа kubernetes-system-apiserver

KubeClientCertificateExpiration

KubeClientCertificateExpiration