Просмотр оповещений

В системе есть предустановленные правила и возможность настройки кастомных правил. На странице “Просмотр оповещений” доступен просмотр результатов обработки правил (алертов). Результаты разбиты по вкладкам в соответствии со статусами срабатывания:

  • Firing. Алерт срабатывает и указывает на актуальное нарушение заданного условия или события.
  • Pending. Алерт не срабатывает в данный момент, но есть вероятность срабатывания в будущем. Это может быть вызвано временными изменениями.
  • Inactive. Алерт не срабатывает в течение определенного периода времени или отключен администратором. На каждой вкладке алерты разбиты по группам в соответствии с заданным выражением. В каждой группе указано выражение, описание условий срабатывания алерта и счетчик количества событий в группе, а также сами алерты.

Состав алерта:

  • Идентификатор;
  • Название;
  • Дата и время срабатывания;
  • Описание;
  • Выражения;
  • Критичность.

Предустановленные правила

В данном разделе представлен перечень всех предустановленных правил оповещений в клиентском кластере платформы.

Группа shturval-backup

VeleroBackupPartialFailures

  • Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} partialy failed backups.

Выражение

velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

VeleroBackupFailures

  • Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} failed backups.

Выражение

velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

Группа alertmanager.rules

AlertmanagerFailedReload

  • Описание: Configuration has failed to load for {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]) == 0

AlertmanagerMembersInconsistent

  • Описание: Alertmanager {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}} has only found {{{{ $value }}}} members of the {{{{$labels.job}}}} cluster.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
  max_over_time(alertmanager_cluster_members{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
< on (namespace,service,cluster) group_left
  count by (namespace,service,cluster) (max_over_time(alertmanager_cluster_members{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]))

AlertmanagerFailedToSendAlerts

  • Описание: Alertmanager {{{{ $labels.namespace }}}}/{{{{ $labels.pod}}}} failed to send {{{{ $value | humanizePercentage }}}} of notifications to {{{{ $labels.integration }}}}.

Выражение

(
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m])
)
> 0.01

AlertmanagerClusterFailedToSendAlerts

  • Описание: The minimum notification failure rate to {{{{ $labels.integration }}}} sent from any instance in the {{{{$labels.job}}}} cluster is {{{{ $value | humanizePercentage }}}}.

Выражение

min by (namespace,service, integration) (
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration=~`.*`}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration=~`.*`}[5m])
)
> 0.01

AlertmanagerClusterFailedToSendAlerts

  • Описание: The minimum notification failure rate to {{{{ $labels.integration }}}} sent from any instance in the {{{{$labels.job}}}} cluster is {{{{ $value | humanizePercentage }}}}.

Выражение

min by (namespace,service, integration) (
  rate(alertmanager_notifications_failed_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration!~`.*`}[5m])
/
  ignoring (reason) group_left rate(alertmanager_notifications_total{job="shturval-metrics-alertmanager",namespace="monitoring", integration!~`.*`}[5m])
)
> 0.01

AlertmanagerConfigInconsistent

  • Описание: Alertmanager instances within the {{{{$labels.job}}}} cluster have different configurations.

Выражение

count by (namespace,service,cluster) (
  count_values by (namespace,service,cluster) ("config_hash", alertmanager_config_hash{job="shturval-metrics-alertmanager",namespace="monitoring"})
)
!= 1

AlertmanagerClusterDown

  • Описание: {{{{ $value | humanizePercentage }}}} of Alertmanager instances within the {{{{$labels.job}}}} cluster have been up for less than half of the last 5m.

Выражение

(
  count by (namespace,service,cluster) (
    avg_over_time(up{job="shturval-metrics-alertmanager",namespace="monitoring"}[5m]) < 0.5
  )
/
  count by (namespace,service,cluster) (
    up{job="shturval-metrics-alertmanager",namespace="monitoring"}
  )
)
>= 0.5

AlertmanagerClusterCrashlooping

  • Описание: {{{{ $value | humanizePercentage }}}} of Alertmanager instances within the {{{{$labels.job}}}} cluster have restarted at least 5 times in the last 10m.

Выражение

(
  count by (namespace,service,cluster) (
    changes(process_start_time_seconds{job="shturval-metrics-alertmanager",namespace="monitoring"}[10m]) > 4
  )
/
  count by (namespace,service,cluster) (
    up{job="shturval-metrics-alertmanager",namespace="monitoring"}
  )
)
>= 0.5

Группа config-reloaders

ConfigReloaderSidecarErrors

  • Описание: Errors encountered while the {{{{$labels.pod}}}} config-reloader sidecar attempts to sync config in {{{{$labels.namespace}}}} namespace. As a result, configuration for service running in {{{{$labels.pod}}}} may be stale and cannot be updated anymore.

Выражение

max_over_time(reloader_last_reload_successful{namespace=~".+"}[5m]) == 0

Группа etcd

etcdMembersDown

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: members are down ({{{{ $value }}}}).

Выражение

max without (endpoint) (
  sum without (instance) (up{job=~".*etcd.*"} == bool 0)
or
  count without (To) (
    sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01
  )
)
> 0

etcdInsufficientMembers

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: insufficient members ({{{{ $value }}}}).

Выражение

sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)

etcdNoLeader

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: member {{{{ $labels.instance }}}} has no leader.

Выражение

etcd_server_has_leader{job=~".*etcd.*"} == 0

etcdHighNumberOfLeaderChanges

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated. etcd cluster has high number of leader changes.

Выражение

increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4

etcdHighNumberOfFailedGRPCRequests

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}}% of requests for {{{{ $labels.grpc_method }}}} failed on etcd instance {{{{ $labels.instance }}}}.

Выражение

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
  /
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
  > 1

etcdHighNumberOfFailedGRPCRequests

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}}% of requests for {{{{ $labels.grpc_method }}}} failed on etcd instance {{{{ $labels.instance }}}}.

Выражение

100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
  /
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
  > 5

etcdGRPCRequestsSlow

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile of gRPC requests is {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}} for {{{{ $labels.grpc_method }}}} method. etcd grpc requests are slow

Выражение

histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_method!="Defragment", grpc_type="unary"}[5m])) without(grpc_type))
> 0.15

etcdMemberCommunicationSlow

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: member communication with {{{{ $labels.To }}}} is taking {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.15

etcdHighNumberOfFailedProposals

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: {{{{ $value }}}} proposal failures within the last 30 minutes on etcd instance {{{{ $labels.instance }}}}.

Выражение

rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5

etcdHighFsyncDurations

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile fsync durations are {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.5

etcdHighFsyncDurations

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile fsync durations are {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1

etcdHighCommitDurations

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: 99th percentile commit durations {{{{ $value }}}}s on etcd instance {{{{ $labels.instance }}}}.

Выражение

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.25

etcdDatabaseQuotaLowSpace

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: database size exceeds the defined quota on etcd instance {{{{ $labels.instance }}}}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.

Выражение

(last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_server_quota_backend_bytes{job=~".*etcd.*"}[5m]))*100 > 95

etcdExcessiveDatabaseGrowth

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: Predicting running out of disk space in the next four hours, based on write observations within the past four hours on etcd instance {{{{ $labels.instance }}}}, please check as it might be disruptive. Выражение
predict_linear(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[4h], 4*60*60) > etcd_server_quota_backend_bytes{job=~".*etcd.*"}

etcdDatabaseHighFragmentationRatio

  • Описание: etcd cluster “{{{{ $labels.job }}}}”: database size in use on instance {{{{ $labels.instance }}}} is {{{{ $value | humanizePercentage }}}} of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space.

Выражение

(last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) < 0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600

Группа general.rules

TargetDown

  • Описание: {{{{ printf “%.4g” $value }}}}% of the {{{{ $labels.job }}}}/{{{{ $labels.service }}}} targets in {{{{ $labels.namespace }}}} namespace are down.

Выражение

100 * (count(up == 0) BY (cluster, job, namespace, service) / count(up) BY (cluster, job, namespace, service)) > 10

Watchdog

  • Описание: This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the “DeadMansSnitch” integration in PagerDuty.

Выражение

vector(1)

InfoInhibitor

  • Описание: This is an alert that is used to inhibit info alerts. By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with other alerts. This alert fires whenever there’s a severity=“info” alert, and stops firing when another alert with a severity of ‘warning’ or ‘critical’ starts firing on the same namespace. This alert should be routed to a null receiver and configured to inhibit alerts with severity=“info”.

Выражение

ALERTS{severity = "info"} == 1 unless on (namespace) ALERTS{alertname != "InfoInhibitor", severity =~ "warning|critical", alertstate="firing"} == 1

Группа kube-apiserver-slos

KubeAPIErrorBudgetBurn

  • Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
and
sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)
  • Long:1h
  • Short:5m

KubeAPIErrorBudgetBurn

  • Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)
and
sum(apiserver_request:burnrate30m) > (6.00 * 0.01000)
  • Long:6h
  • Short:30m

KubeAPIErrorBudgetBurn

  • Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)
and
sum(apiserver_request:burnrate2h) > (3.00 * 0.01000)
  • Long:1d
  • Short:2h

KubeAPIErrorBudgetBurn

  • Описание: The API server is burning too much error budget.

Выражение

sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)
and
sum(apiserver_request:burnrate6h) > (1.00 * 0.01000)

Группа kube-state-metrics

KubeStateMetricsListErrors

  • Описание: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.

Выражение

(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
  /
sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01

KubeStateMetricsWatchErrors

  • Описание: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all.

Выражение

(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
  /
sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01

KubeStateMetricsShardingMismatch

  • Описание: kube-state-metrics pods are running with different –total-shards configuration, some Kubernetes objects may be exposed multiple times or not exposed at all.

Выражение

stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) by (cluster) != 0

KubeStateMetricsShardsMissing

  • Описание: kube-state-metrics shards are missing, some Kubernetes objects are not being exposed.

Выражение

2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) by (cluster) - 1
  -
sum( 2 ^ max by (cluster, shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) ) by (cluster)
!= 0

Группа kubernetes-apps

KubePodCrashLooping

  • Описание: Pod {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}} ({{{{ $labels.container }}}}) is in waiting state (reason: “CrashLoopBackOff”).

Выражение

max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", job="kube-state-metrics", namespace=~".*"}[5m]) >= 1

KubePodNotReady

  • Описание: Pod {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}} has been in a non-ready state for longer than 15 minutes.

Выражение

sum by (namespace, pod, cluster) (
  max by (namespace, pod, cluster) (
    kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"}
  ) * on (namespace, pod, cluster) group_left(owner_kind) topk by (namespace, pod, cluster) (
    1, max by (namespace, pod, owner_kind, cluster) (kube_pod_owner{owner_kind!="Job"})
  )
) > 0

KubeDeploymentGenerationMismatch

  • Описание: Deployment generation for {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} does not match, this indicates that the Deployment has failed but has not been rolled back.

Выражение

kube_deployment_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
  !=
kube_deployment_metadata_generation{job="kube-state-metrics", namespace=~".*"}

KubeDeploymentReplicasMismatch

  • Описание: Deployment {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} has not matched the expected number of replicas for longer than 15 minutes.

Выражение

(
  kube_deployment_spec_replicas{job="kube-state-metrics", namespace=~".*"}
    >
  kube_deployment_status_replicas_available{job="kube-state-metrics", namespace=~".*"}
) and (
  changes(kube_deployment_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
    ==
  0
)

KubeDeploymentRolloutStuck

  • Описание: Rollout of deployment {{{{ $labels.namespace }}}}/{{{{ $labels.deployment }}}} is not progressing for longer than 15 minutes.

Выражение

kube_deployment_status_condition{condition="Progressing", status="false",job="kube-state-metrics", namespace=~".*"}
!= 0

KubeStatefulSetReplicasMismatch

  • Описание: StatefulSet {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} has not matched the expected number of replicas for longer than 15 minutes.

Выражение

(
  kube_statefulset_status_replicas_ready{job="kube-state-metrics", namespace=~".*"}
    !=
  kube_statefulset_status_replicas{job="kube-state-metrics", namespace=~".*"}
) and (
  changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
    ==
  0
)

KubeStatefulSetGenerationMismatch

  • Описание: StatefulSet generation for {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} does not match, this indicates that the StatefulSet has failed but has not been rolled back.

Выражение

kube_statefulset_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
  !=
kube_statefulset_metadata_generation{job="kube-state-metrics", namespace=~".*"}

KubeStatefulSetUpdateNotRolledOut

  • Описание: StatefulSet {{{{ $labels.namespace }}}}/{{{{ $labels.statefulset }}}} update has not been rolled out.

Выражение

(
  max without (revision) (
    kube_statefulset_status_current_revision{job="kube-state-metrics", namespace=~".*"}
      unless
    kube_statefulset_status_update_revision{job="kube-state-metrics", namespace=~".*"}
  )
    *
  (
    kube_statefulset_replicas{job="kube-state-metrics", namespace=~".*"}
      !=
    kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}
  )
)  and (
  changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[5m])
    ==
  0
)

KubeDaemonSetRolloutStuck

  • Описание: DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} has not finished or progressed for at least 15 minutes.

Выражение

(
  (
    kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  ) or (
    kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    0
  ) or (
    kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  ) or (
    kube_daemonset_status_number_available{job="kube-state-metrics", namespace=~".*"}
     !=
    kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  )
) and (
  changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}[5m])
    ==
  0
)

KubeContainerWaiting

  • Описание: pod/{{{{ $labels.pod }}}} in namespace {{{{ $labels.namespace }}}} on container {{{{ $labels.container}}}} has been in waiting state for longer than 1 hour.

Выражение

sum by (namespace, pod, container, cluster) (kube_pod_container_status_waiting_reason{job="kube-state-metrics", namespace=~".*"}) > 0

KubeDaemonSetNotScheduled

  • Описание: {{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are not scheduled.

Выражение

kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
  -
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"} > 0

KubeDaemonSetMisScheduled

  • Описание: {{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are running where they are not supposed to run.

Выражение

kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"} > 0

KubeJobNotCompleted

  • Описание: Job {{{{ $labels.namespace }}}}/{{{{ $labels.job_name }}}} is taking more than {{{{ “43200” | humanizeDuration }}}} to complete.

Выражение

time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
  and
kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200

KubeJobFailed

  • Описание: Job {{{{ $labels.namespace }}}}/{{{{ $labels.job_name }}}} failed to complete. Removing failed job after investigation should clear this alert.

Выражение

kube_job_failed{job="kube-state-metrics", namespace=~".*"}  > 0

KubeHpaReplicasMismatch

  • Описание: HPA {{{{ $labels.namespace }}}}/{{{{ $labels.horizontalpodautoscaler }}}} has not matched the desired number of replicas for longer than 15 minutes.

Выражение

(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", namespace=~".*"}
  !=
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"})
  and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  >
kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", namespace=~".*"})
  and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  <
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"})
  and
changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}[15m]) == 0

KubeHpaMaxedOut

  • Описание: HPA {{{{ $labels.namespace }}}}/{{{{ $labels.horizontalpodautoscaler }}}} has been running at max replicas for longer than 15 minutes.

Выражение

kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
  ==
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"}

Группа kubernetes-resources

KubeCPUOvercommit

  • Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted CPU resource requests for Pods by {{{{ $value }}}} CPU shares and cannot tolerate node failure.

Выражение

sum(namespace_cpu:kube_pod_container_resource_requests:sum{job="kube-state-metrics",}) by (cluster) - (sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0

KubeMemoryOvercommit

  • Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted memory resource requests for Pods by {{{{ $value | humanize }}}} bytes and cannot tolerate node failure.

Выражение

sum(namespace_memory:kube_pod_container_resource_requests:sum{}) by (cluster) - (sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0

KubeCPUQuotaOvercommit

  • Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted CPU resource requests for Namespaces.

Выражение

sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(cpu|requests.cpu)"})) by (cluster)
  /
sum(kube_node_status_allocatable{resource="cpu", job="kube-state-metrics"}) by (cluster)
  > 1.5

KubeMemoryQuotaOvercommit

  • Описание: Cluster {{{{ $labels.cluster }}}} has overcommitted memory resource requests for Namespaces.

Выражение

sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(memory|requests.memory)"})) by (cluster)
  /
sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)
  > 1.5

KubeQuotaAlmostFull

  • Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 0.9 < 1

KubeQuotaFullyUsed

  • Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  == 1

KubeQuotaExceeded

  • Описание: Namespace {{{{ $labels.namespace }}}} is using {{{{ $value | humanizePercentage }}}} of its {{{{ $labels.resource }}}} quota.

Выражение

kube_resourcequota{job="kube-state-metrics", type="used"}
  / ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
  > 1

CPUThrottlingHigh

  • Описание: {{{{ $value | humanizePercentage }}}} throttling of CPU in namespace {{{{ $labels.namespace }}}} for container {{{{ $labels.container }}}} in pod {{{{ $labels.pod }}}}.

Выражение

sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster, container, pod, namespace)
  /
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster, container, pod, namespace)
  > ( 25 / 100 )

Группа kubernetes-storage

KubePersistentVolumeFillingUp

  • Описание: The PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is only {{{{ $value | humanizePercentage }}}} free.

Выражение

(
  kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeFillingUp

  • Описание: Based on recent sampling, the PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is expected to fill up within four days. Currently {{{{ $value | humanizePercentage }}}} is available.

Выражение

(
  kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

  • Описание: The PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} only has {{{{ $value | humanizePercentage }}}} free inodes.

Выражение

(
  kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

  • Описание: Based on recent sampling, the PersistentVolume claimed by {{{{ $labels.persistentvolumeclaim }}}} in Namespace {{{{ $labels.namespace }}}} on Cluster {{{{ $labels.cluster }}}} is expected to run out of inodes within four days. Currently {{{{ $value | humanizePercentage }}}} of its inodes are free.

Выражение

(
  kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
    /
  kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (cluster, namespace, persistentvolumeclaim)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeErrors

  • Описание: The persistent volume {{{{ $labels.persistentvolume }}}} on Cluster {{{{ $labels.cluster }}}} has status {{{{ $labels.phase }}}}.

Выражение

kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0

Группа kubernetes-system-apiserver

KubeClientCertificateExpiration

  • Описание: A client certificate used to authenticate to kubernetes apiserver is expiring in less than 7.0 days.

Выражение

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on (job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

KubeClientCertificateExpiration

  • Описание: A client certificate used to authenticate to kubernetes apiserver is expiring in less than 24.0 hours.

Выражение

apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on (job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400

KubeAggregatedAPIErrors

  • Описание: Kubernetes aggregated API {{{{ $labels.name }}}}/{{{{ $labels.namespace }}}} has reported errors. It has appeared unavailable {{{{ $value | humanize }}}} times averaged over the past 10m.

Выражение

sum by (name, namespace, cluster)(increase(aggregator_unavailable_apiservice_total{job="apiserver"}[10m])) > 4

KubeAggregatedAPIDown

  • Описание: Kubernetes aggregated API {{{{ $labels.name }}}}/{{{{ $labels.namespace }}}} has been only {{{{ $value | humanize }}}}% available over the last 10m.

Выражение

(1 - max by (name, namespace, cluster)(avg_over_time(aggregator_unavailable_apiservice{job="apiserver"}[10m]))) * 100 < 85

KubeAPIDown

  • Описание: KubeAPI has disappeared from Prometheus target discovery.

Выражение

absent(up{job="apiserver"} == 1)

KubeAPITerminatedRequests

  • Описание: The kubernetes apiserver has terminated {{{{ $value | humanizePercentage }}}} of its incoming requests.

Выражение

sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))  / (  sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) ) > 0.20

Группа kubernetes-system-controller-manager

KubeControllerManagerDown

  • Описание: KubeControllerManager has disappeared from Prometheus target discovery.

Выражение

absent(up{job="kube-controller-manager"} == 1)

Группа kubernetes-system-kubelet

KubeNodeNotReady

  • Описание: {{{{ $labels.node }}}} has been unready for more than 15 minutes.

Выражение

kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0

KubeNodeUnreachable

  • Описание: {{{{ $labels.node }}}} is unreachable and some workloads may be rescheduled.

Выражение

(kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} unless ignoring(key,value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1

KubeletTooManyPods

  • Описание: Kubelet ‘{{{{ $labels.node }}}}’ is running at {{{{ $value | humanizePercentage }}}} of its Pod capacity.

Выражение

count by (cluster, node) (
  (kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on (instance,pod,namespace,cluster) group_left(node) topk by (instance,pod,namespace,cluster) (1, kube_pod_info{job="kube-state-metrics"})
)
/
max by (cluster, node) (
  kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1
) > 0.95

KubeNodeReadinessFlapping

  • Описание: The readiness status of node {{{{ $labels.node }}}} has changed {{{{ $value }}}} times in the last 15 minutes.

Выражение

sum(changes(kube_node_status_condition{job="kube-state-metrics",status="true",condition="Ready"}[15m])) by (cluster, node) > 2

KubeletPlegDurationHigh

  • Описание: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{{{ $value }}}} seconds on node {{{{ $labels.node }}}}.

Выражение

node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10

KubeletPodStartUpLatencyHigh

  • Описание: Kubelet Pod startup 99th percentile latency is {{{{ $value }}}} seconds on node {{{{ $labels.node }}}}.

Выражение

histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (cluster, instance, le)) * on (cluster, instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60

KubeletClientCertificateExpiration

  • Описание: Client certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_client_ttl_seconds < 604800

KubeletClientCertificateExpiration

  • Описание: Client certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}. Выражение
kubelet_certificate_manager_client_ttl_seconds < 86400

KubeletServerCertificateExpiration

  • Описание: Server certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_server_ttl_seconds < 604800

KubeletServerCertificateExpiration

  • Описание: Server certificate for Kubelet on node {{{{ $labels.node }}}} expires in {{{{ $value | humanizeDuration }}}}.

Выражение

kubelet_certificate_manager_server_ttl_seconds < 86400

KubeletClientCertificateRenewalErrors

  • Описание: Kubelet on node {{{{ $labels.node }}}} has failed to renew its client certificate ({{{{ $value | humanize }}}} errors in the last 5 minutes).

Выражение

increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0

KubeletServerCertificateRenewalErrors

  • Описание: Kubelet on node {{{{ $labels.node }}}} has failed to renew its server certificate ({{{{ $value | humanize }}}} errors in the last 5 minutes).

Выражение

increase(kubelet_server_expiration_renew_errors[5m]) > 0

KubeletDown

  • Описание: Kubelet has disappeared from Prometheus target discovery. Выражение
absent(up{job="kubelet", metrics_path="/metrics"} == 1)

Группа kubernetes-system-scheduler

KubeSchedulerDown

  • Описание: KubeScheduler has disappeared from Prometheus target discovery.

Выражение

absent(up{job="kube-scheduler"} == 1)

Группа kubernetes-system

KubeVersionMismatch

  • Описание: There are {{{{ $value }}}} different semantic versions of Kubernetes components running.

Выражение

count by (cluster) (count by (git_version, cluster) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"git_version","$1","git_version","(v[0-9]*.[0-9]*).*"))) > 1

KubeClientErrors

  • Описание: Kubernetes API server client ‘{{{{ $labels.job }}}}/{{{{ $labels.instance }}}}’ is experiencing {{{{ $value | humanizePercentage }}}} errors.’

Выражение

(sum(rate(rest_client_requests_total{job="apiserver",code=~"5.."}[5m])) by (cluster, instance, job, namespace)
  /
sum(rate(rest_client_requests_total{job="apiserver"}[5m])) by (cluster, instance, job, namespace))
> 0.01

Группа node-exporter

NodeFilesystemSpaceFillingUp

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left and is filling up.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 15
and
  predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemSpaceFillingUp

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left and is filling up fast.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 10
and
  predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfSpace

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfSpace

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available space left.

Выражение

(
  node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemFilesFillingUp

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left and is filling up.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 40
and
  predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemFilesFillingUp

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left and is filling up fast.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 20
and
  predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfFiles

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeFilesystemAlmostOutOfFiles

  • Описание: Filesystem on {{{{ $labels.device }}}}, mounted on {{{{ $labels.mountpoint }}}}, at {{{{ $labels.instance }}}} has only {{{{ printf “%.2f” $value }}}}% available inodes left.

Выражение

(
  node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
  node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)

NodeNetworkReceiveErrs

  • Описание: {{{{ $labels.instance }}}} interface {{{{ $labels.device }}}} has encountered {{{{ printf “%.0f” $value }}}} receive errors in the last two minutes.

Выражение

rate(node_network_receive_errs_total{job="node-exporter"}[2m]) / rate(node_network_receive_packets_total{job="node-exporter"}[2m]) > 0.01

NodeNetworkTransmitErrs

  • Описание: {{{{ $labels.instance }}}} interface {{{{ $labels.device }}}} has encountered {{{{ printf “%.0f” $value }}}} transmit errors in the last two minutes.

Выражение

rate(node_network_transmit_errs_total{job="node-exporter"}[2m]) / rate(node_network_transmit_packets_total{job="node-exporter"}[2m]) > 0.01

NodeHighNumberConntrackEntriesUsed

  • Описание: {{{{ $value | humanizePercentage }}}} of conntrack entries are used.

Выражение

(node_nf_conntrack_entries{job="node-exporter"} / node_nf_conntrack_entries_limit) > 0.75

NodeTextFileCollectorScrapeError

  • Описание: Node Exporter text file collector on {{{{ $labels.instance }}}} failed to scrape.

Выражение

node_textfile_scrape_error{job="node-exporter"} == 1

NodeClockSkewDetected

  • Описание: Clock at {{{{ $labels.instance }}}} is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.

Выражение

(
  node_timex_offset_seconds{job="node-exporter"} > 0.05
and
  deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
)
or
(
  node_timex_offset_seconds{job="node-exporter"} < -0.05
and
  deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
)

NodeClockNotSynchronising

  • Описание: Clock at {{{{ $labels.instance }}}} is not synchronising. Ensure NTP is configured on this host.

Выражение

min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
and
node_timex_maxerror_seconds{job="node-exporter"} >= 16

NodeRAIDDegraded

  • Описание: RAID array ‘{{{{ $labels.device }}}}’ at {{{{ $labels.instance }}}} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.

Выражение

node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}) > 0

NodeRAIDDiskFailure

  • Описание: At least one device in RAID array at {{{{ $labels.instance }}}} failed. Array ‘{{{{ $labels.device }}}}’ needs attention and possibly a disk swap.

Выражение

node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"} > 0

NodeFileDescriptorLimit

  • Описание: File descriptors limit at {{{{ $labels.instance }}}} is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

(
  node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
)

NodeFileDescriptorLimit

  • Описание: File descriptors limit at {{{{ $labels.instance }}}} is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

(
  node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
)

NodeCPUHighUsage

  • Описание: CPU usage at {{{{ $labels.instance }}}} has been above 90% for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter", mode!="idle"}[2m]))) * 100 > 90

NodeSystemSaturation

  • Описание: System load per core at {{{{ $labels.instance }}}} has been above 2 for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. This might indicate this instance resources saturation and can cause it becoming unresponsive.

Выражение

node_load1{job="node-exporter"}
/ count without (cpu, mode) (node_cpu_seconds_total{job="node-exporter", mode="idle"}) > 2

NodeMemoryMajorPagesFaults

  • Описание: Memory major pages are occurring at very high rate at {{{{ $labels.instance }}}}, 500 major page faults per second for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. Please check that there is enough memory available at this instance.

Выражение

rate(node_vmstat_pgmajfault{job="node-exporter"}[5m]) > 500

NodeMemoryHighUtilization

  • Описание: Memory is filling up at {{{{ $labels.instance }}}}, has been above 90% for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}%.

Выражение

100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 90

NodeDiskIOSaturation

  • Описание: Disk IO queue (aqu-sq) is high on {{{{ $labels.device }}}} at {{{{ $labels.instance }}}}, has been above 10 for the last 15 minutes, is currently at {{{{ printf “%.2f” $value }}}}. This symptom might indicate disk saturation.

Выражение

rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m]) > 10

NodeSystemdServiceFailed

  • Описание: Systemd service {{{{ $labels.name }}}} has entered failed state at {{{{ $labels.instance }}}}

Выражение

node_systemd_unit_state{job="node-exporter", state="failed"} == 1

NodeBondingDegraded

  • Описание: Bonding interface {{{{ $labels.master }}}} on {{{{ $labels.instance }}}} is in degraded state due to one or more slave failures.

Выражение

(node_bonding_slaves - node_bonding_active) != 0

Группа node-network

NodeNetworkInterfaceFlapping

  • Описание: Network interface “{{{{ $labels.device }}}}” changing its up status often on node-exporter {{{{ $labels.namespace }}}}/{{{{ $labels.pod }}}}

Выражение

changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2

Группа prometheus-operator

PrometheusOperatorListErrors

  • Описание: Errors while performing List operations in controller {{{{$labels.controller}}}} in {{{{$labels.namespace}}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_list_operations_failed_total{job="shturval-metrics-operator",namespace="monitoring"}[10m])) / sum by (cluster,controller,namespace) (rate(prometheus_operator_list_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[10m]))) > 0.4

PrometheusOperatorWatchErrors

  • Описание: Errors while performing watch operations in controller {{{{$labels.controller}}}} in {{{{$labels.namespace}}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_watch_operations_failed_total{job="shturval-metrics-operator",namespace="monitoring"}[5m])) / sum by (cluster,controller,namespace) (rate(prometheus_operator_watch_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.4

PrometheusOperatorSyncFailed

  • Описание: Controller {{{{ $labels.controller }}}} in {{{{ $labels.namespace }}}} namespace fails to reconcile {{{{ $value }}}} objects.

Выражение

min_over_time(prometheus_operator_syncs{status="failed",job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0

PrometheusOperatorReconcileErrors

  • Описание: {{{{ $value | humanizePercentage }}}} of reconciling operations failed for {{{{ $labels.controller }}}} controller in {{{{ $labels.namespace }}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_reconcile_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) / (sum by (cluster,controller,namespace) (rate(prometheus_operator_reconcile_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.1

PrometheusOperatorStatusUpdateErrors

  • Описание: {{{{ $value | humanizePercentage }}}} of status update operations failed for {{{{ $labels.controller }}}} controller in {{{{ $labels.namespace }}}} namespace.

Выражение

(sum by (cluster,controller,namespace) (rate(prometheus_operator_status_update_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) / (sum by (cluster,controller,namespace) (rate(prometheus_operator_status_update_operations_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]))) > 0.1

PrometheusOperatorNodeLookupErrors

  • Описание: Errors while reconciling Prometheus in {{{{ $labels.namespace }}}} Namespace.

Выражение

rate(prometheus_operator_node_address_lookup_errors_total{job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0.1

PrometheusOperatorNotReady

  • Описание: Prometheus operator in {{{{ $labels.namespace }}}} namespace isn’t ready to reconcile {{{{ $labels.controller }}}} resources.

Выражение

min by (cluster,controller,namespace) (max_over_time(prometheus_operator_ready{job="shturval-metrics-operator",namespace="monitoring"}[5m]) == 0)

PrometheusOperatorRejectedResources

  • Описание: Prometheus operator in {{{{ $labels.namespace }}}} namespace rejected {{{{ printf “%0.0f” $value }}}} {{{{ $labels.controller }}}}/{{{{ $labels.resource }}}} resources.

Выражение

min_over_time(prometheus_operator_managed_resources{state="rejected",job="shturval-metrics-operator",namespace="monitoring"}[5m]) > 0

Группа prometheus

PrometheusBadConfig

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to reload its configuration.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_config_last_reload_successful{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) == 0

PrometheusSDRefreshFailure

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to refresh SD with mechanism {{{{$labels.mechanism}}}}.

Выражение

increase(prometheus_sd_refresh_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[10m]) > 0

PrometheusNotificationQueueRunningFull

  • Описание: Alert notification queue of Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is running full.

Выражение

# Without min_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  predict_linear(prometheus_notifications_queue_length{job="shturval-metrics-prometheus",namespace="monitoring"}[5m], 60 * 30)
>
  min_over_time(prometheus_notifications_queue_capacity{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)

PrometheusErrorSendingAlertsToSomeAlertmanagers

  • Описание: {{{{ printf “%.1f” $value }}}}% errors while sending alerts from Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} to Alertmanager {{{{$labels.alertmanager}}}}.

Выражение

(
  rate(prometheus_notifications_errors_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
/
  rate(prometheus_notifications_sent_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)
* 100
> 1

PrometheusNotConnectedToAlertmanagers

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is not connected to any Alertmanagers.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_notifications_alertmanagers_discovered{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) < 1

PrometheusTSDBReloadsFailing

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has detected {{{{$value | humanize}}}} reload failures over the last 3h.

Выражение

increase(prometheus_tsdb_reloads_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[3h]) > 0

PrometheusTSDBCompactionsFailing

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has detected {{{{$value | humanize}}}} compaction failures over the last 3h.

Выражение

increase(prometheus_tsdb_compactions_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[3h]) > 0

PrometheusNotIngestingSamples

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is not ingesting samples.

Выражение

(
  rate(prometheus_tsdb_head_samples_appended_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) <= 0
and
  (
    sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="shturval-metrics-prometheus",namespace="monitoring"}) > 0
  or
    sum without(rule_group) (prometheus_rule_group_rules{job="shturval-metrics-prometheus",namespace="monitoring"}) > 0
  )
)

PrometheusDuplicateTimestamps

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is dropping {{{{ printf “%.4g” $value }}}} samples/s with different values but duplicated timestamp.

Выражение

rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusOutOfOrderTimestamps

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} is dropping {{{{ printf “%.4g” $value }}}} samples/s with timestamps arriving out of order.

Выражение

rate(prometheus_target_scrapes_sample_out_of_order_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusRemoteStorageFailures

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} failed to send {{{{ printf “%.1f” $value }}}}% of the samples to {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}

Выражение

(
  (rate(prometheus_remote_storage_failed_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
/
  (
    (rate(prometheus_remote_storage_failed_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
  +
    (rate(prometheus_remote_storage_succeeded_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]))
  )
)
* 100
> 1

PrometheusRemoteWriteBehind

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} remote write is {{{{ printf “%.1f” $value }}}}s behind for {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
- ignoring(remote_name, url) group_right
  max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)
> 120

PrometheusRemoteWriteDesiredShards

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} remote write desired shards calculation wants to run {{{{ $value }}}} shards for queue {{{{ $labels.remote_name}}}}:{{{{ $labels.url }}}}, which is more than the max of {{{{ printf prometheus_remote_storage_shards_max{{instance="%s",job="shturval-metrics-prometheus",namespace="monitoring"}} $labels.instance | query | first | value }}}}.

Выражение

# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
  max_over_time(prometheus_remote_storage_shards_desired{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
>
  max_over_time(prometheus_remote_storage_shards_max{job="shturval-metrics-prometheus",namespace="monitoring"}[5m])
)

PrometheusRuleFailures

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed to evaluate {{{{ printf “%.0f” $value }}}} rules in the last 5m.

Выражение

increase(prometheus_rule_evaluation_failures_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusMissingRuleEvaluations

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has missed {{{{ printf “%.0f” $value }}}} rule group evaluations in the last 5m.

Выражение

increase(prometheus_rule_group_iterations_missed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusTargetLimitHit

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has dropped {{{{ printf “%.0f” $value }}}} targets because the number of targets exceeded the configured target_limit.

Выражение

increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusLabelLimitHit

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has dropped {{{{ printf “%.0f” $value }}}} targets because some samples exceeded the configured label_limit, label_name_length_limit or label_value_length_limit.

Выражение

increase(prometheus_target_scrape_pool_exceeded_label_limits_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusScrapeBodySizeLimitHit

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed {{{{ printf “%.0f” $value }}}} scrapes in the last 5m because some targets exceeded the configured body_size_limit.

Выражение

increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusScrapeSampleLimitHit

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} has failed {{{{ printf “%.0f” $value }}}} scrapes in the last 5m because some targets exceeded the configured sample_limit.

Выражение

increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0

PrometheusTargetSyncFailure

  • Описание: {{{{ printf “%.0f” $value }}}} targets in Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} have failed to sync because invalid configuration was supplied.

Выражение

increase(prometheus_target_sync_failed_total{job="shturval-metrics-prometheus",namespace="monitoring"}[30m]) > 0

PrometheusHighQueryLoad

  • Описание: Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} query API has less than 20% available capacity in its query engine for the last 15 minutes.

Выражение

avg_over_time(prometheus_engine_queries{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) / max_over_time(prometheus_engine_queries_concurrent_max{job="shturval-metrics-prometheus",namespace="monitoring"}[5m]) > 0.8

PrometheusErrorSendingAlertsToAnyAlertmanager

  • Описание: {{{{ printf “%.1f” $value }}}}% minimum errors while sending alerts from Prometheus {{{{$labels.namespace}}}}/{{{{$labels.pod}}}} to any Alertmanager.

Выражение

min without (alertmanager) (
  rate(prometheus_notifications_errors_total{job="shturval-metrics-prometheus",namespace="monitoring",alertmanager!~``}[5m])
/
  rate(prometheus_notifications_sent_total{job="shturval-metrics-prometheus",namespace="monitoring",alertmanager!~``}[5m])
)
* 100
> 3

Группа x509-certificate-exporter.rules

X509ExporterReadErrors

  • Описание: Over the last 15 minutes, this x509-certificate-exporter instance has experienced errors reading certificate files or querying the Kubernetes API. This could be caused by a misconfiguration if triggered when the exporter starts.

Выражение

delta(x509_read_errors[15m]) > 0

CertificateError

  • Описание: Certificate could not be decoded {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

x509_cert_error > 0

CertificateRenewal

  • Описание: Certificate for “{{{{ $labels.subject_CN }}}}” should be renewed {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

(x509_cert_not_after - time()) < (28 * 86400)

CertificateExpiration

  • Описание: Certificate for “{{{{ $labels.subject_CN }}}}” is about to expire after {{{{ humanizeDuration $value }}}} {{{{if $labels.secret_name }}}}in Kubernetes secret “{{{{ $labels.secret_namespace }}}}/{{{{ $labels.secret_name }}}}”{{{{else}}}}at location “{{{{ $labels.filepath }}}}”{{{{end}}}}

Выражение

(x509_cert_not_after - time()) < (14 * 86400)

Группа shturval-backup

VeleroBackupPartialFailures

  • Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} partialy failed backups.

Выражение

velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25

VeleroBackupFailures

  • Описание: Velero backup {{{{ $labels.schedule }}}} has {{{{ $value | humanizePercentage }}}} failed backups.

Выражение

velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25