Skip to content

Commit 45e6091

Browse files
authored
Merge pull request #957 from lsst-it/IT-6303_alertmanager_tuning
(fleet/prometheus-alerts) add pvc alert
2 parents 86b1526 + aa6aa17 commit 45e6091

File tree

4 files changed

+29
-19
lines changed

4 files changed

+29
-19
lines changed

fleet/lib/kube-prometheus-stack/overlays/antu/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ alertmanager:
192192
- site
193193
group_wait: 30s
194194
group_interval: 5m
195-
repeat_interval: 24h
195+
repeat_interval: 120h
196196
receiver: blackhole
197197
routes:
198198
- receiver: blackhole

fleet/lib/prometheus-alerts/README.md

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,10 @@ file with the `rules.namespace` key.
1313

1414
## Prometheus rule AURA standards
1515

16-
* `summary` annotation: The `summary` annotation is used to be able to describe a
17-
group of alerts incomming. This annotation DOES NOT contain any templated
18-
variables and provides a simple single sentence summary of what the alert is
19-
about. For example "Disk space full in 24h". When a cluster triggers several
20-
alerts, it can be hany to group these alerts into a single notification, this
21-
is when the `summary` can be used.
22-
* `discription` annotation: This provides a detailed overview of the alert
23-
specifically to this instance of the alert. It MAY contain templated variables
24-
to enrich the message.
25-
* `receiver` label: The receiver label is used by alertmanager to decide on the
26-
routing of the notification for the alert. It exists out of `,` seperated list
27-
of receivers, pre- and suffixed with `,` to make regex matching easier in the
28-
alertmanager. For example: `,slack,squadcast,email,` The receivers are defined
29-
in the alertmanager configuration.
30-
Currently (20240503) the following receivers are configured:
31-
* `slack-test`
32-
* `squadcast-test`
16+
* `summary` annotation: This annotation MAY contain a templated variable to differentiate between hosts, pods, clusters, etc. and provides a simple single sentence summary of what the alert is about. For example, "Disk space full in acme.lsst.org". When a cluster triggers several alerts, it can be helpful to group these alerts into a single notification. A distinctive summary, it is also useful as a title for Jira tickets.
17+
* `description` annotation: This provides a detailed overview of the alert specifically to this instance of the alert. It MAY contain templated variables to enrich the message.
18+
* routing label: Rubin uses labels to route alerts. The label is used by alertmanager to determine the routing of the notification for the alert. By default, all alerts should be routed to Squadcast. The escalation and notification will be handled by Squadcast API.
19+
20+
Currently (20250616) the following receivers are configured:
21+
* `gnocpush`: Requires label `gnoc: "true"`
22+
* `squadcast-alertmanager`: Requires label `prod: "true"`. In most cases this should be the label of the alert.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
groups:
2+
- name: k8s.rules
3+
rules:
4+
- alert: PVCLowFreeSpace
5+
annotations:
6+
summary: PVC {{ $labels.persistentvolumeclaim }} is low on free space
7+
description: >
8+
PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
9+
has less than 20% free space.
10+
expr: |
11+
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}
12+
/ kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.20
13+
and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0
14+
for: 2m
15+
labels:
16+
prod: "true"
17+
severity: warning
18+
node_name: '{{ $labels.prom_cluster }}'
19+
device: null
20+
service_name: null

fleet/lib/prometheus-alerts/rules/prometheusrule-net.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ groups:
44
- alert: HostDown
55
annotations:
66
summary: Host {{ $labels.instance }} is down
7-
description: Host {{ $labels.instance }} is down. Maybe it is on fire??? 🗑🔥
7+
description: Host {{ $labels.instance }} is down. Maybe it is on fire???
88
expr: probe_success != 1
99
for: 1m
1010
labels:

0 commit comments

Comments
 (0)