prometheus operator添加报警规则及通知方式
配置报警
修改/root/kube-prometheus/manifests/alertmanager-service.yaml添加 type: NodePort,方便浏览器访问alertmanager页面
kubectl get svc -n monitoring可以看到alertmanager地址端口信息 http://172.16.0.6:31568/#/status
在alertmanager的status页面可以查看到AlertManager的配置信息
Config
global:
resolve_timeout: 5m
http_config: {}
smtp_from: yunwei@hhotel.com
smtp_hello: hhotel.com
smtp_smarthost: smtp.qiye.aliyun.com:465
smtp_auth_username: yunwei@hhotel.com
smtp_auth_password: <secret>
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
hipchat_api_url: https://api.hipchat.com/
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
... ...
这些信息实际来自于/root/kube-prometheus/manifests/alertmanager-secret.yaml文件,名为alertmanager-main的secret
apiVersion: v1
data:
alertmanager.yaml: Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg==
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
type: Opaque
可以将alertmanager.yaml对应的value值做一个base64解码:
echo Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== | base64 -d
我们如果想自定义接收器或者模板消息,可以重新生成这个名为alertmanager-main的secret
vi alertmanager.yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_from: 'yunwei@hhotel.com'
smtp_auth_username: 'yunwei@hhotel.com'
smtp_auth_password: 'aRXjq9W1jto^7^Zb'
smtp_hello: 'hhotel.com'
smtp_require_tls: true
templates:
- "*.tmpl"
route:
group_by: ['job', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
receiver: 'wechat'
routes:
- receiver: 'wechat'
group_wait: 10s
match:
alertname: EtcdClusterUnavailable
receivers:
- name: 'default'
email_configs:
- to: 'yunwei@hhotel.com'
send_resolved: true
- name: 'wechat'
wechat_configs:
- corp_id: 'wx02f71fb3dea46c16'
to_party: '1'
to_user: "renzhenxin"
agent_id: '1'
api_secret: 'r4OGerF_p4UrIN6QERCefJRxzpI0SquNG5gHCxGxcOM'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
创建wechat报警模板 wechat.tmpl
{{ define "wechat.default.message" }}
{{ range .Alerts }}
========start==========
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
告警程序: prometheus_alert
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
========end==========
{{ end }}
{{ end }}
删除原来的secret,然后再创建
kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring
查看alertmanegr微信报警模板
kubectl exec -it alertmanager-main-0 /bin/sh -n monitoring
ls /etc/alertmanager/config
cat /etc/alertmanager/config/wechat.tmpl
查看alertmanager的status页面config会显示修改变化
配置自动服务发现
想要让Prometheus Operator去自动发现并监控具有prometheus.io/scrape=true这个annotations的Service,需要对prometheus添加一个额外配置,相应的,Service要在annotation区域添加prometheus.io/scrape=true的声明
vi prometheus-additional.yaml
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
使用这个文件创建一个secret对象
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
kubectl get secret additional-configs -n monitoring -o yaml
在prometheus资源对象中加入刚才创建的额外配置,在spec下添加
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
完整配置cat /root/kube-prometheus/manifests/prometheus-prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
baseImage: quay.io/prometheus/prometheus
nodeSelector:
kubernetes.io/os: linux
podMonitorSelector: {}
replicas: 2
secrets:
- etcd-certs
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0
kubectl apply -f prometheus-prometheus.yaml
过一会儿到prometheus查看配置已经生效,搜索关键词kubernetes-service-endpoints
kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
可以看到有很多错误日志出现,都是xxx is forbidden,这说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole
修改prometheus-clusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
kubectl apply -f prometheus-clusterRole.yaml
从prometheus的targets可以看到已经自动发现了端口9153的服务,这是kube-dns
[root@k8s03 manifests]# kubectl describe svc kube-dns -n kube-system
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: prometheus.io/port: 9153
prometheus.io/scrape: true
Selector: k8s-app=kube-dns
Type: ClusterIP
IP: 10.96.0.10
Port: dns 53/UDP
TargetPort: 53/UDP
Endpoints: 192.168.73.66:53,192.168.73.67:53
Port: dns-tcp 53/TCP
TargetPort: 53/TCP
Endpoints: 192.168.73.66:53,192.168.73.67:53
Port: metrics 9153/TCP
TargetPort: 9153/TCP
Endpoints: 192.168.73.66:9153,192.168.73.67:9153
Session Affinity: None
Events: <none>
标题:prometheus operator添加报警规则及通知方式
作者:fish2018
地址:https://www.devopser.org/articles/2019/08/21/1566379859249.html