第310集:Node-Exporter+AlertManager告警系统架构师实战:邮件+钉钉通知、智能告警规则与故障自愈策略

前言

Node-Exporter+AlertManager是构建企业级监控告警系统的核心组件,理解其架构原理和告警策略对于架构师来说至关重要。本文将深入解析Node-Exporter+AlertManager告警系统架构,从监控指标收集到智能告警规则,提供完整的邮件+钉钉通知方案。

一、Node-Exporter+AlertManager架构深度解析

1.1 整体架构设计

Node-Exporter+AlertManager告警系统采用分层架构设计,包含数据采集层、数据存储层、告警处理层、通知渠道层和展示层。

核心组件介绍

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 核心组件配置
components:
# 数据采集层
exporters:
- name: "node-exporter"
port: 9100
metrics: ["cpu", "memory", "disk", "network"]

- name: "mysql-exporter"
port: 9104
metrics: ["connections", "queries", "replication"]

- name: "redis-exporter"
port: 9121
metrics: ["memory", "keys", "operations"]

# 数据存储层
prometheus:
port: 9090
storage: "local"
retention: "15d"

# 告警处理层
alertmanager:
port: 9093
config_file: "/etc/alertmanager/alertmanager.yml"

# 通知渠道层
notifications:
- email
- dingtalk
- sms
- webhook

1.2 Node-Exporter部署与配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
#!/bin/bash
# Node-Exporter部署脚本

# 配置参数
NODE_EXPORTER_VERSION="1.6.1"
INSTALL_DIR="/opt/node_exporter"
SERVICE_USER="node_exporter"
CONFIG_DIR="/etc/node_exporter"

# 创建用户
create_user() {
echo "创建Node-Exporter用户..."
useradd --no-create-home --shell /bin/false $SERVICE_USER
}

# 下载安装
install_node_exporter() {
echo "下载Node-Exporter..."
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

echo "解压安装..."
tar xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter $INSTALL_DIR/
chown $SERVICE_USER:$SERVICE_USER $INSTALL_DIR/node_exporter
chmod +x $INSTALL_DIR/node_exporter
}

# 创建配置文件
create_config() {
echo "创建配置文件..."
mkdir -p $CONFIG_DIR

cat > $CONFIG_DIR/node_exporter.yml << EOF
# Node-Exporter配置文件
global:
scrape_interval: 15s
evaluation_interval: 15s

# 自定义指标收集
collectors:
enabled:
- cpu
- memory
- disk
- network
- filesystem
- systemd
- textfile

disabled:
- hwmon
- ipvs
- nfs
- ntp

# 文本文件收集器配置
textfile:
directory: /var/lib/node_exporter/textfile_collector
pattern: "*.prom"

# 系统服务收集器配置
systemd:
unit_whitelist: ".*\\.service$"
unit_blacklist: ".*\\.mount$"
enable_restarts_metrics: true
enable_start_time_metrics: true
EOF

chown -R $SERVICE_USER:$SERVICE_USER $CONFIG_DIR
}

# 创建systemd服务
create_service() {
echo "创建systemd服务..."

cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=$SERVICE_USER
Group=$SERVICE_USER
Type=simple
ExecStart=$INSTALL_DIR/node_exporter \\
--config.file=$CONFIG_DIR/node_exporter.yml \\
--web.listen-address=:9100 \\
--web.telemetry-path=/metrics \\
--log.level=info \\
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \\
--collector.systemd \\
--collector.systemd.unit-whitelist=".*\\.service$" \\
--collector.systemd.unit-blacklist=".*\\.mount$" \\
--collector.systemd.enable-restarts-metrics \\
--collector.systemd.enable-start-time-metrics

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
}

# 创建文本文件收集器目录
create_textfile_collector() {
echo "创建文本文件收集器目录..."
mkdir -p /var/lib/node_exporter/textfile_collector
chown -R $SERVICE_USER:$SERVICE_USER /var/lib/node_exporter

# 创建示例指标文件
cat > /var/lib/node_exporter/textfile_collector/custom_metrics.prom << EOF
# HELP custom_metric_example A custom metric example
# TYPE custom_metric_example gauge
custom_metric_example{instance="example"} 1
EOF
}

# 主流程
main() {
echo "开始部署Node-Exporter..."

create_user
install_node_exporter
create_config
create_service
create_textfile_collector

echo "Node-Exporter部署完成"
echo "服务状态:"
systemctl status node_exporter
}

# 执行主流程
main "$@"

1.3 Prometheus配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Prometheus配置文件
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
replica: 'prometheus-01'

# 告警规则文件
rule_files:
- "/etc/prometheus/rules/*.yml"

# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

# 抓取配置
scrape_configs:
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

# Node-Exporter监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
scrape_interval: 15s
metrics_path: /metrics
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'

# MySQL-Exporter监控
- job_name: 'mysql-exporter'
static_configs:
- targets:
- 'mysql1:9104'
- 'mysql2:9104'
scrape_interval: 30s
metrics_path: /metrics

# Redis-Exporter监控
- job_name: 'redis-exporter'
static_configs:
- targets:
- 'redis1:9121'
- 'redis2:9121'
scrape_interval: 30s
metrics_path: /metrics

# 服务发现配置
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_service_id]
target_label: instance

二、AlertManager配置与告警规则

2.1 AlertManager配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# AlertManager配置文件
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@company.com'
smtp_auth_username: 'alerts@company.com'
smtp_auth_password: 'password'
smtp_require_tls: true

# 告警路由配置
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
# 按服务路由
- match:
service: 'mysql'
receiver: 'mysql-receiver'
group_wait: 5s
repeat_interval: 30m

- match:
service: 'redis'
receiver: 'redis-receiver'
group_wait: 5s
repeat_interval: 30m

# 按告警级别路由
- match:
severity: 'critical'
receiver: 'critical-receiver'
group_wait: 0s
repeat_interval: 5m

- match:
severity: 'warning'
receiver: 'warning-receiver'
group_wait: 30s
repeat_interval: 1h

# 告警抑制配置
inhibit_rules:
# 抑制相同实例的警告告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

# 抑制相同服务的多个告警
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '.*High.*'
equal: ['service']

# 接收器配置
receivers:
# 默认接收器
- name: 'default-receiver'
email_configs:
- to: 'admin@company.com'
subject: '{{ .GroupLabels.alertname }} - {{ .GroupLabels.cluster }}'
body: |
{{ range .Alerts }}
告警名称: {{ .Annotations.summary }}
告警级别: {{ .Labels.severity }}
实例: {{ .Labels.instance }}
服务: {{ .Labels.service }}
时间: {{ .StartsAt }}
{{ end }}

# MySQL接收器
- name: 'mysql-receiver'
email_configs:
- to: 'mysql-team@company.com'
subject: 'MySQL告警 - {{ .GroupLabels.instance }}'
body: |
{{ range .Alerts }}
MySQL告警: {{ .Annotations.summary }}
实例: {{ .Labels.instance }}
数据库: {{ .Labels.database }}
时间: {{ .StartsAt }}
{{ end }}

webhook_configs:
- url: 'http://dingtalk-webhook:8080/dingtalk'
send_resolved: true

# Redis接收器
- name: 'redis-receiver'
email_configs:
- to: 'redis-team@company.com'
subject: 'Redis告警 - {{ .GroupLabels.instance }}'
body: |
{{ range .Alerts }}
Redis告警: {{ .Annotations.summary }}
实例: {{ .Labels.instance }}
时间: {{ .StartsAt }}
{{ end }}

webhook_configs:
- url: 'http://dingtalk-webhook:8080/dingtalk'
send_resolved: true

# 严重告警接收器
- name: 'critical-receiver'
email_configs:
- to: 'oncall@company.com'
subject: '严重告警 - {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
严重告警: {{ .Annotations.summary }}
描述: {{ .Annotations.description }}
实例: {{ .Labels.instance }}
服务: {{ .Labels.service }}
时间: {{ .StartsAt }}
{{ end }}

webhook_configs:
- url: 'http://dingtalk-webhook:8080/dingtalk'
send_resolved: true

# 短信通知(需要集成短信服务商)
# sms_configs:
# - to: '+8613800138000'
# text: '严重告警: {{ .GroupLabels.alertname }}'

# 警告告警接收器
- name: 'warning-receiver'
email_configs:
- to: 'dev-team@company.com'
subject: '警告告警 - {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
警告告警: {{ .Annotations.summary }}
实例: {{ .Labels.instance }}
服务: {{ .Labels.service }}
时间: {{ .StartsAt }}
{{ end }}

2.2 告警规则配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# 系统告警规则
groups:
# 系统资源告警
- name: 'system-alerts'
rules:
# CPU使用率告警
- alert: 'HighCPUUsage'
expr: '100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80'
for: '5m'
labels:
severity: 'warning'
service: 'system'
annotations:
summary: 'CPU使用率过高'
description: '实例 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%'

- alert: 'CriticalCPUUsage'
expr: '100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95'
for: '2m'
labels:
severity: 'critical'
service: 'system'
annotations:
summary: 'CPU使用率严重过高'
description: '实例 {{ $labels.instance }} CPU使用率超过95%,当前值: {{ $value }}%'

# 内存使用率告警
- alert: 'HighMemoryUsage'
expr: '(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80'
for: '5m'
labels:
severity: 'warning'
service: 'system'
annotations:
summary: '内存使用率过高'
description: '实例 {{ $labels.instance }} 内存使用率超过80%,当前值: {{ $value }}%'

- alert: 'CriticalMemoryUsage'
expr: '(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95'
for: '2m'
labels:
severity: 'critical'
service: 'system'
annotations:
summary: '内存使用率严重过高'
description: '实例 {{ $labels.instance }} 内存使用率超过95%,当前值: {{ $value }}%'

# 磁盘使用率告警
- alert: 'HighDiskUsage'
expr: '(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 80'
for: '5m'
labels:
severity: 'warning'
service: 'system'
annotations:
summary: '磁盘使用率过高'
description: '实例 {{ $labels.instance }} 磁盘使用率超过80%,当前值: {{ $value }}%'

- alert: 'CriticalDiskUsage'
expr: '(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 95'
for: '2m'
labels:
severity: 'critical'
service: 'system'
annotations:
summary: '磁盘使用率严重过高'
description: '实例 {{ $labels.instance }} 磁盘使用率超过95%,当前值: {{ $value }}%'

# 网络流量告警
- alert: 'HighNetworkTraffic'
expr: 'rate(node_network_receive_bytes_total[5m]) > 100000000'
for: '5m'
labels:
severity: 'warning'
service: 'system'
annotations:
summary: '网络接收流量过高'
description: '实例 {{ $labels.instance }} 网络接收流量过高,当前值: {{ $value }} bytes/s'

# 系统负载告警
- alert: 'HighSystemLoad'
expr: 'node_load1 > 5'
for: '5m'
labels:
severity: 'warning'
service: 'system'
annotations:
summary: '系统负载过高'
description: '实例 {{ $labels.instance }} 系统负载过高,当前值: {{ $value }}'

# MySQL告警规则
- name: 'mysql-alerts'
rules:
# MySQL连接数告警
- alert: 'MySQLHighConnections'
expr: 'mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80'
for: '5m'
labels:
severity: 'warning'
service: 'mysql'
annotations:
summary: 'MySQL连接数过高'
description: 'MySQL实例 {{ $labels.instance }} 连接数使用率超过80%,当前值: {{ $value }}%'

- alert: 'MySQLCriticalConnections'
expr: 'mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 95'
for: '2m'
labels:
severity: 'critical'
service: 'mysql'
annotations:
summary: 'MySQL连接数严重过高'
description: 'MySQL实例 {{ $labels.instance }} 连接数使用率超过95%,当前值: {{ $value }}%'

# MySQL慢查询告警
- alert: 'MySQLSlowQueries'
expr: 'rate(mysql_global_status_slow_queries[5m]) > 10'
for: '5m'
labels:
severity: 'warning'
service: 'mysql'
annotations:
summary: 'MySQL慢查询过多'
description: 'MySQL实例 {{ $labels.instance }} 慢查询过多,当前值: {{ $value }} queries/s'

# MySQL复制延迟告警
- alert: 'MySQLReplicationLag'
expr: 'mysql_slave_lag_seconds > 60'
for: '5m'
labels:
severity: 'warning'
service: 'mysql'
annotations:
summary: 'MySQL复制延迟过高'
description: 'MySQL实例 {{ $labels.instance }} 复制延迟过高,当前值: {{ $value }}秒'

# Redis告警规则
- name: 'redis-alerts'
rules:
# Redis内存使用率告警
- alert: 'RedisHighMemoryUsage'
expr: 'redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80'
for: '5m'
labels:
severity: 'warning'
service: 'redis'
annotations:
summary: 'Redis内存使用率过高'
description: 'Redis实例 {{ $labels.instance }} 内存使用率超过80%,当前值: {{ $value }}%'

- alert: 'RedisCriticalMemoryUsage'
expr: 'redis_memory_used_bytes / redis_memory_max_bytes * 100 > 95'
for: '2m'
labels:
severity: 'critical'
service: 'redis'
annotations:
summary: 'Redis内存使用率严重过高'
description: 'Redis实例 {{ $labels.instance }} 内存使用率超过95%,当前值: {{ $value }}%'

# Redis连接数告警
- alert: 'RedisHighConnections'
expr: 'redis_connected_clients > 1000'
for: '5m'
labels:
severity: 'warning'
service: 'redis'
annotations:
summary: 'Redis连接数过高'
description: 'Redis实例 {{ $labels.instance }} 连接数过高,当前值: {{ $value }}'

# 服务可用性告警
- name: 'service-alerts'
rules:
# 服务宕机告警
- alert: 'ServiceDown'
expr: 'up == 0'
for: '1m'
labels:
severity: 'critical'
service: 'monitoring'
annotations:
summary: '服务宕机'
description: '服务 {{ $labels.job }} 在实例 {{ $labels.instance }} 上宕机'

# 服务响应时间告警
- alert: 'HighResponseTime'
expr: 'http_request_duration_seconds{quantile="0.95"} > 1'
for: '5m'
labels:
severity: 'warning'
service: 'application'
annotations:
summary: '服务响应时间过长'
description: '服务 {{ $labels.job }} 响应时间过长,当前值: {{ $value }}秒'

三、钉钉通知集成

3.1 钉钉Webhook配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# 钉钉Webhook服务
from flask import Flask, request, jsonify
import json
import requests
import logging
from datetime import datetime

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DingTalkNotifier:
def __init__(self, webhook_url, secret=None):
self.webhook_url = webhook_url
self.secret = secret

def send_alert(self, alert_data):
"""发送告警到钉钉"""
try:
# 构建消息内容
message = self.build_message(alert_data)

# 发送消息
response = requests.post(
self.webhook_url,
json=message,
headers={'Content-Type': 'application/json'},
timeout=10
)

if response.status_code == 200:
result = response.json()
if result.get('errcode') == 0:
logger.info("钉钉消息发送成功")
return True
else:
logger.error(f"钉钉消息发送失败: {result.get('errmsg')}")
return False
else:
logger.error(f"钉钉消息发送失败: HTTP {response.status_code}")
return False

except Exception as e:
logger.error(f"发送钉钉消息异常: {str(e)}")
return False

def build_message(self, alert_data):
"""构建钉钉消息"""
alerts = alert_data.get('alerts', [])

if not alerts:
return None

# 获取告警信息
alert = alerts[0]
status = alert.get('status', 'unknown')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})

# 构建消息内容
if status == 'firing':
title = f"🚨 告警通知 - {labels.get('alertname', 'Unknown')}"
color = "#FF0000"
else:
title = f"✅ 告警恢复 - {labels.get('alertname', 'Unknown')}"
color = "#00FF00"

# 构建消息文本
message_text = f"""
**{title}**

**告警级别**: {labels.get('severity', 'unknown')}
**服务**: {labels.get('service', 'unknown')}
**实例**: {labels.get('instance', 'unknown')}
**时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

**描述**: {annotations.get('description', 'No description')}
**摘要**: {annotations.get('summary', 'No summary')}
"""

# 构建钉钉消息
message = {
"msgtype": "markdown",
"markdown": {
"title": title,
"text": message_text
},
"at": {
"atMobiles": self.get_at_mobiles(labels),
"isAtAll": False
}
}

return message

def get_at_mobiles(self, labels):
"""获取需要@的手机号"""
# 根据告警级别和服务获取需要@的手机号
severity = labels.get('severity', '')
service = labels.get('service', '')

at_mobiles = []

if severity == 'critical':
# 严重告警@值班人员
at_mobiles.extend(['13800138000', '13800138001'])

if service == 'mysql':
# MySQL告警@DBA团队
at_mobiles.extend(['13800138002', '13800138003'])

return at_mobiles

# 全局钉钉通知器
dingtalk_notifier = DingTalkNotifier(
webhook_url="https://oapi.dingtalk.com/robot/send?access_token=your_token",
secret="your_secret"
)

@app.route('/dingtalk', methods=['POST'])
def dingtalk_webhook():
"""钉钉Webhook接口"""
try:
# 获取告警数据
alert_data = request.get_json()

if not alert_data:
return jsonify({'error': 'No alert data'}), 400

# 发送钉钉通知
success = dingtalk_notifier.send_alert(alert_data)

if success:
return jsonify({'status': 'success'}), 200
else:
return jsonify({'status': 'failed'}), 500

except Exception as e:
logger.error(f"处理钉钉Webhook异常: {str(e)}")
return jsonify({'error': str(e)}), 500

@app.route('/health', methods=['GET'])
def health_check():
"""健康检查接口"""
return jsonify({'status': 'healthy'}), 200

if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080, debug=False)

3.2 钉钉机器人配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
#!/bin/bash
# 钉钉机器人配置脚本

# 钉钉机器人配置
DINGTALK_WEBHOOK="https://oapi.dingtalk.com/robot/send?access_token=your_token"
DINGTALK_SECRET="your_secret"

# 创建钉钉通知脚本
create_dingtalk_script() {
cat > /opt/scripts/dingtalk_notify.sh << 'EOF'
#!/bin/bash

# 钉钉通知脚本
DINGTALK_WEBHOOK="$1"
ALERT_DATA="$2"

# 发送钉钉消息
send_dingtalk_message() {
local webhook_url="$1"
local message="$2"

curl -X POST "$webhook_url" \
-H 'Content-Type: application/json' \
-d "$message" \
--connect-timeout 10 \
--max-time 30
}

# 构建消息
build_message() {
local alert_data="$1"

# 解析告警数据
local alertname=$(echo "$alert_data" | jq -r '.alerts[0].labels.alertname // "Unknown"')
local severity=$(echo "$alert_data" | jq -r '.alerts[0].labels.severity // "unknown"')
local instance=$(echo "$alert_data" | jq -r '.alerts[0].labels.instance // "unknown"')
local service=$(echo "$alert_data" | jq -r '.alerts[0].labels.service // "unknown"')
local status=$(echo "$alert_data" | jq -r '.alerts[0].status // "unknown"')
local summary=$(echo "$alert_data" | jq -r '.alerts[0].annotations.summary // "No summary"')
local description=$(echo "$alert_data" | jq -r '.alerts[0].annotations.description // "No description"')

# 构建消息内容
if [ "$status" = "firing" ]; then
local title="🚨 告警通知 - $alertname"
local color="#FF0000"
else
local title="✅ 告警恢复 - $alertname"
local color="#00FF00"
fi

local message_text="**$title**

**告警级别**: $severity
**服务**: $service
**实例**: $instance
**时间**: $(date '+%Y-%m-%d %H:%M:%S')

**描述**: $description
**摘要**: $summary"

# 构建钉钉消息
local message=$(cat << EOF
{
"msgtype": "markdown",
"markdown": {
"title": "$title",
"text": "$message_text"
},
"at": {
"atMobiles": [],
"isAtAll": false
}
}
EOF
)

echo "$message"
}

# 主流程
main() {
local webhook_url="$1"
local alert_data="$2"

if [ -z "$webhook_url" ] || [ -z "$alert_data" ]; then
echo "Usage: $0 <webhook_url> <alert_data>"
exit 1
fi

# 构建消息
local message=$(build_message "$alert_data")

# 发送消息
send_dingtalk_message "$webhook_url" "$message"
}

# 执行主流程
main "$@"
EOF

chmod +x /opt/scripts/dingtalk_notify.sh
}

# 创建钉钉通知服务
create_dingtalk_service() {
cat > /etc/systemd/system/dingtalk-notify.service << EOF
[Unit]
Description=DingTalk Notification Service
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/scripts/dingtalk_webhook.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable dingtalk-notify
systemctl start dingtalk-notify
}

# 主流程
main() {
echo "开始配置钉钉通知..."

create_dingtalk_script
create_dingtalk_service

echo "钉钉通知配置完成"
echo "服务状态:"
systemctl status dingtalk-notify
}

# 执行主流程
main "$@"

四、告警流程与抑制策略

4.1 告警流程设计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# 告警流程管理器
import time
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Optional

class AlertFlowManager:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.alert_history = {}
self.alert_groups = {}
self.inhibition_rules = []

def process_alert(self, alert_data: Dict) -> bool:
"""处理告警"""
try:
alerts = alert_data.get('alerts', [])

for alert in alerts:
# 检查告警抑制
if self.check_inhibition(alert):
self.logger.info(f"告警被抑制: {alert.get('labels', {}).get('alertname')}")
continue

# 告警分组
group_key = self.get_group_key(alert)
if group_key not in self.alert_groups:
self.alert_groups[group_key] = []

self.alert_groups[group_key].append(alert)

# 记录告警历史
self.record_alert_history(alert)

# 发送通知
self.send_notification(alert)

return True

except Exception as e:
self.logger.error(f"处理告警异常: {str(e)}")
return False

def check_inhibition(self, alert: Dict) -> bool:
"""检查告警抑制"""
try:
labels = alert.get('labels', {})

for rule in self.inhibition_rules:
if self.match_inhibition_rule(alert, rule):
return True

return False

except Exception as e:
self.logger.error(f"检查告警抑制异常: {str(e)}")
return False

def match_inhibition_rule(self, alert: Dict, rule: Dict) -> bool:
"""匹配抑制规则"""
try:
labels = alert.get('labels', {})

# 检查源匹配
source_match = rule.get('source_match', {})
if source_match:
for key, value in source_match.items():
if labels.get(key) != value:
return False

# 检查目标匹配
target_match = rule.get('target_match', {})
if target_match:
for key, value in target_match.items():
if labels.get(key) != value:
return False

# 检查相等标签
equal_labels = rule.get('equal', [])
if equal_labels:
for label in equal_labels:
if labels.get(label) is None:
return False

return True

except Exception as e:
self.logger.error(f"匹配抑制规则异常: {str(e)}")
return False

def get_group_key(self, alert: Dict) -> str:
"""获取告警分组键"""
try:
labels = alert.get('labels', {})

# 按告警名称、集群、服务分组
group_parts = [
labels.get('alertname', 'unknown'),
labels.get('cluster', 'unknown'),
labels.get('service', 'unknown')
]

return '|'.join(group_parts)

except Exception as e:
self.logger.error(f"获取告警分组键异常: {str(e)}")
return 'unknown'

def record_alert_history(self, alert: Dict):
"""记录告警历史"""
try:
labels = alert.get('labels', {})
alertname = labels.get('alertname', 'unknown')
instance = labels.get('instance', 'unknown')

key = f"{alertname}:{instance}"

if key not in self.alert_history:
self.alert_history[key] = []

self.alert_history[key].append({
'timestamp': datetime.now(),
'status': alert.get('status', 'unknown'),
'labels': labels,
'annotations': alert.get('annotations', {})
})

# 清理旧历史记录
self.cleanup_alert_history(key)

except Exception as e:
self.logger.error(f"记录告警历史异常: {str(e)}")

def cleanup_alert_history(self, key: str):
"""清理告警历史"""
try:
if key in self.alert_history:
# 保留最近24小时的记录
cutoff_time = datetime.now() - timedelta(hours=24)
self.alert_history[key] = [
record for record in self.alert_history[key]
if record['timestamp'] > cutoff_time
]

except Exception as e:
self.logger.error(f"清理告警历史异常: {str(e)}")

def send_notification(self, alert: Dict):
"""发送通知"""
try:
labels = alert.get('labels', {})
severity = labels.get('severity', 'unknown')
service = labels.get('service', 'unknown')

# 根据告警级别和服务选择通知方式
if severity == 'critical':
self.send_critical_notification(alert)
elif severity == 'warning':
self.send_warning_notification(alert)
else:
self.send_info_notification(alert)

except Exception as e:
self.logger.error(f"发送通知异常: {str(e)}")

def send_critical_notification(self, alert: Dict):
"""发送严重告警通知"""
try:
# 立即发送邮件
self.send_email_notification(alert)

# 发送钉钉通知
self.send_dingtalk_notification(alert)

# 发送短信通知
self.send_sms_notification(alert)

self.logger.info("严重告警通知已发送")

except Exception as e:
self.logger.error(f"发送严重告警通知异常: {str(e)}")

def send_warning_notification(self, alert: Dict):
"""发送警告告警通知"""
try:
# 发送邮件通知
self.send_email_notification(alert)

# 发送钉钉通知
self.send_dingtalk_notification(alert)

self.logger.info("警告告警通知已发送")

except Exception as e:
self.logger.error(f"发送警告告警通知异常: {str(e)}")

def send_info_notification(self, alert: Dict):
"""发送信息告警通知"""
try:
# 仅发送邮件通知
self.send_email_notification(alert)

self.logger.info("信息告警通知已发送")

except Exception as e:
self.logger.error(f"发送信息告警通知异常: {str(e)}")

def send_email_notification(self, alert: Dict):
"""发送邮件通知"""
# 邮件通知实现
pass

def send_dingtalk_notification(self, alert: Dict):
"""发送钉钉通知"""
# 钉钉通知实现
pass

def send_sms_notification(self, alert: Dict):
"""发送短信通知"""
# 短信通知实现
pass

# 使用示例
if __name__ == "__main__":
flow_manager = AlertFlowManager()

# 示例告警数据
alert_data = {
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "HighCPUUsage",
"severity": "warning",
"instance": "server1",
"service": "system"
},
"annotations": {
"summary": "CPU使用率过高",
"description": "服务器CPU使用率超过80%"
}
}
]
}

flow_manager.process_alert(alert_data)

4.2 告警抑制与聚合策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 告警抑制与聚合配置
inhibition_rules:
# 时间抑制规则
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
comment: '抑制相同实例的警告告警'

# 服务抑制规则
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '.*High.*'
equal: ['service']
comment: '抑制相同服务的多个告警'

# 级别抑制规则
- source_match:
severity: 'critical'
target_match:
severity: 'info'
equal: ['instance']
comment: '抑制相同实例的信息告警'

# 告警聚合配置
group_by:
- 'alertname'
- 'cluster'
- 'service'
- 'severity'

group_wait: 10s
group_interval: 10s
repeat_interval: 1h

# 告警路由配置
routes:
# 按服务路由
- match:
service: 'mysql'
receiver: 'mysql-receiver'
group_wait: 5s
repeat_interval: 30m

- match:
service: 'redis'
receiver: 'redis-receiver'
group_wait: 5s
repeat_interval: 30m

# 按告警级别路由
- match:
severity: 'critical'
receiver: 'critical-receiver'
group_wait: 0s
repeat_interval: 5m

- match:
severity: 'warning'
receiver: 'warning-receiver'
group_wait: 30s
repeat_interval: 1h

五、智能告警规则与故障自愈

5.1 智能告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# 智能告警规则引擎
import time
import logging
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import numpy as np
from sklearn.ensemble import IsolationForest

class IntelligentAlertEngine:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.metric_history = {}
self.anomaly_detector = IsolationForest(contamination=0.1)
self.baseline_metrics = {}
self.alert_thresholds = {}

def analyze_metric_anomaly(self, metric_name: str, value: float, labels: Dict) -> bool:
"""分析指标异常"""
try:
# 获取指标历史数据
history_key = f"{metric_name}:{labels.get('instance', 'unknown')}"

if history_key not in self.metric_history:
self.metric_history[history_key] = []

# 添加当前值
self.metric_history[history_key].append({
'timestamp': datetime.now(),
'value': value,
'labels': labels
})

# 保持最近1000个数据点
if len(self.metric_history[history_key]) > 1000:
self.metric_history[history_key] = self.metric_history[history_key][-1000:]

# 检查是否有足够的历史数据
if len(self.metric_history[history_key]) < 50:
return False

# 提取特征
features = self.extract_features(history_key)

# 异常检测
is_anomaly = self.detect_anomaly(features, value)

return is_anomaly

except Exception as e:
self.logger.error(f"分析指标异常异常: {str(e)}")
return False

def extract_features(self, history_key: str) -> List[float]:
"""提取特征"""
try:
history = self.metric_history[history_key]
values = [record['value'] for record in history]

# 计算统计特征
features = [
np.mean(values), # 均值
np.std(values), # 标准差
np.median(values), # 中位数
np.percentile(values, 25), # 25分位数
np.percentile(values, 75), # 75分位数
np.max(values), # 最大值
np.min(values), # 最小值
]

# 计算趋势特征
if len(values) >= 10:
# 计算斜率
x = np.arange(len(values[-10:]))
y = np.array(values[-10:])
slope = np.polyfit(x, y, 1)[0]
features.append(slope)

# 计算变化率
change_rate = (values[-1] - values[-10]) / values[-10] if values[-10] != 0 else 0
features.append(change_rate)
else:
features.extend([0, 0])

return features

except Exception as e:
self.logger.error(f"提取特征异常: {str(e)}")
return []

def detect_anomaly(self, features: List[float], current_value: float) -> bool:
"""检测异常"""
try:
if len(features) < 9:
return False

# 添加当前值作为特征
features.append(current_value)

# 重塑特征数组
features_array = np.array(features).reshape(1, -1)

# 异常检测
anomaly_score = self.anomaly_detector.decision_function(features_array)[0]
is_anomaly = self.anomaly_detector.predict(features_array)[0] == -1

# 记录异常分数
self.logger.info(f"异常检测分数: {anomaly_score}, 是否异常: {is_anomaly}")

return is_anomaly

except Exception as e:
self.logger.error(f"检测异常异常: {str(e)}")
return False

def calculate_dynamic_threshold(self, metric_name: str, labels: Dict) -> float:
"""计算动态阈值"""
try:
history_key = f"{metric_name}:{labels.get('instance', 'unknown')}"

if history_key not in self.metric_history:
return 0.8 # 默认阈值

history = self.metric_history[history_key]
values = [record['value'] for record in history]

if len(values) < 10:
return 0.8 # 默认阈值

# 计算动态阈值
mean_value = np.mean(values)
std_value = np.std(values)

# 基于3-sigma规则计算阈值
threshold = mean_value + 3 * std_value

# 限制阈值范围
threshold = max(0.1, min(0.99, threshold))

return threshold

except Exception as e:
self.logger.error(f"计算动态阈值异常: {str(e)}")
return 0.8

def generate_intelligent_alert(self, metric_name: str, value: float, labels: Dict) -> Optional[Dict]:
"""生成智能告警"""
try:
# 分析异常
is_anomaly = self.analyze_metric_anomaly(metric_name, value, labels)

if not is_anomaly:
return None

# 计算动态阈值
threshold = self.calculate_dynamic_threshold(metric_name, labels)

# 确定告警级别
severity = self.determine_severity(value, threshold)

# 生成告警
alert = {
'alertname': f'Intelligent{metric_name}Alert',
'severity': severity,
'labels': labels,
'annotations': {
'summary': f'智能检测到{metric_name}异常',
'description': f'指标{metric_name}{value}超过动态阈值{threshold}',
'intelligent_detection': 'true',
'threshold': str(threshold),
'anomaly_score': str(self.anomaly_detector.decision_function([[value]])[0])
}
}

return alert

except Exception as e:
self.logger.error(f"生成智能告警异常: {str(e)}")
return None

def determine_severity(self, value: float, threshold: float) -> str:
"""确定告警级别"""
try:
if value > threshold * 1.5:
return 'critical'
elif value > threshold * 1.2:
return 'warning'
else:
return 'info'

except Exception as e:
self.logger.error(f"确定告警级别异常: {str(e)}")
return 'warning'

# 使用示例
if __name__ == "__main__":
engine = IntelligentAlertEngine()

# 示例指标数据
metric_name = "cpu_usage"
value = 0.85
labels = {
"instance": "server1",
"service": "system"
}

alert = engine.generate_intelligent_alert(metric_name, value, labels)
if alert:
print(f"生成智能告警: {alert}")

5.2 故障自愈策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# 故障自愈管理器
import subprocess
import time
import logging
from typing import Dict, List, Optional
import requests

class FaultSelfHealingManager:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.healing_actions = {}
self.healing_history = {}
self.max_retry_count = 3

def register_healing_action(self, alert_name: str, action_func):
"""注册自愈动作"""
self.healing_actions[alert_name] = action_func
self.logger.info(f"注册自愈动作: {alert_name}")

def execute_healing_action(self, alert: Dict) -> bool:
"""执行自愈动作"""
try:
labels = alert.get('labels', {})
alertname = labels.get('alertname', '')

# 检查是否有对应的自愈动作
if alertname not in self.healing_actions:
self.logger.info(f"没有找到自愈动作: {alertname}")
return False

# 检查重试次数
instance = labels.get('instance', 'unknown')
retry_key = f"{alertname}:{instance}"

if retry_key not in self.healing_history:
self.healing_history[retry_key] = 0

if self.healing_history[retry_key] >= self.max_retry_count:
self.logger.warning(f"自愈动作重试次数超限: {retry_key}")
return False

# 执行自愈动作
action_func = self.healing_actions[alertname]
success = action_func(alert)

if success:
self.logger.info(f"自愈动作执行成功: {alertname}")
# 重置重试计数
self.healing_history[retry_key] = 0
else:
self.logger.warning(f"自愈动作执行失败: {alertname}")
# 增加重试计数
self.healing_history[retry_key] += 1

return success

except Exception as e:
self.logger.error(f"执行自愈动作异常: {str(e)}")
return False

def restart_service(self, service_name: str) -> bool:
"""重启服务"""
try:
result = subprocess.run(
['systemctl', 'restart', service_name],
capture_output=True,
text=True,
timeout=30
)

if result.returncode == 0:
self.logger.info(f"服务重启成功: {service_name}")
return True
else:
self.logger.error(f"服务重启失败: {service_name}, {result.stderr}")
return False

except Exception as e:
self.logger.error(f"重启服务异常: {str(e)}")
return False

def clear_cache(self, cache_type: str) -> bool:
"""清理缓存"""
try:
if cache_type == 'redis':
# 清理Redis缓存
result = subprocess.run(
['redis-cli', 'FLUSHALL'],
capture_output=True,
text=True,
timeout=10
)

if result.returncode == 0:
self.logger.info("Redis缓存清理成功")
return True
else:
self.logger.error(f"Redis缓存清理失败: {result.stderr}")
return False

elif cache_type == 'memory':
# 清理内存缓存
result = subprocess.run(
['sync', '&&', 'echo', '3', '>', '/proc/sys/vm/drop_caches'],
shell=True,
capture_output=True,
text=True,
timeout=10
)

if result.returncode == 0:
self.logger.info("内存缓存清理成功")
return True
else:
self.logger.error(f"内存缓存清理失败: {result.stderr}")
return False

return False

except Exception as e:
self.logger.error(f"清理缓存异常: {str(e)}")
return False

def scale_service(self, service_name: str, scale_count: int) -> bool:
"""扩缩容服务"""
try:
# 使用Docker Compose扩缩容
result = subprocess.run(
['docker-compose', 'up', '-d', '--scale', f'{service_name}={scale_count}'],
capture_output=True,
text=True,
timeout=60
)

if result.returncode == 0:
self.logger.info(f"服务扩缩容成功: {service_name} -> {scale_count}")
return True
else:
self.logger.error(f"服务扩缩容失败: {service_name}, {result.stderr}")
return False

except Exception as e:
self.logger.error(f"扩缩容服务异常: {str(e)}")
return False

def execute_custom_script(self, script_path: str, args: List[str] = None) -> bool:
"""执行自定义脚本"""
try:
cmd = [script_path]
if args:
cmd.extend(args)

result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=60
)

if result.returncode == 0:
self.logger.info(f"自定义脚本执行成功: {script_path}")
return True
else:
self.logger.error(f"自定义脚本执行失败: {script_path}, {result.stderr}")
return False

except Exception as e:
self.logger.error(f"执行自定义脚本异常: {str(e)}")
return False

# 自愈动作定义
def define_healing_actions(healing_manager: FaultSelfHealingManager):
"""定义自愈动作"""

# MySQL连接数过高自愈
def mysql_high_connections_healing(alert: Dict) -> bool:
labels = alert.get('labels', {})
instance = labels.get('instance', '')

# 重启MySQL服务
return healing_manager.restart_service('mysql')

# Redis内存使用率过高自愈
def redis_high_memory_healing(alert: Dict) -> bool:
# 清理Redis缓存
return healing_manager.clear_cache('redis')

# 服务宕机自愈
def service_down_healing(alert: Dict) -> bool:
labels = alert.get('labels', {})
service = labels.get('service', '')

if service:
return healing_manager.restart_service(service)

return False

# CPU使用率过高自愈
def high_cpu_usage_healing(alert: Dict) -> bool:
# 清理内存缓存
return healing_manager.clear_cache('memory')

# 注册自愈动作
healing_manager.register_healing_action('MySQLHighConnections', mysql_high_connections_healing)
healing_manager.register_healing_action('RedisHighMemoryUsage', redis_high_memory_healing)
healing_manager.register_healing_action('ServiceDown', service_down_healing)
healing_manager.register_healing_action('HighCPUUsage', high_cpu_usage_healing)

# 使用示例
if __name__ == "__main__":
healing_manager = FaultSelfHealingManager()

# 定义自愈动作
define_healing_actions(healing_manager)

# 示例告警
alert = {
'labels': {
'alertname': 'MySQLHighConnections',
'instance': 'mysql1',
'service': 'mysql'
},
'annotations': {
'summary': 'MySQL连接数过高',
'description': 'MySQL连接数超过阈值'
}
}

# 执行自愈动作
success = healing_manager.execute_healing_action(alert)
print(f"自愈动作执行结果: {success}")

六、最佳实践与运维建议

6.1 告警系统最佳实践

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 告警系统最佳实践配置
alert_best_practices:
# 告警规则设计
rule_design:
- 避免告警风暴
- 设置合理的阈值
- 使用告警抑制
- 定期评估告警规则

# 通知策略
notification_strategy:
- 按告警级别分级通知
- 使用多种通知渠道
- 设置通知频率限制
- 提供告警上下文信息

# 告警管理
alert_management:
- 定期清理无效告警
- 监控告警系统本身
- 建立告警响应流程
- 定期进行告警演练

# 性能优化
performance_optimization:
- 优化告警规则性能
- 使用告警分组
- 合理设置评估间隔
- 监控告警系统资源使用

6.2 运维建议

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#!/bin/bash
# 告警系统运维脚本

# 检查告警系统状态
check_alert_system_status() {
echo "检查告警系统状态..."

# 检查Prometheus状态
if systemctl is-active prometheus > /dev/null; then
echo "✅ Prometheus服务正常"
else
echo "❌ Prometheus服务异常"
fi

# 检查AlertManager状态
if systemctl is-active alertmanager > /dev/null; then
echo "✅ AlertManager服务正常"
else
echo "❌ AlertManager服务异常"
fi

# 检查Node-Exporter状态
if systemctl is-active node_exporter > /dev/null; then
echo "✅ Node-Exporter服务正常"
else
echo "❌ Node-Exporter服务异常"
fi

# 检查钉钉通知服务状态
if systemctl is-active dingtalk-notify > /dev/null; then
echo "✅ 钉钉通知服务正常"
else
echo "❌ 钉钉通知服务异常"
fi
}

# 测试告警功能
test_alert_functionality() {
echo "测试告警功能..."

# 测试Prometheus告警规则
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.state == "firing")'

# 测试AlertManager配置
curl -s http://localhost:9093/api/v1/status | jq '.data.config'

# 测试钉钉通知
curl -X POST http://localhost:8080/dingtalk \
-H 'Content-Type: application/json' \
-d '{
"alerts": [{
"status": "firing",
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"instance": "test-server"
},
"annotations": {
"summary": "测试告警",
"description": "这是一个测试告警"
}
}]
}'
}

# 清理告警历史
cleanup_alert_history() {
echo "清理告警历史..."

# 清理Prometheus告警历史
find /var/lib/prometheus -name "*.log" -mtime +30 -delete

# 清理AlertManager告警历史
find /var/lib/alertmanager -name "*.log" -mtime +30 -delete

echo "告警历史清理完成"
}

# 备份告警配置
backup_alert_config() {
echo "备份告警配置..."

backup_dir="/backup/alert-config/$(date +%Y%m%d)"
mkdir -p "$backup_dir"

# 备份Prometheus配置
cp /etc/prometheus/prometheus.yml "$backup_dir/"
cp -r /etc/prometheus/rules "$backup_dir/"

# 备份AlertManager配置
cp /etc/alertmanager/alertmanager.yml "$backup_dir/"

# 备份钉钉通知配置
cp /opt/scripts/dingtalk_notify.sh "$backup_dir/"

echo "告警配置备份完成: $backup_dir"
}

# 主流程
main() {
echo "开始告警系统运维检查..."

check_alert_system_status
test_alert_functionality
cleanup_alert_history
backup_alert_config

echo "告警系统运维检查完成"
}

# 执行主流程
main "$@"

七、总结

Node-Exporter+AlertManager告警系统是构建企业级监控告警体系的核心组件,本文深入解析了其架构原理和告警策略,提供了完整的邮件+钉钉通知方案。

关键要点:

  1. 架构设计:分层架构设计,包含数据采集、存储、处理、通知和展示层
  2. 告警规则:智能告警规则设计,支持多种告警级别和抑制策略
  3. 通知集成:完整的邮件和钉钉通知集成方案
  4. 告警流程:标准化的告警处理流程和抑制策略
  5. 智能告警:基于机器学习的智能异常检测和动态阈值
  6. 故障自愈:自动化的故障检测和自愈机制

通过本文的学习和实践,架构师可以构建高效稳定的监控告警体系,实现智能化的运维管理。


作者简介:资深架构师,专注于监控告警系统设计与优化,拥有丰富的Node-Exporter+AlertManager实战经验。

技术交流:欢迎关注我的技术博客,分享更多监控告警系统经验。