1. 运维自动化巡检概述

运维巡检是保障系统稳定运行的重要手段,通过自动化巡检可以及时发现系统问题、预防故障发生。本文将详细介绍系统巡检、服务巡检、性能巡检和安全巡检的完整解决方案,帮助运维人员实现高效的自动化巡检。

1.1 核心挑战

  1. 系统巡检: 检查系统资源使用情况和健康状态
  2. 服务巡检: 检查服务运行状态和可用性
  3. 性能巡检: 检查系统性能指标和瓶颈
  4. 安全巡检: 检查系统安全漏洞和风险
  5. 自动化执行: 实现巡检任务的自动化执行

1.2 技术架构

1
2
3
4
5
巡检任务 → 任务调度 → 巡检执行 → 结果分析 → 告警通知
↓ ↓ ↓ ↓ ↓
定时任务 → 任务队列 → 巡检代理 → 数据分析 → 通知推送
↓ ↓ ↓ ↓ ↓
巡检报告 → 趋势分析 → 问题诊断 → 自动修复 → 巡检记录

2. 巡检系统架构

2.1 Maven依赖配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<!-- pom.xml -->
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>

<!-- Spring Boot Data Redis -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>

<!-- Spring Boot Scheduler -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-quartz</artifactId>
</dependency>

<!-- OSHI系统信息 -->
<dependency>
<groupId>com.github.oshi</groupId>
<artifactId>oshi-core</artifactId>
<version>6.4.0</version>
</dependency>

<!-- MyBatis Plus -->
<dependency>
<groupId>com.baomidou</groupId>
<artifactId>mybatis-plus-boot-starter</artifactId>
<version>3.5.2</version>
</dependency>
</dependencies>

2.2 应用配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# application.yml
server:
port: 8080

spring:
redis:
host: localhost
port: 6379
database: 0

# 巡检配置
inspection:
system-check-interval: 300000 # 系统巡检间隔(毫秒)
service-check-interval: 60000 # 服务巡检间隔(毫秒)
performance-check-interval: 300000 # 性能巡检间隔(毫秒)
security-check-interval: 3600000 # 安全巡检间隔(毫秒)
report-generation-enabled: true # 启用巡检报告生成
auto-fix-enabled: true # 启用自动修复

3. 巡检任务管理

3.1 巡检任务实体类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/**
* 巡检任务实体类
*/
@Data
@TableName("inspection_task")
public class InspectionTask {

@TableId(type = IdType.AUTO)
private Long id; // 主键ID

private String taskName; // 任务名称

private String taskType; // 任务类型

private String taskDescription; // 任务描述

private String cronExpression; // Cron表达式

private String targetHosts; // 目标主机

private String checkItems; // 检查项目

private String taskStatus; // 任务状态

private Date lastRunTime; // 最后运行时间

private Date nextRunTime; // 下次运行时间

private Date createTime; // 创建时间

private Date updateTime; // 更新时间
}

/**
* 巡检结果实体类
*/
@Data
@TableName("inspection_result")
public class InspectionResult {

@TableId(type = IdType.AUTO)
private Long id; // 主键ID

private Long taskId; // 任务ID

private String hostname; // 主机名

private String checkItem; // 检查项目

private String checkResult; // 检查结果

private String checkStatus; // 检查状态

private String checkMessage; // 检查消息

private String checkData; // 检查数据

private Date checkTime; // 检查时间

private Date createTime; // 创建时间
}

3.2 巡检任务服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
/**
* 巡检任务服务
* 负责巡检任务的管理和执行
*/
@Service
public class InspectionTaskService {

@Autowired
private InspectionTaskMapper inspectionTaskMapper;

@Autowired
private InspectionResultMapper inspectionResultMapper;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Autowired
private Scheduler scheduler;

/**
* 创建巡检任务
*/
public void createInspectionTask(InspectionTask task) {
try {
// 1. 保存任务到数据库
task.setCreateTime(new Date());
task.setUpdateTime(new Date());
task.setTaskStatus("ACTIVE");
inspectionTaskMapper.insert(task);

// 2. 创建定时任务
createScheduledTask(task);

// 3. 更新任务缓存
updateTaskCache(task);

log.info("创建巡检任务成功: taskName={}, taskType={}",
task.getTaskName(), task.getTaskType());

} catch (Exception e) {
log.error("创建巡检任务失败: {}", e.getMessage(), e);
}
}

/**
* 创建定时任务
*/
private void createScheduledTask(InspectionTask task) throws SchedulerException {
JobDetail jobDetail = JobBuilder.newJob(InspectionJob.class)
.withIdentity(task.getTaskName(), "inspection")
.usingJobData("taskId", task.getId())
.build();

Trigger trigger = TriggerBuilder.newTrigger()
.withIdentity(task.getTaskName() + "_trigger", "inspection")
.withSchedule(CronScheduleBuilder.cronSchedule(task.getCronExpression()))
.build();

scheduler.scheduleJob(jobDetail, trigger);
}

/**
* 执行巡检任务
*/
public void executeInspectionTask(Long taskId) {
try {
// 1. 获取任务信息
InspectionTask task = inspectionTaskMapper.selectById(taskId);
if (task == null) {
log.error("巡检任务不存在: taskId={}", taskId);
return;
}

// 2. 解析目标主机
List<String> targetHosts = parseTargetHosts(task.getTargetHosts());

// 3. 解析检查项目
List<String> checkItems = parseCheckItems(task.getCheckItems());

// 4. 执行巡检
for (String hostname : targetHosts) {
for (String checkItem : checkItems) {
executeInspectionItem(task, hostname, checkItem);
}
}

// 5. 更新任务状态
updateTaskStatus(taskId, "COMPLETED");

log.info("执行巡检任务完成: taskId={}, taskName={}", taskId, task.getTaskName());

} catch (Exception e) {
log.error("执行巡检任务失败: taskId={}, error={}", taskId, e.getMessage(), e);
updateTaskStatus(taskId, "FAILED");
}
}

/**
* 执行巡检项目
*/
private void executeInspectionItem(InspectionTask task, String hostname, String checkItem) {
try {
// 1. 根据检查项目类型执行相应的检查
InspectionResult result = performInspection(task, hostname, checkItem);

// 2. 保存检查结果
inspectionResultMapper.insert(result);

// 3. 更新结果缓存
updateResultCache(result);

// 4. 检查是否需要告警
checkInspectionAlert(result);

log.debug("执行巡检项目完成: hostname={}, checkItem={}, status={}",
hostname, checkItem, result.getCheckStatus());

} catch (Exception e) {
log.error("执行巡检项目失败: hostname={}, checkItem={}, error={}",
hostname, checkItem, e.getMessage(), e);
}
}

/**
* 执行具体的巡检
*/
private InspectionResult performInspection(InspectionTask task, String hostname, String checkItem) {
InspectionResult result = new InspectionResult();
result.setTaskId(task.getId());
result.setHostname(hostname);
result.setCheckItem(checkItem);
result.setCheckTime(new Date());
result.setCreateTime(new Date());

try {
// 根据检查项目类型执行相应的检查
switch (checkItem) {
case "CPU_USAGE":
result = checkCpuUsage(hostname);
break;
case "MEMORY_USAGE":
result = checkMemoryUsage(hostname);
break;
case "DISK_USAGE":
result = checkDiskUsage(hostname);
break;
case "SERVICE_STATUS":
result = checkServiceStatus(hostname);
break;
case "NETWORK_STATUS":
result = checkNetworkStatus(hostname);
break;
case "SECURITY_SCAN":
result = checkSecurityScan(hostname);
break;
default:
result.setCheckStatus("UNKNOWN");
result.setCheckMessage("未知的检查项目: " + checkItem);
break;
}

result.setTaskId(task.getId());
result.setHostname(hostname);
result.setCheckItem(checkItem);
result.setCheckTime(new Date());
result.setCreateTime(new Date());

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查失败: " + e.getMessage());
}

return result;
}

/**
* 检查CPU使用率
*/
private InspectionResult checkCpuUsage(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 获取CPU使用率
SystemInfo systemInfo = new SystemInfo();
HardwareAbstractionLayer hal = systemInfo.getHardware();
CentralProcessor processor = hal.getProcessor();

double cpuUsage = processor.getSystemCpuLoad() * 100;

result.setCheckResult(String.valueOf(cpuUsage));
result.setCheckData("{\"cpuUsage\":" + cpuUsage + "}");

if (cpuUsage > 90) {
result.setCheckStatus("CRITICAL");
result.setCheckMessage("CPU使用率过高: " + cpuUsage + "%");
} else if (cpuUsage > 80) {
result.setCheckStatus("WARNING");
result.setCheckMessage("CPU使用率较高: " + cpuUsage + "%");
} else {
result.setCheckStatus("OK");
result.setCheckMessage("CPU使用率正常: " + cpuUsage + "%");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查CPU使用率失败: " + e.getMessage());
}

return result;
}

/**
* 检查内存使用率
*/
private InspectionResult checkMemoryUsage(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 获取内存使用率
SystemInfo systemInfo = new SystemInfo();
HardwareAbstractionLayer hal = systemInfo.getHardware();
GlobalMemory memory = hal.getMemory();

long totalMemory = memory.getTotal();
long availableMemory = memory.getAvailable();
long usedMemory = totalMemory - availableMemory;
double memoryUsage = (double) usedMemory / totalMemory * 100;

result.setCheckResult(String.valueOf(memoryUsage));
result.setCheckData("{\"memoryUsage\":" + memoryUsage + ",\"totalMemory\":" + totalMemory + ",\"usedMemory\":" + usedMemory + "}");

if (memoryUsage > 90) {
result.setCheckStatus("CRITICAL");
result.setCheckMessage("内存使用率过高: " + memoryUsage + "%");
} else if (memoryUsage > 80) {
result.setCheckStatus("WARNING");
result.setCheckMessage("内存使用率较高: " + memoryUsage + "%");
} else {
result.setCheckStatus("OK");
result.setCheckMessage("内存使用率正常: " + memoryUsage + "%");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查内存使用率失败: " + e.getMessage());
}

return result;
}

/**
* 检查磁盘使用率
*/
private InspectionResult checkDiskUsage(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 获取磁盘使用率
SystemInfo systemInfo = new SystemInfo();
HardwareAbstractionLayer hal = systemInfo.getHardware();
List<OSFileStore> fileStores = hal.getFileStores();

double maxUsage = 0;
String maxUsagePath = "";

for (OSFileStore fileStore : fileStores) {
long totalSpace = fileStore.getTotalSpace();
long usableSpace = fileStore.getUsableSpace();
long usedSpace = totalSpace - usableSpace;
double usage = (double) usedSpace / totalSpace * 100;

if (usage > maxUsage) {
maxUsage = usage;
maxUsagePath = fileStore.getMount();
}
}

result.setCheckResult(String.valueOf(maxUsage));
result.setCheckData("{\"maxUsage\":" + maxUsage + ",\"maxUsagePath\":\"" + maxUsagePath + "\"}");

if (maxUsage > 90) {
result.setCheckStatus("CRITICAL");
result.setCheckMessage("磁盘使用率过高: " + maxUsage + "% (" + maxUsagePath + ")");
} else if (maxUsage > 80) {
result.setCheckStatus("WARNING");
result.setCheckMessage("磁盘使用率较高: " + maxUsage + "% (" + maxUsagePath + ")");
} else {
result.setCheckStatus("OK");
result.setCheckMessage("磁盘使用率正常: " + maxUsage + "%");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查磁盘使用率失败: " + e.getMessage());
}

return result;
}

/**
* 检查服务状态
*/
private InspectionResult checkServiceStatus(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 检查关键服务状态
List<String> criticalServices = Arrays.asList("nginx", "mysql", "redis", "java");
int failedServices = 0;
StringBuilder failedServiceList = new StringBuilder();

for (String service : criticalServices) {
if (!isServiceRunning(service)) {
failedServices++;
if (failedServiceList.length() > 0) {
failedServiceList.append(", ");
}
failedServiceList.append(service);
}
}

result.setCheckResult(String.valueOf(failedServices));
result.setCheckData("{\"failedServices\":" + failedServices + ",\"failedServiceList\":\"" + failedServiceList.toString() + "\"}");

if (failedServices > 0) {
result.setCheckStatus("CRITICAL");
result.setCheckMessage("关键服务异常: " + failedServiceList.toString());
} else {
result.setCheckStatus("OK");
result.setCheckMessage("所有关键服务运行正常");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查服务状态失败: " + e.getMessage());
}

return result;
}

/**
* 检查服务是否运行
*/
private boolean isServiceRunning(String serviceName) {
try {
ProcessBuilder pb = new ProcessBuilder("systemctl", "is-active", serviceName);
Process process = pb.start();
int exitCode = process.waitFor();
return exitCode == 0;
} catch (Exception e) {
return false;
}
}

/**
* 检查网络状态
*/
private InspectionResult checkNetworkStatus(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 检查网络连通性
List<String> testHosts = Arrays.asList("8.8.8.8", "www.baidu.com", "www.google.com");
int failedConnections = 0;
StringBuilder failedHostList = new StringBuilder();

for (String testHost : testHosts) {
if (!isHostReachable(testHost)) {
failedConnections++;
if (failedHostList.length() > 0) {
failedHostList.append(", ");
}
failedHostList.append(testHost);
}
}

result.setCheckResult(String.valueOf(failedConnections));
result.setCheckData("{\"failedConnections\":" + failedConnections + ",\"failedHostList\":\"" + failedHostList.toString() + "\"}");

if (failedConnections > 0) {
result.setCheckStatus("WARNING");
result.setCheckMessage("网络连通性异常: " + failedHostList.toString());
} else {
result.setCheckStatus("OK");
result.setCheckMessage("网络连通性正常");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("检查网络状态失败: " + e.getMessage());
}

return result;
}

/**
* 检查主机是否可达
*/
private boolean isHostReachable(String hostname) {
try {
ProcessBuilder pb = new ProcessBuilder("ping", "-c", "1", "-W", "3", hostname);
Process process = pb.start();
int exitCode = process.waitFor();
return exitCode == 0;
} catch (Exception e) {
return false;
}
}

/**
* 检查安全扫描
*/
private InspectionResult checkSecurityScan(String hostname) {
InspectionResult result = new InspectionResult();

try {
// 检查系统安全状态
int securityIssues = 0;
StringBuilder issueList = new StringBuilder();

// 检查开放端口
if (hasOpenPorts()) {
securityIssues++;
issueList.append("开放端口过多, ");
}

// 检查用户权限
if (hasPrivilegedUsers()) {
securityIssues++;
issueList.append("特权用户过多, ");
}

// 检查文件权限
if (hasInsecurePermissions()) {
securityIssues++;
issueList.append("文件权限不安全, ");
}

result.setCheckResult(String.valueOf(securityIssues));
result.setCheckData("{\"securityIssues\":" + securityIssues + ",\"issueList\":\"" + issueList.toString() + "\"}");

if (securityIssues > 2) {
result.setCheckStatus("CRITICAL");
result.setCheckMessage("发现多个安全问题: " + issueList.toString());
} else if (securityIssues > 0) {
result.setCheckStatus("WARNING");
result.setCheckMessage("发现安全问题: " + issueList.toString());
} else {
result.setCheckStatus("OK");
result.setCheckMessage("安全检查通过");
}

} catch (Exception e) {
result.setCheckStatus("ERROR");
result.setCheckMessage("安全检查失败: " + e.getMessage());
}

return result;
}

/**
* 检查是否有过多开放端口
*/
private boolean hasOpenPorts() {
try {
ProcessBuilder pb = new ProcessBuilder("netstat", "-tln");
Process process = pb.start();
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));

int openPorts = 0;
String line;
while ((line = reader.readLine()) != null) {
if (line.contains("LISTEN")) {
openPorts++;
}
}

return openPorts > 20; // 超过20个开放端口认为不安全

} catch (Exception e) {
return false;
}
}

/**
* 检查是否有过多特权用户
*/
private boolean hasPrivilegedUsers() {
try {
ProcessBuilder pb = new ProcessBuilder("grep", "-c", "^sudo:", "/etc/group");
Process process = pb.start();
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));

String line = reader.readLine();
if (line != null) {
int sudoUsers = Integer.parseInt(line.trim());
return sudoUsers > 5; // 超过5个sudo用户认为不安全
}

} catch (Exception e) {
// 忽略错误
}

return false;
}

/**
* 检查是否有不安全的文件权限
*/
private boolean hasInsecurePermissions() {
try {
ProcessBuilder pb = new ProcessBuilder("find", "/etc", "-type", "f", "-perm", "777", "2>/dev/null");
Process process = pb.start();
BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));

int insecureFiles = 0;
String line;
while ((line = reader.readLine()) != null) {
insecureFiles++;
}

return insecureFiles > 0; // 发现777权限文件认为不安全

} catch (Exception e) {
return false;
}
}

/**
* 解析目标主机
*/
private List<String> parseTargetHosts(String targetHosts) {
if (StringUtils.isEmpty(targetHosts)) {
return Arrays.asList("localhost");
}

return Arrays.asList(targetHosts.split(","));
}

/**
* 解析检查项目
*/
private List<String> parseCheckItems(String checkItems) {
if (StringUtils.isEmpty(checkItems)) {
return Arrays.asList("CPU_USAGE", "MEMORY_USAGE", "DISK_USAGE");
}

return Arrays.asList(checkItems.split(","));
}

/**
* 更新任务状态
*/
private void updateTaskStatus(Long taskId, String status) {
try {
InspectionTask task = new InspectionTask();
task.setId(taskId);
task.setTaskStatus(status);
task.setLastRunTime(new Date());
task.setUpdateTime(new Date());

inspectionTaskMapper.updateById(task);

} catch (Exception e) {
log.error("更新任务状态失败: taskId={}, status={}", taskId, status);
}
}

/**
* 更新任务缓存
*/
private void updateTaskCache(InspectionTask task) {
try {
String cacheKey = "inspection:task:" + task.getId();
redisTemplate.opsForValue().set(cacheKey, task, Duration.ofHours(1));

} catch (Exception e) {
log.warn("更新任务缓存失败: {}", e.getMessage());
}
}

/**
* 更新结果缓存
*/
private void updateResultCache(InspectionResult result) {
try {
String cacheKey = "inspection:result:" + result.getHostname() + ":" + result.getCheckItem();
redisTemplate.opsForValue().set(cacheKey, result, Duration.ofHours(1));

} catch (Exception e) {
log.warn("更新结果缓存失败: {}", e.getMessage());
}
}

/**
* 检查巡检告警
*/
private void checkInspectionAlert(InspectionResult result) {
try {
if ("CRITICAL".equals(result.getCheckStatus()) || "WARNING".equals(result.getCheckStatus())) {
// 发送告警通知
sendInspectionAlert(result);
}

} catch (Exception e) {
log.error("检查巡检告警失败: {}", e.getMessage(), e);
}
}

/**
* 发送巡检告警
*/
private void sendInspectionAlert(InspectionResult result) {
try {
AlertMessage alert = new AlertMessage();
alert.setType("INSPECTION_ALERT");
alert.setLevel(result.getCheckStatus());
alert.setMessage(String.format("巡检发现%s: %s - %s",
result.getCheckStatus(), result.getCheckItem(), result.getCheckMessage()));
alert.setTimestamp(new Date());
alert.setHostname(result.getHostname());

// 发送告警通知
alertService.sendAlert(alert);

} catch (Exception e) {
log.error("发送巡检告警失败: {}", e.getMessage(), e);
}
}
}

4. 巡检报告服务

4.1 巡检报告服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
/**
* 巡检报告服务
* 负责生成巡检报告和分析
*/
@Service
public class InspectionReportService {

@Autowired
private InspectionResultMapper inspectionResultMapper;

@Autowired
private InspectionTaskMapper inspectionTaskMapper;

/**
* 生成巡检报告
*/
public InspectionReport generateInspectionReport(String hostname, Date startTime, Date endTime) {
try {
// 1. 获取巡检结果
List<InspectionResult> results = inspectionResultMapper.selectByHostnameAndTimeRange(hostname, startTime, endTime);

// 2. 分析巡检结果
InspectionReport report = analyzeInspectionResults(results);

// 3. 生成报告摘要
generateReportSummary(report);

// 4. 生成趋势分析
generateTrendAnalysis(report, results);

// 5. 生成建议
generateRecommendations(report, results);

return report;

} catch (Exception e) {
log.error("生成巡检报告失败: {}", e.getMessage(), e);
return new InspectionReport();
}
}

/**
* 分析巡检结果
*/
private InspectionReport analyzeInspectionResults(List<InspectionResult> results) {
InspectionReport report = new InspectionReport();

// 统计检查状态
Map<String, Long> statusCount = results.stream()
.collect(Collectors.groupingBy(InspectionResult::getCheckStatus, Collectors.counting()));

report.setTotalChecks(results.size());
report.setOkChecks(statusCount.getOrDefault("OK", 0L));
report.setWarningChecks(statusCount.getOrDefault("WARNING", 0L));
report.setCriticalChecks(statusCount.getOrDefault("CRITICAL", 0L));
report.setErrorChecks(statusCount.getOrDefault("ERROR", 0L));

// 计算健康度
double healthScore = calculateHealthScore(report);
report.setHealthScore(healthScore);

return report;
}

/**
* 计算健康度
*/
private double calculateHealthScore(InspectionReport report) {
if (report.getTotalChecks() == 0) {
return 0.0;
}

double okWeight = report.getOkChecks() * 1.0;
double warningWeight = report.getWarningChecks() * 0.7;
double criticalWeight = report.getCriticalChecks() * 0.3;
double errorWeight = report.getErrorChecks() * 0.0;

return (okWeight + warningWeight + criticalWeight + errorWeight) / report.getTotalChecks() * 100;
}

/**
* 生成报告摘要
*/
private void generateReportSummary(InspectionReport report) {
StringBuilder summary = new StringBuilder();

summary.append("巡检报告摘要:\n");
summary.append("总检查次数: ").append(report.getTotalChecks()).append("\n");
summary.append("正常检查: ").append(report.getOkChecks()).append("\n");
summary.append("警告检查: ").append(report.getWarningChecks()).append("\n");
summary.append("严重检查: ").append(report.getCriticalChecks()).append("\n");
summary.append("错误检查: ").append(report.getErrorChecks()).append("\n");
summary.append("系统健康度: ").append(String.format("%.2f", report.getHealthScore())).append("%\n");

report.setSummary(summary.toString());
}

/**
* 生成趋势分析
*/
private void generateTrendAnalysis(InspectionReport report, List<InspectionResult> results) {
// 按时间分组分析趋势
Map<String, List<InspectionResult>> resultsByItem = results.stream()
.collect(Collectors.groupingBy(InspectionResult::getCheckItem));

Map<String, String> trends = new HashMap<>();

for (Map.Entry<String, List<InspectionResult>> entry : resultsByItem.entrySet()) {
String checkItem = entry.getKey();
List<InspectionResult> itemResults = entry.getValue();

// 分析趋势
String trend = analyzeItemTrend(checkItem, itemResults);
trends.put(checkItem, trend);
}

report.setTrends(trends);
}

/**
* 分析项目趋势
*/
private String analyzeItemTrend(String checkItem, List<InspectionResult> results) {
if (results.size() < 2) {
return "数据不足,无法分析趋势";
}

// 按时间排序
results.sort(Comparator.comparing(InspectionResult::getCheckTime));

// 分析状态变化趋势
long okCount = results.stream().filter(r -> "OK".equals(r.getCheckStatus())).count();
long warningCount = results.stream().filter(r -> "WARNING".equals(r.getCheckStatus())).count();
long criticalCount = results.stream().filter(r -> "CRITICAL".equals(r.getCheckStatus())).count();

if (criticalCount > 0) {
return "存在严重问题,需要立即处理";
} else if (warningCount > okCount) {
return "警告较多,需要关注";
} else if (warningCount > 0) {
return "偶有警告,整体良好";
} else {
return "运行稳定,状态良好";
}
}

/**
* 生成建议
*/
private void generateRecommendations(InspectionReport report, List<InspectionResult> results) {
List<String> recommendations = new ArrayList<>();

// 根据检查结果生成建议
for (InspectionResult result : results) {
if ("CRITICAL".equals(result.getCheckStatus())) {
recommendations.add(generateCriticalRecommendation(result));
} else if ("WARNING".equals(result.getCheckStatus())) {
recommendations.add(generateWarningRecommendation(result));
}
}

report.setRecommendations(recommendations);
}

/**
* 生成严重问题建议
*/
private String generateCriticalRecommendation(InspectionResult result) {
switch (result.getCheckItem()) {
case "CPU_USAGE":
return "CPU使用率过高,建议检查进程占用情况,考虑增加CPU资源或优化程序";
case "MEMORY_USAGE":
return "内存使用率过高,建议检查内存泄漏,考虑增加内存或优化程序";
case "DISK_USAGE":
return "磁盘使用率过高,建议清理无用文件,考虑扩容或数据迁移";
case "SERVICE_STATUS":
return "关键服务异常,建议立即检查服务状态,重启服务或修复配置";
default:
return "发现严重问题,建议立即处理: " + result.getCheckMessage();
}
}

/**
* 生成警告建议
*/
private String generateWarningRecommendation(InspectionResult result) {
switch (result.getCheckItem()) {
case "CPU_USAGE":
return "CPU使用率较高,建议监控进程状态,预防性能问题";
case "MEMORY_USAGE":
return "内存使用率较高,建议监控内存使用情况,预防内存不足";
case "DISK_USAGE":
return "磁盘使用率较高,建议定期清理文件,预防空间不足";
case "NETWORK_STATUS":
return "网络连通性异常,建议检查网络配置和连接状态";
case "SECURITY_SCAN":
return "发现安全问题,建议加强安全防护措施";
default:
return "发现警告问题,建议关注: " + result.getCheckMessage();
}
}
}

/**
* 巡检报告实体类
*/
@Data
public class InspectionReport {
private Long totalChecks; // 总检查次数
private Long okChecks; // 正常检查次数
private Long warningChecks; // 警告检查次数
private Long criticalChecks; // 严重检查次数
private Long errorChecks; // 错误检查次数
private Double healthScore; // 健康度分数
private String summary; // 报告摘要
private Map<String, String> trends; // 趋势分析
private List<String> recommendations; // 建议
private Date generateTime; // 生成时间
}

5. 总结

本文详细介绍了运维自动化巡检的完整解决方案,包括:

5.1 核心技术点

  1. 巡检任务管理: 创建、调度、执行巡检任务
  2. 系统巡检: CPU、内存、磁盘使用率检查
  3. 服务巡检: 关键服务状态检查
  4. 网络巡检: 网络连通性检查
  5. 安全巡检: 系统安全状态检查
  6. 报告生成: 巡检报告生成和分析

5.2 架构优势

  1. 自动化执行: 定时自动执行巡检任务
  2. 多维度检查: 系统、服务、网络、安全全方位检查
  3. 智能告警: 基于检查结果的智能告警
  4. 报告分析: 详细的巡检报告和趋势分析

5.3 最佳实践

  1. 巡检策略: 设置合理的巡检频率和检查项目
  2. 告警策略: 分级告警,避免告警风暴
  3. 报告分析: 定期分析巡检报告,发现系统问题
  4. 自动化: 实现巡检和修复的自动化

通过以上架构设计,可以构建完善的运维自动化巡检系统,实现系统健康状态的持续监控和问题预防。