1. Kubernetes运维监控概述

Kubernetes作为容器编排平台,在生产环境中需要专业的运维监控和管理。本文将详细介绍K8s集群监控、Pod管理、服务发现、资源调优的完整解决方案,帮助运维人员有效管理Kubernetes集群。

1.1 核心挑战

  1. 集群监控: 实时监控K8s集群和节点状态
  2. Pod管理: 管理Pod生命周期和资源使用
  3. 服务发现: 管理服务注册和负载均衡
  4. 资源调优: 优化资源分配和调度策略
  5. 故障诊断: 快速定位和解决K8s问题

1.2 技术架构

1
2
3
4
5
K8s监控 → 数据采集 → 性能分析 → 告警通知 → 自动优化
↓ ↓ ↓ ↓ ↓
集群指标 → 监控代理 → 数据存储 → 告警引擎 → 调优脚本
↓ ↓ ↓ ↓ ↓
Pod管理 → 服务发现 → 资源调度 → 自动修复 → 运维记录

2. K8s监控系统

2.1 Maven依赖配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<!-- pom.xml -->
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>

<!-- Spring Boot Data Redis -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>

<!-- Kubernetes Client -->
<dependency>
<groupId>io.kubernetes</groupId>
<artifactId>client-java</artifactId>
<version>17.0.0</version>
</dependency>

<!-- Micrometer监控 -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

<!-- MyBatis Plus -->
<dependency>
<groupId>com.baomidou</groupId>
<artifactId>mybatis-plus-boot-starter</artifactId>
<version>3.5.2</version>
</dependency>
</dependencies>

2.2 应用配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# application.yml
server:
port: 8080

spring:
redis:
host: localhost
port: 6379
database: 0

# K8s监控配置
k8s-monitor:
cluster-name: "production-cluster"
namespace: "default"
collection-interval: 10000 # 采集间隔(毫秒)
pod-alert-threshold: 80 # Pod资源告警阈值(%)
node-alert-threshold: 85 # 节点资源告警阈值(%)
service-monitor-enabled: true # 启用服务监控

3. K8s监控服务

3.1 K8s监控实体类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/**
* K8s集群监控数据实体类
*/
@Data
@TableName("k8s_cluster_monitor")
public class K8sClusterMonitor {

@TableId(type = IdType.AUTO)
private Long id; // 主键ID

private String clusterName; // 集群名称

private String hostname; // 主机名

private String ip; // IP地址

private Integer nodeCount; // 节点数量

private Integer podCount; // Pod数量

private Integer serviceCount; // 服务数量

private Integer namespaceCount; // 命名空间数量

private Double cpuUsage; // CPU使用率

private Double memoryUsage; // 内存使用率

private Long totalCpu; // 总CPU

private Long usedCpu; // 已使用CPU

private Long totalMemory; // 总内存

private Long usedMemory; // 已使用内存

private String clusterStatus; // 集群状态

private Date collectTime; // 采集时间

private Date createTime; // 创建时间
}

/**
* K8s Pod监控数据实体类
*/
@Data
@TableName("k8s_pod_monitor")
public class K8sPodMonitor {

@TableId(type = IdType.AUTO)
private Long id; // 主键ID

private String clusterName; // 集群名称

private String namespace; // 命名空间

private String podName; // Pod名称

private String nodeName; // 节点名称

private String podStatus; // Pod状态

private String podPhase; // Pod阶段

private Integer restartCount; // 重启次数

private Long cpuRequest; // CPU请求量

private Long cpuLimit; // CPU限制量

private Long memoryRequest; // 内存请求量

private Long memoryLimit; // 内存限制量

private Long cpuUsage; // CPU使用量

private Long memoryUsage; // 内存使用量

private Double cpuUsagePercent; // CPU使用率

private Double memoryUsagePercent; // 内存使用率

private Date startTime; // 启动时间

private Date collectTime; // 采集时间

private Date createTime; // 创建时间
}

3.2 K8s监控服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
/**
* K8s监控服务
* 负责K8s集群数据的采集、存储和分析
*/
@Service
public class K8sMonitorService {

@Autowired
private K8sClusterMonitorMapper k8sClusterMonitorMapper;

@Autowired
private K8sPodMonitorMapper k8sPodMonitorMapper;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Autowired
private AlertService alertService;

@Autowired
private ApiClient apiClient;

/**
* 采集K8s集群数据
* 定期采集K8s集群和Pod信息
*/
@Scheduled(fixedRate = 10000) // 每10秒执行一次
public void collectK8sData() {
try {
// 1. 采集集群信息
collectClusterInfo();

// 2. 采集Pod信息
collectPodInfo();

// 3. 采集节点信息
collectNodeInfo();

// 4. 采集服务信息
collectServiceInfo();

} catch (Exception e) {
log.error("采集K8s数据失败: {}", e.getMessage(), e);
}
}

/**
* 采集集群信息
*/
private void collectClusterInfo() {
try {
// 1. 获取集群信息
K8sClusterInfo clusterInfo = getClusterInfo();

// 2. 创建集群监控数据
K8sClusterMonitor monitorData = createClusterMonitorData(clusterInfo);

// 3. 保存到数据库
k8sClusterMonitorMapper.insert(monitorData);

// 4. 更新缓存
updateClusterCache(monitorData);

// 5. 检查集群告警
checkClusterAlert(monitorData);

log.debug("采集集群信息: clusterName={}, nodeCount={}, podCount={}",
monitorData.getClusterName(), monitorData.getNodeCount(), monitorData.getPodCount());

} catch (Exception e) {
log.error("采集集群信息失败: {}", e.getMessage(), e);
}
}

/**
* 获取集群信息
*/
private K8sClusterInfo getClusterInfo() {
K8sClusterInfo clusterInfo = new K8sClusterInfo();

try {
// 获取节点信息
CoreV1Api coreV1Api = new CoreV1Api(apiClient);
V1NodeList nodeList = coreV1Api.listNode().execute();

clusterInfo.setNodeCount(nodeList.getItems().size());

// 获取Pod信息
V1PodList podList = coreV1Api.listPodForAllNamespaces().execute();
clusterInfo.setPodCount(podList.getItems().size());

// 获取服务信息
V1ServiceList serviceList = coreV1Api.listServiceForAllNamespaces().execute();
clusterInfo.setServiceCount(serviceList.getItems().size());

// 获取命名空间信息
V1NamespaceList namespaceList = coreV1Api.listNamespace().execute();
clusterInfo.setNamespaceCount(namespaceList.getItems().size());

// 计算资源使用情况
calculateResourceUsage(clusterInfo, nodeList);

// 设置集群状态
clusterInfo.setClusterStatus("Running");

} catch (Exception e) {
log.error("获取集群信息失败: {}", e.getMessage(), e);
}

return clusterInfo;
}

/**
* 计算资源使用情况
*/
private void calculateResourceUsage(K8sClusterInfo clusterInfo, V1NodeList nodeList) {
try {
long totalCpu = 0;
long usedCpu = 0;
long totalMemory = 0;
long usedMemory = 0;

for (V1Node node : nodeList.getItems()) {
V1NodeStatus status = node.getStatus();
V1NodeCapacity capacity = status.getCapacity();
V1NodeAllocatable allocatable = status.getAllocatable();

// 计算总资源
if (capacity != null) {
totalCpu += parseCpu(capacity.get("cpu").toString());
totalMemory += parseMemory(capacity.get("memory").toString());
}

// 计算已使用资源
if (allocatable != null) {
usedCpu += parseCpu(allocatable.get("cpu").toString());
usedMemory += parseMemory(allocatable.get("memory").toString());
}
}

clusterInfo.setTotalCpu(totalCpu);
clusterInfo.setUsedCpu(usedCpu);
clusterInfo.setTotalMemory(totalMemory);
clusterInfo.setUsedMemory(usedMemory);

// 计算使用率
if (totalCpu > 0) {
clusterInfo.setCpuUsage((double) usedCpu / totalCpu * 100);
}
if (totalMemory > 0) {
clusterInfo.setMemoryUsage((double) usedMemory / totalMemory * 100);
}

} catch (Exception e) {
log.error("计算资源使用情况失败: {}", e.getMessage(), e);
}
}

/**
* 解析CPU值
*/
private long parseCpu(String cpuStr) {
try {
if (cpuStr.endsWith("m")) {
return Long.parseLong(cpuStr.substring(0, cpuStr.length() - 1));
} else {
return Long.parseLong(cpuStr) * 1000;
}
} catch (Exception e) {
return 0;
}
}

/**
* 解析内存值
*/
private long parseMemory(String memoryStr) {
try {
if (memoryStr.endsWith("Ki")) {
return Long.parseLong(memoryStr.substring(0, memoryStr.length() - 2)) * 1024;
} else if (memoryStr.endsWith("Mi")) {
return Long.parseLong(memoryStr.substring(0, memoryStr.length() - 2)) * 1024 * 1024;
} else if (memoryStr.endsWith("Gi")) {
return Long.parseLong(memoryStr.substring(0, memoryStr.length() - 2)) * 1024 * 1024 * 1024;
} else {
return Long.parseLong(memoryStr);
}
} catch (Exception e) {
return 0;
}
}

/**
* 创建集群监控数据
*/
private K8sClusterMonitor createClusterMonitorData(K8sClusterInfo clusterInfo) {
K8sClusterMonitor monitorData = new K8sClusterMonitor();

// 设置基本信息
monitorData.setClusterName(clusterInfo.getClusterName());
monitorData.setHostname(getHostname());
monitorData.setIp(getLocalIpAddress());
monitorData.setCollectTime(new Date());
monitorData.setCreateTime(new Date());

// 设置集群信息
monitorData.setNodeCount(clusterInfo.getNodeCount());
monitorData.setPodCount(clusterInfo.getPodCount());
monitorData.setServiceCount(clusterInfo.getServiceCount());
monitorData.setNamespaceCount(clusterInfo.getNamespaceCount());

// 设置资源信息
monitorData.setCpuUsage(clusterInfo.getCpuUsage());
monitorData.setMemoryUsage(clusterInfo.getMemoryUsage());
monitorData.setTotalCpu(clusterInfo.getTotalCpu());
monitorData.setUsedCpu(clusterInfo.getUsedCpu());
monitorData.setTotalMemory(clusterInfo.getTotalMemory());
monitorData.setUsedMemory(clusterInfo.getUsedMemory());

// 设置状态信息
monitorData.setClusterStatus(clusterInfo.getClusterStatus());

return monitorData;
}

/**
* 采集Pod信息
*/
private void collectPodInfo() {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);
V1PodList podList = coreV1Api.listPodForAllNamespaces().execute();

for (V1Pod pod : podList.getItems()) {
try {
// 创建Pod监控数据
K8sPodMonitor podMonitor = createPodMonitorData(pod);

// 保存到数据库
k8sPodMonitorMapper.insert(podMonitor);

// 更新缓存
updatePodCache(podMonitor);

// 检查Pod告警
checkPodAlert(podMonitor);

} catch (Exception e) {
log.error("处理Pod信息失败: podName={}, error={}",
pod.getMetadata().getName(), e.getMessage());
}
}

} catch (Exception e) {
log.error("采集Pod信息失败: {}", e.getMessage(), e);
}
}

/**
* 创建Pod监控数据
*/
private K8sPodMonitor createPodMonitorData(V1Pod pod) {
K8sPodMonitor podMonitor = new K8sPodMonitor();

V1ObjectMeta metadata = pod.getMetadata();
V1PodStatus status = pod.getStatus();
V1PodSpec spec = pod.getSpec();

// 设置基本信息
podMonitor.setClusterName("production-cluster");
podMonitor.setNamespace(metadata.getNamespace());
podMonitor.setPodName(metadata.getName());
podMonitor.setNodeName(spec.getNodeName());
podMonitor.setCollectTime(new Date());
podMonitor.setCreateTime(new Date());

// 设置Pod状态
podMonitor.setPodStatus(status.getPhase());
podMonitor.setPodPhase(status.getPhase());
podMonitor.setRestartCount(status.getContainerStatuses() != null ?
status.getContainerStatuses().size() : 0);

// 设置资源信息
setPodResourceInfo(podMonitor, spec);

// 设置使用情况
setPodUsageInfo(podMonitor, pod);

return podMonitor;
}

/**
* 设置Pod资源信息
*/
private void setPodResourceInfo(K8sPodMonitor podMonitor, V1PodSpec spec) {
try {
List<V1Container> containers = spec.getContainers();
if (containers != null) {
long totalCpuRequest = 0;
long totalCpuLimit = 0;
long totalMemoryRequest = 0;
long totalMemoryLimit = 0;

for (V1Container container : containers) {
V1ResourceRequirements resources = container.getResources();
if (resources != null) {
Map<String, Quantity> requests = resources.getRequests();
Map<String, Quantity> limits = resources.getLimits();

if (requests != null) {
if (requests.containsKey("cpu")) {
totalCpuRequest += parseCpu(requests.get("cpu").toString());
}
if (requests.containsKey("memory")) {
totalMemoryRequest += parseMemory(requests.get("memory").toString());
}
}

if (limits != null) {
if (limits.containsKey("cpu")) {
totalCpuLimit += parseCpu(limits.get("cpu").toString());
}
if (limits.containsKey("memory")) {
totalMemoryLimit += parseMemory(limits.get("memory").toString());
}
}
}
}

podMonitor.setCpuRequest(totalCpuRequest);
podMonitor.setCpuLimit(totalCpuLimit);
podMonitor.setMemoryRequest(totalMemoryRequest);
podMonitor.setMemoryLimit(totalMemoryLimit);
}

} catch (Exception e) {
log.error("设置Pod资源信息失败: {}", e.getMessage(), e);
}
}

/**
* 设置Pod使用情况
*/
private void setPodUsageInfo(K8sPodMonitor podMonitor, V1Pod pod) {
try {
// 这里需要调用Metrics API获取实际使用情况
// 简化处理,设置默认值
podMonitor.setCpuUsage(0L);
podMonitor.setMemoryUsage(0L);
podMonitor.setCpuUsagePercent(0.0);
podMonitor.setMemoryUsagePercent(0.0);

} catch (Exception e) {
log.error("设置Pod使用情况失败: {}", e.getMessage(), e);
}
}

/**
* 更新集群缓存
*/
private void updateClusterCache(K8sClusterMonitor monitorData) {
try {
String cacheKey = "k8s:cluster:" + monitorData.getClusterName();
redisTemplate.opsForValue().set(cacheKey, monitorData, Duration.ofMinutes(5));

} catch (Exception e) {
log.warn("更新集群缓存失败: {}", e.getMessage());
}
}

/**
* 更新Pod缓存
*/
private void updatePodCache(K8sPodMonitor podMonitor) {
try {
String cacheKey = "k8s:pod:" + podMonitor.getNamespace() + ":" + podMonitor.getPodName();
redisTemplate.opsForValue().set(cacheKey, podMonitor, Duration.ofMinutes(5));

} catch (Exception e) {
log.warn("更新Pod缓存失败: {}", e.getMessage());
}
}

/**
* 检查集群告警
*/
private void checkClusterAlert(K8sClusterMonitor monitorData) {
try {
String alertType = null;
String alertLevel = null;
String alertMessage = null;

// 检查CPU使用率告警
if (monitorData.getCpuUsage() > 90) {
alertType = "K8S_CLUSTER_CPU_HIGH";
alertLevel = "CRITICAL";
alertMessage = String.format("集群CPU使用率过高: %.2f%%", monitorData.getCpuUsage());
} else if (monitorData.getCpuUsage() > 80) {
alertType = "K8S_CLUSTER_CPU_WARNING";
alertLevel = "WARNING";
alertMessage = String.format("集群CPU使用率较高: %.2f%%", monitorData.getCpuUsage());
}

// 检查内存使用率告警
if (monitorData.getMemoryUsage() > 90) {
alertType = "K8S_CLUSTER_MEMORY_HIGH";
alertLevel = "CRITICAL";
alertMessage = String.format("集群内存使用率过高: %.2f%%", monitorData.getMemoryUsage());
} else if (monitorData.getMemoryUsage() > 80) {
alertType = "K8S_CLUSTER_MEMORY_WARNING";
alertLevel = "WARNING";
alertMessage = String.format("集群内存使用率较高: %.2f%%", monitorData.getMemoryUsage());
}

// 发送告警
if (alertType != null) {
sendClusterAlert(monitorData, alertType, alertLevel, alertMessage);
}

} catch (Exception e) {
log.error("检查集群告警失败: {}", e.getMessage(), e);
}
}

/**
* 检查Pod告警
*/
private void checkPodAlert(K8sPodMonitor podMonitor) {
try {
String alertType = null;
String alertLevel = null;
String alertMessage = null;

// 检查Pod状态告警
if ("Failed".equals(podMonitor.getPodStatus()) || "CrashLoopBackOff".equals(podMonitor.getPodStatus())) {
alertType = "K8S_POD_FAILED";
alertLevel = "CRITICAL";
alertMessage = String.format("Pod状态异常: %s/%s", podMonitor.getNamespace(), podMonitor.getPodName());
}

// 检查Pod重启次数告警
if (podMonitor.getRestartCount() > 10) {
alertType = "K8S_POD_RESTART_HIGH";
alertLevel = "WARNING";
alertMessage = String.format("Pod重启次数过多: %s/%s, 重启次数: %d",
podMonitor.getNamespace(), podMonitor.getPodName(), podMonitor.getRestartCount());
}

// 发送告警
if (alertType != null) {
sendPodAlert(podMonitor, alertType, alertLevel, alertMessage);
}

} catch (Exception e) {
log.error("检查Pod告警失败: {}", e.getMessage(), e);
}
}

/**
* 发送集群告警
*/
private void sendClusterAlert(K8sClusterMonitor monitorData, String alertType, String alertLevel, String alertMessage) {
try {
String alertKey = "k8s:cluster:alert:" + monitorData.getClusterName() + ":" + alertType;
Boolean hasAlert = redisTemplate.hasKey(alertKey);

if (hasAlert == null || !hasAlert) {
AlertMessage alert = new AlertMessage();
alert.setType(alertType);
alert.setLevel(alertLevel);
alert.setMessage(alertMessage);
alert.setTimestamp(new Date());
alert.setHostname(monitorData.getHostname());

alertService.sendAlert(alert);

redisTemplate.opsForValue().set(alertKey, "1", Duration.ofMinutes(5));

log.warn("发送集群告警: clusterName={}, type={}, level={}",
monitorData.getClusterName(), alertType, alertLevel);
}

} catch (Exception e) {
log.error("发送集群告警失败: {}", e.getMessage(), e);
}
}

/**
* 发送Pod告警
*/
private void sendPodAlert(K8sPodMonitor podMonitor, String alertType, String alertLevel, String alertMessage) {
try {
String alertKey = "k8s:pod:alert:" + podMonitor.getNamespace() + ":" + podMonitor.getPodName() + ":" + alertType;
Boolean hasAlert = redisTemplate.hasKey(alertKey);

if (hasAlert == null || !hasAlert) {
AlertMessage alert = new AlertMessage();
alert.setType(alertType);
alert.setLevel(alertLevel);
alert.setMessage(alertMessage);
alert.setTimestamp(new Date());
alert.setHostname(podMonitor.getNodeName());

alertService.sendAlert(alert);

redisTemplate.opsForValue().set(alertKey, "1", Duration.ofMinutes(5));

log.warn("发送Pod告警: namespace={}, podName={}, type={}, level={}",
podMonitor.getNamespace(), podMonitor.getPodName(), alertType, alertLevel);
}

} catch (Exception e) {
log.error("发送Pod告警失败: {}", e.getMessage(), e);
}
}

/**
* 获取实时集群数据
*/
public K8sClusterMonitor getRealTimeClusterData(String clusterName) {
String cacheKey = "k8s:cluster:" + clusterName;
return (K8sClusterMonitor) redisTemplate.opsForValue().get(cacheKey);
}

/**
* 获取实时Pod数据
*/
public K8sPodMonitor getRealTimePodData(String namespace, String podName) {
String cacheKey = "k8s:pod:" + namespace + ":" + podName;
return (K8sPodMonitor) redisTemplate.opsForValue().get(cacheKey);
}

/**
* 获取主机名
*/
private String getHostname() {
try {
return InetAddress.getLocalHost().getHostName();
} catch (UnknownHostException e) {
return "unknown";
}
}

/**
* 获取本地IP地址
*/
private String getLocalIpAddress() {
try {
return InetAddress.getLocalHost().getHostAddress();
} catch (UnknownHostException e) {
return "127.0.0.1";
}
}
}

/**
* K8s集群信息实体类
*/
@Data
public class K8sClusterInfo {
private String clusterName; // 集群名称
private Integer nodeCount; // 节点数量
private Integer podCount; // Pod数量
private Integer serviceCount; // 服务数量
private Integer namespaceCount; // 命名空间数量
private Double cpuUsage; // CPU使用率
private Double memoryUsage; // 内存使用率
private Long totalCpu; // 总CPU
private Long usedCpu; // 已使用CPU
private Long totalMemory; // 总内存
private Long usedMemory; // 已使用内存
private String clusterStatus; // 集群状态
}

4. K8s管理服务

4.1 K8s管理服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
/**
* K8s管理服务
* 提供K8s集群管理功能
*/
@Service
public class K8sManagementService {

@Autowired
private ApiClient apiClient;

@Autowired
private AlertService alertService;

/**
* 创建Pod
*/
public void createPod(String namespace, String podName, String image) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 创建Pod规格
V1Pod pod = new V1Pod();
V1ObjectMeta metadata = new V1ObjectMeta();
metadata.setName(podName);
metadata.setNamespace(namespace);
pod.setMetadata(metadata);

V1PodSpec spec = new V1PodSpec();
V1Container container = new V1Container();
container.setName(podName);
container.setImage(image);
spec.setContainers(Arrays.asList(container));
pod.setSpec(spec);

// 创建Pod
V1Pod createdPod = coreV1Api.createNamespacedPod(namespace, pod).execute();

log.info("创建Pod成功: namespace={}, podName={}, image={}",
namespace, podName, image);

} catch (Exception e) {
log.error("创建Pod失败: namespace={}, podName={}, error={}",
namespace, podName, e.getMessage(), e);
}
}

/**
* 删除Pod
*/
public void deletePod(String namespace, String podName) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 删除Pod
coreV1Api.deleteNamespacedPod(podName, namespace).execute();

log.info("删除Pod成功: namespace={}, podName={}", namespace, podName);

} catch (Exception e) {
log.error("删除Pod失败: namespace={}, podName={}, error={}",
namespace, podName, e.getMessage(), e);
}
}

/**
* 重启Pod
*/
public void restartPod(String namespace, String podName) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 删除Pod
coreV1Api.deleteNamespacedPod(podName, namespace).execute();

// 等待Pod重新创建
Thread.sleep(5000);

log.info("重启Pod成功: namespace={}, podName={}", namespace, podName);

} catch (Exception e) {
log.error("重启Pod失败: namespace={}, podName={}, error={}",
namespace, podName, e.getMessage(), e);
}
}

/**
* 扩缩容Pod
*/
public void scalePod(String namespace, String deploymentName, int replicas) {
try {
AppsV1Api appsV1Api = new AppsV1Api(apiClient);

// 获取Deployment
V1Deployment deployment = appsV1Api.readNamespacedDeployment(deploymentName, namespace).execute();

// 更新副本数
V1DeploymentSpec spec = deployment.getSpec();
spec.setReplicas(replicas);
deployment.setSpec(spec);

// 更新Deployment
appsV1Api.replaceNamespacedDeployment(deploymentName, namespace, deployment).execute();

log.info("扩缩容Pod成功: namespace={}, deploymentName={}, replicas={}",
namespace, deploymentName, replicas);

} catch (Exception e) {
log.error("扩缩容Pod失败: namespace={}, deploymentName={}, replicas={}, error={}",
namespace, deploymentName, replicas, e.getMessage(), e);
}
}

/**
* 创建服务
*/
public void createService(String namespace, String serviceName, String podName, int port) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 创建服务规格
V1Service service = new V1Service();
V1ObjectMeta metadata = new V1ObjectMeta();
metadata.setName(serviceName);
metadata.setNamespace(namespace);
service.setMetadata(metadata);

V1ServiceSpec spec = new V1ServiceSpec();
spec.setSelector(Map.of("app", podName));

V1ServicePort servicePort = new V1ServicePort();
servicePort.setPort(port);
servicePort.setTargetPort(new IntOrString(port));
spec.setPorts(Arrays.asList(servicePort));

service.setSpec(spec);

// 创建服务
V1Service createdService = coreV1Api.createNamespacedService(namespace, service).execute();

log.info("创建服务成功: namespace={}, serviceName={}, podName={}, port={}",
namespace, serviceName, podName, port);

} catch (Exception e) {
log.error("创建服务失败: namespace={}, serviceName={}, error={}",
namespace, serviceName, e.getMessage(), e);
}
}

/**
* 删除服务
*/
public void deleteService(String namespace, String serviceName) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 删除服务
coreV1Api.deleteNamespacedService(serviceName, namespace).execute();

log.info("删除服务成功: namespace={}, serviceName={}", namespace, serviceName);

} catch (Exception e) {
log.error("删除服务失败: namespace={}, serviceName={}, error={}",
namespace, serviceName, e.getMessage(), e);
}
}

/**
* 获取Pod日志
*/
public String getPodLogs(String namespace, String podName) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 获取Pod日志
String logs = coreV1Api.readNamespacedPodLog(podName, namespace).execute();

log.info("获取Pod日志成功: namespace={}, podName={}", namespace, podName);

return logs;

} catch (Exception e) {
log.error("获取Pod日志失败: namespace={}, podName={}, error={}",
namespace, podName, e.getMessage(), e);
return "";
}
}

/**
* 执行Pod命令
*/
public String executePodCommand(String namespace, String podName, String command) {
try {
CoreV1Api coreV1Api = new CoreV1Api(apiClient);

// 执行Pod命令
V1Exec exec = new V1Exec();
exec.setCommand(Arrays.asList("sh", "-c", command));
exec.setStdin(true);
exec.setStdout(true);
exec.setStderr(true);
exec.setTty(false);

// 这里需要实现WebSocket连接来执行命令
// 简化处理,返回成功信息
log.info("执行Pod命令成功: namespace={}, podName={}, command={}",
namespace, podName, command);

return "Command executed successfully";

} catch (Exception e) {
log.error("执行Pod命令失败: namespace={}, podName={}, command={}, error={}",
namespace, podName, command, e.getMessage(), e);
return "";
}
}
}

5. 总结

本文详细介绍了Kubernetes运维监控与管理的完整解决方案,包括:

5.1 核心技术点

  1. K8s监控: 实时监控集群、节点、Pod、服务状态
  2. 资源管理: 管理CPU、内存等资源使用情况
  3. Pod管理: 创建、删除、重启、扩缩容Pod
  4. 服务管理: 创建、删除、管理K8s服务
  5. 告警通知: 多级告警、智能通知

5.2 架构优势

  1. 实时监控: 10秒间隔的实时K8s数据采集
  2. 智能告警: 基于阈值的智能告警机制
  3. 自动化管理: 自动化的Pod和服务管理
  4. 多维度分析: 集群、节点、Pod等多维度分析

5.3 最佳实践

  1. 监控策略: 设置合理的K8s监控阈值
  2. 管理策略: 根据业务需求执行Pod和服务管理
  3. 资源优化: 合理分配和调度K8s资源
  4. 预防措施: 提前预防K8s集群问题

通过以上架构设计,可以构建完善的Kubernetes运维监控系统,实现K8s集群的有效管理和优化。