1. 多区域多活部署与灾备演练概述

多区域多活部署是现代企业级应用的重要架构模式,通过在不同地理区域部署多个活跃的数据中心,实现业务的高可用性和容灾能力。灾备演练则是验证系统容灾能力的重要手段,确保在真实故障发生时能够快速恢复。本文将详细介绍多活架构设计、灾备策略、切换机制和演练流程的完整实现。

1.1 核心功能

  1. 多活架构: 多区域数据中心同时提供服务
  2. 数据同步: 跨区域数据实时同步和一致性保障
  3. 流量调度: 智能流量分发和负载均衡
  4. 灾备切换: 自动故障检测和切换机制
  5. 演练管理: 灾备演练计划和执行流程

1.2 技术架构

1
2
3
4
5
用户请求 → 智能DNS → 负载均衡 → 多活节点
↓ ↓ ↓ ↓
全球用户 → 就近访问 → 流量分发 → 业务处理
↓ ↓ ↓ ↓
监控告警 → 故障检测 → 自动切换 → 数据同步

2. 多活架构配置

2.1 多活配置类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/**
* 多活架构配置
*/
@Configuration
public class MultiActiveConfig {

@Value("${multi-active.regions}")
private String regions;

@Value("${multi-active.current-region}")
private String currentRegion;

@Value("${multi-active.data-sync.enabled}")
private boolean dataSyncEnabled;

@Value("${multi-active.traffic-split.enabled}")
private boolean trafficSplitEnabled;

/**
* 多活区域配置
*/
@Bean
public MultiActiveProperties multiActiveProperties() {
MultiActiveProperties properties = new MultiActiveProperties();
properties.setRegions(Arrays.asList(regions.split(",")));
properties.setCurrentRegion(currentRegion);
properties.setDataSyncEnabled(dataSyncEnabled);
properties.setTrafficSplitEnabled(trafficSplitEnabled);
return properties;
}

/**
* 数据同步配置
*/
@Bean
@ConditionalOnProperty(name = "multi-active.data-sync.enabled", havingValue = "true")
public DataSyncConfig dataSyncConfig() {
return new DataSyncConfig();
}

/**
* 流量调度配置
*/
@Bean
@ConditionalOnProperty(name = "multi-active.traffic-split.enabled", havingValue = "true")
public TrafficSplitConfig trafficSplitConfig() {
return new TrafficSplitConfig();
}
}

/**
* 多活配置属性
*/
@Data
public class MultiActiveProperties {
private List<String> regions;
private String currentRegion;
private boolean dataSyncEnabled;
private boolean trafficSplitEnabled;
private Map<String, RegionConfig> regionConfigs = new HashMap<>();
}

/**
* 区域配置
*/
@Data
public class RegionConfig {
private String regionId;
private String regionName;
private String endpoint;
private String status;
private int priority;
private Map<String, Object> metadata = new HashMap<>();
}

/**
* 数据同步配置
*/
@Data
public class DataSyncConfig {
private String syncMode = "REAL_TIME";
private int batchSize = 1000;
private int syncInterval = 1000;
private boolean conflictResolution = true;
private String conflictStrategy = "LAST_WRITE_WINS";
}

2.2 应用配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# application.yml
multi-active:
regions: beijing,shanghai,guangzhou
current-region: beijing
data-sync:
enabled: true
mode: REAL_TIME
batch-size: 1000
sync-interval: 1000
conflict-resolution: true
conflict-strategy: LAST_WRITE_WINS
traffic-split:
enabled: true
strategy: WEIGHTED_ROUND_ROBIN
weights:
beijing: 50
shanghai: 30
guangzhou: 20
disaster-recovery:
enabled: true
auto-switch: true
switch-threshold: 0.8
health-check-interval: 30000

# 区域配置
regions:
beijing:
region-id: bj
region-name: 北京
endpoint: https://api-bj.example.com
status: ACTIVE
priority: 1
shanghai:
region-id: sh
region-name: 上海
endpoint: https://api-sh.example.com
status: ACTIVE
priority: 2
guangzhou:
region-id: gz
region-name: 广州
endpoint: https://api-gz.example.com
status: ACTIVE
priority: 3

3. 多活架构服务

3.1 多活管理服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
/**
* 多活管理服务
*/
@Service
public class MultiActiveService {

@Autowired
private MultiActiveProperties multiActiveProperties;

@Autowired
private RegionHealthService regionHealthService;

@Autowired
private DataSyncService dataSyncService;

@Autowired
private TrafficSplitService trafficSplitService;

/**
* 获取当前区域
* @return 当前区域
*/
public String getCurrentRegion() {
return multiActiveProperties.getCurrentRegion();
}

/**
* 获取所有区域
* @return 区域列表
*/
public List<String> getAllRegions() {
return multiActiveProperties.getRegions();
}

/**
* 获取可用区域
* @return 可用区域列表
*/
public List<String> getAvailableRegions() {
return multiActiveProperties.getRegions().stream()
.filter(region -> regionHealthService.isRegionHealthy(region))
.collect(Collectors.toList());
}

/**
* 检查区域状态
* @param region 区域
* @return 区域状态
*/
public RegionStatus checkRegionStatus(String region) {
RegionStatus status = new RegionStatus();
status.setRegion(region);
status.setHealthy(regionHealthService.isRegionHealthy(region));
status.setResponseTime(regionHealthService.getRegionResponseTime(region));
status.setLastCheckTime(LocalDateTime.now());

return status;
}

/**
* 执行区域切换
* @param fromRegion 源区域
* @param toRegion 目标区域
* @return 切换结果
*/
public RegionSwitchResult switchRegion(String fromRegion, String toRegion) {
try {
log.info("开始区域切换: fromRegion={}, toRegion={}", fromRegion, toRegion);

// 1. 检查目标区域健康状态
if (!regionHealthService.isRegionHealthy(toRegion)) {
throw new BusinessException("目标区域不健康,无法切换");
}

// 2. 停止源区域流量
trafficSplitService.stopTrafficToRegion(fromRegion);

// 3. 等待数据同步完成
if (multiActiveProperties.isDataSyncEnabled()) {
dataSyncService.waitForSyncCompletion(fromRegion, toRegion);
}

// 4. 启动目标区域流量
trafficSplitService.startTrafficToRegion(toRegion);

// 5. 更新当前区域
multiActiveProperties.setCurrentRegion(toRegion);

log.info("区域切换完成: fromRegion={}, toRegion={}", fromRegion, toRegion);

return RegionSwitchResult.success(fromRegion, toRegion);

} catch (Exception e) {
log.error("区域切换失败: fromRegion={}, toRegion={}", fromRegion, toRegion, e);
return RegionSwitchResult.error("区域切换失败: " + e.getMessage());
}
}

/**
* 执行自动切换
* @return 切换结果
*/
public RegionSwitchResult autoSwitch() {
String currentRegion = getCurrentRegion();

// 1. 检查当前区域健康状态
if (regionHealthService.isRegionHealthy(currentRegion)) {
return RegionSwitchResult.success("当前区域健康,无需切换");
}

// 2. 查找最佳切换目标
String targetRegion = findBestSwitchTarget(currentRegion);
if (targetRegion == null) {
return RegionSwitchResult.error("未找到可切换的目标区域");
}

// 3. 执行切换
return switchRegion(currentRegion, targetRegion);
}

/**
* 查找最佳切换目标
* @param currentRegion 当前区域
* @return 目标区域
*/
private String findBestSwitchTarget(String currentRegion) {
return multiActiveProperties.getRegions().stream()
.filter(region -> !region.equals(currentRegion))
.filter(region -> regionHealthService.isRegionHealthy(region))
.min(Comparator.comparing(region -> regionHealthService.getRegionResponseTime(region)))
.orElse(null);
}
}

/**
* 区域状态
*/
@Data
public class RegionStatus {
private String region;
private boolean healthy;
private long responseTime;
private LocalDateTime lastCheckTime;
private Map<String, Object> metrics = new HashMap<>();
}

/**
* 区域切换结果
*/
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class RegionSwitchResult {
private boolean success;
private String fromRegion;
private String toRegion;
private String message;
private LocalDateTime switchTime;

public static RegionSwitchResult success(String fromRegion, String toRegion) {
return RegionSwitchResult.builder()
.success(true)
.fromRegion(fromRegion)
.toRegion(toRegion)
.switchTime(LocalDateTime.now())
.build();
}

public static RegionSwitchResult success(String message) {
return RegionSwitchResult.builder()
.success(true)
.message(message)
.switchTime(LocalDateTime.now())
.build();
}

public static RegionSwitchResult error(String message) {
return RegionSwitchResult.builder()
.success(false)
.message(message)
.switchTime(LocalDateTime.now())
.build();
}
}

3.2 区域健康检查服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
/**
* 区域健康检查服务
*/
@Service
public class RegionHealthService {

@Autowired
private MultiActiveProperties multiActiveProperties;

@Autowired
private RestTemplate restTemplate;

private final Map<String, RegionHealthStatus> healthStatusCache = new ConcurrentHashMap<>();

/**
* 检查区域健康状态
* @param region 区域
* @return 是否健康
*/
public boolean isRegionHealthy(String region) {
RegionHealthStatus status = healthStatusCache.get(region);
if (status == null || status.isExpired()) {
status = performHealthCheck(region);
healthStatusCache.put(region, status);
}

return status.isHealthy();
}

/**
* 获取区域响应时间
* @param region 区域
* @return 响应时间(毫秒)
*/
public long getRegionResponseTime(String region) {
RegionHealthStatus status = healthStatusCache.get(region);
return status != null ? status.getResponseTime() : Long.MAX_VALUE;
}

/**
* 执行健康检查
* @param region 区域
* @return 健康状态
*/
private RegionHealthStatus performHealthCheck(String region) {
RegionConfig regionConfig = multiActiveProperties.getRegionConfigs().get(region);
if (regionConfig == null) {
return RegionHealthStatus.unhealthy("区域配置不存在");
}

try {
String healthUrl = regionConfig.getEndpoint() + "/health";

long startTime = System.currentTimeMillis();
ResponseEntity<Map> response = restTemplate.getForEntity(healthUrl, Map.class);
long responseTime = System.currentTimeMillis() - startTime;

if (response.getStatusCode().is2xxSuccessful()) {
Map<String, Object> body = response.getBody();
boolean healthy = body != null && "UP".equals(body.get("status"));

return RegionHealthStatus.builder()
.region(region)
.healthy(healthy)
.responseTime(responseTime)
.checkTime(LocalDateTime.now())
.message(healthy ? "健康" : "不健康")
.build();
} else {
return RegionHealthStatus.unhealthy("HTTP状态码: " + response.getStatusCode());
}

} catch (Exception e) {
log.error("区域健康检查失败: region={}", region, e);
return RegionHealthStatus.unhealthy("健康检查异常: " + e.getMessage());
}
}

/**
* 批量健康检查
* @return 所有区域健康状态
*/
public Map<String, RegionHealthStatus> batchHealthCheck() {
Map<String, RegionHealthStatus> results = new HashMap<>();

for (String region : multiActiveProperties.getRegions()) {
results.put(region, performHealthCheck(region));
}

return results;
}

/**
* 定期健康检查
*/
@Scheduled(fixedRate = 30000) // 30秒执行一次
public void scheduledHealthCheck() {
try {
Map<String, RegionHealthStatus> results = batchHealthCheck();

// 更新缓存
results.forEach((region, status) -> {
healthStatusCache.put(region, status);
});

// 检查是否需要自动切换
checkAutoSwitch();

} catch (Exception e) {
log.error("定期健康检查失败", e);
}
}

/**
* 检查是否需要自动切换
*/
private void checkAutoSwitch() {
String currentRegion = multiActiveProperties.getCurrentRegion();
RegionHealthStatus currentStatus = healthStatusCache.get(currentRegion);

if (currentStatus != null && !currentStatus.isHealthy()) {
log.warn("当前区域不健康,触发自动切换: region={}", currentRegion);

// 这里可以触发自动切换逻辑
// multiActiveService.autoSwitch();
}
}
}

/**
* 区域健康状态
*/
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class RegionHealthStatus {
private String region;
private boolean healthy;
private long responseTime;
private LocalDateTime checkTime;
private String message;

public boolean isExpired() {
return checkTime == null || checkTime.isBefore(LocalDateTime.now().minusMinutes(1));
}

public static RegionHealthStatus unhealthy(String message) {
return RegionHealthStatus.builder()
.healthy(false)
.message(message)
.checkTime(LocalDateTime.now())
.build();
}
}

4. 数据同步服务

4.1 数据同步服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
/**
* 数据同步服务
*/
@Service
public class DataSyncService {

@Autowired
private MultiActiveProperties multiActiveProperties;

@Autowired
private DataSyncConfig dataSyncConfig;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

/**
* 同步数据到目标区域
* @param data 数据
* @param targetRegion 目标区域
* @return 同步结果
*/
public SyncResult syncData(Object data, String targetRegion) {
try {
log.debug("开始同步数据到目标区域: targetRegion={}", targetRegion);

// 1. 序列化数据
String dataJson = JSON.toJSONString(data);

// 2. 发送到目标区域
RegionConfig regionConfig = multiActiveProperties.getRegionConfigs().get(targetRegion);
if (regionConfig == null) {
throw new BusinessException("目标区域配置不存在");
}

String syncUrl = regionConfig.getEndpoint() + "/api/sync/data";

HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
headers.set("X-Source-Region", multiActiveProperties.getCurrentRegion());

HttpEntity<String> request = new HttpEntity<>(dataJson, headers);

RestTemplate restTemplate = new RestTemplate();
ResponseEntity<String> response = restTemplate.postForEntity(syncUrl, request, String.class);

if (response.getStatusCode().is2xxSuccessful()) {
log.debug("数据同步成功: targetRegion={}", targetRegion);
return SyncResult.success(targetRegion);
} else {
throw new BusinessException("数据同步失败: " + response.getStatusCode());
}

} catch (Exception e) {
log.error("数据同步失败: targetRegion={}", targetRegion, e);
return SyncResult.error("数据同步失败: " + e.getMessage());
}
}

/**
* 批量同步数据
* @param dataList 数据列表
* @param targetRegion 目标区域
* @return 同步结果
*/
public SyncResult batchSyncData(List<Object> dataList, String targetRegion) {
try {
log.debug("开始批量同步数据: targetRegion={}, count={}", targetRegion, dataList.size());

// 分批处理
int batchSize = dataSyncConfig.getBatchSize();
List<List<Object>> batches = Lists.partition(dataList, batchSize);

int successCount = 0;
int failCount = 0;

for (List<Object> batch : batches) {
try {
SyncResult result = syncData(batch, targetRegion);
if (result.isSuccess()) {
successCount += batch.size();
} else {
failCount += batch.size();
}
} catch (Exception e) {
failCount += batch.size();
log.error("批量同步失败: batchSize={}", batch.size(), e);
}
}

log.debug("批量同步完成: targetRegion={}, successCount={}, failCount={}",
targetRegion, successCount, failCount);

return SyncResult.batchResult(successCount, failCount);

} catch (Exception e) {
log.error("批量同步失败: targetRegion={}", targetRegion, e);
return SyncResult.error("批量同步失败: " + e.getMessage());
}
}

/**
* 等待同步完成
* @param fromRegion 源区域
* @param toRegion 目标区域
*/
public void waitForSyncCompletion(String fromRegion, String toRegion) {
try {
log.info("等待数据同步完成: fromRegion={}, toRegion={}", fromRegion, toRegion);

int maxWaitTime = 30000; // 30秒
int checkInterval = 1000; // 1秒
int waitedTime = 0;

while (waitedTime < maxWaitTime) {
if (isSyncCompleted(fromRegion, toRegion)) {
log.info("数据同步完成: fromRegion={}, toRegion={}", fromRegion, toRegion);
return;
}

Thread.sleep(checkInterval);
waitedTime += checkInterval;
}

log.warn("数据同步超时: fromRegion={}, toRegion={}", fromRegion, toRegion);

} catch (Exception e) {
log.error("等待数据同步完成失败: fromRegion={}, toRegion={}", fromRegion, toRegion, e);
}
}

/**
* 检查同步是否完成
* @param fromRegion 源区域
* @param toRegion 目标区域
* @return 是否完成
*/
private boolean isSyncCompleted(String fromRegion, String toRegion) {
try {
String syncKey = "sync:status:" + fromRegion + ":" + toRegion;
Object status = redisTemplate.opsForValue().get(syncKey);
return "COMPLETED".equals(status);
} catch (Exception e) {
log.error("检查同步状态失败: fromRegion={}, toRegion={}", fromRegion, toRegion, e);
return false;
}
}

/**
* 处理数据冲突
* @param data1 数据1
* @param data2 数据2
* @return 解决后的数据
*/
public Object resolveDataConflict(Object data1, Object data2) {
String strategy = dataSyncConfig.getConflictStrategy();

switch (strategy) {
case "LAST_WRITE_WINS":
return data2; // 后写入的获胜
case "FIRST_WRITE_WINS":
return data1; // 先写入的获胜
case "MERGE":
return mergeData(data1, data2); // 合并数据
default:
return data2; // 默认后写入的获胜
}
}

/**
* 合并数据
* @param data1 数据1
* @param data2 数据2
* @return 合并后的数据
*/
private Object mergeData(Object data1, Object data2) {
// 这里可以实现具体的数据合并逻辑
// 例如:合并Map、List等
return data2;
}
}

/**
* 同步结果
*/
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class SyncResult {
private boolean success;
private String targetRegion;
private String message;
private int successCount;
private int failCount;
private LocalDateTime syncTime;

public static SyncResult success(String targetRegion) {
return SyncResult.builder()
.success(true)
.targetRegion(targetRegion)
.syncTime(LocalDateTime.now())
.build();
}

public static SyncResult batchResult(int successCount, int failCount) {
return SyncResult.builder()
.success(failCount == 0)
.successCount(successCount)
.failCount(failCount)
.syncTime(LocalDateTime.now())
.build();
}

public static SyncResult error(String message) {
return SyncResult.builder()
.success(false)
.message(message)
.syncTime(LocalDateTime.now())
.build();
}
}

5. 灾备演练服务

5.1 灾备演练服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
/**
* 灾备演练服务
*/
@Service
public class DisasterRecoveryDrillService {

@Autowired
private MultiActiveService multiActiveService;

@Autowired
private RegionHealthService regionHealthService;

@Autowired
private DataSyncService dataSyncService;

@Autowired
private DrillRecordMapper drillRecordMapper;

/**
* 创建演练计划
* @param drillPlan 演练计划
* @return 演练计划ID
*/
public Long createDrillPlan(DrillPlan drillPlan) {
try {
drillPlan.setStatus("PLANNED");
drillPlan.setCreateTime(LocalDateTime.now());
drillRecordMapper.insertDrillPlan(drillPlan);

log.info("创建演练计划成功: planId={}, name={}", drillPlan.getId(), drillPlan.getName());

return drillPlan.getId();

} catch (Exception e) {
log.error("创建演练计划失败: name={}", drillPlan.getName(), e);
throw new BusinessException("创建演练计划失败: " + e.getMessage());
}
}

/**
* 执行演练
* @param planId 演练计划ID
* @return 演练结果
*/
public DrillResult executeDrill(Long planId) {
try {
DrillPlan plan = drillRecordMapper.selectDrillPlanById(planId);
if (plan == null) {
throw new BusinessException("演练计划不存在");
}

log.info("开始执行演练: planId={}, name={}", planId, plan.getName());

// 1. 更新演练状态
plan.setStatus("EXECUTING");
plan.setStartTime(LocalDateTime.now());
drillRecordMapper.updateDrillPlan(plan);

// 2. 执行演练步骤
DrillResult result = executeDrillSteps(plan);

// 3. 更新演练结果
plan.setStatus(result.isSuccess() ? "COMPLETED" : "FAILED");
plan.setEndTime(LocalDateTime.now());
plan.setResult(JSON.toJSONString(result));
drillRecordMapper.updateDrillPlan(plan);

log.info("演练执行完成: planId={}, success={}", planId, result.isSuccess());

return result;

} catch (Exception e) {
log.error("执行演练失败: planId={}", planId, e);

// 更新演练状态为失败
DrillPlan plan = drillRecordMapper.selectDrillPlanById(planId);
if (plan != null) {
plan.setStatus("FAILED");
plan.setEndTime(LocalDateTime.now());
plan.setResult("演练执行失败: " + e.getMessage());
drillRecordMapper.updateDrillPlan(plan);
}

return DrillResult.error("演练执行失败: " + e.getMessage());
}
}

/**
* 执行演练步骤
* @param plan 演练计划
* @return 演练结果
*/
private DrillResult executeDrillSteps(DrillPlan plan) {
DrillResult result = new DrillResult();
result.setPlanId(plan.getId());
result.setStartTime(LocalDateTime.now());

List<DrillStep> steps = plan.getSteps();
List<DrillStepResult> stepResults = new ArrayList<>();

for (DrillStep step : steps) {
DrillStepResult stepResult = executeDrillStep(step);
stepResults.add(stepResult);

if (!stepResult.isSuccess()) {
result.setSuccess(false);
result.setErrorMessage(stepResult.getErrorMessage());
break;
}
}

result.setStepResults(stepResults);
result.setEndTime(LocalDateTime.now());

if (result.isSuccess() == null) {
result.setSuccess(true);
}

return result;
}

/**
* 执行演练步骤
* @param step 演练步骤
* @return 步骤结果
*/
private DrillStepResult executeDrillStep(DrillStep step) {
try {
log.debug("执行演练步骤: stepId={}, type={}", step.getId(), step.getType());

DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());
result.setStartTime(LocalDateTime.now());

switch (step.getType()) {
case "REGION_SWITCH":
result = executeRegionSwitchStep(step);
break;
case "DATA_SYNC":
result = executeDataSyncStep(step);
break;
case "HEALTH_CHECK":
result = executeHealthCheckStep(step);
break;
case "TRAFFIC_SPLIT":
result = executeTrafficSplitStep(step);
break;
default:
result.setSuccess(false);
result.setErrorMessage("未知的演练步骤类型: " + step.getType());
}

result.setEndTime(LocalDateTime.now());
return result;

} catch (Exception e) {
log.error("执行演练步骤失败: stepId={}", step.getId(), e);

DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());
result.setSuccess(false);
result.setErrorMessage("执行失败: " + e.getMessage());
result.setEndTime(LocalDateTime.now());

return result;
}
}

/**
* 执行区域切换步骤
*/
private DrillStepResult executeRegionSwitchStep(DrillStep step) {
DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());

try {
Map<String, Object> params = step.getParameters();
String fromRegion = (String) params.get("fromRegion");
String toRegion = (String) params.get("toRegion");

RegionSwitchResult switchResult = multiActiveService.switchRegion(fromRegion, toRegion);

result.setSuccess(switchResult.isSuccess());
result.setResult(JSON.toJSONString(switchResult));

if (!switchResult.isSuccess()) {
result.setErrorMessage(switchResult.getMessage());
}

} catch (Exception e) {
result.setSuccess(false);
result.setErrorMessage("区域切换失败: " + e.getMessage());
}

return result;
}

/**
* 执行数据同步步骤
*/
private DrillStepResult executeDataSyncStep(DrillStep step) {
DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());

try {
Map<String, Object> params = step.getParameters();
String targetRegion = (String) params.get("targetRegion");

// 模拟数据同步
SyncResult syncResult = dataSyncService.syncData("test_data", targetRegion);

result.setSuccess(syncResult.isSuccess());
result.setResult(JSON.toJSONString(syncResult));

if (!syncResult.isSuccess()) {
result.setErrorMessage(syncResult.getMessage());
}

} catch (Exception e) {
result.setSuccess(false);
result.setErrorMessage("数据同步失败: " + e.getMessage());
}

return result;
}

/**
* 执行健康检查步骤
*/
private DrillStepResult executeHealthCheckStep(DrillStep step) {
DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());

try {
Map<String, Object> params = step.getParameters();
String region = (String) params.get("region");

boolean healthy = regionHealthService.isRegionHealthy(region);

result.setSuccess(healthy);
result.setResult("{\"healthy\": " + healthy + "}");

if (!healthy) {
result.setErrorMessage("区域不健康: " + region);
}

} catch (Exception e) {
result.setSuccess(false);
result.setErrorMessage("健康检查失败: " + e.getMessage());
}

return result;
}

/**
* 执行流量分发步骤
*/
private DrillStepResult executeTrafficSplitStep(DrillStep step) {
DrillStepResult result = new DrillStepResult();
result.setStepId(step.getId());

try {
Map<String, Object> params = step.getParameters();
String region = (String) params.get("region");
String action = (String) params.get("action");

// 模拟流量分发操作
boolean success = true;
String message = "流量分发操作成功: " + action + " -> " + region;

result.setSuccess(success);
result.setResult("{\"message\": \"" + message + "\"}");

} catch (Exception e) {
result.setSuccess(false);
result.setErrorMessage("流量分发失败: " + e.getMessage());
}

return result;
}
}

/**
* 演练计划
*/
@Data
@TableName("drill_plans")
public class DrillPlan {
@TableId(type = IdType.AUTO)
private Long id;
private String name;
private String description;
private String status;
private LocalDateTime createTime;
private LocalDateTime startTime;
private LocalDateTime endTime;
private String result;
private List<DrillStep> steps = new ArrayList<>();
}

/**
* 演练步骤
*/
@Data
public class DrillStep {
private Long id;
private String type;
private String name;
private String description;
private Map<String, Object> parameters = new HashMap<>();
private int order;
}

/**
* 演练结果
*/
@Data
public class DrillResult {
private Long planId;
private Boolean success;
private String errorMessage;
private LocalDateTime startTime;
private LocalDateTime endTime;
private List<DrillStepResult> stepResults = new ArrayList<>();
}

/**
* 演练步骤结果
*/
@Data
public class DrillStepResult {
private Long stepId;
private Boolean success;
private String errorMessage;
private String result;
private LocalDateTime startTime;
private LocalDateTime endTime;
}

6. 总结

通过多区域多活部署与灾备演练的实现,我们成功构建了一个高可用、容灾能力强的分布式系统。关键特性包括:

6.1 核心优势

  1. 多活架构: 多区域数据中心同时提供服务
  2. 数据同步: 跨区域数据实时同步和一致性保障
  3. 流量调度: 智能流量分发和负载均衡
  4. 灾备切换: 自动故障检测和切换机制
  5. 演练管理: 灾备演练计划和执行流程

6.2 最佳实践

  1. 架构设计: 合理的多活架构和区域划分
  2. 数据同步: 可靠的数据同步和冲突解决
  3. 故障处理: 完善的故障检测和自动切换
  4. 演练管理: 定期的灾备演练和验证
  5. 监控告警: 全面的监控和告警机制

这套多区域多活方案不仅能够提供高可用的服务能力,还通过灾备演练确保了系统的容灾能力,是现代企业级应用的重要基础设施。