服务间调用如何保证稳定性?限流/熔断/降级/超时/重试/隔离

1. 概述

1.1 服务间调用稳定性的重要性

服务间调用稳定性是微服务架构设计的核心问题之一,直接影响系统的可用性和用户体验。

服务间调用的挑战

  • 网络不稳定:网络延迟、超时、丢包等
  • 服务故障:服务宕机、响应慢、异常等
  • 流量突增:突发流量导致服务过载
  • 级联故障:一个服务故障导致整个系统崩溃

1.2 稳定性保障机制

六大保障机制

  1. 限流(Rate Limiting):控制请求速率,防止服务过载
  2. 熔断(Circuit Breaker):快速失败,防止级联故障
  3. 降级(Fallback):服务不可用时提供降级方案
  4. 超时(Timeout):设置请求超时时间,避免长时间等待
  5. 重试(Retry):失败后自动重试,提高成功率
  6. 隔离(Isolation):资源隔离,防止故障传播

1.3 本文内容结构

本文将从以下几个方面全面解析服务间调用的稳定性保障:

  1. 限流:限流算法、实现方式、最佳实践
  2. 熔断:熔断机制、状态转换、最佳实践
  3. 降级:降级策略、降级方案、最佳实践
  4. 超时:超时设置、超时处理、最佳实践
  5. 重试:重试策略、重试机制、最佳实践
  6. 隔离:线程隔离、信号量隔离、最佳实践

2. 限流(Rate Limiting)

2.1 限流算法

2.1.1 固定窗口算法

固定窗口算法:在固定时间窗口内限制请求数量。

特点

  • 实现简单
  • 可能出现流量突增
  • 边界问题(窗口切换时可能超过限制)

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@Service
public class FixedWindowRateLimiter {

@Autowired
private RedisTemplate<String, String> redisTemplate;

/**
* 固定窗口限流
*/
public boolean tryAcquire(String key, int limit, int windowSeconds) {
String cacheKey = "rate_limit:fixed:" + key;

// 获取当前窗口的计数
String countStr = redisTemplate.opsForValue().get(cacheKey);
int count = countStr == null ? 0 : Integer.parseInt(countStr);

if (count >= limit) {
return false; // 超过限制
}

// 增加计数
redisTemplate.opsForValue().increment(cacheKey);
redisTemplate.expire(cacheKey, windowSeconds, TimeUnit.SECONDS);

return true;
}
}

2.1.2 滑动窗口算法

滑动窗口算法:在滑动时间窗口内限制请求数量。

特点

  • 更平滑的限流
  • 解决固定窗口的边界问题
  • 实现相对复杂

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
@Service
public class SlidingWindowRateLimiter {

@Autowired
private RedisTemplate<String, String> redisTemplate;

/**
* 滑动窗口限流
*/
public boolean tryAcquire(String key, int limit, int windowSeconds) {
String cacheKey = "rate_limit:sliding:" + key;
long now = System.currentTimeMillis();
long windowStart = now - windowSeconds * 1000L;

// 使用ZSet存储请求时间戳
String zsetKey = cacheKey + ":zset";

// 移除窗口外的数据
redisTemplate.opsForZSet().removeRangeByScore(zsetKey, 0, windowStart);

// 获取窗口内的请求数
Long count = redisTemplate.opsForZSet().count(zsetKey, windowStart, now);

if (count != null && count >= limit) {
return false; // 超过限制
}

// 添加当前请求
redisTemplate.opsForZSet().add(zsetKey, String.valueOf(now), now);
redisTemplate.expire(zsetKey, windowSeconds, TimeUnit.SECONDS);

return true;
}
}

2.1.3 令牌桶算法

令牌桶算法:以固定速率生成令牌,请求消耗令牌。

特点

  • 允许突发流量
  • 平滑限流
  • 实现相对复杂

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@Service
public class TokenBucketRateLimiter {

@Autowired
private RedisTemplate<String, String> redisTemplate;

/**
* 令牌桶限流
*/
public boolean tryAcquire(String key, int capacity, int refillRate, int refillPeriod) {
String cacheKey = "rate_limit:token:" + key;
long now = System.currentTimeMillis();

// 获取当前令牌数和上次更新时间
String bucketStr = redisTemplate.opsForValue().get(cacheKey);
TokenBucket bucket;

if (bucketStr == null) {
bucket = new TokenBucket(capacity, now);
} else {
bucket = JSON.parseObject(bucketStr, TokenBucket.class);

// 计算需要补充的令牌数
long elapsed = now - bucket.getLastRefillTime();
int tokensToAdd = (int) (elapsed * refillRate / (refillPeriod * 1000L));

if (tokensToAdd > 0) {
bucket.setTokens(Math.min(capacity, bucket.getTokens() + tokensToAdd));
bucket.setLastRefillTime(now);
}
}

if (bucket.getTokens() <= 0) {
return false; // 没有令牌
}

// 消耗一个令牌
bucket.setTokens(bucket.getTokens() - 1);
redisTemplate.opsForValue().set(cacheKey, JSON.toJSONString(bucket), 1, TimeUnit.HOURS);

return true;
}

@Data
public static class TokenBucket {
private int tokens;
private long lastRefillTime;

public TokenBucket(int tokens, long lastRefillTime) {
this.tokens = tokens;
this.lastRefillTime = lastRefillTime;
}
}
}

2.1.4 漏桶算法

漏桶算法:以固定速率处理请求,超过容量的请求被丢弃。

特点

  • 平滑输出
  • 不允许突发流量
  • 实现相对简单

实现代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@Service
public class LeakyBucketRateLimiter {

@Autowired
private RedisTemplate<String, String> redisTemplate;

/**
* 漏桶限流
*/
public boolean tryAcquire(String key, int capacity, int leakRate, int leakPeriod) {
String cacheKey = "rate_limit:leaky:" + key;
long now = System.currentTimeMillis();

// 获取当前水位和上次漏水时间
String bucketStr = redisTemplate.opsForValue().get(cacheKey);
LeakyBucket bucket;

if (bucketStr == null) {
bucket = new LeakyBucket(0, now);
} else {
bucket = JSON.parseObject(bucketStr, LeakyBucket.class);

// 计算漏掉的水量
long elapsed = now - bucket.getLastLeakTime();
int waterToLeak = (int) (elapsed * leakRate / (leakPeriod * 1000L));

if (waterToLeak > 0) {
bucket.setWater(Math.max(0, bucket.getWater() - waterToLeak));
bucket.setLastLeakTime(now);
}
}

if (bucket.getWater() >= capacity) {
return false; // 桶已满
}

// 增加水量
bucket.setWater(bucket.getWater() + 1);
redisTemplate.opsForValue().set(cacheKey, JSON.toJSONString(bucket), 1, TimeUnit.HOURS);

return true;
}

@Data
public static class LeakyBucket {
private int water;
private long lastLeakTime;

public LeakyBucket(int water, long lastLeakTime) {
this.water = water;
this.lastLeakTime = lastLeakTime;
}
}
}

2.2 限流实现

2.2.1 网关限流

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
@Component
public class RateLimitGatewayFilter implements GatewayFilter {

@Autowired
private RedisTemplate<String, String> redisTemplate;

@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
ServerHttpRequest request = exchange.getRequest();
String key = getClientIp(request);

// 限流:每秒100个请求
if (!tryAcquire(key, 100, 1)) {
ServerHttpResponse response = exchange.getResponse();
response.setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
return response.setComplete();
}

return chain.filter(exchange);
}

private String getClientIp(ServerHttpRequest request) {
String ip = request.getHeaders().getFirst("X-Forwarded-For");
if (ip == null || ip.isEmpty() || "unknown".equalsIgnoreCase(ip)) {
ip = request.getRemoteAddress() != null ?
request.getRemoteAddress().getAddress().getHostAddress() : "unknown";
}
return ip;
}

private boolean tryAcquire(String key, int limit, int windowSeconds) {
// 使用固定窗口限流
String cacheKey = "rate_limit:gateway:" + key;
String countStr = redisTemplate.opsForValue().get(cacheKey);
int count = countStr == null ? 0 : Integer.parseInt(countStr);

if (count >= limit) {
return false;
}

redisTemplate.opsForValue().increment(cacheKey);
redisTemplate.expire(cacheKey, windowSeconds, TimeUnit.SECONDS);
return true;
}
}

2.2.2 服务限流

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
@Service
public class OrderService {

@Autowired
private TokenBucketRateLimiter rateLimiter;

/**
* 创建订单(限流保护)
*/
public Order createOrder(OrderRequest request) {
String key = "order:create:" + request.getUserId();

// 限流:每个用户每秒最多10个请求
if (!rateLimiter.tryAcquire(key, 10, 10, 1)) {
throw new BusinessException("请求过于频繁,请稍后再试");
}

// 执行业务逻辑
return doCreateOrder(request);
}
}

2.3 限流最佳实践

最佳实践

  1. 多级限流:网关限流 + 服务限流
  2. 动态限流:根据系统负载动态调整限流阈值
  3. 限流降级:超过限流时返回降级结果,而不是直接拒绝
  4. 限流监控:监控限流情况,及时调整策略

3. 熔断(Circuit Breaker)

3.1 熔断机制

3.1.1 熔断状态

熔断器三种状态

  • 关闭(Closed):正常状态,请求正常通过
  • 打开(Open):熔断状态,请求直接失败,不调用服务
  • 半开(Half-Open):试探状态,允许少量请求通过,测试服务是否恢复

3.1.2 状态转换

状态转换规则

  • 关闭 → 打开:失败率超过阈值或连续失败次数超过阈值
  • 打开 → 半开:经过一定时间后自动转换
  • 半开 → 关闭:请求成功,服务恢复
  • 半开 → 打开:请求失败,服务仍不可用

3.2 熔断实现

3.2.1 Resilience4j实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@Service
public class PaymentService {

private final CircuitBreaker circuitBreaker;

@Autowired
private PaymentClient paymentClient;

public PaymentService() {
// 配置熔断器
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // 失败率阈值:50%
.waitDurationInOpenState(Duration.ofSeconds(10)) // 打开状态等待时间:10秒
.slidingWindowSize(10) // 滑动窗口大小:10个请求
.minimumNumberOfCalls(5) // 最小调用次数:5次
.permittedNumberOfCallsInHalfOpenState(3) // 半开状态允许的调用次数:3次
.build();

this.circuitBreaker = CircuitBreaker.of("payment", config);
}

/**
* 调用支付服务(熔断保护)
*/
public PaymentResult pay(PaymentRequest request) {
return circuitBreaker.executeSupplier(() -> {
try {
return paymentClient.pay(request);
} catch (Exception e) {
log.error("Payment service call failed", e);
throw new BusinessException("支付服务调用失败", e);
}
});
}

/**
* 获取熔断器状态
*/
public CircuitBreaker.State getCircuitBreakerState() {
return circuitBreaker.getState();
}

/**
* 获取熔断器指标
*/
public CircuitBreaker.Metrics getCircuitBreakerMetrics() {
return circuitBreaker.getMetrics();
}
}

3.2.2 Hystrix实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@Service
public class OrderService {

@Autowired
private InventoryClient inventoryClient;

/**
* 调用库存服务(Hystrix熔断保护)
*/
@HystrixCommand(
fallbackMethod = "deductStockFallback",
commandProperties = {
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
@HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "10000"),
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")
}
)
public InventoryResult deductStock(Long skuId, Integer quantity) {
return inventoryClient.deductStock(skuId, quantity);
}

/**
* 降级方法
*/
public InventoryResult deductStockFallback(Long skuId, Integer quantity) {
log.warn("Inventory service fallback: skuId={}, quantity={}", skuId, quantity);
// 返回降级结果
return InventoryResult.failure("库存服务暂时不可用,请稍后再试");
}
}

3.3 熔断最佳实践

最佳实践

  1. 合理设置阈值:根据业务特点设置失败率和失败次数阈值
  2. 监控熔断状态:实时监控熔断器状态,及时发现问题
  3. 降级处理:熔断时提供降级方案,而不是直接失败
  4. 自动恢复:设置合理的恢复时间,自动尝试恢复

4. 降级(Fallback)

4.1 降级策略

4.1.1 降级类型

降级类型

  • 功能降级:关闭非核心功能
  • 数据降级:返回缓存数据或默认数据
  • 服务降级:调用备用服务或返回默认结果

4.1.2 降级实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@Service
public class ProductService {

@Autowired
private ProductClient productClient;

@Autowired
private RedisTemplate<String, String> redisTemplate;

/**
* 获取商品信息(降级保护)
*/
public Product getProduct(Long productId) {
try {
// 尝试调用服务
return productClient.getProduct(productId);
} catch (Exception e) {
log.warn("Product service call failed, using fallback: productId={}", productId, e);

// 降级:从缓存获取
return getProductFromCache(productId);
}
}

/**
* 降级方案:从缓存获取
*/
private Product getProductFromCache(Long productId) {
String cacheKey = "product:" + productId;
String productJson = redisTemplate.opsForValue().get(cacheKey);

if (productJson != null) {
return JSON.parseObject(productJson, Product.class);
}

// 如果缓存也没有,返回默认商品
return getDefaultProduct(productId);
}

/**
* 默认商品
*/
private Product getDefaultProduct(Long productId) {
Product product = new Product();
product.setId(productId);
product.setName("商品信息暂时不可用");
product.setPrice(BigDecimal.ZERO);
return product;
}
}

4.2 降级最佳实践

最佳实践

  1. 多级降级:缓存降级 → 默认数据降级 → 功能关闭
  2. 降级通知:降级时发送告警,及时处理问题
  3. 降级恢复:服务恢复后自动取消降级
  4. 降级监控:监控降级情况,分析降级原因

5. 超时(Timeout)

5.1 超时设置

5.1.1 超时配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@Configuration
public class FeignConfig {

@Bean
public Request.Options requestOptions() {
// 连接超时:5秒,读取超时:10秒
return new Request.Options(5000, 10000);
}

@Bean
public Retryer feignRetryer() {
// 重试配置:最大重试3次,初始间隔100ms,最大间隔1秒
return new Retryer.Default(100, 1000, 3);
}
}

5.1.2 超时处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Service
public class OrderService {

@Autowired
private PaymentClient paymentClient;

/**
* 支付订单(超时控制)
*/
public PaymentResult payOrder(PaymentRequest request) {
try {
// 设置超时时间:3秒
return paymentClient.pay(request)
.timeout(Duration.ofSeconds(3))
.block();
} catch (TimeoutException e) {
log.warn("Payment service timeout: request={}", request, e);
throw new BusinessException("支付服务超时,请稍后再试");
} catch (Exception e) {
log.error("Payment service call failed", e);
throw new BusinessException("支付服务调用失败", e);
}
}
}

5.2 超时最佳实践

最佳实践

  1. 分层超时:网关超时 > 服务超时 > 数据库超时
  2. 动态超时:根据服务响应时间动态调整超时时间
  3. 超时重试:超时后可以重试,但要限制重试次数
  4. 超时监控:监控超时情况,优化超时设置

6. 重试(Retry)

6.1 重试策略

6.1.1 重试条件

重试条件

  • 网络错误:连接超时、读取超时等
  • 5xx错误:服务器错误,可以重试
  • 特定异常:业务异常,根据异常类型决定是否重试

6.1.2 重试实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@Service
public class OrderService {

@Autowired
private InventoryClient inventoryClient;

/**
* 扣减库存(重试机制)
*/
@Retryable(
value = {TimeoutException.class, ConnectException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 100, multiplier = 2)
)
public InventoryResult deductStock(Long skuId, Integer quantity) {
try {
return inventoryClient.deductStock(skuId, quantity);
} catch (TimeoutException e) {
log.warn("Inventory service timeout, retrying: skuId={}", skuId, e);
throw e; // 抛出异常,触发重试
} catch (Exception e) {
log.error("Inventory service call failed", e);
throw new BusinessException("库存服务调用失败", e);
}
}

/**
* 重试失败后的处理
*/
@Recover
public InventoryResult recover(TimeoutException e, Long skuId, Integer quantity) {
log.error("Inventory service retry failed: skuId={}", skuId, e);
return InventoryResult.failure("库存服务暂时不可用,请稍后再试");
}
}

6.2 重试最佳实践

最佳实践

  1. 指数退避:重试间隔逐渐增加,避免频繁重试
  2. 最大重试次数:限制最大重试次数,避免无限重试
  3. 幂等性:确保重试操作是幂等的
  4. 重试监控:监控重试情况,分析重试原因

7. 隔离(Isolation)

7.1 线程隔离

7.1.1 线程池隔离

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
@Service
public class OrderService {

private final ExecutorService orderExecutor;
private final ExecutorService paymentExecutor;

public OrderService() {
// 订单服务线程池
this.orderExecutor = new ThreadPoolExecutor(
10, // 核心线程数
20, // 最大线程数
60L, TimeUnit.SECONDS, // 空闲线程存活时间
new LinkedBlockingQueue<>(100), // 队列大小
new ThreadFactoryBuilder().setNameFormat("order-pool-%d").build()
);

// 支付服务线程池
this.paymentExecutor = new ThreadPoolExecutor(
5, // 核心线程数
10, // 最大线程数
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(50),
new ThreadFactoryBuilder().setNameFormat("payment-pool-%d").build()
);
}

/**
* 创建订单(线程池隔离)
*/
public CompletableFuture<Order> createOrderAsync(OrderRequest request) {
return CompletableFuture.supplyAsync(() -> {
return doCreateOrder(request);
}, orderExecutor);
}

/**
* 支付订单(线程池隔离)
*/
public CompletableFuture<PaymentResult> payOrderAsync(PaymentRequest request) {
return CompletableFuture.supplyAsync(() -> {
return doPayOrder(request);
}, paymentExecutor);
}
}

7.1.2 Hystrix线程隔离

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@Service
public class OrderService {

@Autowired
private InventoryClient inventoryClient;

/**
* 扣减库存(Hystrix线程隔离)
*/
@HystrixCommand(
fallbackMethod = "deductStockFallback",
commandProperties = {
@HystrixProperty(name = "execution.isolation.strategy", value = "THREAD"),
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000"),
@HystrixProperty(name = "execution.isolation.thread.maxConcurrentRequests", value = "10")
},
threadPoolProperties = {
@HystrixProperty(name = "coreSize", value = "10"),
@HystrixProperty(name = "maximumSize", value = "20"),
@HystrixProperty(name = "queueSizeRejectionThreshold", value = "100")
}
)
public InventoryResult deductStock(Long skuId, Integer quantity) {
return inventoryClient.deductStock(skuId, quantity);
}

public InventoryResult deductStockFallback(Long skuId, Integer quantity) {
return InventoryResult.failure("库存服务暂时不可用");
}
}

7.2 信号量隔离

7.2.1 信号量隔离实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@Service
public class OrderService {

private final Semaphore inventorySemaphore = new Semaphore(10); // 最多10个并发

@Autowired
private InventoryClient inventoryClient;

/**
* 扣减库存(信号量隔离)
*/
public InventoryResult deductStock(Long skuId, Integer quantity) {
try {
// 获取信号量
if (!inventorySemaphore.tryAcquire(1, TimeUnit.SECONDS)) {
throw new BusinessException("库存服务繁忙,请稍后再试");
}

try {
return inventoryClient.deductStock(skuId, quantity);
} finally {
// 释放信号量
inventorySemaphore.release();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new BusinessException("获取信号量被中断", e);
}
}
}

7.3 隔离最佳实践

最佳实践

  1. 资源隔离:不同服务使用不同的线程池或信号量
  2. 合理配置:根据服务特点配置线程池大小和信号量数量
  3. 监控隔离:监控线程池和信号量的使用情况
  4. 故障隔离:一个服务的故障不影响其他服务

8. 综合实战案例

8.1 完整稳定性保障

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
@Service
public class OrderService {

@Autowired
private TokenBucketRateLimiter rateLimiter;

@Autowired
private CircuitBreaker paymentCircuitBreaker;

@Autowired
private PaymentClient paymentClient;

@Autowired
private RedisTemplate<String, String> redisTemplate;

private final ExecutorService orderExecutor = new ThreadPoolExecutor(
10, 20, 60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(100),
new ThreadFactoryBuilder().setNameFormat("order-pool-%d").build()
);

/**
* 创建订单(完整稳定性保障)
*/
public Order createOrder(OrderRequest request) {
// 1. 限流
String rateLimitKey = "order:create:" + request.getUserId();
if (!rateLimiter.tryAcquire(rateLimitKey, 10, 10, 1)) {
throw new BusinessException("请求过于频繁,请稍后再试");
}

// 2. 创建订单
Order order = doCreateOrder(request);

// 3. 支付订单(异步,使用线程池隔离)
CompletableFuture.runAsync(() -> {
payOrderWithProtection(order, request);
}, orderExecutor);

return order;
}

/**
* 支付订单(完整保护)
*/
private void payOrderWithProtection(Order order, OrderRequest request) {
try {
// 1. 熔断保护
PaymentResult result = paymentCircuitBreaker.executeSupplier(() -> {
try {
// 2. 超时控制:3秒
return paymentClient.pay(buildPaymentRequest(order))
.timeout(Duration.ofSeconds(3))
.block();
} catch (TimeoutException e) {
// 3. 超时重试(最多3次)
return retryPayment(order, 3);
}
});

// 4. 更新订单状态
updateOrderStatus(order.getId(), OrderStatus.PAID);

} catch (Exception e) {
log.error("Payment failed: orderId={}", order.getId(), e);

// 5. 降级处理
handlePaymentFallback(order, e);
}
}

/**
* 重试支付
*/
private PaymentResult retryPayment(Order order, int maxRetries) {
for (int i = 0; i < maxRetries; i++) {
try {
Thread.sleep(100 * (long) Math.pow(2, i)); // 指数退避
return paymentClient.pay(buildPaymentRequest(order))
.timeout(Duration.ofSeconds(3))
.block();
} catch (Exception e) {
if (i == maxRetries - 1) {
throw new BusinessException("支付重试失败", e);
}
}
}
throw new BusinessException("支付重试失败");
}

/**
* 降级处理
*/
private void handlePaymentFallback(Order order, Exception e) {
// 1. 记录降级日志
log.warn("Payment fallback: orderId={}", order.getId(), e);

// 2. 发送告警
alertService.sendAlert("支付服务降级", order.getId());

// 3. 更新订单状态为待支付
updateOrderStatus(order.getId(), OrderStatus.PENDING);

// 4. 发送通知
notificationService.sendPaymentFailedNotification(order);
}
}

9. 监控和告警

9.1 监控指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@Component
public class StabilityMetrics {

private final MeterRegistry meterRegistry;

public StabilityMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}

/**
* 记录限流指标
*/
public void recordRateLimit(String service, boolean allowed) {
meterRegistry.counter("rate_limit", "service", service, "allowed", String.valueOf(allowed))
.increment();
}

/**
* 记录熔断指标
*/
public void recordCircuitBreaker(String service, CircuitBreaker.State state) {
meterRegistry.gauge("circuit_breaker_state", Tags.of("service", service), state,
s -> s == CircuitBreaker.State.OPEN ? 1 : 0);
}

/**
* 记录降级指标
*/
public void recordFallback(String service) {
meterRegistry.counter("fallback", "service", service).increment();
}

/**
* 记录超时指标
*/
public void recordTimeout(String service, long duration) {
meterRegistry.timer("timeout", "service", service).record(duration, TimeUnit.MILLISECONDS);
}

/**
* 记录重试指标
*/
public void recordRetry(String service, int retryCount) {
meterRegistry.counter("retry", "service", service, "count", String.valueOf(retryCount))
.increment();
}
}

10. 总结

10.1 核心要点

  1. 限流:控制请求速率,防止服务过载(固定窗口、滑动窗口、令牌桶、漏桶)
  2. 熔断:快速失败,防止级联故障(关闭、打开、半开)
  3. 降级:服务不可用时提供降级方案(功能降级、数据降级、服务降级)
  4. 超时:设置请求超时时间,避免长时间等待(分层超时、动态超时)
  5. 重试:失败后自动重试,提高成功率(指数退避、最大重试次数)
  6. 隔离:资源隔离,防止故障传播(线程隔离、信号量隔离)

10.2 关键理解

  1. 组合使用:六大机制组合使用,形成完整的稳定性保障体系
  2. 动态调整:根据系统负载和业务特点动态调整参数
  3. 监控告警:实时监控各项指标,及时发现问题
  4. 降级策略:提供多级降级方案,保证核心功能可用

10.3 最佳实践

  1. 网关层限流:在网关层进行限流,保护后端服务
  2. 服务层熔断:在服务层进行熔断,快速失败
  3. 多级降级:缓存降级 → 默认数据降级 → 功能关闭
  4. 分层超时:网关超时 > 服务超时 > 数据库超时
  5. 幂等重试:确保重试操作是幂等的
  6. 资源隔离:不同服务使用不同的线程池或信号量

相关文章