线上QPS容量评估与架构实战:系统实际承载能力分析、容量评估方法与高并发系统QPS规划完整解决方案

一、场景分析

1.1 线上QPS评估的核心问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
架构师面临的核心问题:
1. 系统实际能支撑多少QPS?
- 当前线上QPS是多少?
- 峰值QPS是多少?
- 系统极限QPS是多少?

2. 如何准确评估容量?
- 单机QPS上限
- 集群QPS上限
- 瓶颈在哪里?

3. 如何规划扩容?
- 什么时候需要扩容?
- 需要多少机器?
- 如何平滑扩容?

1.2 线上QPS评估方法论

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
评估维度:
实际监控数据:
- 当前QPS
- 峰值QPS
- 平均响应时间
- 错误率

容量测试:
- 单机压测
- 集群压测
- 极限压测

瓶颈分析:
- CPU使用率
- 内存使用率
- 线程池状态
- 数据库连接池
- 网络IO

二、线上QPS监控实现

2.1 基于Prometheus的QPS监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// QPSMonitor.java
@Component
public class QPSMonitor {

private final MeterRegistry meterRegistry;
private final Counter requestCounter;
private final Timer responseTimer;

public QPSMonitor(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.requestCounter = Counter.builder("http.requests.total")
.description("Total HTTP requests")
.tag("type", "all")
.register(meterRegistry);
this.responseTimer = Timer.builder("http.request.duration")
.description("HTTP request duration")
.register(meterRegistry);
}

/**
* 记录请求
*/
public void recordRequest(String path, String method, long duration) {
requestCounter.increment(
Tags.of(
"path", path,
"method", method,
"status", "success"
)
);
responseTimer.record(duration, TimeUnit.MILLISECONDS);
}

/**
* 获取当前QPS
*/
public double getCurrentQPS() {
// 查询最近1分钟的QPS
return requestCounter.count();
}

/**
* 获取峰值QPS(最近1小时)
*/
public double getPeakQPS() {
// 从Prometheus查询最大QPS
// rate(http_requests_total[1h])
return queryPrometheus("rate(http_requests_total[1h])");
}
}

2.2 基于Spring AOP的请求监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// QPSAspect.java
@Aspect
@Component
public class QPSAspect {

@Autowired
private QPSMonitor qpsMonitor;

@Autowired
private RedisTemplate<String, Object> redisTemplate;

/**
* 拦截所有Controller请求
*/
@Around("execution(* com.example.controller..*(..))")
public Object monitorRequest(ProceedingJoinPoint joinPoint) throws Throwable {
long startTime = System.currentTimeMillis();
String path = getRequestPath(joinPoint);
String method = getRequestMethod(joinPoint);

try {
Object result = joinPoint.proceed();
long duration = System.currentTimeMillis() - startTime;

// 记录成功请求
qpsMonitor.recordRequest(path, method, duration);

// 记录到Redis(用于实时QPS计算)
recordToRedis(path, method, "success", duration);

return result;
} catch (Exception e) {
long duration = System.currentTimeMillis() - startTime;

// 记录失败请求
recordToRedis(path, method, "error", duration);

throw e;
}
}

/**
* 记录到Redis,用于实时QPS统计
*/
private void recordToRedis(String path, String method, String status, long duration) {
String key = String.format("qps:path:%s:method:%s", path, method);
String minuteKey = key + ":" + getCurrentMinute();

// 使用HyperLogLog统计去重请求数(可选)
redisTemplate.opsForHyperLogLog().add(minuteKey, getRequestId());

// 使用计数器统计总请求数
redisTemplate.opsForValue().increment(minuteKey + ":count");

// 设置过期时间(保留最近1小时数据)
redisTemplate.expire(minuteKey, 1, TimeUnit.HOURS);
}

/**
* 获取当前QPS(实时计算)
*/
public double getRealTimeQPS(String path, String method) {
String key = String.format("qps:path:%s:method:%s", path, method);
String minuteKey = key + ":" + getCurrentMinute();

Long count = redisTemplate.opsForValue().get(minuteKey + ":count");
return count != null ? count.doubleValue() / 60.0 : 0.0;
}

private String getCurrentMinute() {
return LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmm"));
}

private String getRequestPath(ProceedingJoinPoint joinPoint) {
// 从Request获取路径
RequestAttributes attributes = RequestContextHolder.getRequestAttributes();
if (attributes instanceof ServletRequestAttributes) {
HttpServletRequest request = ((ServletRequestAttributes) attributes).getRequest();
return request.getRequestURI();
}
return joinPoint.getSignature().getName();
}

private String getRequestMethod(ProceedingJoinPoint joinPoint) {
RequestAttributes attributes = RequestContextHolder.getRequestAttributes();
if (attributes instanceof ServletRequestAttributes) {
HttpServletRequest request = ((ServletRequestAttributes) attributes).getRequest();
return request.getMethod();
}
return "UNKNOWN";
}

private String getRequestId() {
return UUID.randomUUID().toString();
}
}

2.3 QPS统计服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
// QPSStatisticsService.java
@Service
public class QPSStatisticsService {

@Autowired
private RedisTemplate<String, Object> redisTemplate;

@Autowired
private QPSMonitor qpsMonitor;

/**
* 获取当前QPS
*/
public QPSMetrics getCurrentQPS() {
// 从Prometheus获取当前QPS
double currentQPS = qpsMonitor.getCurrentQPS();

// 从Redis获取实时QPS(更准确)
double realTimeQPS = getRealTimeQPSFromRedis();

return QPSMetrics.builder()
.currentQPS(Math.max(currentQPS, realTimeQPS))
.timestamp(System.currentTimeMillis())
.build();
}

/**
* 获取峰值QPS(最近24小时)
*/
public QPSMetrics getPeakQPS(int hours) {
double peakQPS = 0;
long peakTime = 0;

// 查询最近N小时的数据
for (int i = 0; i < hours; i++) {
String hourKey = "qps:hour:" + getHourKey(i);
Long count = redisTemplate.opsForValue().get(hourKey + ":count");

if (count != null) {
double hourQPS = count.doubleValue() / 3600.0;
if (hourQPS > peakQPS) {
peakQPS = hourQPS;
peakTime = System.currentTimeMillis() - i * 3600 * 1000;
}
}
}

return QPSMetrics.builder()
.peakQPS(peakQPS)
.peakTime(peakTime)
.build();
}

/**
* 获取QPS趋势(最近N分钟)
*/
public List<QPSMetrics> getQPSTrend(int minutes) {
List<QPSMetrics> trends = new ArrayList<>();

for (int i = 0; i < minutes; i++) {
String minuteKey = "qps:minute:" + getMinuteKey(i);
Long count = redisTemplate.opsForValue().get(minuteKey + ":count");

double qps = count != null ? count.doubleValue() / 60.0 : 0.0;

trends.add(QPSMetrics.builder()
.currentQPS(qps)
.timestamp(System.currentTimeMillis() - i * 60 * 1000)
.build());
}

return trends;
}

private double getRealTimeQPSFromRedis() {
// 统计最近1分钟的所有请求
String currentMinute = getMinuteKey(0);
Set<String> keys = redisTemplate.keys("qps:*:" + currentMinute + ":count");

long totalCount = 0;
if (keys != null) {
for (String key : keys) {
Long count = redisTemplate.opsForValue().get(key);
if (count != null) {
totalCount += count;
}
}
}

return totalCount / 60.0;
}

private String getHourKey(int hoursAgo) {
return LocalDateTime.now()
.minusHours(hoursAgo)
.format(DateTimeFormatter.ofPattern("yyyyMMddHH"));
}

private String getMinuteKey(int minutesAgo) {
return LocalDateTime.now()
.minusMinutes(minutesAgo)
.format(DateTimeFormatter.ofPattern("yyyyMMddHHmm"));
}
}

三、系统容量评估

3.1 单机QPS容量评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
// SingleMachineCapacityEvaluator.java
@Service
public class SingleMachineCapacityEvaluator {

@Autowired
private SystemMetricsCollector metricsCollector;

/**
* 评估单机QPS容量
*/
public CapacityResult evaluateCapacity() {
SystemMetrics metrics = metricsCollector.collect();

// 1. 基于CPU评估
double cpuBasedQPS = evaluateBasedOnCPU(metrics);

// 2. 基于内存评估
double memoryBasedQPS = evaluateBasedOnMemory(metrics);

// 3. 基于线程池评估
double threadPoolBasedQPS = evaluateBasedOnThreadPool(metrics);

// 4. 基于实际压测数据
double stressTestQPS = getStressTestQPS();

// 取最小值作为实际容量
double actualCapacity = Math.min(
Math.min(cpuBasedQPS, memoryBasedQPS),
Math.min(threadPoolBasedQPS, stressTestQPS)
);

// 考虑安全系数(80%使用率)
double recommendedCapacity = actualCapacity * 0.8;

return CapacityResult.builder()
.cpuBasedQPS(cpuBasedQPS)
.memoryBasedQPS(memoryBasedQPS)
.threadPoolBasedQPS(threadPoolBasedQPS)
.stressTestQPS(stressTestQPS)
.actualCapacity(actualCapacity)
.recommendedCapacity(recommendedCapacity)
.build();
}

/**
* 基于CPU使用率评估
*/
private double evaluateBasedOnCPU(SystemMetrics metrics) {
int cpuCores = metrics.getCpuCores();
double cpuUsage = metrics.getCpuUsage();
double avgResponseTime = metrics.getAvgResponseTime();

// CPU计算公式
// QPS = CPU核心数 * CPU利用率 * 每核心处理能力
// 每核心处理能力 = 1 / 平均响应时间
double throughputPerCore = 1000.0 / avgResponseTime;

// 考虑CPU利用率(建议不超过80%)
double effectiveCpuUsage = Math.min(cpuUsage, 0.8);

return cpuCores * effectiveCpuUsage * throughputPerCore;
}

/**
* 基于内存评估
*/
private double evaluateBasedOnMemory(SystemMetrics metrics) {
long totalMemory = metrics.getTotalMemory();
long usedMemory = metrics.getUsedMemory();
long availableMemory = totalMemory - usedMemory;

// 每个请求平均内存占用(需要根据实际情况调整)
long memoryPerRequest = 1024 * 1024; // 1MB

// 预留20%内存
long usableMemory = (long) (availableMemory * 0.8);

return usableMemory / (double) memoryPerRequest;
}

/**
* 基于线程池评估
*/
private double evaluateBasedOnThreadPool(SystemMetrics metrics) {
int activeThreads = metrics.getActiveThreads();
int maxThreads = metrics.getMaxThreads();
double avgResponseTime = metrics.getAvgResponseTime();

// 线程池QPS = 活跃线程数 * 每线程吞吐量
// 每线程吞吐量 = 1000 / 平均响应时间
double throughputPerThread = 1000.0 / avgResponseTime;

// 使用最大线程数的80%
int effectiveThreads = (int) (maxThreads * 0.8);

return effectiveThreads * throughputPerThread;
}

/**
* 获取压测QPS
*/
private double getStressTestQPS() {
// 从压测报告获取
// 这里返回历史压测数据
return 5000; // 示例值
}
}

3.2 集群QPS容量评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// ClusterCapacityEvaluator.java
@Service
public class ClusterCapacityEvaluator {

@Autowired
private SingleMachineCapacityEvaluator singleMachineEvaluator;

@Autowired
private MachineRegistry machineRegistry;

/**
* 评估集群QPS容量
*/
public ClusterCapacityResult evaluateClusterCapacity() {
List<MachineInfo> machines = machineRegistry.getAllMachines();

// 1. 评估每台机器容量
List<MachineCapacity> machineCapacities = new ArrayList<>();
double totalCapacity = 0;

for (MachineInfo machine : machines) {
CapacityResult capacity = evaluateMachineCapacity(machine);
machineCapacities.add(MachineCapacity.builder()
.machineId(machine.getId())
.capacity(capacity.getRecommendedCapacity())
.build());
totalCapacity += capacity.getRecommendedCapacity();
}

// 2. 考虑负载均衡效率(95%)
double effectiveCapacity = totalCapacity * 0.95;

// 3. 考虑冗余(需要预留20%容量)
double availableCapacity = effectiveCapacity * 0.8;

// 4. 当前实际QPS
double currentQPS = getCurrentClusterQPS();

// 5. 容量使用率
double usageRate = currentQPS / availableCapacity;

return ClusterCapacityResult.builder()
.totalCapacity(totalCapacity)
.effectiveCapacity(effectiveCapacity)
.availableCapacity(availableCapacity)
.currentQPS(currentQPS)
.usageRate(usageRate)
.machineCapacities(machineCapacities)
.build();
}

/**
* 评估单机容量
*/
private CapacityResult evaluateMachineCapacity(MachineInfo machine) {
// 调用单机评估器(可能需要RPC调用)
return singleMachineEvaluator.evaluateCapacity();
}

/**
* 获取当前集群QPS
*/
private double getCurrentClusterQPS() {
// 汇总所有机器的QPS
List<MachineInfo> machines = machineRegistry.getAllMachines();
double totalQPS = 0;

for (MachineInfo machine : machines) {
totalQPS += getMachineQPS(machine.getId());
}

return totalQPS;
}

private double getMachineQPS(String machineId) {
// 从监控系统获取机器QPS
return 0; // 示例
}
}

四、瓶颈分析与优化

4.1 系统瓶颈检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// BottleneckDetector.java
@Component
public class BottleneckDetector {

@Autowired
private SystemMetricsCollector metricsCollector;

/**
* 检测系统瓶颈
*/
public List<Bottleneck> detectBottlenecks() {
List<Bottleneck> bottlenecks = new ArrayList<>();
SystemMetrics metrics = metricsCollector.collect();

// 1. CPU瓶颈
if (metrics.getCpuUsage() > 0.8) {
bottlenecks.add(Bottleneck.builder()
.type(BottleneckType.CPU)
.severity(calculateSeverity(metrics.getCpuUsage()))
.message("CPU使用率过高: " + metrics.getCpuUsage() * 100 + "%")
.recommendation("增加CPU核心数或优化CPU密集型代码")
.build());
}

// 2. 内存瓶颈
double memoryUsage = (double) metrics.getUsedMemory() / metrics.getTotalMemory();
if (memoryUsage > 0.8) {
bottlenecks.add(Bottleneck.builder()
.type(BottleneckType.MEMORY)
.severity(calculateSeverity(memoryUsage))
.message("内存使用率过高: " + memoryUsage * 100 + "%")
.recommendation("增加内存或优化内存使用")
.build());
}

// 3. 线程池瓶颈
double threadPoolUsage = (double) metrics.getActiveThreads() / metrics.getMaxThreads();
if (threadPoolUsage > 0.8) {
bottlenecks.add(Bottleneck.builder()
.type(BottleneckType.THREAD_POOL)
.severity(calculateSeverity(threadPoolUsage))
.message("线程池使用率过高: " + threadPoolUsage * 100 + "%")
.recommendation("增加线程池大小或优化异步处理")
.build());
}

// 4. 数据库连接池瓶颈
if (metrics.getDbConnectionPoolUsage() > 0.8) {
bottlenecks.add(Bottleneck.builder()
.type(BottleneckType.DATABASE)
.severity(calculateSeverity(metrics.getDbConnectionPoolUsage()))
.message("数据库连接池使用率过高")
.recommendation("增加连接池大小或优化数据库查询")
.build());
}

// 5. 响应时间瓶颈
if (metrics.getAvgResponseTime() > 1000) {
bottlenecks.add(Bottleneck.builder()
.type(BottleneckType.RESPONSE_TIME)
.severity(Severity.HIGH)
.message("平均响应时间过长: " + metrics.getAvgResponseTime() + "ms")
.recommendation("优化业务逻辑或引入缓存")
.build());
}

return bottlenecks;
}

private Severity calculateSeverity(double usage) {
if (usage > 0.95) {
return Severity.CRITICAL;
} else if (usage > 0.85) {
return Severity.HIGH;
} else if (usage > 0.75) {
return Severity.MEDIUM;
} else {
return Severity.LOW;
}
}
}

4.2 容量预警系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// CapacityAlertService.java
@Service
public class CapacityAlertService {

@Autowired
private ClusterCapacityEvaluator capacityEvaluator;

@Autowired
private AlertNotifier alertNotifier;

/**
* 容量预警检查
*/
@Scheduled(fixedRate = 60000) // 每分钟检查一次
public void checkCapacity() {
ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity();

double usageRate = capacity.getUsageRate();

// 警告阈值
if (usageRate > 0.9) {
alertNotifier.sendAlert(AlertLevel.CRITICAL,
"集群容量使用率过高: " + usageRate * 100 + "%",
"建议立即扩容");
} else if (usageRate > 0.8) {
alertNotifier.sendAlert(AlertLevel.WARNING,
"集群容量使用率较高: " + usageRate * 100 + "%",
"建议准备扩容");
} else if (usageRate > 0.7) {
alertNotifier.sendAlert(AlertLevel.INFO,
"集群容量使用率: " + usageRate * 100 + "%",
"请关注容量使用情况");
}
}

/**
* 预测何时需要扩容
*/
public CapacityForecast forecastCapacity() {
ClusterCapacityResult current = capacityEvaluator.evaluateClusterCapacity();
List<QPSMetrics> trends = getQPSTrends(24); // 最近24小时趋势

// 线性回归预测未来QPS增长
double growthRate = calculateGrowthRate(trends);

// 预测未来7天的QPS
double currentQPS = current.getCurrentQPS();
double predictedQPS7Days = currentQPS * Math.pow(1 + growthRate, 7);

// 计算需要扩容的时间
double availableCapacity = current.getAvailableCapacity();
int daysUntilFull = (int) ((availableCapacity - currentQPS) / (currentQPS * growthRate));

return CapacityForecast.builder()
.currentQPS(currentQPS)
.predictedQPS7Days(predictedQPS7Days)
.availableCapacity(availableCapacity)
.daysUntilFull(daysUntilFull)
.growthRate(growthRate)
.recommendation(generateRecommendation(daysUntilFull, growthRate))
.build();
}

private double calculateGrowthRate(List<QPSMetrics> trends) {
if (trends.size() < 2) {
return 0;
}

double firstQPS = trends.get(trends.size() - 1).getCurrentQPS();
double lastQPS = trends.get(0).getCurrentQPS();

if (firstQPS == 0) {
return 0;
}

// 计算增长率
return (lastQPS - firstQPS) / firstQPS / trends.size();
}

private List<QPSMetrics> getQPSTrends(int hours) {
// 从监控系统获取QPS趋势
return new ArrayList<>();
}

private String generateRecommendation(int daysUntilFull, double growthRate) {
if (daysUntilFull < 3) {
return "紧急扩容:建议立即增加机器";
} else if (daysUntilFull < 7) {
return "准备扩容:建议本周内增加机器";
} else if (daysUntilFull < 30) {
return "关注容量:建议本月内规划扩容";
} else {
return "容量充足:暂无扩容需求";
}
}
}

五、线上压测与验证

5.1 线上灰度压测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// OnlineStressTest.java
@Service
public class OnlineStressTest {

@Autowired
private ThreadPoolExecutor testExecutor;

/**
* 灰度压测(只压测部分流量)
*/
public StressTestResult performGrayStressTest(int qps, int durationSeconds, double trafficRatio) {
AtomicInteger successCount = new AtomicInteger(0);
AtomicInteger failCount = new AtomicInteger(0);
List<Long> responseTimes = Collections.synchronizedList(new ArrayList<>());

long startTime = System.currentTimeMillis();
long endTime = startTime + durationSeconds * 1000;

// 控制QPS
int requestsPerSecond = qps;
long intervalMs = 1000 / requestsPerSecond;

while (System.currentTimeMillis() < endTime) {
for (int i = 0; i < requestsPerSecond; i++) {
// 只压测指定比例的流量
if (Math.random() < trafficRatio) {
testExecutor.submit(() -> {
long requestStart = System.currentTimeMillis();
try {
// 发送请求
String response = sendTestRequest();
successCount.incrementAndGet();
responseTimes.add(System.currentTimeMillis() - requestStart);
} catch (Exception e) {
failCount.incrementAndGet();
}
});
}

try {
Thread.sleep(intervalMs);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}

// 统计结果
long actualDuration = System.currentTimeMillis() - startTime;
double actualQPS = (successCount.get() + failCount.get()) / (actualDuration / 1000.0);

Collections.sort(responseTimes);
long p50 = responseTimes.get(responseTimes.size() / 2);
long p95 = responseTimes.get((int) (responseTimes.size() * 0.95));
long p99 = responseTimes.get((int) (responseTimes.size() * 0.99));

return StressTestResult.builder()
.totalRequests(successCount.get() + failCount.get())
.successCount(successCount.get())
.failCount(failCount.get())
.actualQPS(actualQPS)
.avgResponseTime(responseTimes.stream().mapToLong(Long::longValue).average().orElse(0))
.p50ResponseTime(p50)
.p95ResponseTime(p95)
.p99ResponseTime(p99)
.duration(actualDuration)
.build();
}

private String sendTestRequest() {
// 发送测试请求
return "OK";
}
}

5.2 容量验证工具

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// CapacityValidationTool.java
@Component
public class CapacityValidationTool {

@Autowired
private ClusterCapacityEvaluator capacityEvaluator;

@Autowired
private OnlineStressTest stressTest;

/**
* 验证容量评估是否准确
*/
public ValidationResult validateCapacity() {
// 1. 获取评估容量
ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity();
double estimatedCapacity = capacity.getAvailableCapacity();

// 2. 逐步增加QPS进行压测
int[] testQPS = {
(int) (estimatedCapacity * 0.5),
(int) (estimatedCapacity * 0.7),
(int) (estimatedCapacity * 0.9),
(int) estimatedCapacity,
(int) (estimatedCapacity * 1.1)
};

List<TestResult> results = new ArrayList<>();

for (int qps : testQPS) {
// 灰度压测(5%流量)
StressTestResult result = stressTest.performGrayStressTest(qps, 60, 0.05);

TestResult testResult = TestResult.builder()
.targetQPS(qps)
.actualQPS(result.getActualQPS())
.avgResponseTime(result.getAvgResponseTime())
.p95ResponseTime(result.getP95ResponseTime())
.p99ResponseTime(result.getP99ResponseTime())
.errorRate(result.getFailCount() / (double) result.getTotalRequests())
.success(result.getFailCount() == 0 && result.getAvgResponseTime() < 1000)
.build();

results.add(testResult);

// 如果失败率过高,停止测试
if (testResult.getErrorRate() > 0.01) {
break;
}
}

// 3. 找出实际容量
double actualCapacity = findActualCapacity(results);

// 4. 计算评估准确度
double accuracy = 1 - Math.abs(estimatedCapacity - actualCapacity) / estimatedCapacity;

return ValidationResult.builder()
.estimatedCapacity(estimatedCapacity)
.actualCapacity(actualCapacity)
.accuracy(accuracy)
.testResults(results)
.recommendation(generateValidationRecommendation(accuracy))
.build();
}

private double findActualCapacity(List<TestResult> results) {
for (TestResult result : results) {
if (!result.isSuccess() || result.getErrorRate() > 0.001) {
return result.getTargetQPS();
}
}
return results.get(results.size() - 1).getTargetQPS();
}

private String generateValidationRecommendation(double accuracy) {
if (accuracy > 0.9) {
return "容量评估准确,可以信任评估结果";
} else if (accuracy > 0.7) {
return "容量评估基本准确,建议定期验证";
} else {
return "容量评估偏差较大,需要优化评估模型";
}
}
}

六、扩容决策与执行

6.1 智能扩容决策

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// AutoScalingDecisionEngine.java
@Service
public class AutoScalingDecisionEngine {

@Autowired
private ClusterCapacityEvaluator capacityEvaluator;

@Autowired
private CapacityAlertService alertService;

/**
* 判断是否需要扩容
*/
public ScalingDecision makeDecision() {
ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity();
CapacityForecast forecast = alertService.forecastCapacity();

double currentUsage = capacity.getUsageRate();
int daysUntilFull = forecast.getDaysUntilFull();

ScalingDecision decision = ScalingDecision.builder()
.needScaling(false)
.scalingType(ScalingType.NONE)
.recommendedMachineCount(0)
.build();

// 决策规则
if (currentUsage > 0.9 || daysUntilFull < 1) {
// 紧急扩容
decision.setNeedScaling(true);
decision.setScalingType(ScalingType.EMERGENCY);
decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.7)); // 扩容到70%使用率
} else if (currentUsage > 0.8 || daysUntilFull < 3) {
// 立即扩容
decision.setNeedScaling(true);
decision.setScalingType(ScalingType.IMMEDIATE);
decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.75)); // 扩容到75%使用率
} else if (currentUsage > 0.7 || daysUntilFull < 7) {
// 计划扩容
decision.setNeedScaling(true);
decision.setScalingType(ScalingType.PLANNED);
decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.8)); // 扩容到80%使用率
}

return decision;
}

/**
* 计算需要的机器数
*/
private int calculateMachineCount(ClusterCapacityResult capacity, double targetUsage) {
double currentQPS = capacity.getCurrentQPS();
double targetCapacity = currentQPS / targetUsage;
double currentCapacity = capacity.getAvailableCapacity();

double additionalCapacity = targetCapacity - currentCapacity;
double machineCapacity = capacity.getMachineCapacities().get(0).getCapacity();

return (int) Math.ceil(additionalCapacity / machineCapacity);
}
}

6.2 平滑扩容执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// SmoothScalingExecutor.java
@Service
public class SmoothScalingExecutor {

@Autowired
private MachineProvisioner machineProvisioner;

@Autowired
private LoadBalancer loadBalancer;

/**
* 平滑扩容
*/
public void executeScaling(ScalingDecision decision) {
if (!decision.isNeedScaling()) {
return;
}

int machineCount = decision.getRecommendedMachineCount();

// 分批扩容(每次扩容不超过50%)
int batchSize = Math.max(1, machineCount / 2);
int batches = (int) Math.ceil((double) machineCount / batchSize);

for (int i = 0; i < batches; i++) {
int currentBatchSize = Math.min(batchSize, machineCount - i * batchSize);

// 1. 申请新机器
List<MachineInfo> newMachines = machineProvisioner.provisionMachines(currentBatchSize);

// 2. 部署应用
for (MachineInfo machine : newMachines) {
deployApplication(machine);
}

// 3. 健康检查
waitForHealthy(newMachines);

// 4. 加入负载均衡
for (MachineInfo machine : newMachines) {
loadBalancer.addBackend(machine);
}

// 5. 预热(逐步增加流量)
gradualTrafficIncrease(newMachines);

// 批次间等待(避免一次性压力过大)
if (i < batches - 1) {
try {
Thread.sleep(60000); // 等待1分钟
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}
}

private void deployApplication(MachineInfo machine) {
// 部署应用逻辑
}

private void waitForHealthy(List<MachineInfo> machines) {
// 等待所有机器健康检查通过
for (MachineInfo machine : machines) {
while (!isHealthy(machine)) {
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
}
}
}

private boolean isHealthy(MachineInfo machine) {
// 健康检查
return true;
}

private void gradualTrafficIncrease(List<MachineInfo> machines) {
// 逐步增加流量:10% -> 30% -> 50% -> 100%
int[] trafficRatios = {10, 30, 50, 100};

for (int ratio : trafficRatios) {
for (MachineInfo machine : machines) {
loadBalancer.setTrafficWeight(machine, ratio);
}

try {
Thread.sleep(30000); // 每个阶段等待30秒
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
}
}
}

七、QPS监控Dashboard

7.1 实时QPS监控接口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// QPSDashboardController.java
@RestController
@RequestMapping("/api/qps")
public class QPSDashboardController {

@Autowired
private QPSStatisticsService qpsStatisticsService;

@Autowired
private ClusterCapacityEvaluator capacityEvaluator;

@Autowired
private BottleneckDetector bottleneckDetector;

/**
* 获取当前QPS
*/
@GetMapping("/current")
public QPSMetrics getCurrentQPS() {
return qpsStatisticsService.getCurrentQPS();
}

/**
* 获取峰值QPS
*/
@GetMapping("/peak")
public QPSMetrics getPeakQPS(@RequestParam(defaultValue = "24") int hours) {
return qpsStatisticsService.getPeakQPS(hours);
}

/**
* 获取QPS趋势
*/
@GetMapping("/trend")
public List<QPSMetrics> getQPSTrend(@RequestParam(defaultValue = "60") int minutes) {
return qpsStatisticsService.getQPSTrend(minutes);
}

/**
* 获取容量信息
*/
@GetMapping("/capacity")
public ClusterCapacityResult getCapacity() {
return capacityEvaluator.evaluateClusterCapacity();
}

/**
* 获取瓶颈信息
*/
@GetMapping("/bottlenecks")
public List<Bottleneck> getBottlenecks() {
return bottleneckDetector.detectBottlenecks();
}

/**
* 获取容量预测
*/
@GetMapping("/forecast")
public CapacityForecast getForecast() {
return alertService.forecastCapacity();
}
}

7.2 监控数据模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// 数据模型
@Data
@Builder
public class QPSMetrics {
private double currentQPS;
private double peakQPS;
private Long timestamp;
private Long peakTime;
}

@Data
@Builder
public class CapacityResult {
private double cpuBasedQPS;
private double memoryBasedQPS;
private double threadPoolBasedQPS;
private double stressTestQPS;
private double actualCapacity;
private double recommendedCapacity;
}

@Data
@Builder
public class ClusterCapacityResult {
private double totalCapacity;
private double effectiveCapacity;
private double availableCapacity;
private double currentQPS;
private double usageRate;
private List<MachineCapacity> machineCapacities;
}

@Data
@Builder
public class Bottleneck {
private BottleneckType type;
private Severity severity;
private String message;
private String recommendation;
}

enum BottleneckType {
CPU, MEMORY, THREAD_POOL, DATABASE, RESPONSE_TIME, NETWORK
}

enum Severity {
LOW, MEDIUM, HIGH, CRITICAL
}

八、最佳实践总结

8.1 QPS评估流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
QPS评估标准流程:
1. 监控当前QPS:
- 部署QPS监控系统
- 收集实时QPS数据
- 分析QPS趋势

2. 评估单机容量:
- CPU容量评估
- 内存容量评估
- 线程池容量评估
- 压测验证

3. 评估集群容量:
- 汇总单机容量
- 考虑负载均衡效率
- 考虑冗余

4. 瓶颈分析:
- 检测系统瓶颈
- 分析瓶颈原因
- 制定优化方案

5. 容量规划:
- 预测未来QPS增长
- 计算扩容时间点
- 制定扩容方案

6. 持续优化:
- 定期验证容量评估
- 优化评估模型
- 调整扩容策略

8.2 容量评估公式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
容量评估公式:
单机QPS容量:
CPU容量 = CPU核心数 * 0.8 * (1000 / 平均响应时间)
内存容量 = 可用内存 / 每请求内存占用
线程池容量 = 最大线程数 * 0.8 * (1000 / 平均响应时间)
实际容量 = min(CPU容量, 内存容量, 线程池容量)

集群QPS容量:
理论容量 = 单机容量 * 机器数
有效容量 = 理论容量 * 0.95 (负载均衡效率)
可用容量 = 有效容量 * 0.8 (预留20%冗余)

扩容决策:
需要扩容 = 当前QPS / 可用容量 > 0.8
新增机器数 = (目标容量 - 当前容量) / 单机容量

8.3 架构师级别建议

  1. 建立完善的监控体系: 实时监控QPS、响应时间、错误率等关键指标
  2. 定期进行容量评估: 每月评估一次系统容量,预测未来需求
  3. 建立容量预警机制: 当容量使用率超过80%时及时告警
  4. 灰度压测验证: 通过灰度压测验证容量评估的准确性
  5. 平滑扩容策略: 采用分批扩容、逐步增加流量的方式避免系统抖动
  6. 持续优化评估模型: 根据实际数据不断优化容量评估的准确性

通过以上方案,可以准确评估线上系统的QPS容量,制定合理的扩容策略,确保系统稳定运行。