第351集线上QPS容量评估与架构实战:系统实际承载能力分析、容量评估方法与高并发系统QPS规划完整解决方案
|字数总计:4.6k|阅读时长:22分钟|阅读量:
线上QPS容量评估与架构实战:系统实际承载能力分析、容量评估方法与高并发系统QPS规划完整解决方案
一、场景分析
1.1 线上QPS评估的核心问题
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| 架构师面临的核心问题: 1. 系统实际能支撑多少QPS? - 当前线上QPS是多少? - 峰值QPS是多少? - 系统极限QPS是多少? 2. 如何准确评估容量? - 单机QPS上限 - 集群QPS上限 - 瓶颈在哪里? 3. 如何规划扩容? - 什么时候需要扩容? - 需要多少机器? - 如何平滑扩容?
|
1.2 线上QPS评估方法论
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| 评估维度: 实际监控数据: - 当前QPS - 峰值QPS - 平均响应时间 - 错误率 容量测试: - 单机压测 - 集群压测 - 极限压测 瓶颈分析: - CPU使用率 - 内存使用率 - 线程池状态 - 数据库连接池 - 网络IO
|
二、线上QPS监控实现
2.1 基于Prometheus的QPS监控
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| @Component public class QPSMonitor { private final MeterRegistry meterRegistry; private final Counter requestCounter; private final Timer responseTimer; public QPSMonitor(MeterRegistry meterRegistry) { this.meterRegistry = meterRegistry; this.requestCounter = Counter.builder("http.requests.total") .description("Total HTTP requests") .tag("type", "all") .register(meterRegistry); this.responseTimer = Timer.builder("http.request.duration") .description("HTTP request duration") .register(meterRegistry); }
public void recordRequest(String path, String method, long duration) { requestCounter.increment( Tags.of( "path", path, "method", method, "status", "success" ) ); responseTimer.record(duration, TimeUnit.MILLISECONDS); }
public double getCurrentQPS() { return requestCounter.count(); }
public double getPeakQPS() { return queryPrometheus("rate(http_requests_total[1h])"); } }
|
2.2 基于Spring AOP的请求监控
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
| @Aspect @Component public class QPSAspect { @Autowired private QPSMonitor qpsMonitor; @Autowired private RedisTemplate<String, Object> redisTemplate;
@Around("execution(* com.example.controller..*(..))") public Object monitorRequest(ProceedingJoinPoint joinPoint) throws Throwable { long startTime = System.currentTimeMillis(); String path = getRequestPath(joinPoint); String method = getRequestMethod(joinPoint); try { Object result = joinPoint.proceed(); long duration = System.currentTimeMillis() - startTime; qpsMonitor.recordRequest(path, method, duration); recordToRedis(path, method, "success", duration); return result; } catch (Exception e) { long duration = System.currentTimeMillis() - startTime; recordToRedis(path, method, "error", duration); throw e; } }
private void recordToRedis(String path, String method, String status, long duration) { String key = String.format("qps:path:%s:method:%s", path, method); String minuteKey = key + ":" + getCurrentMinute(); redisTemplate.opsForHyperLogLog().add(minuteKey, getRequestId()); redisTemplate.opsForValue().increment(minuteKey + ":count"); redisTemplate.expire(minuteKey, 1, TimeUnit.HOURS); }
public double getRealTimeQPS(String path, String method) { String key = String.format("qps:path:%s:method:%s", path, method); String minuteKey = key + ":" + getCurrentMinute(); Long count = redisTemplate.opsForValue().get(minuteKey + ":count"); return count != null ? count.doubleValue() / 60.0 : 0.0; } private String getCurrentMinute() { return LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmm")); } private String getRequestPath(ProceedingJoinPoint joinPoint) { RequestAttributes attributes = RequestContextHolder.getRequestAttributes(); if (attributes instanceof ServletRequestAttributes) { HttpServletRequest request = ((ServletRequestAttributes) attributes).getRequest(); return request.getRequestURI(); } return joinPoint.getSignature().getName(); } private String getRequestMethod(ProceedingJoinPoint joinPoint) { RequestAttributes attributes = RequestContextHolder.getRequestAttributes(); if (attributes instanceof ServletRequestAttributes) { HttpServletRequest request = ((ServletRequestAttributes) attributes).getRequest(); return request.getMethod(); } return "UNKNOWN"; } private String getRequestId() { return UUID.randomUUID().toString(); } }
|
2.3 QPS统计服务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
| @Service public class QPSStatisticsService { @Autowired private RedisTemplate<String, Object> redisTemplate; @Autowired private QPSMonitor qpsMonitor;
public QPSMetrics getCurrentQPS() { double currentQPS = qpsMonitor.getCurrentQPS(); double realTimeQPS = getRealTimeQPSFromRedis(); return QPSMetrics.builder() .currentQPS(Math.max(currentQPS, realTimeQPS)) .timestamp(System.currentTimeMillis()) .build(); }
public QPSMetrics getPeakQPS(int hours) { double peakQPS = 0; long peakTime = 0; for (int i = 0; i < hours; i++) { String hourKey = "qps:hour:" + getHourKey(i); Long count = redisTemplate.opsForValue().get(hourKey + ":count"); if (count != null) { double hourQPS = count.doubleValue() / 3600.0; if (hourQPS > peakQPS) { peakQPS = hourQPS; peakTime = System.currentTimeMillis() - i * 3600 * 1000; } } } return QPSMetrics.builder() .peakQPS(peakQPS) .peakTime(peakTime) .build(); }
public List<QPSMetrics> getQPSTrend(int minutes) { List<QPSMetrics> trends = new ArrayList<>(); for (int i = 0; i < minutes; i++) { String minuteKey = "qps:minute:" + getMinuteKey(i); Long count = redisTemplate.opsForValue().get(minuteKey + ":count"); double qps = count != null ? count.doubleValue() / 60.0 : 0.0; trends.add(QPSMetrics.builder() .currentQPS(qps) .timestamp(System.currentTimeMillis() - i * 60 * 1000) .build()); } return trends; } private double getRealTimeQPSFromRedis() { String currentMinute = getMinuteKey(0); Set<String> keys = redisTemplate.keys("qps:*:" + currentMinute + ":count"); long totalCount = 0; if (keys != null) { for (String key : keys) { Long count = redisTemplate.opsForValue().get(key); if (count != null) { totalCount += count; } } } return totalCount / 60.0; } private String getHourKey(int hoursAgo) { return LocalDateTime.now() .minusHours(hoursAgo) .format(DateTimeFormatter.ofPattern("yyyyMMddHH")); } private String getMinuteKey(int minutesAgo) { return LocalDateTime.now() .minusMinutes(minutesAgo) .format(DateTimeFormatter.ofPattern("yyyyMMddHHmm")); } }
|
三、系统容量评估
3.1 单机QPS容量评估
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
| @Service public class SingleMachineCapacityEvaluator { @Autowired private SystemMetricsCollector metricsCollector;
public CapacityResult evaluateCapacity() { SystemMetrics metrics = metricsCollector.collect(); double cpuBasedQPS = evaluateBasedOnCPU(metrics); double memoryBasedQPS = evaluateBasedOnMemory(metrics); double threadPoolBasedQPS = evaluateBasedOnThreadPool(metrics); double stressTestQPS = getStressTestQPS(); double actualCapacity = Math.min( Math.min(cpuBasedQPS, memoryBasedQPS), Math.min(threadPoolBasedQPS, stressTestQPS) ); double recommendedCapacity = actualCapacity * 0.8; return CapacityResult.builder() .cpuBasedQPS(cpuBasedQPS) .memoryBasedQPS(memoryBasedQPS) .threadPoolBasedQPS(threadPoolBasedQPS) .stressTestQPS(stressTestQPS) .actualCapacity(actualCapacity) .recommendedCapacity(recommendedCapacity) .build(); }
private double evaluateBasedOnCPU(SystemMetrics metrics) { int cpuCores = metrics.getCpuCores(); double cpuUsage = metrics.getCpuUsage(); double avgResponseTime = metrics.getAvgResponseTime(); double throughputPerCore = 1000.0 / avgResponseTime; double effectiveCpuUsage = Math.min(cpuUsage, 0.8); return cpuCores * effectiveCpuUsage * throughputPerCore; }
private double evaluateBasedOnMemory(SystemMetrics metrics) { long totalMemory = metrics.getTotalMemory(); long usedMemory = metrics.getUsedMemory(); long availableMemory = totalMemory - usedMemory; long memoryPerRequest = 1024 * 1024; long usableMemory = (long) (availableMemory * 0.8); return usableMemory / (double) memoryPerRequest; }
private double evaluateBasedOnThreadPool(SystemMetrics metrics) { int activeThreads = metrics.getActiveThreads(); int maxThreads = metrics.getMaxThreads(); double avgResponseTime = metrics.getAvgResponseTime(); double throughputPerThread = 1000.0 / avgResponseTime; int effectiveThreads = (int) (maxThreads * 0.8); return effectiveThreads * throughputPerThread; }
private double getStressTestQPS() { return 5000; } }
|
3.2 集群QPS容量评估
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
| @Service public class ClusterCapacityEvaluator { @Autowired private SingleMachineCapacityEvaluator singleMachineEvaluator; @Autowired private MachineRegistry machineRegistry;
public ClusterCapacityResult evaluateClusterCapacity() { List<MachineInfo> machines = machineRegistry.getAllMachines(); List<MachineCapacity> machineCapacities = new ArrayList<>(); double totalCapacity = 0; for (MachineInfo machine : machines) { CapacityResult capacity = evaluateMachineCapacity(machine); machineCapacities.add(MachineCapacity.builder() .machineId(machine.getId()) .capacity(capacity.getRecommendedCapacity()) .build()); totalCapacity += capacity.getRecommendedCapacity(); } double effectiveCapacity = totalCapacity * 0.95; double availableCapacity = effectiveCapacity * 0.8; double currentQPS = getCurrentClusterQPS(); double usageRate = currentQPS / availableCapacity; return ClusterCapacityResult.builder() .totalCapacity(totalCapacity) .effectiveCapacity(effectiveCapacity) .availableCapacity(availableCapacity) .currentQPS(currentQPS) .usageRate(usageRate) .machineCapacities(machineCapacities) .build(); }
private CapacityResult evaluateMachineCapacity(MachineInfo machine) { return singleMachineEvaluator.evaluateCapacity(); }
private double getCurrentClusterQPS() { List<MachineInfo> machines = machineRegistry.getAllMachines(); double totalQPS = 0; for (MachineInfo machine : machines) { totalQPS += getMachineQPS(machine.getId()); } return totalQPS; } private double getMachineQPS(String machineId) { return 0; } }
|
四、瓶颈分析与优化
4.1 系统瓶颈检测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
| @Component public class BottleneckDetector { @Autowired private SystemMetricsCollector metricsCollector;
public List<Bottleneck> detectBottlenecks() { List<Bottleneck> bottlenecks = new ArrayList<>(); SystemMetrics metrics = metricsCollector.collect(); if (metrics.getCpuUsage() > 0.8) { bottlenecks.add(Bottleneck.builder() .type(BottleneckType.CPU) .severity(calculateSeverity(metrics.getCpuUsage())) .message("CPU使用率过高: " + metrics.getCpuUsage() * 100 + "%") .recommendation("增加CPU核心数或优化CPU密集型代码") .build()); } double memoryUsage = (double) metrics.getUsedMemory() / metrics.getTotalMemory(); if (memoryUsage > 0.8) { bottlenecks.add(Bottleneck.builder() .type(BottleneckType.MEMORY) .severity(calculateSeverity(memoryUsage)) .message("内存使用率过高: " + memoryUsage * 100 + "%") .recommendation("增加内存或优化内存使用") .build()); } double threadPoolUsage = (double) metrics.getActiveThreads() / metrics.getMaxThreads(); if (threadPoolUsage > 0.8) { bottlenecks.add(Bottleneck.builder() .type(BottleneckType.THREAD_POOL) .severity(calculateSeverity(threadPoolUsage)) .message("线程池使用率过高: " + threadPoolUsage * 100 + "%") .recommendation("增加线程池大小或优化异步处理") .build()); } if (metrics.getDbConnectionPoolUsage() > 0.8) { bottlenecks.add(Bottleneck.builder() .type(BottleneckType.DATABASE) .severity(calculateSeverity(metrics.getDbConnectionPoolUsage())) .message("数据库连接池使用率过高") .recommendation("增加连接池大小或优化数据库查询") .build()); } if (metrics.getAvgResponseTime() > 1000) { bottlenecks.add(Bottleneck.builder() .type(BottleneckType.RESPONSE_TIME) .severity(Severity.HIGH) .message("平均响应时间过长: " + metrics.getAvgResponseTime() + "ms") .recommendation("优化业务逻辑或引入缓存") .build()); } return bottlenecks; } private Severity calculateSeverity(double usage) { if (usage > 0.95) { return Severity.CRITICAL; } else if (usage > 0.85) { return Severity.HIGH; } else if (usage > 0.75) { return Severity.MEDIUM; } else { return Severity.LOW; } } }
|
4.2 容量预警系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
| @Service public class CapacityAlertService { @Autowired private ClusterCapacityEvaluator capacityEvaluator; @Autowired private AlertNotifier alertNotifier;
@Scheduled(fixedRate = 60000) public void checkCapacity() { ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity(); double usageRate = capacity.getUsageRate(); if (usageRate > 0.9) { alertNotifier.sendAlert(AlertLevel.CRITICAL, "集群容量使用率过高: " + usageRate * 100 + "%", "建议立即扩容"); } else if (usageRate > 0.8) { alertNotifier.sendAlert(AlertLevel.WARNING, "集群容量使用率较高: " + usageRate * 100 + "%", "建议准备扩容"); } else if (usageRate > 0.7) { alertNotifier.sendAlert(AlertLevel.INFO, "集群容量使用率: " + usageRate * 100 + "%", "请关注容量使用情况"); } }
public CapacityForecast forecastCapacity() { ClusterCapacityResult current = capacityEvaluator.evaluateClusterCapacity(); List<QPSMetrics> trends = getQPSTrends(24); double growthRate = calculateGrowthRate(trends); double currentQPS = current.getCurrentQPS(); double predictedQPS7Days = currentQPS * Math.pow(1 + growthRate, 7); double availableCapacity = current.getAvailableCapacity(); int daysUntilFull = (int) ((availableCapacity - currentQPS) / (currentQPS * growthRate)); return CapacityForecast.builder() .currentQPS(currentQPS) .predictedQPS7Days(predictedQPS7Days) .availableCapacity(availableCapacity) .daysUntilFull(daysUntilFull) .growthRate(growthRate) .recommendation(generateRecommendation(daysUntilFull, growthRate)) .build(); } private double calculateGrowthRate(List<QPSMetrics> trends) { if (trends.size() < 2) { return 0; } double firstQPS = trends.get(trends.size() - 1).getCurrentQPS(); double lastQPS = trends.get(0).getCurrentQPS(); if (firstQPS == 0) { return 0; } return (lastQPS - firstQPS) / firstQPS / trends.size(); } private List<QPSMetrics> getQPSTrends(int hours) { return new ArrayList<>(); } private String generateRecommendation(int daysUntilFull, double growthRate) { if (daysUntilFull < 3) { return "紧急扩容:建议立即增加机器"; } else if (daysUntilFull < 7) { return "准备扩容:建议本周内增加机器"; } else if (daysUntilFull < 30) { return "关注容量:建议本月内规划扩容"; } else { return "容量充足:暂无扩容需求"; } } }
|
五、线上压测与验证
5.1 线上灰度压测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
| @Service public class OnlineStressTest { @Autowired private ThreadPoolExecutor testExecutor;
public StressTestResult performGrayStressTest(int qps, int durationSeconds, double trafficRatio) { AtomicInteger successCount = new AtomicInteger(0); AtomicInteger failCount = new AtomicInteger(0); List<Long> responseTimes = Collections.synchronizedList(new ArrayList<>()); long startTime = System.currentTimeMillis(); long endTime = startTime + durationSeconds * 1000; int requestsPerSecond = qps; long intervalMs = 1000 / requestsPerSecond; while (System.currentTimeMillis() < endTime) { for (int i = 0; i < requestsPerSecond; i++) { if (Math.random() < trafficRatio) { testExecutor.submit(() -> { long requestStart = System.currentTimeMillis(); try { String response = sendTestRequest(); successCount.incrementAndGet(); responseTimes.add(System.currentTimeMillis() - requestStart); } catch (Exception e) { failCount.incrementAndGet(); } }); } try { Thread.sleep(intervalMs); } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } } } long actualDuration = System.currentTimeMillis() - startTime; double actualQPS = (successCount.get() + failCount.get()) / (actualDuration / 1000.0); Collections.sort(responseTimes); long p50 = responseTimes.get(responseTimes.size() / 2); long p95 = responseTimes.get((int) (responseTimes.size() * 0.95)); long p99 = responseTimes.get((int) (responseTimes.size() * 0.99)); return StressTestResult.builder() .totalRequests(successCount.get() + failCount.get()) .successCount(successCount.get()) .failCount(failCount.get()) .actualQPS(actualQPS) .avgResponseTime(responseTimes.stream().mapToLong(Long::longValue).average().orElse(0)) .p50ResponseTime(p50) .p95ResponseTime(p95) .p99ResponseTime(p99) .duration(actualDuration) .build(); } private String sendTestRequest() { return "OK"; } }
|
5.2 容量验证工具
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
| @Component public class CapacityValidationTool { @Autowired private ClusterCapacityEvaluator capacityEvaluator; @Autowired private OnlineStressTest stressTest;
public ValidationResult validateCapacity() { ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity(); double estimatedCapacity = capacity.getAvailableCapacity(); int[] testQPS = { (int) (estimatedCapacity * 0.5), (int) (estimatedCapacity * 0.7), (int) (estimatedCapacity * 0.9), (int) estimatedCapacity, (int) (estimatedCapacity * 1.1) }; List<TestResult> results = new ArrayList<>(); for (int qps : testQPS) { StressTestResult result = stressTest.performGrayStressTest(qps, 60, 0.05); TestResult testResult = TestResult.builder() .targetQPS(qps) .actualQPS(result.getActualQPS()) .avgResponseTime(result.getAvgResponseTime()) .p95ResponseTime(result.getP95ResponseTime()) .p99ResponseTime(result.getP99ResponseTime()) .errorRate(result.getFailCount() / (double) result.getTotalRequests()) .success(result.getFailCount() == 0 && result.getAvgResponseTime() < 1000) .build(); results.add(testResult); if (testResult.getErrorRate() > 0.01) { break; } } double actualCapacity = findActualCapacity(results); double accuracy = 1 - Math.abs(estimatedCapacity - actualCapacity) / estimatedCapacity; return ValidationResult.builder() .estimatedCapacity(estimatedCapacity) .actualCapacity(actualCapacity) .accuracy(accuracy) .testResults(results) .recommendation(generateValidationRecommendation(accuracy)) .build(); } private double findActualCapacity(List<TestResult> results) { for (TestResult result : results) { if (!result.isSuccess() || result.getErrorRate() > 0.001) { return result.getTargetQPS(); } } return results.get(results.size() - 1).getTargetQPS(); } private String generateValidationRecommendation(double accuracy) { if (accuracy > 0.9) { return "容量评估准确,可以信任评估结果"; } else if (accuracy > 0.7) { return "容量评估基本准确,建议定期验证"; } else { return "容量评估偏差较大,需要优化评估模型"; } } }
|
六、扩容决策与执行
6.1 智能扩容决策
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| @Service public class AutoScalingDecisionEngine { @Autowired private ClusterCapacityEvaluator capacityEvaluator; @Autowired private CapacityAlertService alertService;
public ScalingDecision makeDecision() { ClusterCapacityResult capacity = capacityEvaluator.evaluateClusterCapacity(); CapacityForecast forecast = alertService.forecastCapacity(); double currentUsage = capacity.getUsageRate(); int daysUntilFull = forecast.getDaysUntilFull(); ScalingDecision decision = ScalingDecision.builder() .needScaling(false) .scalingType(ScalingType.NONE) .recommendedMachineCount(0) .build(); if (currentUsage > 0.9 || daysUntilFull < 1) { decision.setNeedScaling(true); decision.setScalingType(ScalingType.EMERGENCY); decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.7)); } else if (currentUsage > 0.8 || daysUntilFull < 3) { decision.setNeedScaling(true); decision.setScalingType(ScalingType.IMMEDIATE); decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.75)); } else if (currentUsage > 0.7 || daysUntilFull < 7) { decision.setNeedScaling(true); decision.setScalingType(ScalingType.PLANNED); decision.setRecommendedMachineCount(calculateMachineCount(capacity, 0.8)); } return decision; }
private int calculateMachineCount(ClusterCapacityResult capacity, double targetUsage) { double currentQPS = capacity.getCurrentQPS(); double targetCapacity = currentQPS / targetUsage; double currentCapacity = capacity.getAvailableCapacity(); double additionalCapacity = targetCapacity - currentCapacity; double machineCapacity = capacity.getMachineCapacities().get(0).getCapacity(); return (int) Math.ceil(additionalCapacity / machineCapacity); } }
|
6.2 平滑扩容执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
| @Service public class SmoothScalingExecutor { @Autowired private MachineProvisioner machineProvisioner; @Autowired private LoadBalancer loadBalancer;
public void executeScaling(ScalingDecision decision) { if (!decision.isNeedScaling()) { return; } int machineCount = decision.getRecommendedMachineCount(); int batchSize = Math.max(1, machineCount / 2); int batches = (int) Math.ceil((double) machineCount / batchSize); for (int i = 0; i < batches; i++) { int currentBatchSize = Math.min(batchSize, machineCount - i * batchSize); List<MachineInfo> newMachines = machineProvisioner.provisionMachines(currentBatchSize); for (MachineInfo machine : newMachines) { deployApplication(machine); } waitForHealthy(newMachines); for (MachineInfo machine : newMachines) { loadBalancer.addBackend(machine); } gradualTrafficIncrease(newMachines); if (i < batches - 1) { try { Thread.sleep(60000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } } } } private void deployApplication(MachineInfo machine) { } private void waitForHealthy(List<MachineInfo> machines) { for (MachineInfo machine : machines) { while (!isHealthy(machine)) { try { Thread.sleep(5000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); return; } } } } private boolean isHealthy(MachineInfo machine) { return true; } private void gradualTrafficIncrease(List<MachineInfo> machines) { int[] trafficRatios = {10, 30, 50, 100}; for (int ratio : trafficRatios) { for (MachineInfo machine : machines) { loadBalancer.setTrafficWeight(machine, ratio); } try { Thread.sleep(30000); } catch (InterruptedException e) { Thread.currentThread().interrupt(); return; } } } }
|
七、QPS监控Dashboard
7.1 实时QPS监控接口
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
| @RestController @RequestMapping("/api/qps") public class QPSDashboardController { @Autowired private QPSStatisticsService qpsStatisticsService; @Autowired private ClusterCapacityEvaluator capacityEvaluator; @Autowired private BottleneckDetector bottleneckDetector;
@GetMapping("/current") public QPSMetrics getCurrentQPS() { return qpsStatisticsService.getCurrentQPS(); }
@GetMapping("/peak") public QPSMetrics getPeakQPS(@RequestParam(defaultValue = "24") int hours) { return qpsStatisticsService.getPeakQPS(hours); }
@GetMapping("/trend") public List<QPSMetrics> getQPSTrend(@RequestParam(defaultValue = "60") int minutes) { return qpsStatisticsService.getQPSTrend(minutes); }
@GetMapping("/capacity") public ClusterCapacityResult getCapacity() { return capacityEvaluator.evaluateClusterCapacity(); }
@GetMapping("/bottlenecks") public List<Bottleneck> getBottlenecks() { return bottleneckDetector.detectBottlenecks(); }
@GetMapping("/forecast") public CapacityForecast getForecast() { return alertService.forecastCapacity(); } }
|
7.2 监控数据模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| @Data @Builder public class QPSMetrics { private double currentQPS; private double peakQPS; private Long timestamp; private Long peakTime; }
@Data @Builder public class CapacityResult { private double cpuBasedQPS; private double memoryBasedQPS; private double threadPoolBasedQPS; private double stressTestQPS; private double actualCapacity; private double recommendedCapacity; }
@Data @Builder public class ClusterCapacityResult { private double totalCapacity; private double effectiveCapacity; private double availableCapacity; private double currentQPS; private double usageRate; private List<MachineCapacity> machineCapacities; }
@Data @Builder public class Bottleneck { private BottleneckType type; private Severity severity; private String message; private String recommendation; }
enum BottleneckType { CPU, MEMORY, THREAD_POOL, DATABASE, RESPONSE_TIME, NETWORK }
enum Severity { LOW, MEDIUM, HIGH, CRITICAL }
|
八、最佳实践总结
8.1 QPS评估流程
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| QPS评估标准流程: 1. 监控当前QPS: - 部署QPS监控系统 - 收集实时QPS数据 - 分析QPS趋势 2. 评估单机容量: - CPU容量评估 - 内存容量评估 - 线程池容量评估 - 压测验证 3. 评估集群容量: - 汇总单机容量 - 考虑负载均衡效率 - 考虑冗余 4. 瓶颈分析: - 检测系统瓶颈 - 分析瓶颈原因 - 制定优化方案 5. 容量规划: - 预测未来QPS增长 - 计算扩容时间点 - 制定扩容方案 6. 持续优化: - 定期验证容量评估 - 优化评估模型 - 调整扩容策略
|
8.2 容量评估公式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| 容量评估公式: 单机QPS容量: CPU容量 = CPU核心数 * 0.8 * (1000 / 平均响应时间) 内存容量 = 可用内存 / 每请求内存占用 线程池容量 = 最大线程数 * 0.8 * (1000 / 平均响应时间) 实际容量 = min(CPU容量, 内存容量, 线程池容量) 集群QPS容量: 理论容量 = 单机容量 * 机器数 有效容量 = 理论容量 * 0.95 (负载均衡效率) 可用容量 = 有效容量 * 0.8 (预留20%冗余) 扩容决策: 需要扩容 = 当前QPS / 可用容量 > 0.8 新增机器数 = (目标容量 - 当前容量) / 单机容量
|
8.3 架构师级别建议
- 建立完善的监控体系: 实时监控QPS、响应时间、错误率等关键指标
- 定期进行容量评估: 每月评估一次系统容量,预测未来需求
- 建立容量预警机制: 当容量使用率超过80%时及时告警
- 灰度压测验证: 通过灰度压测验证容量评估的准确性
- 平滑扩容策略: 采用分批扩容、逐步增加流量的方式避免系统抖动
- 持续优化评估模型: 根据实际数据不断优化容量评估的准确性
通过以上方案,可以准确评估线上系统的QPS容量,制定合理的扩容策略,确保系统稳定运行。