第197集系统性能监控核心指标实战:CPU、内存、连接数、QPS监控与优化 | 字数总计: 8.5k | 阅读时长: 40分钟 | 阅读量:
1. 系统性能监控概述 系统性能监控是保障业务稳定运行的关键环节,通过实时监控CPU使用率、内存使用率、活跃连接数、QPS等核心指标,可以及时发现性能瓶颈、预防系统故障、优化资源配置。本文将详细介绍这些关键监控指标的采集方法、分析工具以及生产级监控方案。
1.1 性能监控的重要性
业务保障 : 确保系统在高负载下稳定运行
故障预防 : 提前发现性能瓶颈,预防系统崩溃
容量规划 : 为系统扩容和架构优化提供数据支持
成本优化 : 合理配置资源,降低运营成本
用户体验 : 保证应用响应速度和可用性
1.2 核心监控指标 CPU使用率监控
系统CPU使用率 : 整体CPU利用率
用户态CPU使用率 : 用户程序占用CPU比例
内核态CPU使用率 : 系统内核占用CPU比例
I/O等待CPU使用率 : I/O操作等待时间
CPU负载均衡 : 系统负载平均值
内存使用率监控
物理内存使用率 : RAM使用百分比
虚拟内存使用率 : 交换分区使用情况
缓存命中率 : 内存缓存效率
内存泄漏检测 : 内存使用异常增长
页面交换频率 : 内存页面交换统计
连接数监控
活跃连接数 : 当前建立的连接数量
连接建立速率 : 每秒新建连接数
连接断开速率 : 每秒断开连接数
连接超时率 : 连接超时比例
连接池状态 : 数据库连接池使用情况
QPS监控
请求处理速率 : 每秒处理的请求数量
响应时间分布 : 请求响应时间统计
错误率统计 : 请求失败比例
吞吐量监控 : 系统处理能力指标
并发用户数 : 同时在线用户数量
1.3 监控层次架构 1 2 3 4 5 6 7 8 9 10 11 12 13 ┌─────────────────────────────────────────┐ │ 业务监控层 │ │ (QPS、响应时间、错误率、业务指标) │ ├─────────────────────────────────────────┤ │ 应用监控层 │ │ (JVM、应用性能、数据库连接、缓存) │ ├─────────────────────────────────────────┤ │ 系统监控层 │ │ (CPU、内存、磁盘、网络、进程) │ ├─────────────────────────────────────────┤ │ 基础设施层 │ │ (服务器、网络设备、存储设备) │ └─────────────────────────────────────────┘
2. CPU使用率监控实战 2.1 CPU监控指标详解 1 2 3 4 5 6 7 8 CPU使用率 = (1 - 空闲时间/总时间) × 100% 平均CPU使用率 = Σ(各核心使用率) / 核心数 负载平均值 = 1分钟、5分钟、15分钟的平均负载
2.2 CPU监控工具使用 top命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 top -p 进程ID top -o %CPU top -a top -d 1 top - 10:30:15 up 15 days, 2:45, 3 users , load average: 0.52, 0.58, 0.59 Tasks: 156 total, 1 running, 155 sleeping, 0 stopped, 0 zombie %Cpu(s): 8.2 us, 2.1 sy, 0.0 ni, 89.4 id , 0.2 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 8000000 total, 2000000 free, 3000000 used, 3000000 buff/cache KiB Swap: 2000000 total, 1800000 free, 200000 used. 4500000 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1234 java 20 0 8000000 2000000 100000 S 15.2 25.0 123:45.67 java 5678 nginx 20 0 50000 20000 5000 S 2.1 0.3 5:30.12 nginx
htop命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 yum install htop -y apt-get install htop -y htop htop -s PERCENT_CPU htop -H htop -d 2 -s PERCENT_CPU
vmstat命令监控 1 2 3 4 5 6 7 8 9 10 11 vmstat 1 vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 2000000 50000 3000000 0 0 0 5 10 25 8 2 89 1 0 0 0 0 2000000 50000 3000000 0 0 0 0 15 30 5 1 94 0 0
2.3 CPU监控脚本实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 #!/bin/bash LOG_FILE="/var/log/cpu_monitor.log" ALERT_THRESHOLD=80 CHECK_INTERVAL=60 get_cpu_usage () { cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}' ) echo $cpu_usage } get_load_average () { load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk -F',' '{print $1}' | xargs) echo $load_avg } monitor_cpu () { while true ; do timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) cpu_usage=$(get_cpu_usage) load_avg=$(get_load_average) echo "$timestamp - CPU使用率: ${cpu_usage} %, 负载: ${load_avg} " >> $LOG_FILE if (( $(echo "$cpu_usage > $ALERT_THRESHOLD " | bc -l) )); then echo "ALERT: CPU使用率过高 - ${cpu_usage} %" >> $LOG_FILE send_alert "CPU使用率过高" "${cpu_usage} %" fi sleep $CHECK_INTERVAL done } send_alert () { local subject="$1 " local message="$2 " echo "$message " | mail -s "$subject " admin@company.com curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \ -H 'Content-Type: application/json' \ -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$subject : $message \"}}" } monitor_cpu
2.4 CPU性能优化策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 check_cpu_optimization () { echo "=== CPU性能优化检查 ===" cpu_cores=$(nproc ) echo "CPU核心数: $cpu_cores " cpu_freq=$(cat /proc/cpuinfo | grep "cpu MHz" | head -1 | awk '{print $4}' ) echo "CPU频率: ${cpu_freq} MHz" cpu_cache=$(cat /proc/cpuinfo | grep "cache size" | head -1 | awk '{print $4}' ) echo "CPU缓存: $cpu_cache " echo "Top 10 CPU使用进程:" ps aux --sort =-%cpu | head -11 echo "系统负载:" uptime echo "中断统计:" cat /proc/interrupts | head -10 } cpu_tuning_recommendations () { echo "=== CPU调优建议 ===" echo "1. 调整进程优先级:" echo " nice -n -10 java -jar app.jar" echo "2. 设置CPU亲和性:" echo " taskset -c 0,1 java -jar app.jar" echo "3. 启用中断均衡:" echo " echo 1 > /proc/sys/kernel/irq_balance" echo "4. 内核参数调优:" echo " echo 'kernel.sched_migration_cost_ns = 5000000' >> /etc/sysctl.conf" echo " echo 'kernel.sched_autogroup_enabled = 0' >> /etc/sysctl.conf" }
3. 内存使用率监控实战 3.1 内存监控指标详解 1 2 3 4 5 6 7 8 内存使用率 = (总内存 - 可用内存) / 总内存 × 100% 可用内存 = 空闲内存 + 缓存内存 + 缓冲区内存 内存压力 = 交换分区使用量 / 交换分区总量
3.2 内存监控工具使用 free命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 free -h free -h -s 1 free -h -w total used free shared buff/cache available Mem: 7.8G 3.2G 1.8G 200M 2.8G 4.5G Swap: 2.0G 200M 1.8G
/proc/meminfo文件监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 cat /proc/meminfogrep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree" /proc/meminfo MemTotal: 8000000 kB MemFree: 1800000 kB MemAvailable: 4500000 kB Buffers: 50000 kB Cached: 3000000 kB SwapTotal: 2000000 kB SwapFree: 1800000 kB
ps命令监控进程内存 1 2 3 4 5 6 7 8 ps aux --sort =-%mem | head -10 ps -p 1234 -o pid,ppid,cmd,%mem,%cpu,rss,vsz ps -ef | grep java | awk '{print $2}' | xargs ps -p -o pid,ppid,cmd,%mem,rss,vsz
3.3 内存监控脚本实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 #!/bin/bash LOG_FILE="/var/log/memory_monitor.log" ALERT_THRESHOLD=85 CHECK_INTERVAL=60 get_memory_usage () { memory_usage=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}' ) echo $memory_usage } get_swap_usage () { swap_usage=$(free | grep Swap | awk '{if($2==0) print 0; else printf "%.2f", $3/$2 * 100.0}' ) echo $swap_usage } get_cache_hit_rate () { cache_hits=$(grep "cache_hits" /proc/vmstat | awk '{print $2}' ) cache_misses=$(grep "cache_misses" /proc/vmstat | awk '{print $2}' ) if [ -n "$cache_hits " ] && [ -n "$cache_misses " ]; then total_cache=$(($cache_hits + $cache_misses )) if [ $total_cache -gt 0 ]; then hit_rate=$(echo "scale=2; $cache_hits * 100 / $total_cache " | bc) echo $hit_rate else echo "0" fi else echo "N/A" fi } detect_memory_leak () { local process_name="$1 " local threshold="$2 " ps -C "$process_name " -o pid,rss --no-headers | while read pid rss; do if [ -n "$pid " ] && [ -n "$rss " ]; then echo "$(date '+%Y-%m-%d %H:%M:%S') $pid $rss " >> "/tmp/memory_history_${pid} .log" if [ -f "/tmp/memory_history_${pid} .log" ]; then recent_memory=$(tail -10 "/tmp/memory_history_${pid} .log" | awk '{print $3}' ) first_memory=$(echo "$recent_memory " | head -1) last_memory=$(echo "$recent_memory " | tail -1) if [ -n "$first_memory " ] && [ -n "$last_memory " ]; then growth_rate=$(echo "scale=2; ($last_memory - $first_memory ) * 100 / $first_memory " | bc) if (( $(echo "$growth_rate > $threshold " | bc -l) )); then echo "ALERT: 进程 $pid 可能存在内存泄漏,增长率为 ${growth_rate} %" fi fi fi fi done } monitor_memory () { while true ; do timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) memory_usage=$(get_memory_usage) swap_usage=$(get_swap_usage) cache_hit_rate=$(get_cache_hit_rate) echo "$timestamp - 内存使用率: ${memory_usage} %, 交换分区: ${swap_usage} %, 缓存命中率: ${cache_hit_rate} %" >> $LOG_FILE if (( $(echo "$memory_usage > $ALERT_THRESHOLD " | bc -l) )); then echo "ALERT: 内存使用率过高 - ${memory_usage} %" >> $LOG_FILE send_alert "内存使用率过高" "${memory_usage} %" fi if (( $(echo "$swap_usage > 50 " | bc -l) )); then echo "ALERT: 交换分区使用率过高 - ${swap_usage} %" >> $LOG_FILE send_alert "交换分区使用率过高" "${swap_usage} %" fi detect_memory_leak "java" 20 detect_memory_leak "nginx" 15 sleep $CHECK_INTERVAL done } monitor_memory
3.4 内存性能优化策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 check_memory_optimization () { echo "=== 内存性能优化检查 ===" echo "内存使用情况:" free -h echo "内存碎片情况:" cat /proc/buddyinfo echo "大页内存配置:" cat /proc/meminfo | grep -i huge echo "内存回收统计:" cat /proc/vmstat | grep -E "pgpgin|pgpgout|pswpin|pswpout" echo "Top 10 内存使用进程:" ps aux --sort =-%mem | head -11 echo "内存泄漏检测:" dmesg | grep -i "out of memory" } memory_tuning_recommendations () { echo "=== 内存调优建议 ===" echo "1. 调整内存回收参数:" echo " echo 'vm.swappiness = 10' >> /etc/sysctl.conf" echo " echo 'vm.vfs_cache_pressure = 50' >> /etc/sysctl.conf" echo "2. 启用大页内存:" echo " echo 'vm.nr_hugepages = 1024' >> /etc/sysctl.conf" echo "3. 调整内存分配策略:" echo " echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf" echo " echo 'vm.overcommit_ratio = 80' >> /etc/sysctl.conf" echo "4. 优化缓存参数:" echo " echo 'vm.dirty_ratio = 15' >> /etc/sysctl.conf" echo " echo 'vm.dirty_background_ratio = 5' >> /etc/sysctl.conf" }
4. 连接数监控实战 4.1 连接数监控指标详解 1 2 3 4 5 6 7 8 9 10 11 12 netstat -an | grep ESTABLISHED | wc -l netstat -an | grep :80 | grep ESTABLISHED | wc -l netstat -an | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,state[key]}'
4.2 连接数监控工具使用 netstat命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 netstat -an netstat -ant netstat -anu netstat -tlnp netstat -an | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,state[key]}' ESTABLISHED 150 TIME_WAIT 25 LISTEN 10
ss命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ss -an ss -ant ss -tlnp ss -ant | awk 'NR>1 {++state[$1]} END {for(key in state) print key,state[key]}' ss -i ss -s
lsof命令监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 lsof -i lsof -i :80 lsof -p 1234 lsof -i tcp lsof -i udp
4.3 连接数监控脚本实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 #!/bin/bash LOG_FILE="/var/log/connection_monitor.log" ALERT_THRESHOLD=1000 CHECK_INTERVAL=30 get_total_connections () { netstat -an | grep ESTABLISHED | wc -l } get_tcp_connections () { netstat -ant | grep ESTABLISHED | wc -l } get_udp_connections () { netstat -anu | grep ESTABLISHED | wc -l } get_listening_ports () { netstat -tlnp | wc -l } get_connection_states () { netstat -an | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,state[key]}' } get_port_connections () { netstat -an | awk '/^tcp/ {print $4}' | awk -F: '{print $NF}' | sort | uniq -c | sort -nr | head -10 } get_connection_rate () { local current_connections=$(get_total_connections) local timestamp=$(date +%s) echo "$timestamp $current_connections " >> "/tmp/connection_history.log" tail -100 "/tmp/connection_history.log" > "/tmp/connection_history.tmp" mv "/tmp/connection_history.tmp" "/tmp/connection_history.log" if [ -f "/tmp/connection_history.log" ] && [ $(wc -l < "/tmp/connection_history.log" ) -gt 1 ]; then local first_line=$(head -1 "/tmp/connection_history.log" ) local last_line=$(tail -1 "/tmp/connection_history.log" ) local first_time=$(echo $first_line | awk '{print $1}' ) local first_conn=$(echo $first_line | awk '{print $2}' ) local last_time=$(echo $last_line | awk '{print $1}' ) local last_conn=$(echo $last_line | awk '{print $2}' ) local time_diff=$(($last_time - $first_time )) local conn_diff=$(($last_conn - $first_conn )) if [ $time_diff -gt 0 ]; then local rate=$(echo "scale=2; $conn_diff * 60 / $time_diff " | bc) echo $rate else echo "0" fi else echo "0" fi } detect_abnormal_connections () { local time_wait_count=$(netstat -an | grep TIME_WAIT | wc -l) if [ $time_wait_count -gt 500 ]; then echo "WARNING: TIME_WAIT连接数过多 - $time_wait_count " fi local syn_recv_count=$(netstat -an | grep SYN_RECV | wc -l) if [ $syn_recv_count -gt 100 ]; then echo "WARNING: SYN_RECV连接数过多 - $syn_recv_count " fi netstat -an | grep ESTABLISHED | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | sort -nr | head -5 | while read count ip; do if [ $count -gt 50 ]; then echo "WARNING: IP $ip 连接数过多 - $count " fi done } monitor_connections () { while true ; do timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) total_connections=$(get_total_connections) tcp_connections=$(get_tcp_connections) udp_connections=$(get_udp_connections) listening_ports=$(get_listening_ports) connection_rate=$(get_connection_rate) echo "$timestamp - 总连接数: $total_connections , TCP: $tcp_connections , UDP: $udp_connections , 监听端口: $listening_ports , 连接速率: ${connection_rate} /min" >> $LOG_FILE if [ $total_connections -gt $ALERT_THRESHOLD ]; then echo "ALERT: 连接数过多 - $total_connections " >> $LOG_FILE send_alert "连接数过多" "$total_connections " fi abnormal_connections=$(detect_abnormal_connections) if [ -n "$abnormal_connections " ]; then echo "WARNING: $abnormal_connections " >> $LOG_FILE fi echo "连接状态统计:" >> $LOG_FILE get_connection_states >> $LOG_FILE echo "端口连接数统计:" >> $LOG_FILE get_port_connections >> $LOG_FILE sleep $CHECK_INTERVAL done } monitor_connections
4.4 连接数优化策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 check_connection_optimization () { echo "=== 连接数优化检查 ===" echo "当前连接数统计:" netstat -an | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,state[key]}' echo "监听端口:" netstat -tlnp | head -10 echo "TCP连接超时设置:" sysctl net.ipv4.tcp_keepalive_time sysctl net.ipv4.tcp_keepalive_intvl sysctl net.ipv4.tcp_keepalive_probes echo "连接队列设置:" sysctl net.core.somaxconn sysctl net.ipv4.tcp_max_syn_backlog echo "TIME_WAIT设置:" sysctl net.ipv4.tcp_tw_reuse sysctl net.ipv4.tcp_tw_recycle sysctl net.ipv4.tcp_fin_timeout } connection_tuning_recommendations () { echo "=== 连接数调优建议 ===" echo "1. 调整连接超时参数:" echo " echo 'net.ipv4.tcp_keepalive_time = 600' >> /etc/sysctl.conf" echo " echo 'net.ipv4.tcp_keepalive_intvl = 60' >> /etc/sysctl.conf" echo " echo 'net.ipv4.tcp_keepalive_probes = 3' >> /etc/sysctl.conf" echo "2. 优化连接队列:" echo " echo 'net.core.somaxconn = 65535' >> /etc/sysctl.conf" echo " echo 'net.ipv4.tcp_max_syn_backlog = 65535' >> /etc/sysctl.conf" echo "3. 优化TIME_WAIT处理:" echo " echo 'net.ipv4.tcp_tw_reuse = 1' >> /etc/sysctl.conf" echo " echo 'net.ipv4.tcp_fin_timeout = 30' >> /etc/sysctl.conf" echo "4. 调整连接跟踪:" echo " echo 'net.netfilter.nf_conntrack_max = 65535' >> /etc/sysctl.conf" echo " echo 'net.netfilter.nf_conntrack_tcp_timeout_established = 1200' >> /etc/sysctl.conf" }
5. QPS监控实战 5.1 QPS监控指标详解 1 2 3 4 5 6 7 8 9 10 11 12 13 QPS = 总请求数 / 时间间隔(秒) 平均响应时间 = 总响应时间 / 请求数 P95响应时间 = 95%的请求响应时间 P99响应时间 = 99%的请求响应时间 错误率 = 错误请求数 / 总请求数 × 100% 吞吐量 = 成功请求数 / 时间间隔(秒)
5.2 QPS监控工具使用 应用层QPS监控 1 2 3 4 5 6 7 8 9 10 tail -f /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1-2 | uniq -ctail -f /var/log/apache2/access.log | awk '{print $4}' | cut -d[ -f2 | cut -d: -f1-2 | uniq -ctail -f /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1-2 | uniq -c | while read count time; do echo "$time : $count requests" done
系统层QPS监控 1 2 3 4 5 6 7 8 iftop -i eth0 nethogs eth0 tcpdump -i eth0 -n port 80 | awk '{print $3}' | cut -d. -f1-4 | sort | uniq -c | sort -nr
5.3 QPS监控脚本实现 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 #!/bin/bash LOG_FILE="/var/log/qps_monitor.log" ACCESS_LOG="/var/log/nginx/access.log" ALERT_QPS_THRESHOLD=1000 ALERT_RESPONSE_TIME_THRESHOLD=2000 CHECK_INTERVAL=60 get_qps_stats () { local time_window="$1 " local current_time=$(date '+%d/%b/%Y:%H:%M' ) tail -1000 "$ACCESS_LOG " | awk -v current_time="$current_time " ' BEGIN { count = 0 total_response_time = 0 error_count = 0 } { # 提取时间戳 timestamp = $4 gsub(/\[/, "", timestamp) gsub(/:/, " ", timestamp) # 提取响应时间 response_time = $NF gsub(/"/, "", response_time) # 提取状态码 status_code = $9 # 统计请求数 count++ total_response_time += response_time # 统计错误数 if (status_code >= 400) { error_count++ } } END { if (count > 0) { avg_response_time = total_response_time / count error_rate = (error_count / count) * 100 printf "%d %.2f %.2f", count, avg_response_time, error_rate } else { printf "0 0 0" } }' } get_response_time_distribution () { tail -1000 "$ACCESS_LOG " | awk ' { response_time = $NF gsub(/"/, "", response_time) if (response_time < 100) { bucket_100++ } else if (response_time < 500) { bucket_500++ } else if (response_time < 1000) { bucket_1000++ } else if (response_time < 2000) { bucket_2000++ } else { bucket_2000_plus++ } } END { total = bucket_100 + bucket_500 + bucket_1000 + bucket_2000 + bucket_2000_plus if (total > 0) { printf "<100ms: %.1f%%, <500ms: %.1f%%, <1000ms: %.1f%%, <2000ms: %.1f%%, >2000ms: %.1f%%\n", (bucket_100/total)*100, (bucket_500/total)*100, (bucket_1000/total)*100, (bucket_2000/total)*100, (bucket_2000_plus/total)*100 } }' } get_top_urls () { tail -1000 "$ACCESS_LOG " | awk '{print $7}' | sort | uniq -c | sort -nr | head -10 } get_status_code_stats () { tail -1000 "$ACCESS_LOG " | awk '{print $9}' | sort | uniq -c | sort -nr } get_ip_stats () { tail -1000 "$ACCESS_LOG " | awk '{print $1}' | sort | uniq -c | sort -nr | head -10 } monitor_qps () { while true ; do timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) qps_stats=$(get_qps_stats 60) qps=$(echo $qps_stats | awk '{print $1}' ) avg_response_time=$(echo $qps_stats | awk '{print $2}' ) error_rate=$(echo $qps_stats | awk '{print $3}' ) echo "$timestamp - QPS: $qps , 平均响应时间: ${avg_response_time} ms, 错误率: ${error_rate} %" >> $LOG_FILE if [ $qps -gt $ALERT_QPS_THRESHOLD ]; then echo "ALERT: QPS过高 - $qps " >> $LOG_FILE send_alert "QPS过高" "$qps " fi if (( $(echo "$avg_response_time > $ALERT_RESPONSE_TIME_THRESHOLD " | bc -l) )); then echo "ALERT: 响应时间过长 - ${avg_response_time} ms" >> $LOG_FILE send_alert "响应时间过长" "${avg_response_time} ms" fi echo "响应时间分布:" >> $LOG_FILE get_response_time_distribution >> $LOG_FILE echo "热门URL:" >> $LOG_FILE get_top_urls >> $LOG_FILE echo "状态码统计:" >> $LOG_FILE get_status_code_stats >> $LOG_FILE echo "IP统计:" >> $LOG_FILE get_ip_stats >> $LOG_FILE sleep $CHECK_INTERVAL done } monitor_qps
5.4 QPS性能优化策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 check_qps_optimization () { echo "=== QPS性能优化检查 ===" echo "当前QPS统计:" tail -100 "$ACCESS_LOG " | wc -l echo "响应时间分布:" get_response_time_distribution echo "错误率统计:" tail -1000 "$ACCESS_LOG " | awk '{print $9}' | sort | uniq -c | sort -nr echo "热门URL:" get_top_urls echo "系统负载:" uptime echo "网络连接:" netstat -an | grep ESTABLISHED | wc -l } qps_tuning_recommendations () { echo "=== QPS调优建议 ===" echo "1. Web服务器优化:" echo " - 启用gzip压缩" echo " - 配置缓存策略" echo " - 优化worker进程数" echo " - 调整连接超时参数" echo "2. 数据库优化:" echo " - 优化SQL查询" echo " - 添加适当索引" echo " - 配置连接池" echo " - 启用查询缓存" echo "3. 应用优化:" echo " - 启用应用缓存" echo " - 优化代码逻辑" echo " - 使用异步处理" echo " - 配置负载均衡" echo "4. 系统优化:" echo " - 调整内核参数" echo " - 优化网络配置" echo " - 配置资源限制" echo " - 启用监控告警" }
6. 综合监控方案 6.1 监控架构设计 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 monitoring_architecture: data_collection: - prometheus: "指标采集" - telegraf: "系统指标" - filebeat: "日志采集" - node_exporter: "节点指标" data_storage: - influxdb: "时序数据存储" - elasticsearch: "日志存储" - redis: "缓存存储" data_visualization: - grafana: "监控面板" - kibana: "日志分析" - custom_dashboard: "自定义面板" alerting: - alertmanager: "告警管理" - webhook: "自定义告警" - email: "邮件告警" - sms: "短信告警"
6.2 监控脚本集成 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 #!/bin/bash LOG_DIR="/var/log/monitoring" ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" CHECK_INTERVAL=60 mkdir -p "$LOG_DIR " monitor_cpu () { local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}' ) echo "cpu_usage:$cpu_usage " } monitor_memory () { local memory_usage=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}' ) echo "memory_usage:$memory_usage " } monitor_connections () { local connections=$(netstat -an | grep ESTABLISHED | wc -l) echo "connections:$connections " } monitor_qps () { local qps=$(tail -100 /var/log/nginx/access.log 2>/dev/null | wc -l) echo "qps:$qps " } collect_metrics () { local timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) local metrics="timestamp:$timestamp " metrics="$metrics ,$(monitor_cpu) " metrics="$metrics ,$(monitor_memory) " metrics="$metrics ,$(monitor_connections) " metrics="$metrics ,$(monitor_qps) " echo "$metrics " >> "$LOG_DIR /metrics.log" echo "$metrics " } check_alerts () { local metrics="$1 " local alerts="" local cpu_usage=$(echo "$metrics " | grep -o "cpu_usage:[0-9.]*" | cut -d: -f2) if (( $(echo "$cpu_usage > 80 " | bc -l) )); then alerts="$alerts CPU使用率过高:${cpu_usage} %" fi local memory_usage=$(echo "$metrics " | grep -o "memory_usage:[0-9.]*" | cut -d: -f2) if (( $(echo "$memory_usage > 85 " | bc -l) )); then alerts="$alerts 内存使用率过高:${memory_usage} %" fi local connections=$(echo "$metrics " | grep -o "connections:[0-9]*" | cut -d: -f2) if [ $connections -gt 1000 ]; then alerts="$alerts 连接数过多:$connections " fi local qps=$(echo "$metrics " | grep -o "qps:[0-9]*" | cut -d: -f2) if [ $qps -gt 500 ]; then alerts="$alerts QPS过高:$qps " fi if [ -n "$alerts " ]; then send_alert "$alerts " fi } send_alert () { local message="$1 " local timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) echo "$timestamp - ALERT: $message " >> "$LOG_DIR /alerts.log" curl -X POST "$ALERT_WEBHOOK " \ -H 'Content-Type: application/json' \ -d "{\"text\":\"🚨 系统告警: $message \"}" echo "$message " | mail -s "系统告警 - $timestamp " admin@company.com } main_monitor () { echo "启动综合监控系统..." while true ; do metrics=$(collect_metrics) check_alerts "$metrics " sleep $CHECK_INTERVAL done } main_monitor
6.3 监控面板配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 { "dashboard" : { "title" : "系统性能监控面板" , "panels" : [ { "title" : "CPU使用率" , "type" : "graph" , "targets" : [ { "expr" : "cpu_usage_percent" , "legendFormat" : "CPU使用率" } ] , "yAxes" : [ { "max" : 100 , "min" : 0 , "unit" : "percent" } ] } , { "title" : "内存使用率" , "type" : "graph" , "targets" : [ { "expr" : "memory_usage_percent" , "legendFormat" : "内存使用率" } ] , "yAxes" : [ { "max" : 100 , "min" : 0 , "unit" : "percent" } ] } , { "title" : "连接数" , "type" : "graph" , "targets" : [ { "expr" : "tcp_connections" , "legendFormat" : "TCP连接数" } ] } , { "title" : "QPS" , "type" : "graph" , "targets" : [ { "expr" : "requests_per_second" , "legendFormat" : "每秒请求数" } ] } ] } }
7. 告警策略设计 7.1 告警规则配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 alert_rules: cpu_high: condition: "cpu_usage > 80" duration: "5m" severity: "warning" message: "CPU使用率持续高于80%" cpu_critical: condition: "cpu_usage > 95" duration: "2m" severity: "critical" message: "CPU使用率持续高于95%" memory_high: condition: "memory_usage > 85" duration: "5m" severity: "warning" message: "内存使用率持续高于85%" memory_critical: condition: "memory_usage > 95" duration: "2m" severity: "critical" message: "内存使用率持续高于95%" connections_high: condition: "connections > 1000" duration: "3m" severity: "warning" message: "连接数持续高于1000" qps_high: condition: "qps > 500" duration: "3m" severity: "warning" message: "QPS持续高于500"
7.2 告警处理流程 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 #!/bin/bash handle_alert () { local alert_type="$1 " local alert_value="$2 " local severity="$3 " case "$alert_type " in "cpu_high" ) handle_cpu_alert "$alert_value " "$severity " ;; "memory_high" ) handle_memory_alert "$alert_value " "$severity " ;; "connections_high" ) handle_connections_alert "$alert_value " "$severity " ;; "qps_high" ) handle_qps_alert "$alert_value " "$severity " ;; *) echo "未知告警类型: $alert_type " ;; esac } handle_cpu_alert () { local cpu_usage="$1 " local severity="$2 " echo "处理CPU告警: $cpu_usage %" log_alert "CPU" "$cpu_usage " "$severity " send_notification "CPU告警" "CPU使用率: $cpu_usage %" if [ "$severity " = "critical" ]; then restart_high_cpu_processes fi } handle_memory_alert () { local memory_usage="$1 " local severity="$2 " echo "处理内存告警: $memory_usage %" log_alert "Memory" "$memory_usage " "$severity " send_notification "内存告警" "内存使用率: $memory_usage %" if [ "$severity " = "critical" ]; then clear_memory_cache fi } handle_connections_alert () { local connections="$1 " local severity="$2 " echo "处理连接数告警: $connections " log_alert "Connections" "$connections " "$severity " send_notification "连接数告警" "连接数: $connections " if [ "$severity " = "critical" ]; then cleanup_abnormal_connections fi } handle_qps_alert () { local qps="$1 " local severity="$2 " echo "处理QPS告警: $qps " log_alert "QPS" "$qps " "$severity " send_notification "QPS告警" "QPS: $qps " if [ "$severity " = "critical" ]; then enable_rate_limiting fi } restart_high_cpu_processes () { echo "重启高CPU进程..." } clear_memory_cache () { echo "清理内存缓存..." sync echo 3 > /proc/sys/vm/drop_caches } cleanup_abnormal_connections () { echo "清理异常连接..." } enable_rate_limiting () { echo "启用限流..." } log_alert () { local metric="$1 " local value="$2 " local severity="$3 " local timestamp=$(date '+%Y-%m-%d %H:%M:%S' ) echo "$timestamp - $severity - $metric : $value " >> "/var/log/alerts.log" } send_notification () { local title="$1 " local message="$2 " echo "$message " | mail -s "$title " admin@company.com curl -X POST "$SLACK_WEBHOOK " \ -H 'Content-Type: application/json' \ -d "{\"text\":\"🚨 $title : $message \"}" }
8. 容量规划与优化 8.1 容量规划模型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 get_current_metrics () { local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}' ) local memory_usage=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}' ) local connections=$(netstat -an | grep ESTABLISHED | wc -l) local qps=$(tail -100 /var/log/nginx/access.log 2>/dev/null | wc -l) echo "CPU:$cpu_usage Memory:$memory_usage Connections:$connections QPS:$qps " } predict_capacity () { local current_metrics="$1 " local growth_rate="$2 " local time_horizon="$3 " local current_qps=$(echo "$current_metrics " | grep -o "QPS:[0-9]*" | cut -d: -f2) local predicted_qps=$(echo "$current_qps * (1 + $growth_rate /100)^$time_horizon " | bc) echo "预测$time_horizon 个月后QPS: $predicted_qps " local required_cpu=$(echo "$predicted_qps * 0.1" | bc) local required_memory=$(echo "$predicted_qps * 0.05" | bc) echo "所需CPU核心数: $required_cpu " echo "所需内存(GB): $required_memory " } capacity_recommendations () { local current_metrics="$1 " echo "=== 容量规划建议 ===" local cpu_usage=$(echo "$current_metrics " | grep -o "CPU:[0-9.]*" | cut -d: -f2) if (( $(echo "$cpu_usage > 70 " | bc -l) )); then echo "建议增加CPU核心数" fi local memory_usage=$(echo "$current_metrics " | grep -o "Memory:[0-9.]*" | cut -d: -f2) if (( $(echo "$memory_usage > 80 " | bc -l) )); then echo "建议增加内存容量" fi local connections=$(echo "$current_metrics " | grep -o "Connections:[0-9]*" | cut -d: -f2) if [ $connections -gt 800 ]; then echo "建议优化连接管理" fi local qps=$(echo "$current_metrics " | grep -o "QPS:[0-9]*" | cut -d: -f2) if [ $qps -gt 400 ]; then echo "建议实施负载均衡" fi } main () { echo "=== 容量规划分析 ===" current_metrics=$(get_current_metrics) echo "当前系统指标: $current_metrics " echo "=== 容量预测 ===" predict_capacity "$current_metrics " 20 6 capacity_recommendations "$current_metrics " } main
8.2 性能优化策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 optimize_system () { echo "=== 系统优化 ===" echo "优化内核参数..." cat >> /etc/sysctl.conf << EOF # 网络优化 net.core.somaxconn = 65535 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 # 内存优化 vm.swappiness = 10 vm.vfs_cache_pressure = 50 vm.dirty_ratio = 15 vm.dirty_background_ratio = 5 # 文件系统优化 fs.file-max = 65535 EOF sysctl -p echo "内核参数优化完成" } optimize_application () { echo "=== 应用优化 ===" if [ -f "/etc/nginx/nginx.conf" ]; then echo "优化Nginx配置..." cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup sed -i 's/worker_processes auto;/worker_processes 4;/' /etc/nginx/nginx.conf sed -i 's/worker_connections 1024;/worker_connections 4096;/' /etc/nginx/nginx.conf systemctl restart nginx echo "Nginx优化完成" fi if [ -f "/etc/mysql/my.cnf" ]; then echo "优化MySQL配置..." cp /etc/mysql/my.cnf /etc/mysql/my.cnf.backup cat >> /etc/mysql/my.cnf << EOF [mysqld] innodb_buffer_pool_size = 1G innodb_log_file_size = 256M innodb_flush_log_at_trx_commit = 2 query_cache_size = 64M query_cache_type = 1 max_connections = 500 EOF systemctl restart mysql echo "MySQL优化完成" fi } optimize_monitoring () { echo "=== 监控优化 ===" echo "安装监控工具..." yum install -y htop iotop nethogs iftop echo "配置日志轮转..." cat > /etc/logrotate.d/monitoring << EOF /var/log/monitoring/*.log { daily rotate 30 compress delaycompress missingok notifempty create 644 root root } EOF echo "监控优化完成" } main () { echo "开始性能优化..." optimize_system optimize_application optimize_monitoring echo "性能优化完成" } main
9. 总结 系统性能监控是保障业务稳定运行的关键环节,通过监控CPU使用率、内存使用率、连接数、QPS等核心指标,可以:
9.1 监控价值
预防故障 : 提前发现性能瓶颈,避免系统崩溃
优化性能 : 识别性能瓶颈,指导系统优化
容量规划 : 为系统扩容提供数据支持
成本控制 : 合理配置资源,降低运营成本
用户体验 : 保证应用响应速度和可用性
9.2 最佳实践
全面监控 : 覆盖系统、应用、业务各个层面
实时告警 : 及时发现和处理异常情况
自动化处理 : 减少人工干预,提高响应速度
持续优化 : 基于监控数据持续改进系统性能
容量规划 : 基于历史数据预测未来需求
9.3 技术要点
指标采集 : 使用多种工具采集系统指标
数据处理 : 实时处理和存储监控数据
可视化 : 通过图表展示监控数据
告警机制 : 建立完善的告警体系
自动处理 : 实现自动化的故障处理
通过本文的学习,您应该已经掌握了系统性能监控的核心技术,能够建立完善的监控体系,保障系统的稳定运行。