1. 进程监控概述

进程监控是系统运维的重要组成部分,通过实时监控进程的运行状态、资源使用情况和性能指标,可以及时发现进程异常、优化系统性能、预防故障发生。本文将详细介绍进程监控的核心指标、监控工具、性能分析方法以及最佳实践。

1.1 进程监控的重要性

  1. 故障预防: 提前发现进程异常,预防服务中断
  2. 性能优化: 识别进程性能瓶颈,优化资源配置
  3. 资源管理: 合理分配和管理进程资源
  4. 问题诊断: 快速定位和解决进程问题
  5. 容量规划: 为系统扩容提供数据支持

1.2 核心监控指标

  • 进程状态: 运行、睡眠、停止、僵尸等状态
  • CPU使用率: 进程CPU占用情况
  • 内存使用: 进程内存占用和内存泄漏检测
  • 文件描述符: 打开文件数量监控
  • 网络连接: 网络连接状态和流量
  • 进程树: 父子进程关系和进程层次

1.3 监控层次

  1. 进程级: 单个进程的资源使用和状态
  2. 线程级: 进程内线程的详细监控
  3. 系统级: 系统整体进程分布和资源使用
  4. 应用级: 应用程序进程的业务指标

2. 进程监控工具详解

2.1 ps命令详解

ps命令是最基础的进程查看工具,提供了丰富的进程信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 查看所有进程
ps aux

# 查看进程树结构
ps auxf

# 按CPU使用率排序
ps aux --sort=-%cpu

# 按内存使用率排序
ps aux --sort=-%mem

# 查看特定用户的进程
ps -u username

# 查看特定进程的详细信息
ps -p PID -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# 实时监控进程变化
watch -n 1 'ps aux --sort=-%cpu | head -20'

2.2 top命令详解

top命令提供实时动态的进程监控信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 启动top命令
top

# 按CPU使用率排序(默认)
top

# 按内存使用率排序
top -o %MEM

# 按进程ID排序
top -o PID

# 只显示特定用户的进程
top -u username

# 设置刷新间隔
top -d 2

# 显示线程信息
top -H

# 批处理模式,输出到文件
top -b -n 1 > process_info.txt

2.3 htop命令详解

htop是top的增强版本,提供更友好的界面和更多功能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 安装htop
yum install htop # CentOS/RHEL
apt install htop # Ubuntu/Debian

# 启动htop
htop

# 按CPU使用率排序
htop -s PERCENT_CPU

# 按内存使用率排序
htop -s PERCENT_MEM

# 只显示特定用户的进程
htop -u username

# 设置刷新间隔
htop -d 5

# 显示线程
htop -H

2.4 进程监控脚本

创建自定义的进程监控脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#!/bin/bash
# process_monitor.sh - 进程监控脚本

# 配置参数
LOG_FILE="/var/log/process_monitor.log"
ALERT_CPU_THRESHOLD=80
ALERT_MEM_THRESHOLD=80
CHECK_INTERVAL=60

# 日志函数
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

# 检查进程CPU使用率
check_cpu_usage() {
ps aux --sort=-%cpu | awk -v threshold=$ALERT_CPU_THRESHOLD '
NR>1 && $3 > threshold {
print "HIGH_CPU: " $2 " " $11 " " $3 "%"
}'
}

# 检查进程内存使用率
check_mem_usage() {
ps aux --sort=-%mem | awk -v threshold=$ALERT_MEM_THRESHOLD '
NR>1 && $4 > threshold {
print "HIGH_MEM: " $2 " " $11 " " $4 "%"
}'
}

# 检查僵尸进程
check_zombie_processes() {
zombie_count=$(ps aux | awk '$8 ~ /^Z/ { count++ } END { print count+0 }')
if [ $zombie_count -gt 0 ]; then
log_message "WARNING: Found $zombie_count zombie processes"
ps aux | awk '$8 ~ /^Z/ { print "ZOMBIE: " $2 " " $11 }'
fi
}

# 检查进程文件描述符
check_file_descriptors() {
ps aux | while read line; do
pid=$(echo $line | awk '{print $2}')
if [ "$pid" != "PID" ] && [ -n "$pid" ]; then
fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
if [ $fd_count -gt 1000 ]; then
log_message "HIGH_FD: Process $pid has $fd_count file descriptors"
fi
fi
done
}

# 主监控循环
main() {
log_message "Process monitor started"

while true; do
# 检查CPU使用率
cpu_alerts=$(check_cpu_usage)
if [ -n "$cpu_alerts" ]; then
echo "$cpu_alerts" | while read alert; do
log_message "$alert"
done
fi

# 检查内存使用率
mem_alerts=$(check_mem_usage)
if [ -n "$mem_alerts" ]; then
echo "$mem_alerts" | while read alert; do
log_message "$alert"
done
fi

# 检查僵尸进程
check_zombie_processes

# 检查文件描述符
check_file_descriptors

sleep $CHECK_INTERVAL
done
}

# 启动监控
main

3. 进程性能分析

3.1 CPU性能分析

3.1.1 CPU使用率分析

1
2
3
4
5
6
7
8
9
10
11
# 查看进程CPU使用率趋势
sar -u 1 10

# 查看特定进程的CPU使用率
pidstat -p PID 1 10

# 查看进程CPU使用率分布
top -p PID -n 1 -b | grep PID

# 分析进程CPU使用模式
perf stat -p PID sleep 10

3.1.2 CPU热点分析

1
2
3
4
5
6
7
8
9
10
11
12
# 使用perf分析CPU热点
perf top -p PID

# 生成火焰图
perf record -p PID sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# 分析系统调用
strace -c -p PID

# 分析函数调用
ltrace -c -p PID

3.2 内存性能分析

3.2.1 内存使用分析

1
2
3
4
5
6
7
8
9
10
11
# 查看进程内存使用情况
ps -p PID -o pid,vsz,rss,pmem,comm

# 查看进程内存映射
cat /proc/PID/maps

# 查看进程内存统计
cat /proc/PID/status

# 查看进程内存使用趋势
pidstat -r -p PID 1 10

3.2.2 内存泄漏检测

1
2
3
4
5
6
7
8
9
10
11
12
# 使用valgrind检测内存泄漏
valgrind --tool=memcheck --leak-check=full ./program

# 使用AddressSanitizer检测内存问题
gcc -fsanitize=address -g program.c
./a.out

# 监控进程内存增长
while true; do
ps -p PID -o pid,vsz,rss,pmem,comm
sleep 10
done

3.3 I/O性能分析

3.3.1 磁盘I/O分析

1
2
3
4
5
6
7
8
9
10
11
# 查看进程磁盘I/O
iotop -p PID

# 查看进程I/O统计
pidstat -d -p PID 1 10

# 分析进程文件访问
lsof -p PID

# 监控进程I/O模式
strace -e trace=file -p PID

3.3.2 网络I/O分析

1
2
3
4
5
6
7
8
9
10
11
# 查看进程网络连接
netstat -tulpn | grep PID

# 查看进程网络统计
ss -tulpn | grep PID

# 分析进程网络流量
iftop -p PID

# 监控进程网络活动
tcpdump -i any -p PID

4. 进程问题诊断

4.1 进程状态异常诊断

4.1.1 僵尸进程处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 查找僵尸进程
ps aux | awk '$8 ~ /^Z/ { print $2, $11 }'

# 查找僵尸进程的父进程
ps -eo pid,ppid,state,comm | awk '$3 ~ /^Z/'

# 杀死僵尸进程的父进程
kill -9 PARENT_PID

# 预防僵尸进程的脚本
#!/bin/bash
# zombie_cleaner.sh
while true; do
zombie_pids=$(ps aux | awk '$8 ~ /^Z/ { print $2 }')
if [ -n "$zombie_pids" ]; then
echo "Found zombie processes: $zombie_pids"
# 尝试清理僵尸进程
for pid in $zombie_pids; do
kill -9 $pid 2>/dev/null
done
fi
sleep 60
done

4.1.2 进程卡死诊断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 检查进程是否响应
kill -0 PID

# 查看进程堆栈信息
gdb -p PID
(gdb) bt
(gdb) quit

# 使用strace跟踪系统调用
strace -p PID

# 检查进程文件描述符
lsof -p PID

# 检查进程网络连接
netstat -tulpn | grep PID

4.2 资源泄漏诊断

4.2.1 内存泄漏诊断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 监控进程内存增长
#!/bin/bash
# memory_leak_detector.sh
PID=$1
if [ -z "$PID" ]; then
echo "Usage: $0 <PID>"
exit 1
fi

echo "Monitoring memory usage for PID $PID"
echo "Time,RSS(MB),VSS(MB)"

while true; do
if ps -p $PID > /dev/null 2>&1; then
rss=$(ps -p $PID -o rss= | awk '{print $1/1024}')
vsz=$(ps -p $PID -o vsz= | awk '{print $1/1024}')
echo "$(date '+%H:%M:%S'),$rss,$vsz"
else
echo "Process $PID not found"
break
fi
sleep 10
done

4.2.2 文件描述符泄漏诊断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 监控进程文件描述符使用
#!/bin/bash
# fd_monitor.sh
PID=$1
if [ -z "$PID" ]; then
echo "Usage: $0 <PID>"
exit 1
fi

echo "Monitoring file descriptors for PID $PID"
echo "Time,FD Count"

while true; do
if ps -p $PID > /dev/null 2>&1; then
fd_count=$(ls /proc/$PID/fd 2>/dev/null | wc -l)
echo "$(date '+%H:%M:%S'),$fd_count"

if [ $fd_count -gt 1000 ]; then
echo "WARNING: High FD count detected!"
lsof -p $PID | head -20
fi
else
echo "Process $PID not found"
break
fi
sleep 5
done

4.3 性能瓶颈诊断

4.3.1 CPU瓶颈诊断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# CPU瓶颈分析脚本
#!/bin/bash
# cpu_bottleneck_analyzer.sh
PID=$1
DURATION=${2:-60}

echo "Analyzing CPU bottleneck for PID $PID for $DURATION seconds"

# 启动性能监控
perf record -p $PID sleep $DURATION

# 分析结果
perf report --stdio

# 生成火焰图
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg
echo "CPU flame graph saved to cpu_flame.svg"

4.3.2 I/O瓶颈诊断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# I/O瓶颈分析脚本
#!/bin/bash
# io_bottleneck_analyzer.sh
PID=$1
DURATION=${2:-60}

echo "Analyzing I/O bottleneck for PID $PID for $DURATION seconds"

# 监控I/O统计
pidstat -d -p $PID 1 $DURATION > io_stats.txt

# 分析文件访问模式
strace -e trace=file -p $PID -o file_access.log &
STRACE_PID=$!

sleep $DURATION
kill $STRACE_PID

# 分析结果
echo "I/O statistics:"
cat io_stats.txt

echo "File access patterns:"
grep -E "(open|read|write|close)" file_access.log | head -20

5. 进程优化策略

5.1 CPU优化

5.1.1 进程优先级调整

1
2
3
4
5
6
7
8
9
10
# 查看进程优先级
ps -eo pid,ni,comm

# 调整进程优先级
nice -n -10 command
renice -10 PID

# 设置进程CPU亲和性
taskset -c 0,1 command
taskset -cp 0,1 PID

5.1.2 多线程优化

1
2
3
4
5
6
7
8
# 查看进程线程信息
ps -T -p PID

# 查看线程CPU使用率
top -H -p PID

# 设置线程CPU亲和性
taskset -cp CPU_LIST PID

5.2 内存优化

5.2.1 内存使用优化

1
2
3
4
5
6
7
8
9
10
# 查看进程内存映射
cat /proc/PID/smaps

# 分析内存使用模式
pmap -x PID

# 优化内存分配
# 在应用程序中使用内存池
# 避免频繁的内存分配和释放
# 使用内存映射文件

5.2.2 内存泄漏预防

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 内存泄漏检测脚本
#!/bin/bash
# memory_leak_prevention.sh
PROCESS_NAME=$1
MAX_MEMORY=${2:-1000} # MB

while true; do
pids=$(pgrep $PROCESS_NAME)

for pid in $pids; do
memory_mb=$(ps -p $pid -o rss= | awk '{print $1/1024}')

if (( $(echo "$memory_mb > $MAX_MEMORY" | bc -l) )); then
echo "WARNING: Process $pid ($PROCESS_NAME) using ${memory_mb}MB memory"

# 可以选择重启进程或发送告警
# kill -TERM $pid
fi
done

sleep 300 # 5分钟检查一次
done

5.3 I/O优化

5.3.1 磁盘I/O优化

1
2
3
4
5
6
7
8
# 设置进程I/O优先级
ionice -c 1 -n 0 -p PID

# 优化文件系统缓存
echo 3 > /proc/sys/vm/drop_caches

# 使用异步I/O
# 在应用程序中使用aio_read/aio_write

5.3.2 网络I/O优化

1
2
3
4
5
6
7
# 优化网络缓冲区
echo 16777216 > /proc/sys/net/core/rmem_max
echo 16777216 > /proc/sys/net/core/wmem_max

# 设置进程网络优先级
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip sport PORT 0xffff flowid 1:1

6. 进程监控最佳实践

6.1 监控策略

6.1.1 分层监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 系统级监控
#!/bin/bash
# system_process_monitor.sh
echo "=== System Process Overview ==="
echo "Total processes: $(ps aux | wc -l)"
echo "Running processes: $(ps aux | grep -c " R ")"
echo "Sleeping processes: $(ps aux | grep -c " S ")"
echo "Zombie processes: $(ps aux | grep -c " Z ")"

echo -e "\n=== Top CPU Consumers ==="
ps aux --sort=-%cpu | head -10

echo -e "\n=== Top Memory Consumers ==="
ps aux --sort=-%mem | head -10

echo -e "\n=== Process States Distribution ==="
ps aux | awk '{print $8}' | sort | uniq -c | sort -nr

6.1.2 关键进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 关键进程监控脚本
#!/bin/bash
# critical_process_monitor.sh
CRITICAL_PROCESSES=("nginx" "mysql" "redis" "java")

for process in "${CRITICAL_PROCESSES[@]}"; do
pid=$(pgrep $process)

if [ -z "$pid" ]; then
echo "CRITICAL: Process $process is not running!"
# 发送告警
# send_alert "Process $process is down"
else
echo "OK: Process $process is running (PID: $pid)"

# 检查资源使用
cpu_usage=$(ps -p $pid -o %cpu= | tr -d ' ')
mem_usage=$(ps -p $pid -o %mem= | tr -d ' ')

if (( $(echo "$cpu_usage > 80" | bc -l) )); then
echo "WARNING: Process $process CPU usage is ${cpu_usage}%"
fi

if (( $(echo "$mem_usage > 80" | bc -l) )); then
echo "WARNING: Process $process memory usage is ${mem_usage}%"
fi
fi
done

6.2 告警机制

6.2.1 告警脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# process_alert.sh
ALERT_EMAIL="admin@company.com"
ALERT_WEBHOOK="https://hooks.slack.com/services/xxx"

send_email_alert() {
local subject="$1"
local message="$2"
echo "$message" | mail -s "$subject" $ALERT_EMAIL
}

send_slack_alert() {
local message="$1"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$message\"}" \
$ALERT_WEBHOOK
}

# 检查进程状态
check_process_status() {
local process_name="$1"
local pid=$(pgrep $process_name)

if [ -z "$pid" ]; then
local alert_msg="CRITICAL: Process $process_name is not running!"
send_email_alert "Process Down Alert" "$alert_msg"
send_slack_alert "$alert_msg"
return 1
fi

return 0
}

# 检查资源使用
check_resource_usage() {
local pid="$1"
local process_name="$2"

local cpu_usage=$(ps -p $pid -o %cpu= | tr -d ' ')
local mem_usage=$(ps -p $pid -o %mem= | tr -d ' ')

if (( $(echo "$cpu_usage > 90" | bc -l) )); then
local alert_msg="WARNING: Process $process_name CPU usage is ${cpu_usage}%"
send_email_alert "High CPU Usage" "$alert_msg"
send_slack_alert "$alert_msg"
fi

if (( $(echo "$mem_usage > 90" | bc -l) )); then
local alert_msg="WARNING: Process $process_name memory usage is ${mem_usage}%"
send_email_alert "High Memory Usage" "$alert_msg"
send_slack_alert "$alert_msg"
fi
}

# 主检查逻辑
main() {
local processes=("nginx" "mysql" "redis" "java")

for process in "${processes[@]}"; do
if check_process_status "$process"; then
local pid=$(pgrep $process)
check_resource_usage "$pid" "$process"
fi
done
}

main

6.3 自动化运维

6.3.1 进程自动重启

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/bin/bash
# process_auto_restart.sh
PROCESS_NAME="$1"
MAX_RESTARTS=${2:-3}
RESTART_INTERVAL=${3:-300} # 5分钟

if [ -z "$PROCESS_NAME" ]; then
echo "Usage: $0 <process_name> [max_restarts] [restart_interval]"
exit 1
fi

restart_count=0
last_restart=0

while true; do
pid=$(pgrep $PROCESS_NAME)

if [ -z "$pid" ]; then
current_time=$(date +%s)

if [ $((current_time - last_restart)) -gt $RESTART_INTERVAL ]; then
if [ $restart_count -lt $MAX_RESTARTS ]; then
echo "$(date): Process $PROCESS_NAME not running, attempting restart..."

# 根据进程类型执行不同的重启命令
case $PROCESS_NAME in
"nginx")
systemctl start nginx
;;
"mysql")
systemctl start mysql
;;
"redis")
systemctl start redis
;;
*)
echo "Unknown process type: $PROCESS_NAME"
;;
esac

restart_count=$((restart_count + 1))
last_restart=$current_time

sleep 10

if pgrep $PROCESS_NAME > /dev/null; then
echo "$(date): Process $PROCESS_NAME restarted successfully"
restart_count=0 # 重置计数器
else
echo "$(date): Failed to restart process $PROCESS_NAME"
fi
else
echo "$(date): Maximum restart attempts reached for $PROCESS_NAME"
# 发送告警
send_alert "Process $PROCESS_NAME failed to restart after $MAX_RESTARTS attempts"
fi
fi
else
restart_count=0 # 进程正常运行,重置计数器
fi

sleep 60
done

6.3.2 进程健康检查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/bin/bash
# process_health_check.sh
HEALTH_CHECK_INTERVAL=30
LOG_FILE="/var/log/process_health.log"

log_health_status() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

check_process_health() {
local pid="$1"
local process_name="$2"

# 检查进程是否存在
if ! ps -p $pid > /dev/null 2>&1; then
log_health_status "CRITICAL: Process $process_name (PID: $pid) is not running"
return 1
fi

# 检查进程状态
local status=$(ps -p $pid -o stat= | cut -c1)
if [ "$status" = "Z" ]; then
log_health_status "CRITICAL: Process $process_name (PID: $pid) is zombie"
return 1
fi

# 检查资源使用
local cpu_usage=$(ps -p $pid -o %cpu= | tr -d ' ')
local mem_usage=$(ps -p $pid -o %mem= | tr -d ' ')

if (( $(echo "$cpu_usage > 95" | bc -l) )); then
log_health_status "WARNING: Process $process_name CPU usage is ${cpu_usage}%"
fi

if (( $(echo "$mem_usage > 95" | bc -l) )); then
log_health_status "WARNING: Process $process_name memory usage is ${mem_usage}%"
fi

# 检查文件描述符
local fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
if [ $fd_count -gt 10000 ]; then
log_health_status "WARNING: Process $process_name has $fd_count file descriptors"
fi

# 检查进程响应性
if ! kill -0 $pid 2>/dev/null; then
log_health_status "WARNING: Process $process_name is not responding to signals"
fi

log_health_status "OK: Process $process_name is healthy (CPU: ${cpu_usage}%, MEM: ${mem_usage}%, FD: $fd_count)"
return 0
}

main() {
local processes=("nginx" "mysql" "redis" "java")

while true; do
for process in "${processes[@]}"; do
local pids=$(pgrep $process)

if [ -n "$pids" ]; then
for pid in $pids; do
check_process_health "$pid" "$process"
done
else
log_health_status "CRITICAL: No instances of process $process found"
fi
done

sleep $HEALTH_CHECK_INTERVAL
done
}

main

7. 进程监控工具集成

7.1 Prometheus集成

7.1.1 Node Exporter配置

1
2
3
4
5
6
7
8
9
10
# node_exporter.yml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 5s
metrics_path: /metrics

7.1.2 进程监控指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# 自定义进程监控指标
#!/bin/bash
# process_metrics_exporter.sh
METRICS_FILE="/tmp/process_metrics.prom"

generate_metrics() {
cat > $METRICS_FILE << EOF
# HELP process_cpu_usage_percent CPU usage percentage
# TYPE process_cpu_usage_percent gauge
EOF

ps aux | awk 'NR>1 {
print "process_cpu_usage_percent{pid=\"" $2 "\",cmd=\"" $11 "\"} " $3
}' >> $METRICS_FILE

cat >> $METRICS_FILE << EOF

# HELP process_memory_usage_percent Memory usage percentage
# TYPE process_memory_usage_percent gauge
EOF

ps aux | awk 'NR>1 {
print "process_memory_usage_percent{pid=\"" $2 "\",cmd=\"" $11 "\"} " $4
}' >> $METRICS_FILE

cat >> $METRICS_FILE << EOF

# HELP process_memory_rss_bytes Resident memory size in bytes
# TYPE process_memory_rss_bytes gauge
EOF

ps aux | awk 'NR>1 {
print "process_memory_rss_bytes{pid=\"" $2 "\",cmd=\"" $11 "\"} " ($6 * 1024)
}' >> $METRICS_FILE
}

# 启动HTTP服务器提供指标
start_metrics_server() {
while true; do
generate_metrics
sleep 10
done &

# 使用Python提供HTTP服务
python3 -c "
import http.server
import socketserver
import os

class MetricsHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
with open('$METRICS_FILE', 'r') as f:
self.wfile.write(f.read().encode())
else:
self.send_response(404)
self.end_headers()

with socketserver.TCPServer(('', 8080), MetricsHandler) as httpd:
httpd.serve_forever()
"
}

start_metrics_server

7.2 Grafana仪表板

7.2.1 进程监控仪表板配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
{
"dashboard": {
"title": "Process Monitoring Dashboard",
"panels": [
{
"title": "Process CPU Usage",
"type": "graph",
"targets": [
{
"expr": "process_cpu_usage_percent",
"legendFormat": "{{cmd}} ({{pid}})"
}
]
},
{
"title": "Process Memory Usage",
"type": "graph",
"targets": [
{
"expr": "process_memory_usage_percent",
"legendFormat": "{{cmd}} ({{pid}})"
}
]
},
{
"title": "Process Count",
"type": "stat",
"targets": [
{
"expr": "count(process_cpu_usage_percent)"
}
]
}
]
}
}

8. 实战案例

8.1 Java应用进程监控

8.1.1 JVM进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash
# java_process_monitor.sh
JAVA_PID=$1

if [ -z "$JAVA_PID" ]; then
echo "Usage: $0 <java_pid>"
exit 1
fi

echo "=== Java Process Monitoring for PID $JAVA_PID ==="

# 基本进程信息
echo "Process Info:"
ps -p $JAVA_PID -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# JVM信息
echo -e "\nJVM Info:"
jinfo $JAVA_PID

# 堆内存使用
echo -e "\nHeap Memory:"
jstat -gc $JAVA_PID

# 线程信息
echo -e "\nThread Info:"
jstack $JAVA_PID | head -50

# GC信息
echo -e "\nGC Statistics:"
jstat -gcutil $JAVA_PID

# 内存映射
echo -e "\nMemory Map:"
pmap -x $JAVA_PID

8.1.2 Java应用性能分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# java_performance_analyzer.sh
JAVA_PID=$1
DURATION=${2:-60}

echo "Analyzing Java application performance for PID $JAVA_PID"

# CPU分析
echo "=== CPU Analysis ==="
perf record -p $JAVA_PID sleep $DURATION
perf report --stdio

# 内存分析
echo "=== Memory Analysis ==="
jmap -histo $JAVA_PID > memory_histogram.txt
jmap -dump:format=b,file=heap_dump.hprof $JAVA_PID

# 线程分析
echo "=== Thread Analysis ==="
jstack $JAVA_PID > thread_dump.txt

# GC分析
echo "=== GC Analysis ==="
jstat -gc $JAVA_PID 1s $DURATION > gc_stats.txt

echo "Analysis complete. Check the generated files for details."

8.2 Web服务器进程监控

8.2.1 Nginx进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# nginx_process_monitor.sh
echo "=== Nginx Process Monitoring ==="

# 检查Nginx进程
nginx_pids=$(pgrep nginx)
if [ -z "$nginx_pids" ]; then
echo "ERROR: Nginx is not running!"
exit 1
fi

echo "Nginx PIDs: $nginx_pids"

# 主进程信息
master_pid=$(pgrep -f "nginx: master process")
echo "Master PID: $master_pid"

# Worker进程信息
worker_pids=$(pgrep -f "nginx: worker process")
echo "Worker PIDs: $worker_pids"

# 进程状态
echo -e "\nProcess Status:"
ps -p $nginx_pids -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# 连接统计
echo -e "\nConnection Statistics:"
netstat -an | grep :80 | wc -l
netstat -an | grep :443 | wc -l

# 文件描述符使用
echo -e "\nFile Descriptor Usage:"
for pid in $nginx_pids; do
fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
echo "PID $pid: $fd_count file descriptors"
done

# 内存使用
echo -e "\nMemory Usage:"
for pid in $nginx_pids; do
memory_mb=$(ps -p $pid -o rss= | awk '{print $1/1024}')
echo "PID $pid: ${memory_mb}MB memory"
done

8.2.2 Apache进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# apache_process_monitor.sh
echo "=== Apache Process Monitoring ==="

# 检查Apache进程
apache_pids=$(pgrep httpd)
if [ -z "$apache_pids" ]; then
echo "ERROR: Apache is not running!"
exit 1
fi

echo "Apache PIDs: $apache_pids"

# 主进程信息
master_pid=$(pgrep -f "httpd.*-D")
echo "Master PID: $master_pid"

# Worker进程信息
worker_pids=$(pgrep -f "httpd.*-k")
echo "Worker PIDs: $worker_pids"

# 进程状态
echo -e "\nProcess Status:"
ps -p $apache_pids -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# 连接统计
echo -e "\nConnection Statistics:"
netstat -an | grep :80 | wc -l
netstat -an | grep :443 | wc -l

# 模块信息
echo -e "\nLoaded Modules:"
httpd -M 2>/dev/null | head -20

8.3 数据库进程监控

8.3.1 MySQL进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
# mysql_process_monitor.sh
echo "=== MySQL Process Monitoring ==="

# 检查MySQL进程
mysql_pids=$(pgrep mysqld)
if [ -z "$mysql_pids" ]; then
echo "ERROR: MySQL is not running!"
exit 1
fi

echo "MySQL PIDs: $mysql_pids"

# 主进程信息
master_pid=$(pgrep -f "mysqld.*--daemonize")
echo "Master PID: $master_pid"

# 进程状态
echo -e "\nProcess Status:"
ps -p $mysql_pids -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# 连接统计
echo -e "\nConnection Statistics:"
mysql -e "SHOW STATUS LIKE 'Connections';" 2>/dev/null
mysql -e "SHOW STATUS LIKE 'Threads_connected';" 2>/dev/null
mysql -e "SHOW STATUS LIKE 'Threads_running';" 2>/dev/null

# 内存使用
echo -e "\nMemory Usage:"
mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_pages_data';" 2>/dev/null
mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_pages_total';" 2>/dev/null

# 文件描述符使用
echo -e "\nFile Descriptor Usage:"
for pid in $mysql_pids; do
fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
echo "PID $pid: $fd_count file descriptors"
done

8.3.2 Redis进程监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# redis_process_monitor.sh
echo "=== Redis Process Monitoring ==="

# 检查Redis进程
redis_pids=$(pgrep redis-server)
if [ -z "$redis_pids" ]; then
echo "ERROR: Redis is not running!"
exit 1
fi

echo "Redis PIDs: $redis_pids"

# 进程状态
echo -e "\nProcess Status:"
ps -p $redis_pids -o pid,ppid,cmd,%cpu,%mem,vsz,rss,tty,stat,start,time

# Redis信息
echo -e "\nRedis Info:"
redis-cli info server 2>/dev/null | head -10
redis-cli info memory 2>/dev/null | head -10
redis-cli info stats 2>/dev/null | head -10

# 连接统计
echo -e "\nConnection Statistics:"
redis-cli info clients 2>/dev/null

# 内存使用
echo -e "\nMemory Usage:"
redis-cli info memory 2>/dev/null | grep -E "(used_memory|used_memory_peak|used_memory_rss)"

# 文件描述符使用
echo -e "\nFile Descriptor Usage:"
for pid in $redis_pids; do
fd_count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
echo "PID $pid: $fd_count file descriptors"
done

9. 总结

进程监控是系统运维的核心技能,通过合理的监控策略和工具使用,可以:

  1. 预防故障: 及时发现进程异常,预防服务中断
  2. 优化性能: 识别性能瓶颈,优化资源配置
  3. 提高可靠性: 确保关键进程稳定运行
  4. 降低运维成本: 自动化监控和告警,减少人工干预

9.1 关键要点

  • 全面监控: 覆盖CPU、内存、I/O、网络等关键指标
  • 实时告警: 建立完善的告警机制,及时发现问题
  • 自动化运维: 实现进程自动重启和健康检查
  • 性能分析: 使用专业工具进行深度性能分析
  • 持续优化: 根据监控数据持续优化系统配置

9.2 最佳实践

  1. 分层监控: 从系统级到应用级的全方位监控
  2. 阈值设置: 合理设置监控阈值,避免误报
  3. 工具集成: 集成多种监控工具,提供统一视图
  4. 文档记录: 详细记录监控配置和故障处理流程
  5. 定期演练: 定期进行故障演练,验证监控有效性

通过本文的学习和实践,您将掌握企业级进程监控的核心技能,能够有效监控和管理生产环境中的各种进程,确保系统稳定高效运行。