使用Prometheus结合Pushgateway实现脚本运行状态监控

作者：搬砖的石头2025.09.25 17:12浏览量：63

简介：本文详细介绍了如何通过Prometheus与Pushgateway结合实现脚本运行状态的实时监控，涵盖架构设计、配置步骤、数据推送与查询方法，以及实际生产环境中的优化建议。

使用Prometheus结合Pushgateway实现脚本运行状态监控

一、背景与需求分析

在分布式系统或自动化运维场景中，脚本任务的执行状态（成功/失败/耗时）对系统稳定性至关重要。传统监控方式（如日志分析或轮询检查）存在实时性差、扩展性弱等问题。Prometheus作为开源监控解决方案，擅长时序数据采集与告警，但其原生设计更适用于服务端指标的拉取模式（Pull-based）。对于短生命周期的脚本任务（如定时备份、数据清洗），需通过Pushgateway实现中间数据暂存，解决以下痛点：

短生命周期任务监控：脚本执行完成后退出，无法直接暴露HTTP端点供Prometheus拉取。
批量任务聚合：需统一收集多个节点的脚本执行结果，避免配置复杂的Service Discovery。
灵活告警规则：基于执行结果（如退出码、耗时阈值）触发告警。

二、技术架构设计

1. 组件角色

脚本任务：需监控的Python/Shell脚本，执行后推送指标至Pushgateway。
Pushgateway：作为中间缓存，接收脚本推送的指标并持久化（默认内存存储，可配置磁盘）。
Prometheus Server：定期从Pushgateway拉取指标，存储至TSDB并执行告警规则。
Alertmanager：接收Prometheus告警，通过邮件/Webhook通知运维人员。
Grafana（可选）：可视化脚本执行趋势与告警历史。

2. 数据流

脚本任务 → Pushgateway（HTTP Push） 
       ← Prometheus（HTTP Pull） 
       → Alertmanager → 通知渠道

三、实现步骤详解

1. 部署Pushgateway

# 使用Docker快速部署
docker run -d --name pushgateway -p 9091:9091 prom/pushgateway

验证服务：curl http://localhost:9091/metrics 应返回空指标或已有数据。

2. 脚本集成Pushgateway

示例：Python脚本推送指标

import requests
import time
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
def main():
    start_time = time.time()
    try:
        # 模拟脚本逻辑（此处替换为实际任务）
        result = "success"  # 或 "failure"
        exit_code = 0       # 0表示成功，非0表示失败
        duration = time.time() - start_time
        # 创建指标
        registry = CollectorRegistry()
        g = Gauge('script_duration_seconds', 'Execution duration', registry=registry)
        g.set(duration)
        # 推送指标（带任务标识和实例信息）
        push_to_gateway('http://pushgateway:9091', 
                       job='script_monitor',
                       grouping_key={'instance': 'node1', 'script_name': 'data_backup'},
                       registry=registry)
        # 可选：直接推送文本格式指标（兼容Shell脚本）
        metrics = f"""
        # HELP script_exit_code Exit code of the script (0=success)
        # TYPE script_exit_code gauge
        script_exit_code{{instance="node1",script_name="data_backup"}} {exit_code}
        # HELP script_result Result of the script (1=success, 0=failure)
        # TYPE script_result gauge
        script_result{{instance="node1",script_name="data_backup"}} {1 if exit_code == 0 else 0}
        """
        requests.post('http://pushgateway:9091/metrics/job/script_monitor/instance/node1/script_name/data_backup',
                     data=metrics.encode('utf-8'))
    except Exception as e:
        # 失败时推送错误指标
        error_metrics = f"""
        script_exit_code{{instance="node1",script_name="data_backup"}} 1
        script_result{{instance="node1",script_name="data_backup"}} 0
        """
        requests.post('http://pushgateway:9091/metrics/job/script_monitor/instance/node1/script_name/data_backup',
                     data=error_metrics.encode('utf-8'))
if __name__ == '__main__':
    main()

Shell脚本示例

#!/bin/bash
# 执行任务
start_time=$(date +%s)
/path/to/your/script.sh
exit_code=$?
end_time=$(date +%s)
duration=$((end_time - start_time))
# 生成Prometheus指标
metrics=$(cat <<EOF
# HELP script_duration_seconds Execution duration in seconds
# TYPE script_duration_seconds gauge
script_duration_seconds{instance="node1",script_name="cleanup"} $duration
# HELP script_exit_code Exit code (0=success)
# TYPE script_exit_code gauge
script_exit_code{instance="node1",script_name="cleanup"} $exit_code
EOF
)
# 推送至Pushgateway
curl -X POST -H "Content-Type: text/plain" --data "$metrics" \
  http://pushgateway:9091/metrics/job/script_monitor/instance/node1/script_name/cleanup

3. Prometheus配置

在prometheus.yml中添加抓取任务：

scrape_configs:
  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']
    metrics_path: '/metrics'
    honor_labels: true  # 保留Pushgateway中的标签（如instance）

4. 告警规则配置

在alert.rules.yml中定义告警：

groups:
- name: script-alerts
  rules:
  - alert: ScriptFailure
    expr: script_result{job="script_monitor"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Script {{ $labels.script_name }} failed on {{ $labels.instance }}"
      description: "Exit code: {{ $value }}, duration: {{ with query 'script_duration_seconds{instance=~\"' + $labels.instance + '\",script_name=~\"' + $labels.script_name + '\"}' }}{{ . | first | value }}s"
  - alert: ScriptLongRunning
    expr: script_duration_seconds{job="script_monitor"} > 300
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Script {{ $labels.script_name }} running too long on {{ $labels.instance }}"

四、生产环境优化建议

1. 数据清理策略

Pushgateway默认不会自动清理数据，需通过以下方式管理：

按分组键删除：脚本推送时指定唯一grouping_key，执行成功后推送空指标覆盖旧数据。
```
push_to_gateway(..., registry=CollectorRegistry())  # 推送空指标
```
API清理：通过DELETE /metrics/job/<job_name>接口手动清理。
定时任务：使用Cron定期清理过期数据（需结合外部脚本）。

2. 高可用设计

Pushgateway集群：通过Nginx反向代理多个Pushgateway实例，脚本随机选择一个推送。
持久化存储：启动Pushgateway时添加--persistence.file=/data/pushgateway.data参数。
Prometheus远程存储：配置Prometheus将数据写入Thanos或InfluxDB实现长期存储。

3. 安全加固

基础认证：在Pushgateway前部署Nginx，添加Basic Auth。

location /metrics {
    auth_basic "Prometheus Pushgateway";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://pushgateway:9091;
}

网络隔离：限制Pushgateway仅允许内网访问。

五、常见问题与解决方案

1. 指标重复推送

现象：Prometheus中同一分组键出现多条时间序列。
原因：脚本未正确覆盖旧指标或Pushgateway未清理。
解决：

确保每次推送使用相同的grouping_key。
推送成功后立即推送空指标覆盖（如Python示例中的registry=CollectorRegistry()）。

2. 告警不触发

检查步骤：

确认Prometheus已成功抓取Pushgateway数据：

curl http://prometheus:9090/api/v1/query?query=script_result{job="script_monitor"}

检查告警规则语法是否正确。
验证Alertmanager配置与路由规则。

3. Pushgateway性能瓶颈

优化建议：

单实例Pushgateway建议控制在10万条指标以内。
横向扩展：部署多个Pushgateway实例，脚本通过哈希算法分散推送。

六、总结与扩展

通过Prometheus+Pushgateway的组合，可高效实现脚本运行状态的实时监控，尤其适用于以下场景：

跨主机分布式脚本任务。
需基于执行结果（如退出码、耗时）触发告警的场景。
资源受限环境（如边缘设备）无法长期运行Exporter。

扩展方向：

结合Grafana创建动态仪表盘，展示脚本执行成功率、平均耗时等聚合指标。
集成CI/CD流水线，在脚本部署后自动注册监控任务。
使用Prometheus的Recording Rules预计算常用指标（如每日成功率）。

通过本文的实践，开发者可快速构建一套高可用、低延迟的脚本监控体系，显著提升运维效率与系统稳定性。

发表评论

开发者关注产品榜

最热文章

关于作者

被阅读数
被赞数
被收藏数

活动

咨询

开发者热搜

使用Prometheus结合Pushgateway实现脚本运行状态监控

使用Prometheus结合Pushgateway实现脚本运行状态监控

一、背景与需求分析

二、技术架构设计

1. 组件角色

2. 数据流

三、实现步骤详解

1. 部署Pushgateway

2. 脚本集成Pushgateway

示例：Python脚本推送指标

Shell脚本示例

3. Prometheus配置

4. 告警规则配置

四、生产环境优化建议

1. 数据清理策略

2. 高可用设计

3. 安全加固

五、常见问题与解决方案

1. 指标重复推送

2. 告警不触发

3. Pushgateway性能瓶颈

六、总结与扩展

相关文章推荐

文心一言接入指南：通过百度智能云千帆大模型平台API调用

从 MLOps 到 LMOps 的关键技术嬗变

Sugar BI教你怎么做数据可视化 - 拓扑图，让节点连接信息一目了然

更轻量的百度百舸，CCE Stack 智算版发布

打造合规数据闭环，加速自动驾驶技术研发

LMOps 工具链与千帆大模型平台

发表评论

开发者关注产品榜

百度千帆·大模型服务及Agent开发平台

百度千帆·数据智能平台

秒哒-生成式应用开发平台

百度智能云客悦智能客服平台

最热文章

关于作者