当前位置: 首页 > ds >正文

@Docker Compose 部署 Prometheus

文章目录

      • Docker Compose 部署 Prometheus
        • 1. 环境准备
        • 2. 配置文件准备
        • 3. 编写 Docker Compose 文件
        • 4. 启动服务
        • 5. 验证部署
        • 6. 常用操作
        • 7. 生产环境增强建议
        • 8. 扩展监控对象

Docker Compose 部署 Prometheus

1. 环境准备
  • 安装 Docker(版本 ≥ 20.10)和 Docker Compose(版本 ≥ 1.29)
  • 创建项目目录:
    mkdir prometheus && cd prometheus
    
2. 配置文件准备
  • 创建 Prometheus 配置文件
    prometheus.yml(基础配置):

    global:scrape_interval: 15sevaluation_interval: 15sscrape_configs:- job_name: "prometheus"static_configs:- targets: ["localhost:9090"]  # 监控自身# 示例:添加 Node Exporter(需额外部署)# - job_name: "node"#   static_configs:#     - targets: ["node-exporter:9100"]
    
  • 创建告警规则文件(可选)
    alerts.yml

    groups:
    - name: examplerules:- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"
    

    linux_rules.yml

    groups:
    - name: linux-system-rulesrules:# CPU 相关规则- alert: HighCpuLoadexpr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 10mlabels:severity: warningannotations:summary: "High CPU load on {{ $labels.instance }}"description: "CPU usage is {{ $value }}% for last 10 minutes"# 内存相关规则- alert: HighMemoryUsageexpr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 5  # 修改测试触发告警for: 10mlabels:severity: warningannotations:summary: "High memory usage on {{ $labels.instance }}"description: "Memory usage is {{ $value }}% for last 10 minutes"# 交换分区规则- alert: HighSwapUsageexpr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50for: 15mlabels:severity: warningannotations:summary: "High swap usage on {{ $labels.instance }}"description: "Swap usage is {{ $value }}% for last 15 minutes"# 磁盘空间规则- alert: LowDiskSpaceexpr: (node_filesystem_avail_bytes{mountpoint!~"^(/run|/var/lib/docker).*",fstype!="tmpfs"} / node_filesystem_size_bytes * 100) < 15for: 10mlabels:severity: warningannotations:summary: "Low disk space on {{ $labels.instance }} ({{ $labels.mountpoint }})"description: "Only {{ $value }}% free space left on {{ $labels.mountpoint }}"# 磁盘 I/O 规则- alert: HighDiskIoLoadexpr: rate(node_disk_io_time_seconds_total[1m]) * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High disk I/O load on {{ $labels.instance }} ({{ $labels.device }})"description: "Disk I/O load is {{ $value }}% for last 10 minutes"# 网络相关规则- alert: HighNetworkErrorsexpr: increase(node_network_receive_errs_total[5m]) > 10 or increase(node_network_transmit_errs_total[5m]) > 10for: 5mlabels:severity: warningannotations:summary: "High network errors on {{ $labels.instance }} ({{ $labels.device }})"description: "Network errors detected on interface {{ $labels.device }}"# 系统负载规则- alert: HighSystemLoadexpr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"}) > 1.5for: 15mlabels:severity: warningannotations:summary: "High system load on {{ $labels.instance }}"description: "5-minute load average is {{ $value }} (relative to CPU count)"# 节点宕机规则- alert: InstanceDownexpr: up{job="node"} == 0for: 5mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} has been down for more than 5 minutes"# 文件描述符规则- alert: HighFileDescriptorUsageexpr: node_filefd_allocated / node_filefd_maximum * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High file descriptor usage on {{ $labels.instance }}"description: "File descriptor usage is {{ $value }}% of maximum"

    windows_rules.yml

    groups:
    - name: windows-system-rulesrules:# CPU 相关规则- alert: HighCpuUsageWindowsexpr: 100 - (avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 85for: 10mlabels:severity: warningannotations:summary: "High CPU usage on {{ $labels.instance }}"description: "CPU usage is {{ $value }}% for last 10 minutes"# 内存相关规则- alert: HighMemoryUsageWindowsexpr: (windows_os_physical_memory_total_bytes - windows_os_physical_memory_free_bytes) / windows_os_physical_memory_total_bytes * 100 > 90for: 10mlabels:severity: warningannotations:summary: "High memory usage on {{ $labels.instance }}"description: "Memory usage is {{ $value }}% for last 10 minutes"# 磁盘空间规则- alert: LowDiskSpaceWindowsexpr: (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes * 100) < 95  # 修改测试触发告警for: 10mlabels:severity: warningannotations:summary: "Low disk space on {{ $labels.instance }} ({{ $labels.volume }})"description: "Only {{ $value }}% free space left on {{ $labels.volume }}"# 磁盘 I/O 规则- alert: HighDiskIoWindowsexpr: rate(windows_logical_disk_read_seconds_total[5m]) * 100 > 80 or rate(windows_logical_disk_write_seconds_total[5m]) * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High disk I/O on {{ $labels.instance }} ({{ $labels.volume }})"description: "Disk I/O utilization is {{ $value }}% for last 10 minutes"# 服务状态规则- alert: CriticalServiceDownexpr: windows_service_status{status!="running"} == 1for: 2mlabels:severity: criticalannotations:summary: "Critical service down on {{ $labels.instance }}"description: "Service {{ $labels.service }} is not running"# 系统启动时间规则- alert: SystemRebootedexpr: time() - windows_system_system_up_time > 300for: 0mlabels:severity: infoannotations:summary: "System rebooted on {{ $labels.instance }}"description: "System was rebooted, uptime is {{ $value }} seconds"# 网络连接规则- alert: HighNetworkUtilizationWindowsexpr: rate(windows_net_bytes_total[5m]) / windows_net_speed_bits * 8 * 100 > 80for: 10mlabels:severity: warningannotations:summary: "High network utilization on {{ $labels.instance }} ({{ $labels.interface }})"description: "Network utilization is {{ $value }}% for last 10 minutes"# 进程内存泄漏检测- alert: ProcessMemoryLeakWindowsexpr: predict_linear(windows_process_private_bytes[1h], 3600) / 1024 / 1024 / 1024 > 2for: 30mlabels:severity: warningannotations:summary: "Possible memory leak in {{ $labels.process }} on {{ $labels.instance }}"description: "Process {{ $labels.process }} is predicted to exceed 2GB memory in 1 hour"# 系统日志错误规则- alert: SystemLogErrorsWindowsexpr: rate(windows_event_log_errors_total[5m]) > 5for: 5mlabels:severity: warningannotations:summary: "High system log errors on {{ $labels.instance }}"description: "{{ $value }} errors per second in system logs"

    linux_recording_rules.yml

    groups:
    - name: linux-recording-rulesinterval: 1mrules:# CPU Usage (兼容多版本Node Exporter)- record: instance:node_cpu_usage:rate5mexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle",job=~".*"}[5m])) * 100)# Memory Usage (排除缓存/缓冲区)- record: instance:node_memory_usage:ratioexpr: >(node_memory_MemTotal_bytes - node_memory_MemFree_bytes- node_memory_Buffers_bytes - node_memory_Cached_bytes)/ node_memory_MemTotal_bytes * 100# Disk Space Usage (过滤无效挂载点)- record: instance:node_filesystem_usage:ratioexpr: >(node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"}- node_filesystem_avail_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"})/ node_filesystem_size_bytes{fstype!~"tmpfs|squashfs",mountpoint!~"/run|/snap"} * 100# Network Traffic (过滤虚拟接口)- record: instance:node_network_receive_mbps:rate5mexpr: sum by(instance)(rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])) * 8 / 1048576# System Load (标准化)- record: instance:node_load_ratio:rate5mexpr: node_load5 / count by(instance)(node_cpu_seconds_total{mode="system"})
3. 编写 Docker Compose 文件

docker-compose.yml

version: '3.8'services:prometheus:image: prom/prometheus:latestcontainer_name: prometheusvolumes:- ./prometheus.yml:/etc/prometheus/prometheus.yml- ./alerts.yml:/etc/prometheus/alerts.yml  # 挂载告警规则- prometheus-data:/prometheus  # 数据持久化command:- '--config.file=/etc/prometheus/prometheus.yml'- '--storage.tsdb.path=/prometheus'- '--web.enable-lifecycle'  # 允许热重载配置ports:- "9090:9090"restart: unless-stoppednetworks:- monitor-net# 可选:添加 Grafana 可视化grafana:image: grafana/grafana:latestcontainer_name: grafanavolumes:- grafana-data:/var/lib/grafanaports:- "3000:3000"restart: unless-stoppednetworks:- monitor-net# 可选:添加 Node Exporter 监控主机# node-exporter:#   image: prom/node-exporter:latest#   container_name: node-exporter#   restart: unless-stopped#   network_mode: host  # 需主机模式#   pid: host#   volumes:#     - /:/host:ro,rslave#   command:#     - '--path.rootfs=/host'volumes:prometheus-data:grafana-data:networks:monitor-net:driver: bridge
4. 启动服务
docker-compose up -d  # 后台启动
5. 验证部署
  • Prometheus UI:访问 http://<服务器IP>:9090
    • 检查 Targets:Status → Targets
    • 查询指标:Graph → 输入 up 查看状态
  • Grafana UI(如部署):http://<服务器IP>:3000(默认账号 admin/admin)
    • 添加 Prometheus 数据源:http://prometheus:9090
6. 常用操作
  • 重载配置(不重启)
    curl -X POST http://localhost:9090/-/reload
    
  • 查看日志
    docker-compose logs -f prometheus
    
  • 停止服务
    docker-compose down
    
  • 备份数据:备份 prometheus-data 卷(默认位置:/var/lib/docker/volumes/...
7. 生产环境增强建议
  1. 安全加固
    • 设置 Prometheus --web.config.file 启用基础认证
    • 限制 Grafana 登录策略
  2. 持久化优化
    volumes:prometheus-data:driver_opts:type: nfso: addr=<nfs_server>,rwdevice: ":/path/to/nfs"
    
  3. 资源限制
    prometheus:deploy:resources:limits:cpus: '2'memory: 4G
    
  4. 高可用方案
    • 部署多个 Prometheus 实例 + Thanos
    • 使用 Alertmanager 集群
8. 扩展监控对象

修改 prometheus.yml 添加:

# 监控 Docker 容器
- job_name: "docker"static_configs:- targets: ["docker-host:9323"]  # 需配置 docker daemon 暴露 metrics# 监控 MySQL
- job_name: "mysql"static_configs:- targets: ["mysql-exporter:9104"]  # 需部署 mysqld-exporter

:完整配置参考 Prometheus 官方文档

http://www.xdnf.cn/news/10262.html

相关文章:

  • SOC-ESP32S3部分:19-ADC模数转换
  • 基于CNN的OFDM-IM信号检测系统设计与实现
  • 安装启动Mosquitto以及问题error: cjson/cJSON.h: No such file or directory解决
  • 实验设计与分析(第6版,Montgomery)第5章析因设计引导5.7节思考题5.14 R语言解题
  • 从印巴空战看数据制胜密码:元数据如何赋能数字战场
  • 长尾关键词优化驱动SEO增长
  • 数据结构 堆与优先级队列
  • 几种常用的Agent的Prompt格式
  • 【GESP真题解析】第 17 集 GESP 二级 2024 年 9 月编程题 2:小杨的 N 字矩阵
  • 8.5 Q1|广州医科大学CHARLS发文 甘油三酯葡萄糖指数累积变化与 0-3期心血管-肾脏-代谢综合征人群中风发生率的相关性
  • UE5蓝图暴露变量,类似Unity中public一个变量,在游戏运行时修改变量实时变化和看向目标跟随目标Find Look at Rotation
  • 法律AI大模型与:应用原理、技术演进和实际案例
  • Apptrace:APP安全加速解决方案
  • Bitlocker密钥提取之SYSTEM劫持
  • CesiumInstancedMesh 实例
  • 从认识AI开始-----解密LSTM:RNN的进化之路
  • 比较云计算的四种部署模式:哪个是最佳选择?
  • LabVIEW与PLC液压泵测控系统
  • DPO(Direct Preference Optimization)详解-1
  • 国标GB28181设备管理软件EasyGBS实现生产全流程可视化监控与精细化管理
  • 2.从0开始搭建vue项目(node.js,vue3,Ts,ES6)
  • 【android bluetooth 案例分析 04】【Carplay 详解 1】【CarPlay 在车机侧的蓝牙通信原理与角色划分详解】
  • RPA如何支持跨平台和跨浏览器的自动化
  • 高级数据结构与算法期末考试速成记录
  • ECS-7000能耗监测系统能耗数据管理机
  • Linux之Nginx配置篇
  • 国芯思辰| 16通道12位模数转换器SC1425高性价比SGM5200替代方案,专为数字电源优化
  • 历年南开大学计算机保研上机真题
  • Wi-Fi 切换 5G 的时机
  • Express教程【001】:Express创建基本的Web服务器