当前位置: 首页 > news >正文

k8s集群GPU监控项说明

文章目录

  • 1. DCGM_FI_DEV_SM_CLOCK
  • 2. DCGM_FI_DEV_MEM_CLOCK
  • 3. DCGM_FI_DEV_MEMORY_TEMP
  • 4. DCGM_FI_DEV_GPU_TEMP
  • 5. DCGM_FI_DEV_POWER_USAGE
  • 6. DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION
  • 7. DCGM_FI_DEV_PCIE_REPLAY_COUNTER
  • 8. DCGM_FI_DEV_GPU_UTIL
  • 9. DCGM_FI_DEV_MEM_COPY_UTIL
  • 10. DCGM_FI_DEV_ENC_UTIL
  • 11. DCGM_FI_DEV_DEC_UTIL
  • 12. DCGM_FI_DEV_XID_ERRORS
  • 13. DCGM_FI_DEV_FB_FREE
  • 14. DCGM_FI_DEV_FB_USED
  • 15. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
  • 16. DCGM_FI_DEV_VGPU_LICENSE_STATUS
  • 17. DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS
  • 18. DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
  • 19. DCGM_FI_DEV_ROW_REMAP_FAILURE

1. DCGM_FI_DEV_SM_CLOCK

  • 作用: NVIDIA GPU 的流多处理器(SM)时钟频率,单位MHz。
    • DCGM:NVIDIA Data Center GPU Manager
  • 示例:
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-08496763-b93f-8f6a-4fc0-d93e1357bb10", cluster="k8s-cto-gpu-pro", device="nvidia4", gpu="4", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-1df8afeb-04ab-4cb6-7d00-ab7dbcc4eed9", cluster="k8s-cto-gpu-pro", device="nvidia5", gpu="5", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-4a14cb97-c86f-7e77-a74b-05575eccc227", cluster="k8s-cto-gpu-pro", device="nvidia1", gpu="1", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-82a64530-b090-e638-c523-5f3eb4eea01a", cluster="k8s-cto-gpu-pro", device="nvidia0", gpu="0", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b05339cb-5930-dadd-480e-dde12630e0b1", cluster="k8s-cto-gpu-pro", device="nvidia3", gpu="3", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b8b369cc-8369-5891-ce29-21ca979abb00", cluster="k8s-cto-gpu-pro", device="nvidia7", gpu="7", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-d9d28678-a25a-0d6b-7f80-d2234aa4dc0d", cluster="k8s-cto-gpu-pro", device="nvidia2", gpu="2", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 1980
DCGM_FI_DEV_SM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-de655b8a-1101-25e8-ef43-6bde6c315612", cluster="k8s-cto-gpu-pro", container="fssc-ocr-qa-online", device="nvidia6", gpu="6", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20", namespace="fisam", pod="a0703240723011-fssc-ocr-qa-online-67bbc98589-mqskl"} 1980
  • 说明:表示 GPU 的 SM 时钟频率为 1980 MHz

2. DCGM_FI_DEV_MEM_CLOCK

  • 作用:表示 GPU 显存的时钟频率,单位MHz

  • 示例

DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-08496763-b93f-8f6a-4fc0-d93e1357bb10", cluster="k8s-cto-gpu-pro", device="nvidia4", gpu="4", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-1df8afeb-04ab-4cb6-7d00-ab7dbcc4eed9", cluster="k8s-cto-gpu-pro", device="nvidia5", gpu="5", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-4a14cb97-c86f-7e77-a74b-05575eccc227", cluster="k8s-cto-gpu-pro", device="nvidia1", gpu="1", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-82a64530-b090-e638-c523-5f3eb4eea01a", cluster="k8s-cto-gpu-pro", device="nvidia0", gpu="0", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b05339cb-5930-dadd-480e-dde12630e0b1", cluster="k8s-cto-gpu-pro", device="nvidia3", gpu="3", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b8b369cc-8369-5891-ce29-21ca979abb00", cluster="k8s-cto-gpu-pro", device="nvidia7", gpu="7", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-d9d28678-a25a-0d6b-7f80-d2234aa4dc0d", cluster="k8s-cto-gpu-pro", device="nvidia2", gpu="2", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 2619
DCGM_FI_DEV_MEM_CLOCK{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-de655b8a-1101-25e8-ef43-6bde6c315612", cluster="k8s-cto-gpu-pro", container="fssc-ocr-qa-online", device="nvidia6", gpu="6", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20", namespace="fisam", pod="a0703240723011-fssc-ocr-qa-online-67bbc98589-mqskl"} 2619

3. DCGM_FI_DEV_MEMORY_TEMP

  • 作用: 显存(GPU Memory)的温度,即 GPU 显存模块的温度
  • 示例
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-08496763-b93f-8f6a-4fc0-d93e1357bb10", cluster="k8s-cto-gpu-pro", device="nvidia4", gpu="4", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 36
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-1df8afeb-04ab-4cb6-7d00-ab7dbcc4eed9", cluster="k8s-cto-gpu-pro", device="nvidia5", gpu="5", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 40
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-4a14cb97-c86f-7e77-a74b-05575eccc227", cluster="k8s-cto-gpu-pro", device="nvidia1", gpu="1", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 41
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-82a64530-b090-e638-c523-5f3eb4eea01a", cluster="k8s-cto-gpu-pro", device="nvidia0", gpu="0", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 35
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b05339cb-5930-dadd-480e-dde12630e0b1", cluster="k8s-cto-gpu-pro", device="nvidia3", gpu="3", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 41
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-b8b369cc-8369-5891-ce29-21ca979abb00", cluster="k8s-cto-gpu-pro", device="nvidia7", gpu="7", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 40
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-d9d28678-a25a-0d6b-7f80-d2234aa4dc0d", cluster="k8s-cto-gpu-pro", device="nvidia2", gpu="2", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20"} 36
DCGM_FI_DEV_MEMORY_TEMP{Hostname="gpu-metrics-exporter-r2dch", UUID="GPU-de655b8a-1101-25e8-ef43-6bde6c315612", cluster="k8s-cto-gpu-pro", container="fssc-ocr-qa-online", device="nvidia6", gpu="6", instance="10.10.182.63:9400", job="k8s-cto-gpu-pro-gpu", modelName="NVIDIA H20", namespace="fisam", pod="a0703240723011-fssc-ocr-qa-online-67bbc98589-mqskl"} 34

4. DCGM_FI_DEV_GPU_TEMP

  • 作用: GPU 芯片(流多处理器,SM)的温度
  • 示例:
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-02deb8e4-d4fc-a6c9-c189-8e776203d7c0", cluster="k8s-test", container="a0703240710024-qa", device="nvidia7", gpu="7", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20", namespace="qa", pod="a0703240710024-qa-7c576c869b-vtn4x"} 33
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1bba11cf-d9a9-f77c-9c66-6b99d116faff", cluster="k8s-test", device="nvidia0", gpu="0", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 26
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1ccf2570-2815-3ced-8898-95648cd876d9", cluster="k8s-test", device="nvidia4", gpu="4", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 27
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-419d9aa5-aa41-5e8c-830c-2217fb5d155f", cluster="k8s-test", device="nvidia3", gpu="3", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 47
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-82160b93-6cae-dfe0-3665-e5a8c22c6897", cluster="k8s-test", device="nvidia2", gpu="2", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 34
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b45c6ec4-2b71-f68e-bd68-95df336318b6", cluster="k8s-test", device="nvidia1", gpu="1", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 34
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b7833685-11da-1f34-0f07-5756b22f781c", cluster="k8s-test", device="nvidia5", gpu="5", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 33
DCGM_FI_DEV_GPU_TEMP{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-ff64b19f-1bc9-2652-2fbb-e2df22d2e2e5", cluster="k8s-test", device="nvidia6", gpu="6", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 28

5. DCGM_FI_DEV_POWER_USAGE

  • 作用: GPU 的当前功耗,单位(W)
  • 示例
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-02deb8e4-d4fc-a6c9-c189-8e776203d7c0", cluster="k8s-test", container="a0703240710024-qa", device="nvidia7", gpu="7", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20", namespace="qa", pod="a0703240710024-qa-7c576c869b-vtn4x"} 118.396
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1bba11cf-d9a9-f77c-9c66-6b99d116faff", cluster="k8s-test", device="nvidia0", gpu="0", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 117.088
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1ccf2570-2815-3ced-8898-95648cd876d9", cluster="k8s-test", device="nvidia4", gpu="4", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 115.64
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-419d9aa5-aa41-5e8c-830c-2217fb5d155f", cluster="k8s-test", device="nvidia3", gpu="3", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 118.583
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-82160b93-6cae-dfe0-3665-e5a8c22c6897", cluster="k8s-test", device="nvidia2", gpu="2", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 115.182
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b45c6ec4-2b71-f68e-bd68-95df336318b6", cluster="k8s-test", device="nvidia1", gpu="1", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 121.918
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b7833685-11da-1f34-0f07-5756b22f781c", cluster="k8s-test", device="nvidia5", gpu="5", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 120.652
DCGM_FI_DEV_POWER_USAGE{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-ff64b19f-1bc9-2652-2fbb-e2df22d2e2e5", cluster="k8s-test", device="nvidia6", gpu="6", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 119.484

6. DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

  • 作用:显卡总能耗
  • 示例
    查询语句:我们查询每秒平均能耗
rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{cluster="k8s-test"}[5m])

查询结果:

{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-02deb8e4-d4fc-a6c9-c189-8e776203d7c0", cluster="k8s-test", container="a0703240710024-qa", device="nvidia7", gpu="7", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20", namespace="qa", pod="a0703240710024-qa-7c576c869b-vtn4x"} 95843.91749251151
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1bba11cf-d9a9-f77c-9c66-6b99d116faff", cluster="k8s-test", device="nvidia0", gpu="0", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 95613.14993275795
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1ccf2570-2815-3ced-8898-95648cd876d9", cluster="k8s-test", device="nvidia4", gpu="4", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 93766.0105002778
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-419d9aa5-aa41-5e8c-830c-2217fb5d155f", cluster="k8s-test", device="nvidia3", gpu="3", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 94915.19862014297
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-82160b93-6cae-dfe0-3665-e5a8c22c6897", cluster="k8s-test", device="nvidia2", gpu="2", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 93046.31376552948
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b45c6ec4-2b71-f68e-bd68-95df336318b6", cluster="k8s-test", device="nvidia1", gpu="1", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 99465.33625232431
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b7833685-11da-1f34-0f07-5756b22f781c", cluster="k8s-test", device="nvidia5", gpu="5", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 97453.83702762033
{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-ff64b19f-1bc9-2652-2fbb-e2df22d2e2e5", cluster="k8s-test", device="nvidia6", gpu="6", instance="10.10.177.64:9401", job="kubernetes-gpu-exporter", modelName="NVIDIA H20"} 97032.63242375078

7. DCGM_FI_DEV_PCIE_REPLAY_COUNTER

  • 作用: GPU 的 PCIe 重放事件计数

PCIe (Peripheral Component Interconnect Express)是一种高速串行计算机扩展总线标准,用于连接计算机主板与各种硬件设备(如显卡、固态硬盘、网卡等)

8. DCGM_FI_DEV_GPU_UTIL

  • 作用: GPU 的计算单元利用率

9. DCGM_FI_DEV_MEM_COPY_UTIL

  • 作用:GPU从显存复制的利用率

10. DCGM_FI_DEV_ENC_UTIL

  • 作用:GPU 硬件编码器(NVENC)利用率(%)

11. DCGM_FI_DEV_DEC_UTIL

  • 作用: GPU 硬件解码器(NVDEC)利用率(%)

12. DCGM_FI_DEV_XID_ERRORS

  • 作用:GPU XID 错误,值是错误码

常见的 XID 错误类型包括:

  • XID 31 - GPU 负载过高,导致驱动超时
  • XID 43 - GPU 挂起或崩溃
  • XID 48 - 运行时 GPU 访问错误
  • XID 79 - 过热导致 GPU 频率降级

13. DCGM_FI_DEV_FB_FREE

  • 作用:GPU显存剩余量

14. DCGM_FI_DEV_FB_USED

  • 作用:GPU显存使用量

15. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

  • 作用:通过 NVLink 传输数据的总带宽

NVLink 是 NVIDIA 开发的一种高速互联技术,用于在多个 GPU 之间或 GPU 与 CPU 之间传输数据

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-02deb8e4-d4fc-a6c9-c189-8e776203d7c0", cluster="k8s-test", device="nvidia7", gpu="7", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1bba11cf-d9a9-f77c-9c66-6b99d116faff", cluster="k8s-test", device="nvidia0", gpu="0", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-1ccf2570-2815-3ced-8898-95648cd876d9", cluster="k8s-test", device="nvidia4", gpu="4", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-419d9aa5-aa41-5e8c-830c-2217fb5d155f", cluster="k8s-test", device="nvidia3", gpu="3", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 342
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-82160b93-6cae-dfe0-3665-e5a8c22c6897", cluster="k8s-test", container="a0703240710024-qa", device="nvidia2", gpu="2", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20", namespace="qa", pod="a0703240710024-qa-7c576c869b-lzclw"} 343
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b45c6ec4-2b71-f68e-bd68-95df336318b6", cluster="k8s-test", device="nvidia1", gpu="1", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-b7833685-11da-1f34-0f07-5756b22f781c", cluster="k8s-test", device="nvidia5", gpu="5", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{Hostname="gpu-metrics-exporter-fplml", UUID="GPU-ff64b19f-1bc9-2652-2fbb-e2df22d2e2e5", cluster="k8s-test", device="nvidia6", gpu="6", instance="10.10.177.64:9401", job="kubernetes-cto-gpu-test-gpu", modelName="NVIDIA H20"} 0
  • 拓展:服务器上查看GPU 互连拓扑
root@liubei:~# nvidia-smi topo -mGPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      32-63,96-127    1               N/ALegend:X    = SelfSYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA nodePHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)PIX  = Connection traversing at most a single PCIe bridgeNV#  = Connection traversing a bonded set of # NVLinks 

如上可知:

  • GPU 0-3 绑定到 NUMA Node 0 ,链接CPU 0-31,64-95
  • GPU 4-7 绑定到 NUMA Node 1 ,链接CPU 32-63,96-127

16. DCGM_FI_DEV_VGPU_LICENSE_STATUS

  • 作用:NVIDA vGPU许可证状态

17. DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS

  • 作用:GPU 显存单元(Memory Rows) 是否出现了无法修正的错误

18. DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS

  • 作用:GPU 显存单元(Memory Rows) 是否出现了轻微错误

19. DCGM_FI_DEV_ROW_REMAP_FAILURE

  • 作用:显存行(Row)重映射失败 的次数
http://www.xdnf.cn/news/64747.html

相关文章:

  • 【人工智能】使用vLLM高性能部署Qwen大语言模型
  • 10天学会嵌入式技术之51单片机-day-2
  • LVGL学习(一)(IMX6ULL运行LVGL,lv_obj_t,lv_obj,size,position,border-box,styles,events)
  • 4·25调价倒计时:SheinTemu美国站价格策略3大关键解读
  • 多路由器通过RIP动态路由实现通讯(单臂路由)
  • Dify忘记管理员密码,重置的问题
  • 数据结构中的各种排序
  • java反射 笔记
  • redis数据类型-位图bitmap
  • Java:多线程
  • C#处理网络传输中不完整的数据流
  • Maxscript调用Newtonsoft.Json解析Json
  • 制作一款打飞机游戏13:状态机
  • 广州可信数据空间上线:1个城市枢纽+N个产业专区+高质量数据集(附28个数据集清单)
  • 如何建设企业级合成数据中台?架构设计、权限治理与复用机制全解
  • 第 3 篇:揭秘时间模式 - 时间序列分解
  • OpenCV基础函数学习4
  • 【油藏地球物理正演软件ColchisFM】ColchisFM正演软件在阿姆河右岸区块礁滩复合体识别中的应用
  • transformer
  • 【Docker-16】Docker Volume存储卷
  • android 多个viewmodel之间通信
  • Android 最简单的native二进制程序
  • 【MySQL】:数据库事务管理
  • 深入理解路由器、IP地址及网络配置
  • 你的大模型服务如何压测:首 Token 延迟、并发与 QPS
  • 前端笔记-AJAX
  • Excel/WPS表格中图片链接转换成对应的实际图片
  • 大模型应用开发大纲
  • 前端框架开发编译阶段与运行时的核心内容详解Tree Shaking核心实现原理详解
  • C语言中的双链表和单链表详细解释与实现