当前位置：首页 > ops >正文

一文吃透 deviceQuery：从安装到输出解读，彻底验证服务器 GPU 环境

ops 2025/9/3 6:03:04

在服务器上进行 GPU 计算或深度学习任务之前，第一步通常是确认 GPU 设备是否被正确识别和配置。NVIDIA 提供的 `deviceQuery` 示例程序正是为此而生。本文将详细介绍如何在服务器上安装 CUDA 工具包、编译并运行 `deviceQuery`，以及如何解读其输出结果，确保你的 GPU 环境一切就绪。

1. 什么是 deviceQuery？

`deviceQuery` 是 NVIDIA CUDA 工具包（CUDA Toolkit）附带的一个简单示例程序，用于查询和显示系统中所有 CUDA-capable GPU 的详细信息。它能告诉我们：

系统中是否检测到 GPU
GPU 的型号、计算能力（Compute Capability）
全局内存大小、共享内存大小
每个 block 的最大线程数、最大 grid 尺寸
驱动版本和 CUDA 运行时版本
其他硬件限制和功能特性

简单来说，如果 `deviceQuery` 运行成功并识别出你的 GPU，那么你的基本 CUDA 环境就配置好了。

2. 安装 CUDA Toolkit

在运行 `deviceQuery` 之前，需要确保服务器上已正确安装了 NVIDIA 驱动和 CUDA Toolkit。

2.1 检查 NVIDIA 驱动

nvidia-smi

如果这条命令能看到 GPU 列表、驱动版本（Driver Version）和 CUDA 版本（如 `CUDA Version: 12.2`），说明驱动已经装好。

2.2 安装 CUDA Toolkit（如未安装）

以 Ubuntu 为例，使用官方 runfile 或 deb 包安装：

添加 NVIDIA 包仓库

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/comuda/12.2.0/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-2-local_12.2.0-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-2

安装完成后，将 CUDA 路径加入环境变量：

echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.rc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.rc
source ~/.rc

3. 编译 deviceQuery

CUDA Toolkit 安装完成后，`deviceQuery` 源码位于：

/usr/local/cuda-12.2/samples/1_Utilities/deviceQuery

进入目录并编译：

cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
sudo make

编译成功后，会在同一目录下生成可执行文件 `deviceQuery`。

4. 运行 deviceQuery

执行：

./deviceQuery

正常情况下，你会看到类似以下输出：

./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100-SXM4-80GB"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 81051 MBytes (84997303040 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor: 2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA A100-SXM4-80GB"
  ...（同上，略）...
Result = PASS

5. 解读输出信息

CUDA Driver Version / Runtime Version：驱动版本和运行时版本必须兼容，否则程序无法运行。

CUDA Capability Major/Minor version number：计算能力（如 8.0 对应 Ampere 架构），决定可使用的 CUDA 特性。

Total amount of global memory：GPU 显存大小，深度学习框架通常需要充足显存。

CUDA Cores：流处理器数量，越多计算性能越强。

Maximum number of threads per block：每个 block 最多线程数，影响 kernel 设计。

Concurrent copy and kernel execution：是否支持 GPU 计算与数据传输并行，提高程序效率。

Device supports Unified Addressing (UVA)：是否支持统一寻址，简化内存管理。

6. 常见问题及排查

问题 1：找不到 deviceQuery

可能 CUDA Toolkit 未安装或路径错误。检查 `/usr/local/cuda/samples` 或重新安装 Toolkit。

问题 2：运行后 Result = FAIL

通常是驱动未正确安装或 GPU 不支持 CUDA。重新安装驱动，或用 `nvidia-smi` 确认驱动状态。

问题 3：无权限运行

Docker 容器内需添加 `--gpus all` 参数，或宿主机 `/dev/nvidia*` 权限不足，用 `sudo` 或修改 udev 规则。

问题 4：版本不匹配

驱动版本过低无法支持高版本 CUDA，需升级驱动。

7. 小结

通过 `deviceQuery`，我们能快速验证服务器上的 GPU 是否正常工作，并获取详细硬件信息，为后续 CUDA 编程或深度学习框架（如 PyTorch、TensorFlow）的使用打下基础。务必确保 `deviceQuery` 输出 `Result = PASS`，再开始你的 GPU 计算之旅！

查看全文

http://www.xdnf.cn/news/19522.html