服务器nvidia-smi报错找不到设备

Unable to determine the device handle for GPU0000:C2:00.0: Unknown Error

Posted by ThreeStones1029 on May 29, 2024

[TOC]

一、问题

早上查看服务器nvidia-smi发现报错

Unable to determine the device handle for GPU0000:C2:00.0: Unknown Error

大概意思就是找不到这张卡

二、问题分析与解决方法

2.1.检查挂载的GPU

先查看系统是否可以读取当前挂载的设备

lspci| grep -i nvidia

输出如下

01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
81:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
81:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
c1:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1)
c1:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
c2:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev ff)
c2:00.1 Audio device: NVIDIA Corporation Device 22ba (rev ff)

可以看到这张0000:C2:00.0没有挂载上,掉线了

2.2.生成log文件

我们再生成一下nvidia-bug-report的log文件

sudo nvidia-bug-report.sh

会在当前目录生成nvidia-bug-report.log.gz,解压后使用如下命令查看:

grep "fallen off" nvidia-bug-report.log

输出如下:

6月 28 02:52:53 jjf-Super-Server kernel: NVRM: Xid (PCI:0000:c2:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
6月 28 02:52:53 jjf-Super-Server kernel: NVRM: GPU 0000:c2:00.0: GPU has fallen off the bus.
[1590188.789226] NVRM: Xid (PCI:0000:c2:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[1590188.789232] NVRM: GPU 0000:c2:00.0: GPU has fallen off the bus.

可以看到报错代号为79,查阅资料有人说是因为电力不足或者温度过高,因为还有一张卡在跑,所以想查看一下温度,但nvidia-smi又不能使用

大佬指出的原因如下:

One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU.

2.3.禁用报错的卡查看温度

所以先禁用这张报错的卡

sudo nvidia-smi drain -p 0000:C2:00.0 -m 1

这行命令解释如下:

  • drain用于将指定的 GPU 标记为排空状态,防止新作业分配到该 GPU 上
  • -p 0000:C2:00.0 指定要排空的 GPU 的 PCI 地址。在你的系统中,这个地址标识特定的 GPU
  • -m 1: 设置 GPU 的排空模式。-m 选项有几个可能的值:
    • 0: 关闭排空模式(即恢复 GPU 正常操作)
    • 1: 开启排空模式,停止新作业分配到该 GPU,但允许当前运行的作业完成。
    • 2: 立即停止所有当前运行的作业并关闭排空模式。

禁用成功

Successfully set GPU 00000000:C2:00.0 drain state to: draining.

然后再查看nvidia-smi,可以看到只有三张卡了(原本有四张)

Fri Jun 28 10:05:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   37C    P8             13W /  450W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:81:00.0 Off |                  Off |
|  0%   40C    P8             17W /  450W |     105MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:C1:00.0 Off |                  Off |
| 30%   64C    P2            386W /  450W |   22990MiB /  24564MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3267      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A      3267      G   /usr/lib/xorg/Xorg                             81MiB |
|    1   N/A  N/A      3787      G   /usr/bin/gnome-shell                           12MiB |
|    2   N/A  N/A      3267      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A    569947      C   python                                      22972MiB |
+-----------------------------------------------------------------------------------------+

可以看到温度很正常,所以我猜测是电源的原因导致这个报错

2.4.解决方法

执行如下命令调整显卡的时钟速度(实际就是锁住其最大功率)

sudo nvidia-smi -lgc 300,1500

PS:但我没有使用这条命令来解决,因为我觉得这是硬件的问题,所以只是重新来解决它。

参考资料

服务器GPU温度过高挂掉排查记录Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

解决[Unable to determine the device handle for GPU…: Unknown Error]问题

Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error