华为云UCS-创建GPU应用:验证GPU虚拟化隔离能力

时间:2024-12-19 08:52:02

验证GPU虚拟化隔离能力

工作负载创建成功后,您可以尝试验证GPU虚拟化的隔离能力。
  • 登录容器查看容器被分配显存总量
    kubectl exec -it gpu-app -- nvidia-smi
    预期输出:
    Wed Apr 12 07:54:59 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  Off  | 00000000:21:01.0 Off |                    0 |
    | N/A   27C    P0    37W / 300W |   4792MiB /  5000MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

    预期输出表明,该容器被分配显存总量为5000 MiB,实际使用了4792MiB

  • 查看所在节点的GPU显存隔离情况(在节点上执行)。
    export PATH=$PATH:/usr/local/nvidia/bin;nvidia-smi

    预期输出:

    Wed Apr 12 09:31:10 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  Off  | 00000000:21:01.0 Off |                    0 |
    | N/A   27C    P0    37W / 300W |   4837MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A    760445      C   python                           4835MiB |
    +-----------------------------------------------------------------------------+

    预期输出表明,GPU节点上的显存总量为16160 MiB,其中示例Pod使用了4837MiB

support.huaweicloud.com/usermanual-ucs/ucs_01_0298.html