MPS 介绍
MPS(Multi-Process Service) 是NVIDIA为CUDA设计的多进程并发执行机制,允许多个CPU进程共享同一GPU的CUDA Context,从而突破默认单进程独占GPU的限制,实现多个进程的CUDA Kernel真正并行执行。
在默认情况下,GPU采用时间片轮转调度,不同进程的CUDA任务会交替运行,并伴随高昂的上下文切换开销。而MPS通过合并多个进程的上下文,将它们的Kernel交织发射到GPU,减少切换成本并提升GPU利用率。
参考文档
配置 NVIDIA MPS 环境
配置 MPS 环境之前需要安装好 NVIDIA 驱动和 cloudpods 云平台。然后在容器主机宿主机上配置 MPS 相关组件。
安装 yunion-mps-daemon 服务
# rpm base 系统
$ yum install -y yunion-mps-daemon
# deb base 系统
$ apt-get install -y yunion-mps-daemon
修改 mps 配置
# 修改 host.conf 启用 MPS 并且指定 host_container_device.yml 配置文件
$ vi /etc/yunin/host.conf
enable_cuda_mps: true
container_device_config_file: /etc/yunion/host_container_device.yml
# 修改 host_container_device.yml 配置 MPS
# 这里的 path 为 NVIDIA GPU 对应的渲染设备
# type 修改为 NVIDIA_MPS,默认添加的 AI 节点 type 为 NVIDIA_GPU_SHARE
# virtual_number 是要使用 MPS 虚拟设备数量,单个设备可以显存为 memory.total / virtual_number
$ vi /etc/yunion/host_container_device.yml
devices:
- path: "/dev/dri/renderD128"
type: "NVIDIA_MPS"
virtual_number: 4
启用 yunion-mps-daemon 服务
$ systemctl enable --now yunion-mps-daemon
启用成功后查询 gpu compute_mode,预期为 Exclusive_Process
$ nvidia-smi --query-gpu=gpu_uuid,gpu_name,gpu_bus_id,memory.total,compute_mode --format=csv
uuid, name, pci.bus_id, memory.total [MiB], compute_mode
GPU-76aef7ff-372d-2432-b4b4-beca4d8d3400, Tesla P40, 00000000:00:06.0, 23040 MiB, Exclusive_Process
重启 host 服务
$ kubectl rollout restart ds -n onecloud default-host
# 等待 host 服务启动成功后查看 mps 设备是否注册
$ /opt/yunion/bin/climc isolated-device-list
+--------------------------------------+------------+-----------+-----------+------------------+--------------------------------------+-----------+------------------------------------------+----------------------------------------------------------------------------------------------+-------+--------------+
| ID | Dev_type | Model | Addr | Vendor_device_id | Host_id | numa_node | Device_path | PCIE_Info | Index | Device_minor |
+--------------------------------------+------------+-----------+-----------+------------------+--------------------------------------+-----------+------------------------------------------+----------------------------------------------------------------------------------------------+-------+--------------+
| 604b59e6-745b-4834-8c19-229af27c4333 | NVIDIA_MPS | Tesla P40 | 00:06.0-1 | 10de:1b38 | 197c1d98-316a-42e6-889d-68497153cc82 | -1 | GPU-76aef7ff-372d-2432-b4b4-beca4d8d3400 | {"lane_width":16,"throughput":"15.76 GB/s","transfer_rate_per_lane":"8GT/s","version":"3.0"} | -1 | -1 |
| 6736e951-c00d-4ef4-82bc-68efc0e04756 | NVIDIA_MPS | Tesla P40 | 00:06.0-2 | 10de:1b38 | 197c1d98-316a-42e6-889d-68497153cc82 | -1 | GPU-76aef7ff-372d-2432-b4b4-beca4d8d3400 | {"lane_width":16,"throughput":"15.76 GB/s","transfer_rate_per_lane":"8GT/s","version":"3.0"} | -1 | -1 |
| b8cbd75d-479a-41b8-8010-8221b09d5599 | NVIDIA_MPS | Tesla P40 | 00:06.0-3 | 10de:1b38 | 197c1d98-316a-42e6-889d-68497153cc82 | -1 | GPU-76aef7ff-372d-2432-b4b4-beca4d8d3400 | {"lane_width":16,"throughput":"15.76 GB/s","transfer_rate_per_lane":"8GT/s","version":"3.0"} | -1 | -1 |
| 2a90a0c1-7c3b-4b8f-848d-7aa596fa5cf2 | NVIDIA_MPS | Tesla P40 | 00:06.0-0 | 10de:1b38 | 197c1d98-316a-42e6-889d-68497153cc82 | -1 | GPU-76aef7ff-372d-2432-b4b4-beca4d8d3400 | {"lane_width":16,"throughput":"15.76 GB/s","transfer_rate_per_lane":"8GT/s","version":"3.0"} | -1 | -1 |
+--------------------------------------+------------+-----------+-----------+------------------+--------------------------------------+-----------+------------------------------------------+----------------------------------------------------------------------------------------------+-------+--------------+
*** Total: 4 Pages: 1 Limit: 20 Offset: 0 Page: 1 ***