AI开发平台MODELARTS-在Lite Cluster资源池上使用Snt9B完成推理任务:操作步骤

时间：2024-11-21 19:45:02

AI开发平台MODELARTS Lite Cluster资源使用

操作步骤

拉取镜像。本测试镜像为bert_pretrain_mindspore:v1，已经把测试数据和代码打进镜像中。

docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

在主机上新建config.yaml文件。

config.yaml文件用于配置pod，本示例中使用sleep命令启动pod，便于进入pod调试。您也可以修改command为对应的任务启动命令（如“python inference.py”），任务会在启动容器后执行。

config.yaml内容如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yourapp
  labels:
      app: infers
spec:
  replicas: 1
  selector:
    matchLabels:
      app: infers
  template:
    metadata: 
      labels:
         app: infers
    spec:
      schedulerName: volcano
      nodeSelector:
        accelerator/huawei-npu: ascend-1980
      containers:
      - image: bert_pretrain_mindspore:v1                  # Inference image name
        imagePullPolicy: IfNotPresent
        name: mindspore
        command:
        - "sleep"
        - "1000000000000000000"
        resources:
          requests:
            huawei.com/ascend-1980: "1"             # 需求卡数，key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
          limits:
            huawei.com/ascend-1980: "1"             # 限制卡数，key保持不变。The value must be consistent with that in requests.
        volumeMounts:
        - name: ascend-driver               #驱动挂载，保持不动
          mountPath: /usr/local/Ascend/driver
        - name: ascend-add-ons           #驱动挂载，保持不动
          mountPath: /usr/local/Ascend/add-ons
        - name: hccn                             #驱动hccn配置，保持不动
          mountPath: /etc/hccn.conf
        - name: npu-smi                             #npu-smi
          mountPath: /usr/local/sbin/npu-smi
        - name: localtime                       #The container time must be the same as the host time.
          mountPath: /etc/localtime
      volumes:
      - name: ascend-driver
        hostPath:
          path: /usr/local/Ascend/driver
      - name: ascend-add-ons
        hostPath:
          path: /usr/local/Ascend/add-ons
      - name: hccn
        hostPath:
          path: /etc/hccn.conf
      - name: npu-smi
        hostPath:
          path: /usr/local/sbin/npu-smi
      - name: localtime
        hostPath:
          path: /etc/localtime

根据config.yaml创建pod。
```
kubectl apply -f config.yaml
```
检查pod启动情况，执行下述命令。如果显示“1/1 running”状态代表启动成功。
```
kubectl get pod -A
```
进入容器，{pod_name}替换为您的pod名字（get pod中显示的名字），{namespace}替换为您的命名空间（默认为default）。
```
kubectl exec -it {pod_name} bash -n {namespace}
```

激活conda模式。

su - ma-user   //切换用户身份
conda activate MindSpore //激活 MindSpore环境

创建测试代码test.py。

from flask import Flask, request
import json 
app = Flask(__name__)

@app.route('/greet', methods=['POST'])
def say_hello_func():
    print("----------- in hello func ----------")
    data = json.loads(request.get_data(as_text=True))
    print(data)
    username = data['name']
    rsp_msg = 'Hello, {}!'.format(username)
    return json.dumps({"response":rsp_msg}, indent=4)

@app.route('/goodbye', methods=['GET'])
def say_goodbye_func():
    print("----------- in goodbye func ----------")
    return '\nGoodbye!\n'


@app.route('/', methods=['POST'])
def default_func():
    print("----------- in default func ----------")
    data = json.loads(request.get_data(as_text=True))
    return '\n called default func !\n {} \n'.format(str(data))

# host must be "0.0.0.0", port must be 8080
if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8080)

执行代码，执行后如下图所示，会部署一个在线服务，该容器即为服务端。

python test.py

图2 部署在线服务

在XShell中新建一个终端，参考步骤5~7进入容器，该容器为客户端。执行以下命令验证自定义镜像的三个API接口功能。当显示如图所示时，即可调用服务成功。
```
curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}'  127.0.0.1:8080/
curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}' 127.0.0.1:8080/greet
curl -X GET 127.0.0.1:8080/goodbye
```
图3 访问在线服务

limit/request配置cpu和内存大小，已知单节点Snt9B机器为：8张Snt9B卡+192u1536g，请合理规划，避免cpu和内存限制过小引起任务无法正常运行。