AI开发平台MODELARTS-在Lite Cluster资源池上使用Snt9B完成推理任务:操作步骤

时间:2024-11-20 09:05:57

操作步骤

  1. 拉取镜像。本测试镜像为bert_pretrain_mindspore:v1,已经把测试数据和代码打进镜像中。

    docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
    docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

  2. 在主机上新建config.yaml文件。

    config.yaml文件用于配置pod,本示例中使用sleep命令启动pod,便于进入pod调试。您也可以修改command为对应的任务启动命令(如“python inference.py”),任务会在启动容器后执行。

    config.yaml内容如下:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: yourapp
      labels:
          app: infers
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: infers
      template:
        metadata: 
          labels:
             app: infers
        spec:
          schedulerName: volcano
          nodeSelector:
            accelerator/huawei-npu: ascend-1980
          containers:
          - image: bert_pretrain_mindspore:v1                  # Inference image name
            imagePullPolicy: IfNotPresent
            name: mindspore
            command:
            - "sleep"
            - "1000000000000000000"
            resources:
              requests:
                huawei.com/ascend-1980: "1"             # 需求卡数,key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
              limits:
                huawei.com/ascend-1980: "1"             # 限制卡数,key保持不变。The value must be consistent with that in requests.
            volumeMounts:
            - name: ascend-driver               #驱动挂载,保持不动
              mountPath: /usr/local/Ascend/driver
            - name: ascend-add-ons           #驱动挂载,保持不动
              mountPath: /usr/local/Ascend/add-ons
            - name: hccn                             #驱动hccn配置,保持不动
              mountPath: /etc/hccn.conf
            - name: npu-smi                             #npu-smi
              mountPath: /usr/local/sbin/npu-smi
            - name: localtime                       #The container time must be the same as the host time.
              mountPath: /etc/localtime
          volumes:
          - name: ascend-driver
            hostPath:
              path: /usr/local/Ascend/driver
          - name: ascend-add-ons
            hostPath:
              path: /usr/local/Ascend/add-ons
          - name: hccn
            hostPath:
              path: /etc/hccn.conf
          - name: npu-smi
            hostPath:
              path: /usr/local/sbin/npu-smi
          - name: localtime
            hostPath:
              path: /etc/localtime

  3. 根据config.yaml创建pod。

    kubectl apply -f config.yaml

  4. 检查pod启动情况,执行下述命令。如果显示“1/1 running”状态代表启动成功。

    kubectl get pod -A

  5. 进入容器,{pod_name}替换为您的pod名字(get pod中显示的名字),{namespace}替换为您的命名空间(默认为default)。

    kubectl exec -it {pod_name} bash -n {namespace}

  6. 激活conda模式。

    su - ma-user   //切换用户身份
    conda activate MindSpore //激活 MindSpore环境

  7. 创建测试代码test.py。

    from flask import Flask, request
    import json 
    app = Flask(__name__)
    
    @app.route('/greet', methods=['POST'])
    def say_hello_func():
        print("----------- in hello func ----------")
        data = json.loads(request.get_data(as_text=True))
        print(data)
        username = data['name']
        rsp_msg = 'Hello, {}!'.format(username)
        return json.dumps({"response":rsp_msg}, indent=4)
    
    @app.route('/goodbye', methods=['GET'])
    def say_goodbye_func():
        print("----------- in goodbye func ----------")
        return '\nGoodbye!\n'
    
    
    @app.route('/', methods=['POST'])
    def default_func():
        print("----------- in default func ----------")
        data = json.loads(request.get_data(as_text=True))
        return '\n called default func !\n {} \n'.format(str(data))
    
    # host must be "0.0.0.0", port must be 8080
    if __name__ == '__main__':
        app.run(host="0.0.0.0", port=8080)
    执行代码,执行后如下图所示,会部署一个在线服务,该容器即为服务端。
    python test.py
    图2 部署在线服务

  8. 在XShell中新开一个终端,参考步骤5~7进入容器,该容器为客户端。执行以下命令验证 自定义镜像 的三个API接口功能。当显示如图所示时,即可调用服务成功。

    curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}'  127.0.0.1:8080/
    curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}' 127.0.0.1:8080/greet
    curl -X GET 127.0.0.1:8080/goodbye
    图3 访问在线服务

    limit/request配置cpu和内存大小,已知单节点Snt9B机器为:8张Snt9B卡+192u1536g,请合理规划,避免cpu和内存限制过小引起任务无法正常运行。

support.huaweicloud.com/usermanual-cluster-modelarts/umn-cluster-modelarts-0016.html