云服务器内容精选

华为云首页用户手册

Lite Cluster资源使用

AI开发平台MODELARTS-在Lite Cluster资源池上使用Snt9B完成推理任务:操作步骤

操作步骤拉取镜像。本测试镜像为bert_pretrain_mindspore:v1，已经把测试数据和代码打进镜像中。 docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1 在主机上新建config.yaml文件。 config.yaml文件用于配置pod，本示例中使用sleep命令启动pod，便于进入pod调试。您也可以修改command为对应的任务启动命令（如“python inference.py”），任务会在启动容器后执行。 config.yaml内容如下： apiVersion: apps/v1 kind: Deployment metadata: name: yourapp labels: app: infers spec: replicas: 1 selector: matchLabels: app: infers template: metadata: labels: app: infers spec: schedulerName: volcano nodeSelector: accelerator/huawei-npu: ascend-1980 containers: - image: bert_pretrain_mindspore:v1 # Inference image name imagePullPolicy: IfNotPresent name: mindspore command: - "sleep" - "1000000000000000000" resources: requests: huawei.com/ascend-1980: "1" # 需求卡数，key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU. limits: huawei.com/ascend-1980: "1" # 限制卡数，key保持不变。The value must be consistent with that in requests. volumeMounts: - name: ascend-driver #驱动挂载，保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons #驱动挂载，保持不动 mountPath: /usr/local/Ascend/add-ons - name: hccn #驱动hccn配置，保持不动 mountPath: /etc/hccn.conf - name: npu-smi #npu-smi mountPath: /usr/local/sbin/npu-smi - name: localtime #The container time must be the same as the host time. mountPath: /etc/localtime volumes: - name: ascend-driver hostPath: path: /usr/local/Ascend/driver - name: ascend-add-ons hostPath: path: /usr/local/Ascend/add-ons - name: hccn hostPath: path: /etc/hccn.conf - name: npu-smi hostPath: path: /usr/local/sbin/npu-smi - name: localtime hostPath: path: /etc/localtime 根据config.yaml创建pod。 kubectl apply -f config.yaml 检查pod启动情况，执行下述命令。如果显示“1/1 running”状态代表启动成功。 kubectl get pod -A 进入容器，{pod_name}替换为您的pod名字（get pod中显示的名字），{namespace}替换为您的命名空间（默认为default）。 kubectl exec -it {pod_name} bash -n {namespace} 激活conda模式。 su - ma-user //切换用户身份 conda activate MindSpore //激活 MindSpore环境创建测试代码test.py。 from flask import Flask, request import json app = Flask(__name__) @app.route('/greet', methods=['POST']) def say_hello_func(): print("----------- in hello func ----------") data = json.loads(request.get_data(as_text=True)) print(data) username = data['name'] rsp_msg = 'Hello, {}!'.format(username) return json.dumps({"response":rsp_msg}, indent=4) @app.route('/goodbye', methods=['GET']) def say_goodbye_func(): print("----------- in goodbye func ----------") return '\nGoodbye!\n' @app.route('/', methods=['POST']) def default_func(): print("----------- in default func ----------") data = json.loads(request.get_data(as_text=True)) return '\n called default func !\n {} \n'.format(str(data)) # host must be "0.0.0.0", port must be 8080 if __name__ == '__main__': app.run(host="0.0.0.0", port=8080) 执行代码，执行后如下图所示，会部署一个在线服务，该容器即为服务端。 python test.py 图2 部署在线服务在XShell中新建一个终端，参考步骤5~7进入容器，该容器为客户端。执行以下命令验证自定义镜像的三个API接口功能。当显示如图所示时，即可调用服务成功。 curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}' 127.0.0.1:8080/ curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}' 127.0.0.1:8080/greet curl -X GET 127.0.0.1:8080/goodbye 图3 访问在线服务 limit/request配置cpu和内存大小，已知单节点Snt9B机器为：8张Snt9B卡+192u1536g，请合理规划，避免cpu和内存限制过小引起任务无法正常运行。

AI开发平台MODELARTS Lite Cluster资源使用
AI开发平台MODELARTS-修复节点

修复节点当前修复节点功能为白名单邀测阶段，如果您有试用需求，请联系技术支持。若资源池节点发生硬件故障，可在资源池详情页的“节点管理”页签，查看对应故障节点。在对应节点的操作列的“更多”按钮中，修复按钮变为可单击状态，此时可单击“修复”按钮，对节点进行修复，待修复完成后，节点状态会变为“可用”。当前支持“换件维修”和“重部署”两种修复方式： - 换件维修：通过更换硬件实现原地修复，修复耗时较长，对于非本地盘类故障，本地盘数据可以保留。 - 重部署：通过更换为新服务器实现修复，修复耗时较短，本地盘数据会丢失。修复期间实例将无法工作，请确保相关实例业务已离线。如果云服务器上的业务不可停止，请勿修复，并联系技术支持进行处理。若选择了重部署修复方式，实例会立即关机并迁移到新服务器，本地盘数据会被清空，请提前做好业务迁移和数据备份。图1 修复节点父主题： Lite Cluster资源使用

AI开发平台MODELARTS Lite Cluster资源使用