AI开发平台MODELARTS-准备镜像环境:步骤六 编写Config.yaml文件

时间:2025-01-03 09:39:07

步骤六 编写Config.yaml文件

k8s有两种方式来管理对象:

  • 命令式,即通过Kubectl指令直接操作对象。
  • 声明式,通过定义资源YAML格式的文件来操作对象。
首先给出单个节点训练的config.yaml文件模板,用于配置pod。而在训练中,需要按照参数说明修改${}中的参数值。该模板使用SFS Turbo挂载方案。
apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap1980-vcjob               # 前缀使用“configmap1980-”不变,后接vcjob的名字
  namespace: default                      # 命名空间自选,需要和下边的vcjob处在同一命名空间
  labels:
    ring-controller.cce: ascend-1980      # 保持不动
data:                                     # data内容保持不动,初始化完成,会被volcano插件自动修改
  jobstart_hccl.json: |
    {
      "status":"initializing"
    }
---
apiVersion: batch.volcano.sh/v1alpha1 
kind: Job                              
metadata:
  name: vcjob                           # job名字,需要和configmap中名字保持联系
  namespace: default                    # 和configmap保持一致
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
    fault-scheduling: "force"
spec:
  minAvailable: 1                       
  schedulerName: volcano                # 保持不动
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    configmap1980:
    - --rank-table-version=v2  # 保持不动,生成v2版本ranktablefile
    env: []
    svc:
    - --publish-not-ready-addresses=true
  maxRetry: 5
  queue: default
  tasks:
  - name: main
    replicas: 1                 
    template:
      metadata:
        name: training
        labels:
          app: ascendspeed
          ring-controller.cce: ascend-1980  # 保持不动
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - vcjob
                topologyKey: kubernetes.io/hostname
        hostNetwork: true                       # 采用宿主机网络模式
        containers:
        - image: ${image_name}            # 镜像地址
          imagePullPolicy: IfNotPresent  # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像
          name: ${container_name}
          securityContext:                           # 容器内 root 权限
            allowPrivilegeEscalation: false
            runAsUser: 0
          env:
          - name: name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                             
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "PyTorch"
          command: ["/bin/sh", "-c"]
          args:
          - ${command}
          resources:
            requests:
              huawei.com/ascend-1980: "8"                 # 需求卡数,key保持不变.
              memory: ${requests_memory}                  # 容器请求的最小内存
              cpu: ${requests_cpu}                        # 容器请求的最小 CPU
            limits:
              huawei.com/ascend-1980: "8"                 # 限制卡数,key保持不变。
              memory: ${limits_memory}                    # 容器可使用的最大内存
              cpu: ${limits_cpu}                          # 容器可使用的最大 CPU
          volumeMounts:                             # 容器内部映射路径
          - name: shared-memory-volume
            mountPath: /dev/shm
          - name: ascend-driver                     # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons                    # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                               # 驱动hccn配置,保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             # npu-smi
            mountPath: /usr/local/sbin/npu-smi
          - name: ascend-install
            mountPath: /etc/ascend_install.info
          - name: log
            mountPath: /var/log/npu/
          - name: sfs-volume
            mountPath: /mnt/sfs_turbo
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:                                          # 物理机外部路径
        - name: shared-memory-volume                      # 共享内存
          emptyDir:
            medium: Memory
            sizeLimit: "200Gi"
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                     
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/sbin/npu-smi
        - name: ascend-install
          hostPath:
            path: /etc/ascend_install.info
        - name: log
          hostPath:
            path: /usr/slog
        - name: sfs-volume         
          persistentVolumeClaim:             
            claimName: ${pvc_name}    #已创建的PVC名称
        restartPolicy: OnFailure

双个节点训练的config.yaml文件模板,用于实现双机分布式训练。

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap1980-vcjob     # 前缀使用“configmap1980-”不变,后接vcjob的名字
  namespace: default                      # 命名空间自选,需要和下边的vcjob处在同一命名空间
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
data:                    #data内容保持不动,初始化完成,会被volcano插件自动修改
  jobstart_hccl.json: |
    {
      "status":"initializing"
    }
---
apiVersion: batch.volcano.sh/v1alpha1  
kind: Job                            
metadata:
  name: vcjob                           # job名字,需要和configmap中名字保持联系
  namespace: default                      # 和configmap保持一致
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
    fault-scheduling: "force"
spec:
  minAvailable: 1                     
  schedulerName: volcano                # 保持不动
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    configmap1980:
    - --rank-table-version=v2  # 保持不动,生成v2版本ranktablefile
    env: []
    svc:
    - --publish-not-ready-addresses=true
  maxRetry: 5
  queue: default
  tasks:
  - name: main
    replicas: 1                    
    template:
      metadata:
        name: training
        labels:
          app: ascendspeed
          ring-controller.cce: ascend-1980  # 保持不动
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - vcjob
                topologyKey: kubernetes.io/hostname
        hostNetwork: true                      # 采用宿主机网络模式
        containers:
        - image: ${image_name}         # 镜像地址
          imagePullPolicy: IfNotPresent      # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像
          name: ${container_name}
          securityContext:                           # 容器内 root 权限
            allowPrivilegeEscalation: false
            runAsUser: 0
          env:
          - name: name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                            
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "PyTorch"
          command: ["/bin/sh", "-c"]
          args:
          - ${command}
          resources:
            requests:
              huawei.com/ascend-1980: "8"                 # 需求卡数,key保持不变.
              memory: ${requests_memory}                  # 容器请求的最小内存
              cpu: ${requests_cpu}                        # 容器请求的最小 CPU
            limits:
              huawei.com/ascend-1980: "8"                 # 限制卡数,key保持不变。
              memory: ${limits_memory}                    # 容器可使用的最大内存
              cpu: ${limits_cpu}                          # 容器可使用的最大 CPU
          volumeMounts:                             # 容器内部映射路径
          - name: shared-memory-volume
            mountPath: /dev/shm
          - name: ascend-driver                     # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons                    # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                               # 驱动hccn配置,保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             # npu-smi
            mountPath: /usr/local/sbin/npu-smi
          - name: ascend-install
            mountPath: /etc/ascend_install.info
          - name: log
            mountPath: /var/log/npu/
          - name: sfs-volume
            mountPath: /mnt/sfs_turbo
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:                                    # 物理机外部路径
        - name: shared-memory-volume                        # 共享内存
          emptyDir:
            medium: Memory
            sizeLimit: "200Gi"
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                     
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/sbin/npu-smi
        - name: ascend-install
          hostPath:
            path: /etc/ascend_install.info
        - name: log
          hostPath:
            path: /usr/slog
        - name: sfs-volume         
          persistentVolumeClaim:             
            claimName: ${pvc_name}    #已创建的PVC名称
        restartPolicy: OnFailure
  - name: work
    replicas: 1                    
    template:
      metadata:
        name: training
        labels:
          app: ascendspeed
          ring-controller.cce: ascend-1980  # 保持不动
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - vcjob
                topologyKey: kubernetes.io/hostname
        hostNetwork: true                      # 采用宿主机网络模式
        containers:
        - image: ${image_name}         # 镜像地址
          imagePullPolicy: IfNotPresent       # IfNotPresent:默认值,镜像在宿主机上不存在时才拉取;Always:每次创建Pod都会重新拉取一次镜像;Never:Pod永远不会主动拉取这个镜像
          name: ${container_name}
          securityContext:                           # 容器内 root 权限
            allowPrivilegeEscalation: false
            runAsUser: 0
          env:
          - name: name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                           
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "PyTorch"
          command: ["/bin/sh", "-c"]
          args:
          - ${command}
          resources:
            requests:
              huawei.com/ascend-1980: "8"                 # 需求卡数,key保持不变.
              memory: ${requests_memory}                  # 容器请求的最小内存
              cpu: ${requests_cpu}                        # 容器请求的最小 CPU
            limits:
              huawei.com/ascend-1980: "8"                 # 限制卡数,key保持不变。
              memory: ${limits_memory}                    # 容器可使用的最大内存
              cpu: ${limits_cpu}                          # 容器可使用的最大 CPU
          volumeMounts:                             # 容器内部映射路径
          - name: shared-memory-volume
            mountPath: /dev/shm
          - name: ascend-driver                     # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons                    # 驱动挂载,保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                               # 驱动hccn配置,保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             # npu-smi
            mountPath: /usr/local/sbin/npu-smi
          - name: ascend-install
            mountPath: /etc/ascend_install.info
          - name: log
            mountPath: /var/log/npu/
          - name: sfs-volume
            mountPath: /mnt/sfs_turbo
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:                                    # 物理机外部路径
        - name: shared-memory-volume                        # 共享内存
          emptyDir:
            medium: Memory
            sizeLimit: "200Gi"
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                     
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/sbin/npu-smi
        - name: ascend-install
          hostPath:
            path: /etc/ascend_install.info
        - name: log
          hostPath:
            path: /usr/slog
        - name: sfs-volume         
          persistentVolumeClaim:             
            claimName: ${pvc_name}    #已创建的PVC名称
        restartPolicy: OnFailure

参数说明:

  • ${container_name} 容器名称,此处可以自己定义一个容器名称,例如ascendspeed。
  • ${image_name} 为步骤五 修改并上传镜像中,上传至SWR上的镜像链接。
  • ${command} 使用config.yaml文件创建pod后,在容器内自动运行的命令。在进行训练任务中会给出替换命令。
  • /mnt/sfs_turbo 为宿主机中默认挂载SFS Turbo的工作目录,目录下存放着训练所需代码、数据等文件。
    • 同样,/mnt/sfs_turbo 也可以映射至容器中,作为容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统。为方便访问两个地址可以相同。
  • ${pvc_name} 为在CCE集群关联SFS Turbo步骤中创建的PVC名称。
  • 在设置容器中需要的CPU与内存大小时,可通过运行以下命令查看申请的节点机器中具体的CPU与内存信息。
    kubectl describe node
    • ${requests_cpu} 指在容器中请求的最小CPU核心数量,可使用Requests中的值,例如2650m。
    • ${requests_memory} 指在容器中请求的最小内存空间大小,可使用Requests中的值,例如3200Mi。
    • ${limits_cpu} 指在容器中可使用的最大CPU核心数量,例如192。
    • ${limits_memory} 指在容器中可使用的最大内存空间大小,例如换算成1500Gi。

support.huaweicloud.com/bestpractice-modelarts/modelarts_llm_train_90946.html