AI开发平台ModelArts-以PyTorch框架创建训练作业(新版训练):操作步骤

时间:2025-02-12 15:15:44

操作步骤

  1. 调用认证鉴权接口获取用户的Token。
    1. 请求消息体:

      URI格式:POST https://{iam_endpoint}/v3/auth/tokens

      请求消息头:Content-Type →application/json

      请求Body:
      {  "auth": {    "identity": {      "methods": ["password"],      "password": {        "user": {          "name": "user_name",           "password": "user_password",          "domain": {            "name": "domain_name"            }        }      }    },    "scope": {      "project": {        "name": "cn-north-1"        }    }  }}
      其中,加粗的斜体字段需要根据实际值填写:
      • iam_endpoint IAM 的终端节点。
      • user_name为IAM用户名。
      • user_password为用户登录密码。
      • domain_name为用户所属的账号名。
      • cn-north-1为项目名,代表服务的部署区域。
    2. 返回状态码“201 Created”,在响应Header中获取“X-Subject-Token”的值即为Token,如下所示:
      x-subject-token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
  2. 调用获取训练作业支持的公共规格接口获取训练作业支持的资源规格。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/ training-job-flavors? flavor_type=CPU

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      • ma_endpoint为ModelArts的终端节点。
      • project_id为用户的项目ID。
      • “X-Auth-Token”的值是上一步获取到的Token值。
    2. 返回状态码“200”,响应Body如下所示:
      {  "total_count": 2,  "flavors": [    {      "flavor_id": "modelarts.vm.cpu.2u",      "flavor_name": "Computing CPU(2U) instance",      "flavor_type": "CPU",      "billing": {        "code": "modelarts.vm.cpu.2u",        "unit_num": 1      },      "flavor_info": {        "max_num": 1,        "cpu": {          "arch": "x86",          "core_num": 2        },        "memory": {          "size": 8,          "unit": "GB"        },        "disk": {          "size": 50,          "unit": "GB"        }      }    },    {      "flavor_id": "modelarts.vm.cpu.8u",      "flavor_name": "Computing CPU(8U) instance",      "flavor_type": "CPU",      "billing": {        "code": "modelarts.vm.cpu.8u",        "unit_num": 1      },      "flavor_info": {        "max_num": 16,        "cpu": {          "arch": "x86",          "core_num": 8        },        "memory": {          "size": 32,          "unit": "GB"        },        "disk": {          "size": 50,          "unit": "GB"        }      }    }  ]}
      • 根据“flavor_id”字段选择并记录创建训练作业时需要的规格类型,本章以“modelarts.vm.cpu.8u”为例,并记录“max_num”字段的值为“16”。
  3. 调用获取训练作业支持的AI预置框架接口查看训练作业的引擎类型和版本。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/job/ training-job-engines

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“200”,响应Body如下所示(引擎较多,只展示部分):
      {    "total": 28,    "items": [        ......        {            "engine_id": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",            "engine_name": "Ascend-Powered-Engine",            "engine_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",            "v1_compatible": false,            "run_user": "1000",            "image_info": {                "cpu_image_url": "",                "gpu_image_url": "atelier/mindspore_1_6_0:train",                "image_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64-snt9-roma-20211231193205-33131ee"            }        },......        {            "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "engine_name": "PyTorch",            "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "tags": [                {                    "key": "auto_search",                    "value": "True"                }            ],            "v1_compatible": false,            "run_user": "1102",            "image_info": {                "cpu_image_url": "aip/pytorch_1_8:train",                "gpu_image_url": "aip/pytorch_1_8:train",                "image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"            }        },        ......        {            "engine_id": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",            "engine_name": "TensorFlow",            "engine_version": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",            "tags": [                {                    "key": "auto_search",                    "value": "True"                }            ],            "v1_compatible": false,            "run_user": "1102",            "image_info": {                "cpu_image_url": "aip/tensorflow_2_1:train",                "gpu_image_url": "aip/tensorflow_2_1:train",                "image_version": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"            }        },        ......    ]}

      根据“engine_name”“engine_version”字段选择创建训练作业时需要的引擎规格,并记录对应的“engine_name”“engine_version”,本章以Pytorch引擎为例创建作业,记录“engine_name”“PyTorch”“engine_version”“pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64”

  4. 调用创建算法接口创建一个算法,记录算法id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/ algorithms

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求body:

      {"metadata": {"name": "test-pytorch-cpu","description": "test pytorch job in cpu in mode gloo"},"job_config": {"boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py","code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/","engine": {"engine_name": "PyTorch","engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64"},"inputs": [{"name": "data_url","description": "数据来源1"}],"outputs": [{"name": "train_url","description": "输出数据1"}],"parameters": [{"name": "dist","description": "","value": "False","constraint": {"editable": true,"required": false,"sensitive": false,"type": "Boolean","valid_range": [],"valid_type": "None"}},{"name": "world_size","description": "","value": "1","constraint": {"editable": true,"required": false,"sensitive": false,"type": "Integer","valid_range": [],"valid_type": "None"}}],"parameters_customization": true},"resource_requirements": []}

      其中,加粗的斜体字段需要根据实际值填写:

      • “metadata”字段下的“name”“description”分别为算法的名称和描述。
      • “job_config”字段下的“code_dir”“boot_file”分别为算法的代码目录和代码启动文件。代码目录为代码启动文件的一级目录。
      • “job_config”字段下的“inputs”“outputs”分别为算法的输入输出管道。可以按照实例指定“data_url”“train_url”,在代码中解析超参分别指定训练所需要的数据文件本地路径和训练生成的模型输出本地路径。
      • “job_config”字段下的“parameters_customization”表示是否支持自定义超参,此处填true。
      • “job_config”字段下的“parameters”表示算法本身的超参。“name”填写超参名称(64个以内字符,仅支持大小写字母、数字、下划线和中划线),“value”填写超参的默认值,“constraint”填写超参的约束,例如此处“type”填写“String”(支持String、Integer、Float和Boolean),“editable”填写“true”“required”填写“false”等。
      • “job_config”字段下的“engine”表示算法所依赖的引擎,使用3记录的“engine_name”“engine_version”
    2. 返回状态码“200 OK”,表示创建算法成功,响应Body如下所示:
      {    "metadata": {        "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",        "name": "test-pytorch-cpu",        "description": "test pytorch job in cpu in mode gloo",        "create_time": 1641890623262,        "workspace_id": "0",        "ai_project": "default-ai-project",        "user_name": "",        "domain_id": "0659fbf6de00109b0ff1c01fc037d240",        "source": "custom",        "api_version": "",        "is_valid": true,        "state": "",        "size": 4790,        "tags": null,        "attr_list": null,        "version_num": 0,        "update_time": 0    },    "share_info": {},    "job_config": {        "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",        "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",        "parameters": [            {                "name": "dist",                "description": "",                "i18n_description": null,                "value": "False",                "constraint": {                    "type": "Boolean",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            },            {                "name": "world_size",                "description": "",                "i18n_description": null,                "value": "1",                "constraint": {                    "type": "Integer",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            }        ],        "parameters_customization": true,        "inputs": [            {                "name": "data_url",                "description": "数据来源1"            }        ],        "outputs": [            {                "name": "train_url",                "description": "输出数据1"            }        ],        "engine": {            "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "engine_name": "PyTorch",            "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "tags": [                {                    "key": "auto_search",                    "value": "True"                }            ],            "v1_compatible": false,            "run_user": "1102",            "image_info": {                "cpu_image_url": "aip/pytorch_1_8:train",                "gpu_image_url": "aip/pytorch_1_8:train",                "image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"            }        },        "code_tree": {            "name": "cpu/",            "children": [                {                    "name": "test-pytorch.py"                }            ]        }    },    "resource_requirements": [],    "advanced_config": {}}

      记录“metadata”字段下的“id”(算法id,32位UUID)字段的值便于后续步骤使用。

  5. 调用创建训练作业接口使用刚创建的算法返回的uuid创建一个训练作业,记录训练作业id。
    1. 请求消息体:

      URI格式:POST https://{ma_endpoint}/v2/{project_id}/training-jobs

      请求消息头:

      • X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...
      • Content-Type →application/json

      其中,加粗的斜体字段需要根据实际值填写。

      请求Body:

      {"kind": "job","metadata": {"name": "test-pytorch-cpu01","description": "test pytorch work cpu in mode gloo"},"algorithm": {"id": "01c399ae-8593-4ef5-9e4d-085950aacde1","parameters": [{"name": "dist","value": "False"},{"name": "world_size","value": "1"}],"inputs": [{"name": "data_url","remote": {"obs": {"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"}}}],"outputs": [{"name": "train_url","remote": {"obs": {"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"}}}]},"spec": {"resource": {"flavor_id": "modelarts.vm.cpu.8u","node_count": 1},"log_export_path": {"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"}}}

      其中,加粗的斜体字段需要根据实际值填写:

      • “kind”填写训练作业的类型,默认为job。
      • “metadata”下的“name”“description”填写训练作业的名称和描述。
      • “algorithm”下的“id”填写4获取的算法ID。
      • “algorithm”“inputs”“outputs”填写训练作业输入输出管道的具体信息。实例中“inputs”“remote”下的“obs_url”表示从OBS桶中选择训练数据的OBS路径。实例中“outputs”“remote”下的“obs_url”表示上传训练输出至指定OBS路径。
      • “spec”字段下的“flavor_id”表示训练作业所依赖的规格,使用2记录的flavor_id。“node_count”表示训练是否需要多机训练(分布式训练),此处为单机情况使用默认值“1”“log_export_path”用于指定用户需要上传日志的obs目录。
    2. 返回状态码“201 Created”,表示训练作业创建成功,响应Body如下所示:
      {    "kind": "job",    "metadata": {        "id": "66ff6991-fd66-40b6-8101-0829a46d3731",        "name": "test-pytorch-cpu01",        "description": "test pytorch work cpu in mode gloo",        "create_time": 1641892642625,        "workspace_id": "0",        "ai_project": "default-ai-project",        "user_name": "",        "annotations": {            "job_template": "Template DL",            "key_task": "worker"        }    },    "status": {        "phase": "Creating",        "secondary_phase": "Creating",        "duration": 0,        "start_time": 0,        "node_count_metrics": null,        "tasks": [            "worker-0"        ]    },    "algorithm": {        "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",        "name": "test-pytorch-cpu",        "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",        "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",        "parameters": [            {                "name": "dist",                "description": "",                "i18n_description": null,                "value": "False",                "constraint": {                    "type": "Boolean",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            },            {                "name": "world_size",                "description": "",                "i18n_description": null,                "value": "1",                "constraint": {                    "type": "Integer",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            }        ],        "parameters_customization": true,        "inputs": [            {                "name": "data_url",                "description": "数据来源1",                "local_dir": "/home/ma-user/modelarts/inputs/data_url_0",                "remote": {                    "obs": {                        "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"                    }                }            }        ],        "outputs": [            {                "name": "train_url",                "description": "输出数据1",                "local_dir": "/home/ma-user/modelarts/outputs/train_url_0",                "remote": {                    "obs": {                        "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"                    }                },                "mode": "upload_periodically",                "period": 30            }        ],        "engine": {            "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "engine_name": "PyTorch",            "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "usage": "training",            "support_groups": "public",            "tags": [                {                    "key": "auto_search",                    "value": "True"                }            ],            "v1_compatible": false,            "run_user": "1102"        }    },    "spec": {        "resource": {            "flavor_id": "modelarts.vm.cpu.8u",            "flavor_name": "Computing CPU(8U) instance",            "node_count": 1,            "flavor_detail": {                "flavor_type": "CPU",                "billing": {                    "code": "modelarts.vm.cpu.8u",                    "unit_num": 1                },                "flavor_info": {                    "cpu": {                        "arch": "x86",                        "core_num": 8                    },                    "memory": {                        "size": 32,                        "unit": "GB"                    },                    "disk": {                        "size": 50,                        "unit": "GB"                    }                }            }        },        "log_export_path": {            "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"        },        "is_hosted_log": true    }}
      • 记录“metadata”下的“id”(训练作业的任务ID)字段的值便于后续步骤使用。
      • “Status”下的“phase”“secondary_phase”为表示训练作业的状态和下一步状态。示例中“Creating”表示训练作业正在创建中。
  6. 调用查询训练作业详情接口使用刚创建的训练作业返回的uuid查询训练作业状态。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写:

      “training_job_id”5记录的训练作业的任务ID。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {    "kind": "job",    "metadata": {        "id": "66ff6991-fd66-40b6-8101-0829a46d3731",        "name": "test-pytorch-cpu01",        "description": "test pytorch work cpu in mode gloo",        "create_time": 1641892642625,        "workspace_id": "0",        "ai_project": "default-ai-project",        "user_name": "hwstaff_z00424192",        "annotations": {            "job_template": "Template DL",            "key_task": "worker"        }    },    "status": {        "phase": "Running",        "secondary_phase": "Running",        "duration": 268000,        "start_time": 1641892655000,        "node_count_metrics": [            [                1641892645000,                0            ],            [                1641892654000,                0            ],            [                1641892655000,                1            ],            [                1641892922000,                1            ],            [                1641892923000,                1            ]        ],        "tasks": [            "worker-0"        ]    },    "algorithm": {        "id": "01c399ae-8593-4ef5-9e4d-085950aacde1",        "name": "test-pytorch-cpu",        "code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",        "boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",        "parameters": [            {                "name": "dist",                "description": "",                "i18n_description": null,                "value": "False",                "constraint": {                    "type": "Boolean",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            },            {                "name": "world_size",                "description": "",                "i18n_description": null,                "value": "1",                "constraint": {                    "type": "Integer",                    "editable": true,                    "required": false,                    "sensitive": false,                    "valid_type": "None",                    "valid_range": []                }            }        ],        "parameters_customization": true,        "inputs": [            {                "name": "data_url",                "description": "数据来源1",                "local_dir": "/home/ma-user/modelarts/inputs/data_url_0",                "remote": {                    "obs": {                        "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"                    }                }            }        ],        "outputs": [            {                "name": "train_url",                "description": "输出数据1",                "local_dir": "/home/ma-user/modelarts/outputs/train_url_0",                "remote": {                    "obs": {                        "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"                    }                },                "mode": "upload_periodically",                "period": 30            }        ],        "engine": {            "engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "engine_name": "PyTorch",            "engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",            "usage": "training",            "support_groups": "public",            "tags": [                {                    "key": "auto_search",                    "value": "True"                }            ],            "v1_compatible": false,            "run_user": "1102"        }    },    "spec": {        "resource": {            "flavor_id": "modelarts.vm.cpu.8u",            "flavor_name": "Computing CPU(8U) instance",            "node_count": 1,            "flavor_detail": {                "flavor_type": "CPU",                "billing": {                    "code": "modelarts.vm.cpu.8u",                    "unit_num": 1                },                "flavor_info": {                    "cpu": {                        "arch": "x86",                        "core_num": 8                    },                    "memory": {                        "size": 32,                        "unit": "GB"                    },                    "disk": {                        "size": 50,                        "unit": "GB"                    }                }            }        },        "log_export_path": {            "obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"        },        "is_hosted_log": true    }}

      根据响应可以了解训练作业的版本详情,其中“status”“Running”表示训练作业正在运行。

  7. 调用查询训练作业指定任务的日志(OBS链接)接口获取训练作业日志的对应的obs路径。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/tasks/{task_id}/logs/url

      请求消息头:

      X-Auth-Token→MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      Content-Type→text/plain

      其中,加粗的斜体字段需要根据实际值填写:

      • “task_id”为训练作业的任务名称,一般使用work-0。
      • Content-Type可以设置成不同方式。text/plain,返回OBS临时预览链接。application/octet-stream,返回OBS临时下载链接。
    2. 返回状态码“200 OK”,响应Body如下所示:
      {    "obs_url": "https://modelarts-training-log-cn-north-4.obs.cn-north-4.myhuaweicloud.com:443/66ff6991-fd66-40b6-8101-0829a46d3731/worker-0/modelarts-job-66ff6991-fd66-40b6-8101-0829a46d3731-worker-0.log?AWSAccessKeyId=GFGTBKOZENDD83QEMZMV&Expires=1641896599&Signature=BedFZHEU1oCmqlI912UL9mXlhkg%3D"}

      返回字段表示日志的obs路径。复制至浏览器即可看到对应效果。

  8. 调用查询训练作业指定任务的运行指标接口查看训练作业的运行指标详情。
    1. 请求消息体:

      URI格式:GET https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}/metrics/{task_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“200 OK”,响应Body如下所示:
      {    "metrics": [        {            "metric": "cpuUsage",            "value": [                -1,                -1,                28.622,                35.053,                39.988,                40.069,                40.082,                40.094            ]        },        {            "metric": "memUsage",            "value": [                -1,                -1,                0.544,                0.641,                0.736,                0.737,                0.738,                0.739            ]        },        {            "metric": "npuUtil",            "value": [                -1,                -1,                -1,                -1,                -1,                -1,                -1,                -1            ]        },        {            "metric": "npuMemUsage",            "value": [                -1,                -1,                -1,                -1,                -1,                -1,                -1,                -1            ]        },        {            "metric": "gpuUtil",            "value": [                -1,                -1,                -1,                -1,                -1,                -1,                -1,                -1            ]        },        {            "metric": "gpuMemUsage",            "value": [                -1,                -1,                -1,                -1,                -1,                -1,                -1,                -1            ]        }    ]}

      可以看到CPU等相关的使用率指标。

  9. 当训练作业使用完成或不再需要时,调用删除训练作业接口删除训练作业。
    1. 请求消息体:

      URI格式:DELETE https://{ma_endpoint}/v2/{project_id}/training-jobs/{training_job_id}

      请求消息头:X-Auth-Token →MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...

      其中,加粗的斜体字段需要根据实际值填写。

    2. 返回状态码“202 No Content”响应,则表示删除作业成功。
support.huaweicloud.com/api-modelarts/modelarts_03_0407.html