AI开发平台MODELARTS-设置训练故障优雅退出:特性使用操作

时间:2024-09-02 17:41:25

特性使用操作

  1. 安装优雅退出二进制包

    通过ma_pre_start.sh安装whl包。

    echo "[ma-pre-start] Enter the input directory"
    cd/home/ma-user/modelarts/inputs/data_url_0/
    echo "[ma-pre-start] Start to install mindx-elastic 0.0.1版本"
    export PATH=/home/ma-user/anaconda/bin:$PATH
    pip install ./mindx_elastic-0.0.1-py3-none-any.whl
    echo "[ma-pre-start] Clean run package"
    sudo rm -rf ./script ./*.run ./run_package *.whl
    echo "[ma-pre-start] Set ENV"
    export G LOG _v=2    # 当前使用诊断模式需要用户手动设置成INFO日志级别 echo "[ma-pre-start] End"
  2. 创建训练任务
    • 约束:MindSpore版本要求1.6.0及以上。
    • 修改样例代码,增加如下内容:
      # 载入依赖接口
      from mindx_elastic.terminating_message import ExceptionCheckpoint
      ...
      
      if args_opt.do_train:
      dataset = create_dataset()
      loss_cb = LossMonitor()
      cb = [loss_cb]
      if int(os.getenv('RANK_ID')) == 0:
      batch_num = dataset.get_dataset_size()
      # 开启优雅退出保存
      config_ck = CheckpointConfig(save_checkpoint_steps=batch_num,
      keep_checkpoint_max=35,
      async_save=True,
      append_info=[{"epoch_num": cur_epoch_num}],
      exception_save=True)
      
      ckpoint_cb = ModelCheckpoint(prefix="train_resnet_cifar10",
      directory=args_opt.train_url,
      config=config_ck)
      # 定义优雅退出ckpt保存callback
      ckpoint_exp = ExceptionCheckpoint(
      prefix="train_resnet_cifar10",
      directory=args_opt.train_url,
      config=config_ck)
      # 添加优雅退出ckpt保存callback
      cb += [ckpoint_cb, ckpoint_exp]
support.huaweicloud.com/usermanual-standard-modelarts/modelarts_trouble_0107.html