AI开发平台MODELARTS-InternVL2基于DevServer适配PyTorch NPU训练指导(6.3.910):步骤九:开始训练

时间:2024-12-17 18:07:08

步骤九:开始训练

单机训练

cd ${container_work_dir}/InternVL/internvl_chat 
# 8B全参微调
GPUS=8 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full.sh 
# 8Blora微调
GPUS=8 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora.sh 
# 26Blora微调
GPUS=8 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_20b_dynamic_res_2nd_finetune_lora.sh 
# 40Blora微调
GPUS=8 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl2.0/2nd_finetune/internvl2_40b_hermes2_yi_34b_dynamic_res_2nd_finetune_lora.sh

多机训练

cd ${container_work_dir}/InternVL/internvl_chat  
# 8B lora
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_lora_multi.sh 
# 8B full
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_7b_dynamic_res_2nd_finetune_full_multi.sh 
# 26B lora
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_20b_dynamic_res_2nd_finetune_lora_multi.sh 
# 26B full
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_20b_dynamic_res_2nd_finetune_full_multi.sh
# 40B lora
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_40b_hermes2_yi_34b_dynamic_res_2nd_finetune_lora_multi.sh 
# 40B full
GPUS=8 PER_DEVICE_BATCH_SIZE=2 NNODES=${NODE_NUM} NODE_RANK=${NODE_RANK} MASTER_ADDR="${master_node_ip}" sh shell/internvl2.0/2nd_finetune/internvl2_40b_hermes2_yi_34b_dynamic_res_2nd_finetune_full_multi.sh

参数说

  • NODE_NUM:机器数量。
  • NODE_RANK:机器rank num,主机为0,其余递增。
  • MASTER_ADDR:主机IP地址。
  • GPUS: 每台机器npu卡数
  • PER_DEVICE_BATCH_SIZE:每张卡batch size

训练成功如下图所示。

support.huaweicloud.com/bestpractice-modelarts/modelarts_aigc_internvl2_910.html