AI开发平台MODELARTS-模型NPU卡数、梯度累积值取值表
模型NPU卡数、梯度累积值取值表
不同模型推荐的训练参数和计算规格要求如表1所示。规格与节点数中的1*节点 & 4*Ascend表示单机4卡,以此类推。
模型 |
模型参数量 |
训练类型 |
序列长度cutoff_len |
梯度累积值 |
优化工具(Deepspeed) |
规格与节点数 |
---|---|---|---|---|---|---|
llama2 |
7B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
sft |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
13B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
|
sft |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
|||
70B |
lora/dpo |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
|
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
|||
sft |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
||
llama3 |
70B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
sft |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
|||
8B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
|
sft |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
|||
llama3.1 |
8B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||
70B |
lora/dpo |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
|
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
|||
sft |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
||
Qwen2 |
72B |
lora/dpo |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
|||
sft |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
||
7B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
|
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||
0.5/1.5B |
lora/sft/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
|
Qwen1.5 |
0.5/1.8B |
lora/sft/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-0 |
1*节点 & 1*Ascend |
4B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
|
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 4*Ascend |
||
7B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
|
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||
14B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 1*Ascend |
|
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
||
32B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 4*Ascend |
|
sft |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
sft |
8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
||
72B |
lora/dpo |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
|
lora |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3-Offload |
2*节点 & 8*Ascend |
||
sft |
4096/8192 |
gradient_accumulation_steps: 4 |
ZeRO-3-Offload |
4*节点 & 8*Ascend |
||
falcon2 |
11B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 8*Ascend |
||
GLM4 |
9B |
lora |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-2 |
1*节点 & 1*Ascend |
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 8*Ascend |
||
Yi |
6B |
lora/dpo |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 1*Ascend |
sft |
4096/8192 |
gradient_accumulation_steps: 8 |
ZeRO-1 |
1*节点 & 4*Ascend |
||
34B |
sft |
4096 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
|
lora/dpo |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 2*Ascend |
|||
sft |
8192 |
gradient_accumulation_steps: 8 |
ZeRO-3 |
2*节点 & 8*Ascend |
||
lora/dpo |
gradient_accumulation_steps: 8 |
ZeRO-3 |
1*节点 & 4*Ascend |
以上参数为未开启NPU FlashAttention融合算子,上述参数值仅供参考,请根据自己实际要求合理配置其他加速框架或ZeRO (Zero Redundancy Optimizer)优化器、NPU节点数即其他配置。
具体优化工具使用说明可参考如何选择最佳性能的zero-stage和-offloads。