AI开发平台MODELARTS-训练作业运行失败,出现NCCL报错:问题现象

时间:2024-10-22 15:11:54

问题现象

训练作业的状态“运行失败”,查看训练作业的“日志”,存在NCCL的报错,例如“NCCL timeout”“RuntimeError: NCCL communicator was aborted on rank 7”“NCCL WARN Bootstrap : no socket interface found”“NCCL INFO Call to connect returned Connection refused, retrying”

support.huaweicloud.com/trouble-modelarts/modelarts_trouble_0001.html