AI开发平台MODELARTS-在AOM控制台查看ModelArts所有监控指标:网络相关指标

时间:2025-01-09 16:29:25

网络相关指标

表3 Diagnos(IB,仅专属池上会收集)

分类

名称

指标

指标含义

单位

取值范围

infiniband或RoCE网络

PortXmitData

infiniband_port_xmit_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

计数值

自然数

PortRcvData

infiniband_port_rcv_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

计数值

自然数

SymbolErrorCounter

infiniband_symbol_error_counter_total

Total number of minor link errors detected on one or more physical lanes.

计数值

自然数

LinkErrorRecoveryCounter

infiniband_link_error_recovery_counter_total

Total number of times the Port Training state machine has successfully completed the link error recovery process.

计数值

自然数

PortRcvErrors

infiniband_port_rcv_errors_total

Total number of packets containing errors that were received on the port including:

Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

Malformed data packet errors (LVer, length, VL)

Malformed link packet errors (operand, length, VL)

Packets discarded due to buffer overrun (overflow)

计数值

自然数

LocalLinkIntegrityErrors

infiniband_local_link_integrity_errors_total

This counter indicates the number of retries initiated by a link transfer layer receiver.

计数值

自然数

PortRcvRemotePhysicalErrors

infiniband_port_rcv_remote_physical_errors_total

Total number of packets marked with the EBP delimiter received on the port.

计数值

自然数

PortRcvSwitchRelayErrors

infiniband_port_rcv_switch_relay_errors_total

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

DLI D mapping

VL mapping

Looping (output port = input port)

计数值

自然数

PortXmitWait

infiniband_port_transmit_wait_total

The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).

计数值

自然数

PortXmitDiscards

infiniband_port_xmit_discards_total

Total number of outbound packets discarded by the port because the port is down or congested.

计数值

自然数

support.huaweicloud.com/usermanual-standard-modelarts/resmgmt-modelarts_0033.html