AI开发平台MODELARTS-在AOM控制台查看ModelArts所有监控指标:网络相关指标
网络相关指标
分类 |
名称 |
指标 |
指标含义 |
单位 |
取值范围 |
---|---|---|---|---|---|
infiniband或RoCE网络 |
PortXmitData |
infiniband_port_xmit_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
计数值 |
自然数 |
PortRcvData |
infiniband_port_rcv_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
计数值 |
自然数 |
|
SymbolErrorCounter |
infiniband_symbol_error_counter_total |
Total number of minor link errors detected on one or more physical lanes. |
计数值 |
自然数 |
|
LinkErrorRecoveryCounter |
infiniband_link_error_recovery_counter_total |
Total number of times the Port Training state machine has successfully completed the link error recovery process. |
计数值 |
自然数 |
|
PortRcvErrors |
infiniband_port_rcv_errors_total |
Total number of packets containing errors that were received on the port including: Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine) Malformed data packet errors (LVer, length, VL) Malformed link packet errors (operand, length, VL) Packets discarded due to buffer overrun (overflow) |
计数值 |
自然数 |
|
LocalLinkIntegrityErrors |
infiniband_local_link_integrity_errors_total |
This counter indicates the number of retries initiated by a link transfer layer receiver. |
计数值 |
自然数 |
|
PortRcvRemotePhysicalErrors |
infiniband_port_rcv_remote_physical_errors_total |
Total number of packets marked with the EBP delimiter received on the port. |
计数值 |
自然数 |
|
PortRcvSwitchRelayErrors |
infiniband_port_rcv_switch_relay_errors_total |
Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons: DLI D mapping VL mapping Looping (output port = input port) |
计数值 |
自然数 |
|
PortXmitWait |
infiniband_port_transmit_wait_total |
The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). |
计数值 |
自然数 |
|
PortXmitDiscards |
infiniband_port_xmit_discards_total |
Total number of outbound packets discarded by the port because the port is down or congested. |
计数值 |
自然数 |