Network¶
Perlmutter Slingshot Interconnect¶
Perlmutter uses Slingshot, a high-performance networking interconnect
technology developed by Hewlett Packard Enterprise (HPE). The hardware
component of Perlmutter's Slingshot interconnect is described in more
detail on the Perlmutter Architechture
page. We will track
some user-tunable features of the Slingshot host software (SHS)
here. In general more details can be found in the intro_mpi man page
(man intro_mpi).
Adjusting message matching mode¶
Hardware message matching is a powerful tool for improving the performance, efficiency, and reliability of HPC network communication. It typically involves the use of specialized hardware components that are optimized for high-speed data processing. These components can perform message matching operations in real-time, enabling high-speed data transfer and processing in complex systems.
By default, SHS is set to use hardware message matching and only
switch over to (slower) software message matching when the hardware
counters are full. This is controlled by setting the environment
variable FI_CXI_RX_MATCH_MODE to hybrid. This ensures that codes
that generate many MPI messages will continue to function once they've
filled up the hardware queues on the Network Interface Cards (albeit
at a slower rate). This comes at the cost of a slight overhead in
memory usage. The exact value depends on the number of ranks and the
size of each request buffer, but will generally be less than 2% of
the existing memory on the node. If users wish to default to only
using hardware message matching, they can set
FI_CXI_RX_MATCH_MODE=hardware.
Defaulting to pure hardware mode can cause jobs to fail
Setting `FI_CXI_RX_MATCH_MODE=hardware` can cause jobs to fail
when they exhaust the hardware message queue (usually by
sending too many MPI messages). If this happens, the job will
fail with a message like `LE resources not recovered during
flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is
required`.
Network Performance Report¶
If desired, SHS can collect network data such as the number of network
timeouts and counter data for Network Interface Cards during an MPI
job. Often times it's difficult to put these reports in context for a
single job, but if you wish to enable this report, it can be done by
setting the environment variable MPICH_OFI_CXI_COUNTER_REPORT to a
value between 1 and 5. See man intro_mpi for a list of the different
reporting levels available. Please note if there are no network
timeouts, no extra information will be printed.