Performance variability¶
There are many potential sources of variability on an HPC system and NERSC has identified the following best practices to mitigate variability and improve application performance.
hugepages¶
Use of hugepages can reduce the cost of accessing memory, especially
in the case of many MPI_Alltoall operations.
- Load the hugepages module (
module load craype-hugepages2M). - Recompile your code.
- Add
module load craype-hugepages2Mto batch scripts.
Note
Consider adding module load craype-hugepages2M to
~/.bashrc.
For more details see the manual pages (man intro_hugepages).
Location of executables¶
Compilation of executables should be done in $HOME or
/tmp. Executables can be copied into the compute node memory at the
start of a job with sbcast
to greatly improve job startup times and reduce run-time variability
in some cases.
For applications with dynamic executables and many libraries (especially python based applications) use Shifter.
Network Congestion¶
Sometimes, due to other communication-intensive workloads running at the same time as your workload, there may be variation in the amount of time spent on communication. There are Cray MPI environment variables that can be set to change the strategy used by the system to route messages in your job. The Network page provides more details on these environment variables.
Affinity¶
Running with correct affinity and binding options can greatly affect variability.
- use at least 8 ranks per node (1 rank per node cannot utilize the full network bandwidth)
- read
man intro_mpifor additional options - check job script generator to get correct binding
- use check-mpi.
.pm and check-hybrid. .pm, where can be gnu, nvidia, or cce to check affinity settings
elvis@perlmutter$ salloc -N 2 -C cpu -q interactive -t 10:00
salloc: Granted job allocation 9887582
salloc: Waiting for resource configuration
salloc: Nodes nid[004434,005440] are ready for job
elvis@nid004434$ srun -n 8 -c 64 --cpu-bind=cores check-mpi.gnu.pm|sort -nk 4
Hello from rank 0, on nid004434. (core affinity = 0-31,128-159)
Hello from rank 1, on nid004434. (core affinity = 64-95,192-223)
Hello from rank 2, on nid004434. (core affinity = 32-63,160-191)
Hello from rank 3, on nid004434. (core affinity = 96-127,224-255)
Hello from rank 4, on nid005440. (core affinity = 0-31,128-159)
Hello from rank 5, on nid005440. (core affinity = 64-95,192-223)
Hello from rank 6, on nid005440. (core affinity = 32-63,160-191)
Hello from rank 7, on nid005440. (core affinity = 96-127,224-255)
Core specialization¶
Using core-specialization (#SBATCH -S n or #SBATCH --core-spec=n)
moves OS functions to cores not in use by user applications, where
n is the number of cores to dedicate to the OS. The flag only works
in a batch script with sbatch. It can not be requested as a flag
with salloc for interactive jobs, since salloc is already a
wrapper script for srun.
The example shows 1 core per node on Perlmutter CPU for the OS and the other
127 for the application. Note that, when computing the -c (or
--cpus-per-task) value using a formula provided in the
affinity page, cores for
the OS should be excluded from the numerator. So the -c value
is \(2*\left \lfloor{(128-1)/(32/2)}\right \rfloor = 14\).
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH -S 1
srun -n 32 -c 14 --cpu-bind=cores /tmp/my_program.x
Combined example¶
This example is for Perlmutter CPU.