GNU Parallel¶
GNU Parallel is a free, open-source tool for running shell commands and scripts in parallel and sequence on a single node.
This tool is best suited for workflows that contain many similar tasks with no execution order requirements or data dependencies. Its second primary strength is a convenient syntax for specifying patterns of file path organization and command parameters.
It is workable, but does not excel, when tasks use multiple nodes or MPI applications. A logging system is present and enables incomplete workflows to be resumed but it is rudimentary; if a project requires consistent workflow state be maintained over the course of many batch job submissions then using GNU Parallel alone is not the best choice.
Strengths of GNU Parallel:¶
- No user installation or configuration
- No database or persistent server needed
- Powerful specification of task filenames, paths, and parameters
- Easily scales to a very large number of tasks
- Does not burden cluster scheduler
Disadvantages of GNU Parallel:¶
- Doesn't easily load balance work
- User is required to do careful organization of input and output files
- Scaling up requires awareness of system I/O performance
- Does not strongly preserve workflow state between invocations
- Modest familiarity with bash scripting recommended
Example Repository¶
Working examples of the following concepts are available on the
DOE Cross-facility Workflows Training Repository.
Obtain them by using git clone with the repository url on your Perlmutter storage and path to /DOE-HPC-workflow-training/GNU-Parallel/NERSC.
How to use GNU Parallel at NERSC¶
In this first example, seq is used to generate four lines of input, which are
piped into parallel. Those four input lines cause parallel to run four
tasks in total. Then the command each task will run is passed to parallel, in
this case, echo and its arguments. The {} in the command sets a location
where individual input line content will be substituted inside each task
command.
Basic example
The next necessary concept is how to submit substantial tasks and input files
to parallel. This example shows how input data created with sequential
file names can be passed to parallel using bash commands and pipes:
Sequentially named input files for each task
elvis@nid004258:~/work> ls
input01.dat input02.dat input03.dat input04.dat input05.dat
input06.dat input07.dat input08.dat input09.dat input10.dat
elvis@nid004258:~/work> seq -w 1 10 | parallel task_command.sh input{}.dat
salloc session. Though parallel is great for
automating mundane, lightweight, and repetitive tasks such as creating many directories or
parsing lots of log files, tasks which use a substantial amount of
compute resources still need to be run on Slurm allocated compute nodes
and not on shared login nodes.
A second approach places all task inputs into the same directory and uses the
find command to build the file listing all their paths:
Using find to build an input file list
I/O Performance Pitfalls at Large Scale
If work requires large numbers (more than 1000) of tasks and input files then additional consideration for I/O systems may be needed or beneficial.
Directories containing more than 1000 files are less performant; distribute files between subdirectories to avoid this.
When using a pipe to pass the task list to GNU parallel, multiple temporary files are needed per task, which can cause parallel to fail by exceeding the OS ulimit of open file handles. Avoid this by explicitly creating a file containing the task list and passing it as an argument to parallel.
At larger scale it is more important to use higher performance file systems such as the Lustre scratch file system to read and write data.
If all of your tasks are reading the same files then you can increase performance by making multiple copies of those files and assigning different tasks to read different copies.
Running Many Tasks Inside a Single Node Allocation¶
This Slurm batch script will request one CPU node in the regular QOS and then
run parallel on that node. The parallel command runs up six tasks of
payload.sh at a time, one for each line in the file
input.txt. If the input file contains more than six lines
then the additional tasks will wait until earlier tasks finish and space
is available for them. Each input line string becomes an argument to its task
script.
single_node_many_task_with_parallel.sh
This arrangement is a great alternative to submitting many individual jobs or a task array to the shared Slurm QOS. Current scheduling policy only allows two jobs per user to gain priority at a time; a single job running many tasks will spend less time waiting in the queue than many jobs each running a single task. Also, this work pattern requires much less interaction with the Slurm controller, which makes it less likely to cause or be impacted by the Slurm controller experiencing heavy load.
Many Tasks Inside a Multiple Node Allocation¶
Demonstrated using two scripts: a batch submission to Slurm and a driver containing the parallel and payload commands.
This batch submission will request two CPU nodes, then the srun will run two
instances of driver.sh with the $1 argument containing the task input
list.
multiple_nodes_many_tasks_parallel.sh
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node 1
srun --no-kill --ntasks=2 --wait=0 driver.sh $1
The --no-kill argument will keep the slurm allocation running if any of
the allocated nodes fail during the job. The --wait=0 argument prevents
the job from terminating the other driver instances when the first one
finishes.
The driver script uses environment variables set by Slurm inside a job to
distinguish each instance of parallel, and then round-robin distributes input
tasks to them using awk.
driver.sh
#!/bin/bash
if [[ -z "${SLURM_NODEID}" ]]; then
echo "need \$SLURM_NODEID set"
exit
fi
if [[ -z "${SLURM_NNODES}" ]]; then
echo "need \$SLURM_NNODES set"
exit
fi
cat $1 | \
awk -v NNODE="$SLURM_NNODES" -v NODEID="$SLURM_NODEID" \
'NR % NNODE == NODEID' | \
parallel payload.sh {}
The conditional statements make sure the needed Slurm environment variables
are in place. $SLURM_NNODES holds the total number of nodes in the
job and $SLURM_NODEID holds the unique ID number of this node. The
awk command uses the line number of each input and the two environment
variables to implement round-robin assignments of tasks to nodes. An
advantage of this method is the number of nodes requested by the job can be
freely changed without needing to adjust the task-to-node assignment logic.
Grouping Many One-Node MPI Jobs Into a Larger Job¶
This Slurm batch script demonstrates how parallel can be used to distribute
multiple single-node MPI tasks within a multi-node job. The batch script starts
a job with 4 CPU nodes, then tells parallel to run 4 simultaneous jobs.
Note here that the number of nodes requested and the number of processes to run
in parallel are the same, in order to run one instance of the MPI task script
on each node. This job script must request an appropriate number of tasks per
node as well (128 on a Perlmutter CPU node) so that same value may be used
in the MPI task script.
mpi-task-job.sub
Below is an example task script which would be used in combination with the
above job script. The srun command is used to launch the MPI executable, with
its argument passed in via parallel and the input.txt input file. Note that
the srun call specifies one node and 128 tasks on that node.
Scaling Parallel with --sshlogin is Not Recommended¶
GNU parallel includes a feature to distribute tasks to multiple machines using ssh connections. Though this allows work to balance between multiple nodes, our testing suggests that scaling is much less effective and it would be better to use a different task manager. More detail about this finding is available upon request.
Resuming Unfinished or Retrying Failed Tasks¶
If any tasks in a GNU parallel instance return a non-zero exit code, the parallel command will also return non-zero. Parallel can be configured to use a job log file which tracks failed or incomplete tasks so that they can be resumed or retried.
Add --resume-failed --joblog logfile.txt to the list of parallel arguments
and the state of tasks will be recorded. When that parallel instance is rerun
with the exact same command line, it will skip any tasks that are already
complete and re-run any tasks which failed. When using joblog it is good
practice to use the available Slurm environment variables to distinguish
files for each instance of parallel.
It is very important that the input file and command line arguments not be modified between runs and that only one instance of parallel per log file run at a time.
Note that the --retries n parallel argument seems like it should allow
an instance of parallel to retry a failed task, but actually, this feature
only works when using the --sshlogin feature.