Monitoring Jobs¶
Note
Continuously running squeue/sqs using watch, and especially multiple instances of "watch squeue/sqs" is not allowed. When many users are doing this at once it adversely impacts the performance of the job scheduler, which is a shared resource.
If you must monitor your workload, run only single instances of
squeue or sqs or use
sacct. If
watch is essential to your workflow then
limit the refresh interval to 1 min (watch -n 60) and be sure to
terminate the process when you are not actively using it.
For users who are interested in monitoring their job's resource usage while the job is running, the section on how to log in to compute nodes while jobs are running below.
sqs¶
sqs is a NERSC custom wrapper for the Slurm native squeue script with a chosen default format
to view job information in the batch queue managed by Slurm. The sqs command without any
flag displays queued jobs for the logged-in user. Invoking sqs -a displays the jobs of all users.
sqs is fully compatible with squeue in that it takes any flag that is accepted by squeue,
thus enabling more flexibility in customizing the output. For
example, you could choose to only see running jobs with -t R, or you
could overwrite the default format of sqs with the -o flag to provide the
list and format for fields of your own interest.
Note
Please refer to sqs --help and the squeue man page for the available flags and more information.
$ sqs
JOBID ST USER NAME NODES TIME_LIMIT TIME SUBMIT_TIME QOS START_TIME FEATURES NODELIST(REASON
9992934 R elvis myjob1 1024 12:00:00 0:00 2023-06-05T05:05:12 regular_0 2023-06-05T06:00:00 cpu nid[004196-0041
9992980 PD elvis myjob2 1024 12:00:00 0:00 2023-06-05T05:19:59 regular_0 2023-06-05T06:00:00 cpu (ReqNodeNotAvai
9995272 PD elvis myjob3 48 6:00:00 0:00 2023-06-05T05:38:36 regular_1 N/A cpu (Dependency)
9992985 PD elvis myjob4 48 6:00:00 0:00 2023-06-05T05:51:06 regular_1 N/A cpu (Nodes required
squeue¶
squeue provides information about jobs in the Slurm scheduling queue and is best
used for viewing jobs and job step information for active jobs (PENDING, RUNNING, SUSPENDED).
For more details on squeue refer to the squeue manual
or run squeue --help, man squeue.
To view current user jobs:
The same output can be retrieved via --me option which is equivalent to --user=<$USER>
To view all running jobs for the current user:
To view all pending jobs for current user:
To view all pending jobs in QOS shared:
To view all running jobs for current user in the shared QOS:
$ squeue --me -q shared -t RUNNING
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000 shared netcdf_r user1 R 1:16:47 1 nid006504
To view all jobs for a particular account (project), use -A <nersc_project>:
$ squeue -A <nersc_project>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2000 regular_m tokio-ab admin1 PD 0:00 256 (Priority)
2001 regular_m mpi4py-i admin2 PD 0:00 150 (Priority)
2002 regular_m mpi4py-i admin3 PD 0:00 150 (Priority)
2003 regular_m preproce admin4 PD 0:00 1 (Priority)
To view filter jobs, use the -j option followed by the job ID. You can specify
multiple job IDs separated by commas.
$ squeue -j 2542,2560
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2542 shared_mi wrfpostp user1 PD 0:00 1 (Dependency)
2560 shared_mi netcdf_r user2 PD 0:00 1 (Resources)
To view a job step use the --steps option with the job step ID.
$ squeue --steps 1001.0
STEPID NAME PARTITION USER TIME NODELIST
1001.0 vasp_std regular_m elvis 5:19:26 nid004113
sacct¶
sacct is used to report job or job step accounting information about
active or completed jobs. You can directly invoke sacct without any arguments
and it will show jobs for the current user. sacct can be used for monitoring but
it is primarily used for Job Accounting.
For a complete list of sacct options please refer to the
sacct manual or run man sacct.
jobstats¶
Note
You must use Python 3.x in order to use jobstats; this can be done with
module load python.
jobstats provides Slurm
accounting and job details from sacct, sreport
and squeue. You can run jobstats without any arguments and it will show a report
for the current user from sreport for today. If you have any pending or running
jobs it will show that as well.
$ jobstats
User: XXXXXX
Default Account: YYYYY
User is part of the following slurm accounts ['YYYYY']
User Raw Share: 1
User Raw Usage: 0
Number of Pending Jobs: 0
Number of Running Jobs: 0
Total Jobs Completed: 0
Total Jobs Completed Successfully: 0
Total Jobs Failed: 0
Total Jobs Cancelled: 0
Total Jobs Timeout: 0
Today: 06/05/2023 12:13:37 sreport
--------------------------------------------------------------------------------
Top 10 Users 2020-06-14T00:00:00 - 2020-06-14T23:59:59 (86400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
Cluster Login Proper Name Account Used Energy
--------- --------- --------------- --------------- ------------ -------------
Shown below is a list of options for the jobstats command.
$ jobstats --help
usage: jobstats [-h] [-u USER] [-S START] [-E END] [-j]
[--state {COMPLETED,FAILED,TIMEOUT,CANCELLED}] [-a]
slurm utility for display user job statistics, reporting, and account detail.
optional arguments:
-h, --help show this help message and exit
-u USER, --user USER Select a user
-S START, --start START
Start Date Format: YYYY-MM-DD
-E END, --end END End Date Format: YYYY-MM-DD
-j, --jobsummary Display job summary for user
--state {COMPLETED,FAILED,TIMEOUT,CANCELLED}
Filter by Job State
-a, --account Display information on account shares that user
belongs to
Developed by Shahzeb Siddiqui <shahzebmsiddiqui@gmail.com>
For more information see the jobstats documentation.
sstat¶
sstat is used to display various status information of a running job or job
step. For example, one may wish to see the maximum memory usage (resident set
size) of all tasks in a running job.
For a complete list of sstat options and examples please see
sstat manual.
Email notification¶
You can add directives within your job script to notify you when
your job starts, finishes, or fails. Using the --mail-type
option, you can select one of begin, end, or fail
(respectively), or two or more in a comma-separated list (as
below). You should specify the email address to which the
notifications should go with the --mail-user option.
How to log in to compute nodes running your jobs¶
It can be useful for troubleshooting or diagnostics to log in to compute nodes running one's job in order to observe activity on those nodes. Below is the series of steps required to log in to a compute node while one's job is running.
Access to compute nodes is enabled only while the job is running
A user's SSH access to compute nodes is enabled only during the lifetime of the job. When the job ends, the user's SSH connections to all compute nodes in the job will be disconnected.
-
Retrieve the list of nodes that your job is running on. This will either print the host name
nid*****or a range of host names -- if the job has more than one node -- in square brackets. -
SSH into any
nid*****node in thescontrollist generated in step 1.
Requesting the head-node ID
If you need the head-node only (eg. for DMTCP applications) use BatchHost
instead of NodeList:
Updating Jobs¶
Cancel jobs¶
To cancel a specific job:
You can also cancel more than one job in a single call to scancel:
To cancel all jobs owned by a user
Warning
If you want to cancel several hundred jobs, do not perform this action as one bulk change; cancel jobs by subset instead.
Because scancel sends a remote procedure call to the Slurm daemon, a
degradation of service can result from many scancel calls happening
all at once. Therefore we recommend using as few individual calls to
this function as possible. In particular, do not wrap scancel in a
loop in a script or other function.
Change timelimit¶
Change QOS¶
Change account¶
Note
The new project must be eligible to run the job.
Controlling Jobs¶
Prevent a pending job from being started:
Note
A held job will lose its accumulated wait time in the queue. Later, if this job is released, it will have the same priority as a newly submitted job.
Release a previously held job (``scontrol hold```):
To requeue (cancel and rerun) a particular job:
Job Accounting¶
sacct example
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10009775 sh regular_m proj1 256 FAILED 1:0
10009775.ex+ extern proj1 256 COMPLETED 0:0
10009775.0 bash proj1 1 FAILED 1:0
10009775.1 a.out proj1 256 COMPLETED 0:0
31171781 sh resv proj1 256 COMPLETED 0:0
31171781.ex+ extern proj1 256 COMPLETED 0:0
31171781.0 bash proj1 1 COMPLETED 0:0
31172253 sh resv proj1 256 TIMEOUT 0:0
31172253.ex+ extern proj1 256 COMPLETED 0:0
31172253.0 bash proj1 1 COMPLETED 0:0
You can format columns as you wish using the --format option. For example,
we can format columns based on User JobName State and Submit as follows
sacct format example
$ sacct --format=User,JobName,State,Submit
User JobName State Submit
--------- ---------- ---------- -------------------
user1 sh FAILED 2023-05-27T07:49:18
extern COMPLETED 2023-05-27T07:49:18
bash FAILED 2023-05-27T07:49:41
a.out COMPLETED 2023-05-27T07:52:31
user1 sh COMPLETED 2023-05-27T08:28:34
extern COMPLETED 2023-05-27T08:28:34
bash COMPLETED 2023-05-27T08:28:42
user1 sh TIMEOUT 2023-05-27T08:51:43
extern COMPLETED 2023-05-27T08:51:43
bash COMPLETED 2023-05-27T08:51:52
We can retrieve historical data for any given user. For example if you want
to filter jobs by Start Time 2023-05-20 and End Time 2023-05-27 for user elvis
you can do the following
sacct format example with Start and End Date
$ sacct -u elvis -S 2023-05-20 -E 2023-05-27
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10009730 test_node+ system physics 4096 TIMEOUT 0:0
10009730.ba+ batch physics 64 CANCELLED 0:15
10009730.ex+ extern physics 4096 COMPLETED 0:0
10009730.0 test_node+ physics 128 FAILED 1:0
10009730.1 test_node+ physics 2048 FAILED 1:0
10009732 test_node+ system physics 512 PENDING 0:0
You can retrieve up to 31 days of job records within given time window; this limit was implemented as safety measure to prevent bringing down the Slurm database. You will see the following error if you exceed the 31 day count:
$ sacct --start 2023-01-04
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
sacct: error: slurmdbd: Too wide of a date range in query
$ date
Wed 07 Jun 2023 09:35:56 AM PDT
To query by job states, use the option -s (or long option --state) plus
the abbreviated state name code. For complete list of job states and their
codes, see the JOB STATE CODES section
in the sacct manual. In the example below we query for all failed jobs. The start
and end window to your query, indicated by the --start and --end options, are
required arguments.
sacct example with user, format fields and job states
To filter output by JobID, you can specify the -j option with a list
of comma-separated job IDs.
sacct filter by jobs
$ sacct -j 9994271,9992980
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
9994271 sh regular_m+ proj2 256 FAILED 1:0
9994271.ex+ extern proj2 256 COMPLETED 0:0
9994271.0 bash proj2 1 FAILED 1:0
9994271.1 a.out proj2 256 COMPLETED 0:0
9992980 sh resv proj2 256 COMPLETED 0:0
9992980.ex+ extern proj2 256 COMPLETED 0:0
9992980.0 bash proj2 1 COMPLETED 0:0