Once your job has been submitted to PBS via the qsub command, multiple tools are available to monitor and control your job.

1 – Getting general job information

PBS native tools

The qstat command gives information about the current jobs queued or running on HPCaVe servers (more information with man qstat).

Without arguments, qstat will display the current jobs for all users:

user1@mesu2:~> qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
293682.mesu2      run.sh           user1                0 Q f96c_b          
294059.mesu2      colmlge          user2         190:16:0 R f48c_a          
294060.mesu2      ibdmlge          user2         187:28:4 R f48c_a

If you know you job identifier (output of the qsub command, or by identifying it the qstat output), you can get information about your specific job with the qstat -f command (use qstat -fx if your job has already ended):

user1@mesu2:~> qstat -fx 294059.mesu2
Job Id: 294059.mesu2
    Job_Name = job
    Job_Owner = user1@mesu1.ib0.xa.dsi.upmc.fr
    resources_used.cpupercent = 949
    resources_used.cput = 190:24:04
    resources_used.mem = 147434480kb
    resources_used.ncpus = 24
    resources_used.vmem = 164753884kb
    resources_used.walltime = 47:13:50
    job_state = R
    queue = f48c_a
    server = mesu2

If you want to display detailed status of job arrays, use qstat -tr.

MeSU Specific tools

In addition to qstat, MeSU offers additional tools that can help you monitor your jobs.
These tools provide a similar user experience to the tools provided by Slurm (a resource management system like PBS) which is available on many supercomputers.

The qqueue tool displays a summary of the jobs running and pending currently on MeSU :

user1@mesu2:~> qqueue
JOBID      PARTITION  NAME      USER   ST    TIME     NODES  CPUS  NODELIST(REASON)
150981     alpha     V2E       user1    R   19:55:09     1     128  mesu-uv
151036     alpha     IK        user2    R    1:16:28     1     128  mesu-uv
151038     alpha     n600K     user3    R    1:15:06     1     128  mesu-uv
150794     beta      m0p50     user4    R    1:17:00     8     192  r1i0n[0-7]
150850     beta      GA8_v     user1    R  2-12:10:59    4     96   r1i0n[22-25]
151044     beta      V3E       user1    PD               4     96  (Resources)

The qinfo tool displays a summary of MeSU resources status :

user1@mesu2:~> qinfo
PARTITION  STATE     NODES  NODELIST
beta       alloc       59   r1i0n[0-7,13-14,18-21,26-35],r1i3n[0-17,35],r1i2n[18-20,27-35],r1i1n[13-16]
alpha      mixed       1    mesu-uv
gamma      mixed       1    mesu3
beta       idle        85   r1i1n[0-12,17-35],r1i2n[0-17,21-26],r1i3n[18-34],r1i0n[8-12,15-17,22-25]
gamma      idle        1    mesu4

2 – Deleting or stopping a job

At any time, you can delete (or stop) a queued (or running) job by using the qdel command :

user1@mesu2:~> qdel 298109.mesu2

If you’re job is running, PBS will handle the killing of its processes.

3 – Connecting to a computing node for more details

tracejob will print log messages for a job :

user1@mesu2:~> tracejob 298109.mesu2
Job: 298109.mesu2
06/08/2018 12:58:41  S    dequeuing from alpha, state 1
06/08/2018 12:58:41  S    enqueuing into f32c_a, state 1 hop 1
06/08/2018 12:58:41  S    Job Queued at request of user1@mesu1.ib0.xa.dsi.upmc.fr, owner =
                          user1@mesu1.ib0.xa.dsi.upmc.fr, job name = tripfft, queue = f32c_a
06/08/2018 12:58:41  A    queue=f32c_a
06/08/2018 13:21:05  L    Node is in an ineligible state: job-busy
...
06/08/2018 13:25:21  S    Job Run at request of Scheduler@mesu2.ib0.xa.dsi.upmc.fr on exec_vnode
                          (mesu-uv[87]:ncpus=8+mesu-uv[91]:ncpus=8)+(mesu-uv[92]:ncpus=8+mesu-uv[93]:ncpus=8)
06/08/2018 13:25:21  L    Job run

Once your job is running (status R in qstat), you can connect to one of the used nodes with ssh in order to inspect the status of the node for instance, or to dynamically interact with your job. Identify the nodes your job has been dispatched to with tracejob or qstat -f and run one of the following commands depending on the server your job has been dispatched to:

# To connect on the node 72 of MeSU-alpha:
ssh mesu-uv[72]  

# To connect to the node 17 of MeSU-beta
ssh r1i0n17

Once connected to a node, you can run the same “classical” system monitoring tools (top, pstree, watch… ) that you would on a desktop computer.