Chapter 6. Monitoring, Debugging and Statistics

Pegasus comes bundled with useful tools that help users debug workflows and generate useful statistics and plots about their workflow runs. Most of the tools query a runtime workflow database ( usually a sqllite in the workflow submit directory ) populated at runtime by pegasus-monitord. With the exception of pegasus-monitord (see below), all tools take in the submit directory as an argument. Users can invoke the tools listed in this chapter as follows:

$ pegasus-[toolname]   <path to the submit directory>

6.1. Workflow Status

As the number of jobs and tasks in workflows increase, the ability to track the progress and quickly debug a workflow becomes more and more important. Pegasus comes with a series of utilities that can be used to monitor and debug workflows both in real-time as well as after execution is already completed.

6.1.1. pegasus-status

To monitor the execution of the workflow run the pegasus-status command as suggested by the output of the pegasus-run command. pegasus-status shows the current status of the Condor Q as pertaining to the master workflow from the workflow directory you are pointing it to. In a second section, it will show a summary of the state of all jobs in the workflow and all of its sub-workflows.

The details of pegasus-status are described in its respective manual page. There are many options to help you gather the most out of this tool, including a watch-mode to repeatedly draw information, various modes to add more information, and legends if you are new to it, or need to present it.

$ pegasus-status /Workflow/dags/directory
STAT  IN_STATE  JOB
Run      05:08  level-3-0
Run      04:32   |-sleep_ID000005
Run      04:27   \_subdax_level-2_ID000004
Run      03:51      |-sleep_ID000003
Run      03:46      \_subdax_level-1_ID000002
Run      03:10         \_sleep_ID000001
Summary: 6 Condor jobs total (R:6)

UNREADY   READY     PRE  QUEUED    POST SUCCESS FAILURE %DONE
      0       0       0       6       0       3       0  33.3
Summary: 3 DAGs total (Running:3)

Without the -l option, the only a summary of the workflow statistics is shown under the current queue status. However, with the -l option, it will show each sub-workflow separately:

$ pegasus-status -l /Workflow/dags/directory
STAT  IN_STATE  JOB
Run      07:01  level-3-0
Run      06:25   |-sleep_ID000005
Run      06:20   \_subdax_level-2_ID000004
Run      05:44      |-sleep_ID000003
Run      05:39      \_subdax_level-1_ID000002
Run      05:03         \_sleep_ID000001
Summary: 6 Condor jobs total (R:6)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
    0     0     0     1     0     1     0  50.0 Running level-2_ID000004/level-1_ID000002/level-1-0.dag
    0     0     0     2     0     1     0  33.3 Running level-2_ID000004/level-2-0.dag
    0     0     0     3     0     1     0  25.0 Running *level-3-0.dag
    0     0     0     6     0     3     0  33.3         TOTALS (9 jobs)
Summary: 3 DAGs total (Running:3)

The following output shows a successful workflow of workflow summary after it has finished.

$ pegasus-status work/2011080514
(no matching jobs found in Condor Q)
UNREADY   READY     PRE  QUEUED    POST SUCCESS FAILURE %DONE
      0       0       0       0       0   7,137       0 100.0
Summary: 44 DAGs total (Success:44)

Warning

For large workflows with many jobs, please note that pegasus-status will take time to compile state from all workflow files. This typically affects the initial run, and sub-sequent runs are faster due to the file system's buffer cache. However, on a low-RAM machine, thrashing is a possibility.

The following output show a failed workflow after no more jobs from it exist. Please note how no active jobs are shown, and the failure status of the total workflow.

$ pegasus-status work/submit
(no matching jobs found in Condor Q)
UNREADY   READY     PRE  QUEUED    POST SUCCESS FAILURE %DONE
     20       0       0       0       0       0       2   0.0
Summary: 1 DAG total (Failure:1)

6.1.2. pegasus-analyzer

Pegasus-analyzer is a command-line utility for parsing several files in the workflow directory and summarizing useful information to the user. It should be used after the workflow has already finished execution. pegasus-analyzer quickly goes through the jobstate.log file, and isolates jobs that did not complete successfully. It then parses their submit, and kickstart output files, printing to the user detailed information for helping the user debug what happened to his/her workflow.

The simplest way to invoke pegasus-analyzer is to simply give it a workflow run directory, like in the example below:

$ pegasus-analyzer  /home/user/run0004
pegasus-analyzer: initializing...

************************************Summary*************************************

 Total jobs         :     26 (100.00%)
 # jobs succeeded   :     25 (96.15%)
 # jobs failed      :      1 (3.84%)
 # jobs held        :      1 (3.84%)
 # jobs unsubmitted :      0 (0.00%)

*******************************Held jobs' details*******************************
      
================================sleep_ID0000001=================================
       
       submit file            : sleep_ID0000001.sub
       last_job_instance_id   : 7
       reason                 :  Error from slot1@corbusier.isi.edu:
                                 STARTER at 128.9.64.188 failed to
                                 send file(s) to 
                                 <128.9.64.188:62639>: error reading from
                                 /opt/condor/8.4.8/local.corbusier/execute/dir_76205/f.out:
                                 (errno 2) No such file or directory;
                                SHADOW failed to receive file(s) from <128.9.64.188:62653> 

******************************Failed jobs' details******************************

============================register_viz_glidein_7_0============================

 last state: POST_SCRIPT_FAILURE
       site: local
submit file: /home/user/run0004/register_viz_glidein_7_0.sub
output file: /home/user/run0004/register_viz_glidein_7_0.out.002
 error file: /home/user/run0004/register_viz_glidein_7_0.err.002

-------------------------------Task #1 - Summary--------------------------------

site        : local
executable  : /lfs1/software/install/pegasus/default/bin/rc-client
arguments   : -Dpegasus.user.properties=/lfs1/work/pegasus/run0004/pegasus.15181.properties \
-Dpegasus.catalog.replica.url=rlsn://smarty.isi.edu --insert register_viz_glidein_7_0.in
exitcode    : 1
working dir : /lfs1/work/pegasus/run0004

---------Task #1 - pegasus::rc-client - pegasus::rc-client:1.0 - stdout---------

2009-02-20 16:25:13.467 ERROR [root] You need to specify the pegasus.catalog.replica property
2009-02-20 16:25:13.468 WARN  [root] non-zero exit-code 1

In the case above, pegasus-analyzer's output contains a brief summary section, showing how many jobs have succeeded and how many have failed. If there are any held jobs, pegasus-analyzer will report the name of the job that was held, and the reason why , as determined from the dagman.out file for the workflow. The last_job_instance_id is the database id for the job in the job instance table of the monitoring database. After that, pegasus-analyzer will print information about each job that failed, showing its last known state, along with the location of its submit, output, and error files. pegasus-analyzer will also display any stdout and stderr from the job, as recorded in its kickstart record. Please consult pegasus-analyzer's man page for more examples and a detailed description of its various command-line options.

Note

Starting with 4.0 release, by default pegasus analyzer queries the database to debug the workflow. If you want it to use files in the submit directory , use the --files option.

6.1.3. pegasus-remove

If you want to abort your workflow for any reason you can use the pegasus-remove command listed in the output of pegasus-run invocation or by specifying the Dag directory for the workflow you want to terminate.

$ pegasus-remove /PATH/To/WORKFLOW DIRECTORY

6.1.4. Resubmitting failed workflows

Pegasus will remove the DAGMan and all the jobs related to the DAGMan from the condor queue. A rescue DAG will be generated in case you want to resubmit the same workflow and continue execution from where it last stopped. A rescue DAG only skips jobs that have completely finished. It does not continue a partially running job unless the executable supports checkpointing.

To resubmit an aborted or failed workflow with the same submit files and rescue Dag just rerun the pegasus-run command

$ pegasus-run /Path/To/Workflow/Directory