2.6. Command line tools for Monitoring and Debugging

Pegasus also comes with a series of command line tools that users can use to monitor and debug their workflows.

  • pegasus-status : monitor the status of the workflow

  • pegasus-analyzer : debug a failed workflow

  • pegasus-statistics : generate statistics from a workflow run.

2.6.1. pegasus-status - monitoring the workflow

After the workflow has been submitted you can monitor it using the pegasus-status command:

$ pegasus-status -l submit/tutorial/pegasus/split/run0001
STAT  IN_STATE  JOB
Run      00:39  split-0 ( /home/tutorial/split/submit/tutorial/pegasus/split/run0001 )
Idle     00:03   ┗━split_ID0000001
Summary: 2 Condor jobs total (I:1 R:1)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
   14     0     0     1     0     2     0  11.8 Running *split-0.dag     

This command shows the workflow (split-0) and the running jobs (in the above output it shows the two findrange jobs). It also gives statistics on the number of jobs in each state and the percentage of the jobs in the workflow that have finished successfully.

Use the watch option to continuously monitor the workflow:

$ pegasus-status -w submit/tutorial/pegasus/split/run0001
...

You should see all of the jobs in the workflow run one after the other. After a few minutes you will see:

(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
    0     0     0     0     0    15     0 100.0 Success *split-0.dag      

That means the workflow is finished successfully.

If the workflow finished successfully you should see the output count files in the output directory.

$ ls output/
count.txt.a  count.txt.b  count.txt.c  count.txt.d

2.6.2. pegasus-analyzer - debug a failed workflow

In the case that one or more jobs fails, then the output of the pegasus-status command above will have a non-zero value in the FAILURE column.

You can debug the failure using the pegasus-analyzer command. This command will identify the jobs that failed and show their output. Because the workflow succeeded, pegasus-analyzer will only show some basic statistics about the number of successful jobs:

$ pegasus-analyzer submit/tutorial/pegasus/split/run0001
pegasus-analyzer: initializing...

****************************Summary***************************

 Total jobs         :      11 (100.00%)
 # jobs succeeded   :      11 (100.00%)
 # jobs failed      :      0 (0.00%)
 # jobs unsubmitted :      0 (0.00%)

If the workflow had failed you would see something like this:

$ pegasus-analyzer submit/tutorial/pegasus/split/run0002

************************************Summary*************************************

 Submit Directory   : submit/tutorial/pegasus/split/run0002
 Total jobs         :     15 (100.00%)
 # jobs succeeded   :      1 (5.88%)
 # jobs failed      :      1 (5.88%)
 # jobs unsubmitted :     15 (88.24%)

******************************Failed jobs' details******************************

==========================stage_in_local_PegasusVM_0_0==========================

 last state: POST_SCRIPT_FAILED
       site: local
submit file: stage_in_local_PegasusVM_0_0.sub
output file: stage_in_local_PegasusVM_0_0.out.001
 error file: stage_in_local_PegasusVM_0_0.err.001

-------------------------------Task #1 - Summary--------------------------------

site        : local
hostname    : unknown
executable  : /usr/bin/pegasus-transfer
arguments   :   --threads   2
exitcode    : 1
working dir : /home/tutorial/split/submit/tutorial/pegasus/split/run0002

------------------Task #1 - pegasus::transfer - None - stdout-------------------

2015-10-22 21:13:50,970    INFO:  Reading URL pairs from stdin
2015-10-22 21:13:50,970    INFO:  PATH=/usr/bin:/bin
2015-10-22 21:13:50,970    INFO:  LD_LIBRARY_PATH=
2015-10-22 21:13:50,972    INFO:  1 transfers loaded
2015-10-22 21:13:50,972    INFO:  Sorting the tranfers based on transfer type and source/destination
2015-10-22 21:13:50,972    INFO:  --------------------------------------------------------------------------------
2015-10-22 21:13:50,972    INFO:  Starting transfers - attempt 1
2015-10-22 21:13:50,972    INFO:  Using 1 threads for this round of transfers
2015-10-22 21:13:53,845   ERROR:  Command exited with non-zero exit code (1): /usr/bin/scp -r -B -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /home/tutorial/.ssh/id_rsa -P 22 '/home/tutorial/examples/split/input/pegasus.html' 'tutorial@127.0.0.1:/home/tutorial/work/tutorial/pegasus/split/run0002/pegasus.html'
2015-10-22 21:15:55,911    INFO:  --------------------------------------------------------------------------------
2015-10-22 21:15:55,912    INFO:  Starting transfers - attempt 2
2015-10-22 21:15:55,912    INFO:  Using 1 threads for this round of transfers
2015-10-22 21:15:58,446   ERROR:  Command exited with non-zero exit code (1): /usr/bin/scp -r -B -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /home/tutorial/.ssh/id_rsa -P 22 '/home/tutorial/examples/split/input/pegasus.html' 'tutorial@127.0.0.1:/home/tutorial/work/tutorial/pegasus/split/run0002/pegasus.html'
2015-10-22 21:16:40,468    INFO:  --------------------------------------------------------------------------------
2015-10-22 21:16:40,469    INFO:  Starting transfers - attempt 3
2015-10-22 21:16:40,469    INFO:  Using 1 threads for this round of transfers
2015-10-22 21:16:43,168   ERROR:  Command exited with non-zero exit code (1): /usr/bin/scp -r -B -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /home/tutorial/.ssh/id_rsa -P 22 '/home/tutorial/examples/split/input/pegasus.html' 'tutorial@127.0.0.1:/home/tutorial/work/tutorial/pegasus/split/run0002/pegasus.html'
2015-10-22 21:16:43,173    INFO:  --------------------------------------------------------------------------------
2015-10-22 21:16:43,173    INFO:  Stats: no local files in the transfer set
2015-10-22 21:16:43,173 CRITICAL:  Some transfers failed! See above, and possibly stderr.


-------------Task #1 - pegasus::transfer - None - Kickstart stderr--------------

Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
/home/tutorial/split/input/pegasus.html: No such file or directory
..
/home/tutorial/split/input/pegasus.html: No such file or directory

In this example, we removed one of the input files. We will cover this in more detail in the recovery section. The output of pegasus-analyzer indicates that pegasus.html file could not be found.

2.6.3. pegasus-statistics - collect statistics about a workflow run

The pegasus-statistics command can be used to gather statistics about the runtime of the workflow and its jobs. The -s all argument tells the program to generate all statistics it knows how to calculate:

$ pegasus-statistics –s all submit/tutorial/pegasus/split/run0001

#
# Pegasus Workflow Management System - http://pegasus.isi.edu
#
# Workflow summary:
#   Summary of the workflow execution. It shows total
#   tasks/jobs/sub workflows run, how many succeeded/failed etc.
#   In case of hierarchical workflow the calculation shows the
#   statistics across all the sub workflows.It shows the following
#   statistics about tasks, jobs and sub workflows.
#     * Succeeded - total count of succeeded tasks/jobs/sub workflows.
#     * Failed - total count of failed tasks/jobs/sub workflows.
#     * Incomplete - total count of tasks/jobs/sub workflows that are
#       not in succeeded or failed state. This includes all the jobs
#       that are not submitted, submitted but not completed etc. This
#       is calculated as  difference between 'total' count and sum of
#       'succeeded' and 'failed' count.
#     * Total - total count of tasks/jobs/sub workflows.
#     * Retries - total retry count of tasks/jobs/sub workflows.
#     * Total+Retries - total count of tasks/jobs/sub workflows executed
#       during workflow run. This is the cumulative of retries,
#       succeeded and failed count.
# Workflow wall time:
#   The wall time from the start of the workflow execution to the end as
#   reported by the DAGMAN.In case of rescue dag the value is the
#   cumulative of all retries.
# Workflow cumulative job wall time:
#   The sum of the wall time of all jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job wall time as seen from submit side:
#   The sum of the wall time of all jobs as reported by DAGMan.
#   This is similar to the regular cumulative job wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.
# Cumulative job wall time as seen from submit side:
#   The sum of the wall time of all jobs as reported by DAGMan.
#   This is similar to the regular cumulative job wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.
# Workflow cumulative job badput wall time:
#   The sum of the wall time of all failed jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job badput wall time as seen from submit side:
#   The sum of the wall time of all failed jobs as reported by DAGMan.
#   This is similar to the regular cumulative job badput wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.

------------------------------------------------------------------------------
Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
Tasks          5         0       0           5         0         5
Jobs           15        0       0           15        0         15
Sub-Workflows  0         0       0           0         0         0
------------------------------------------------------------------------------

Workflow wall time                                       : 2 mins, 6 secs
Workflow cumulative job wall time                        : 38 secs
Cumulative job wall time as seen from submit side        : 42 secs
Workflow cumulative job badput wall time                 :
Cumulative job badput wall time as seen from submit side :

Summary                       : submit/tutorial/pegasus/split/run0001/statistics/summary.txt
Workflow execution statistics : submit/tutorial/pegasus/split/run0001/statistics/workflow.txt
Job instance statistics       : submit/tutorial/pegasus/split/run0001/statistics/jobs.txt
Transformation statistics     : submit/tutorial/pegasus/split/run0001/statistics/breakdown.txt
Time statistics               : submit/tutorial/pegasus/split/run0001/statistics/time.txt


The output of pegasus-statistics contains many definitions to help users understand what all of the values reported mean. Among these are the total wall time of the workflow, which is the time from when the workflow was submitted until it finished, and the total cumulative job wall time, which is the sum of the runtimes of all the jobs.

The pegasus-statistics command also writes out several reports in the statistics subdirectory of the workflow submit directory:

$ ls submit/tutorial/pegasus/split/run0001/statistics/
jobs.txt          summary.txt         time.txt          breakdown.txt          workflow.txt

The file breakdown.txt, for example, has min, max, and mean runtimes for each transformation:

$ more submit/tutorial/pegasus/split/run0001/statistics/breakdown.txt
# legends
# Transformation - name of the transformation.
# Count          - the number of times the invocations corresponding to
#                  the transformation was executed.
# Succeeded      - the count of the succeeded invocations corresponding
#                  to the transformation.
# Failed         - the count of the failed invocations corresponding to
#                  the transformation.
# Min(sec)       - the minimum invocation runtime value corresponding to
#                  the transformation.
# Max(sec)       - the maximum invocation runtime value corresponding to
#                  the transformation.
# Mean(sec)      - the mean of the invocation runtime corresponding to
#                  the transformation.
# Total(sec)     - the cumulative of invocation runtime corresponding to
#                  the transformation.

# 773d8fa3-8bff-4f75-8e2b-38e2c904f803 (split)
Transformation           Count     Succeeded Failed  Min       Max       Mean      Total
dagman::post             15        15        0       5.0       6.0       5.412     92.0
pegasus::cleanup         6         6         0       1.474     3.178     2.001     12.008
pegasus::dirmanager      1         1         0       2.405     2.405     2.405     2.405
pegasus::rc-client       2         2         0       2.382     7.406     4.894     9.788
pegasus::transfer        3         3         0       3.951     5.21      4.786     14.358
split                    1         1         0       0.009     0.009     0.009     0.009
wc                       4         4         0       0.005     0.029     0.012     0.047


# All (All)
Transformation           Count     Succeeded  Failed  Min       Max       Mean      Total
dagman::post             15        15         0       5.0       6.0       5.412     92.0
pegasus::cleanup         6         6          0       1.474     3.178     2.001     12.008
pegasus::dirmanager      1         1          0       2.405     2.405     2.405     2.405
pegasus::rc-client       2         2          0       2.382     7.406     4.894     9.788
pegasus::transfer        3         3          0       3.951     5.21      4.786     14.358
split                    1         1          0       0.009     0.009     0.009     0.009
wc                       4         4          0       0.005     0.029     0.012     0.047

In this case, because the example transformation sleeps for 30 seconds, the min, mean, and max runtimes for each of the analyze, findrange, and preprocess transformations are all close to 30.