Pegasus comes bundled with useful tools that help users debug workflows and generate useful statistics and plots about their workflow runs. These tools internally parse the Condor log files and have a similar interface. With the exception of pegasus-monitord (see below), all tools take in the submit directory as an argument. Users can invoke the tools listed in this chapter as follows:
$ pegasus-[toolname] <path to the submit directory>
As the number of jobs and tasks in workflows increase, the ability to track the progress and quickly debug a workflow becomes more and more important. Pegasus comes with a series of utilities that can be used to monitor and debug workflows both in real-time as well as after execution is already completed.
Pegasus-monitord is used to follow workflows, parsing the output of DAGMan's dagman.out file. In addition to generating the jobstate.log file, which contains the various states that a job goes through during the workflow execution, pegasus-monitord can also be used to mine information from jobs' submit and output files, and either populate a database, or write a file with NetLogger events containing this information. Pegasus-monitord can also send notifications to users in real-time as it parses the workflow execution logs.
Pegasus-monitord is automatically invoked by pegasus-run, and tracks workflows in real-time. By default, it produces the jobstate.log file, and a SQLite database, which contains all the information listed in the Stampede schema. When a workflow fails, and is re-submitted with a rescue DAG, pegasus-monitord will automatically pick up from where it left previously and continue to write the jobstate.log file and populate the database.
If, after the workflow has already finished, users need to re-create the jobstate.log file, or re-populate the database from scratch, pegasus-monitord's --replay option should be used when running it manually.
In addition to SQLite, pegasus-monitord supports other types of databases, such as MySQL and Postgres. Users will need to install the low-level database drivers, and can use the --dest command-line option, or the pegasus.monitord.output property to select where the logs should go.
As an example, the command:
$ pegasus-monitord -r diamond-0.dag.dagman.out
will launch pegasus-monitord in replay mode. In this case, if a jobstate.log file already exists, it will be rotated and a new file will be created. It will also create/use a SQLite database in the workflow's run directory, with the name of diamond-0.stampede.db. If the database already exists, it will make sure to remove any references to the current workflow before it populates the database. In this case, pegasus-monitord will process the workflow information from start to finish, including any restarts that may have happened.
Users can specify an alternative database for the events, as illustrated by the following examples:
$ pegasus-monitord -r -d mysql://username:userpass@hostname/database_name diamond-0.dag.dagman.out
$ pegasus-monitord -r -d sqlite:////tmp/diamond-0.db diamond-0.dag.dagman.out
In the first example, pegasus-monitord will send the data to the database_name database located at server hostname, using the username and userpass provided. In the second example, pegasus-monitord will store the data in the /tmp/diamond-0.db SQLite database.
Note
For absolute paths four slashes are required when specifying an alternative database path in SQLite.
Users should also be aware that in all cases, with the exception of SQLite, the database should exist before pegasus-monitord is run (as it creates all needed tables but does not create the database itself).
Finally, the following example:
$ pegasus-monitord -r --dest diamond-0.bp diamond-0.dag.dagman.out
sends events to the diamond-0.bp file. (please note that in replay mode, any data on the file will be overwritten).
One important detail is that while processing a workflow, pegasus-monitord will automatically detect if/when sub-workflows are initiated, and will automatically track those sub-workflows as well. In this case, although pegasus-monitord will create a separate jobstate.log file in each workflow directory, the database at the top-level workflow will contain the information from not only the main workflow, but also from all sub-workflows.
Pegasus-monitord generates a number of files in each workflow directory:
jobstate.log: contains a summary of workflow and job execution.
monitord.log: contains any log messages generated by pegasus-monitord. It is not overwritten when it restarts. This file is not generated in replay mode, as all log messages from pegasus-monitord are output to the console. Also, when sub-workflows are involved, only the top-level workflow will have this log file.
monitord.started: contains a timestamp indicating when pegasus-monitord was started. This file get overwritten every time pegasus-monitord starts.
monitord.done: contains a timestamp indicating when pegasus-monitord finished. This file is overwritten every time pegasus-monitord starts.
monitord.info: contains pegasus-monitord state information, which allows it to resume processing if a workflow does not finish properly and a rescue dag is submitted. This file is erased when pegasus-monitord is executed in replay mode.
monitord.recover: contains pegasus-monitord state information that allows it to detect that a previous instance of pegasus-monitord failed (or was killed) midway through parsing a workflow's execution logs. This file is only present while pegasus-monitord is running, as it is deleted when it ends and the monitord.info file is generated.
monitord.subwf.db: contains information that aids pegasus-monitord to track when sub-workflows fail and are re-planned/re-tried. It is overwritten when pegasus-monitord is started in replay mode.
monitord-notifications.log: contains the log file for notification-related messages. Normally, this file only includes logs for failed notifications, but can be populated with all notification information when pegasus-monitord is run in verbose mode via the -v command-line option.
To monitor the execution of the workflow run the pegasus-status command as suggested by the output of the pegasus-run command. pegasus-status shows the current status of the Condor Q as pertaining to the master workflow from the workflow directory you are pointing it to. In a second section, it will show a summary of the state of all jobs in the workflow and all of its sub-workflows.
The details of pegasus-status are described in its respective manual page. There are many options to help you gather the most out of this tool, including a watch-mode to repeatedly draw information, various modes to add more information, and legends if you are new to it, or need to present it.
$ pegasus-status /Workflow/dags/directory
STAT IN_STATE JOB
Run 05:08 level-3-0
Run 04:32 |-sleep_ID000005
Run 04:27 \_subdax_level-2_ID000004
Run 03:51 |-sleep_ID000003
Run 03:46 \_subdax_level-1_ID000002
Run 03:10 \_sleep_ID000001
Summary: 6 Condor jobs total (R:6)
UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE
0 0 0 6 0 3 0 33.3
Summary: 3 DAGs total (Running:3)
Without the -l option, the only a summary
of the workflow statistics is shown under the current queue status.
However, with the -l option, it will show each
sub-workflow separately:
$ pegasus-status -l /Workflow/dags/directory
STAT IN_STATE JOB
Run 07:01 level-3-0
Run 06:25 |-sleep_ID000005
Run 06:20 \_subdax_level-2_ID000004
Run 05:44 |-sleep_ID000003
Run 05:39 \_subdax_level-1_ID000002
Run 05:03 \_sleep_ID000001
Summary: 6 Condor jobs total (R:6)
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
0 0 0 1 0 1 0 50.0 Running level-2_ID000004/level-1_ID000002/level-1-0.dag
0 0 0 2 0 1 0 33.3 Running level-2_ID000004/level-2-0.dag
0 0 0 3 0 1 0 25.0 Running *level-3-0.dag
0 0 0 6 0 3 0 33.3 TOTALS (9 jobs)
Summary: 3 DAGs total (Running:3)
The following output shows a successful workflow of workflow summary after it has finished.
$ pegasus-status work/2011080514
(no matching jobs found in Condor Q)
UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE
0 0 0 0 0 7,137 0 100.0
Summary: 44 DAGs total (Success:44)
Warning
For large workflows with many jobs, please note that pegasus-status will take time to compile state from all workflow files. This typically affects the initial run, and sub-sequent runs are faster due to the file system's buffer cache. However, on a low-RAM machine, thrashing is a possibility.
The following output show a failed workflow after no more jobs from it exist. Please note how no active jobs are shown, and the failure status of the total workflow.
$ pegasus-status work/submit
(no matching jobs found in Condor Q)
UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE
20 0 0 0 0 0 2 0.0
Summary: 1 DAG total (Failure:1)
Pegasus-analyzer is a command-line utility for parsing several files in the workflow directory and summarizing useful information to the user. It should be used after the workflow has already finished execution. pegasus-analyzer quickly goes through the jobstate.log file, and isolates jobs that did not complete successfully. It then parses their submit, and kickstart output files, printing to the user detailed information for helping the user debug what happened to his/her workflow.
The simplest way to invoke pegasus-analyzer is to simply give it a workflow run directory, like in the example below:
$ pegasus-analyzer -d /home/user/run0004
pegasus-analyzer: initializing...
************************************Summary*************************************
Total jobs : 26 (100.00%)
# jobs succeeded : 25 (96.15%)
# jobs failed : 1 (3.84%)
# jobs unsubmitted : 0 (0.00%)
******************************Failed jobs' details******************************
============================register_viz_glidein_7_0============================
last state: POST_SCRIPT_FAILURE
site: local
submit file: /home/user/run0004/register_viz_glidein_7_0.sub
output file: /home/user/run0004/register_viz_glidein_7_0.out.002
error file: /home/user/run0004/register_viz_glidein_7_0.err.002
-------------------------------Task #1 - Summary--------------------------------
site : local
executable : /lfs1/software/install/pegasus/default/bin/rc-client
arguments : -Dpegasus.user.properties=/lfs1/work/pegasus/run0004/pegasus.15181.properties \
-Dpegasus.catalog.replica.url=rlsn://smarty.isi.edu --insert register_viz_glidein_7_0.in
exitcode : 1
working dir : /lfs1/work/pegasus/run0004
---------Task #1 - pegasus::rc-client - pegasus::rc-client:1.0 - stdout---------
2009-02-20 16:25:13.467 ERROR [root] You need to specify the pegasus.catalog.replica property
2009-02-20 16:25:13.468 WARN [root] non-zero exit-code 1
In the case above, pegasus-analyzer's output contains a brief summary section, showing how many jobs have succeeded and how many have failed. After that, pegasus-analyzer will print information about each job that failed, showing its last known state, along with the location of its submit, output, and error files. pegasus-analyzer will also display any stdout and stderr from the job, as recorded in its kickstart record. Please consult pegasus-analyzer's man page for more examples and a detailed description of its various command-line options.
If you want to abort your workflow for any reason you can use the pegasus-remove command listed in the output of pegasus-run invocation or by specifying the Dag directory for the workflow you want to terminate.
$ pegasus-remove /PATH/To/WORKFLOW DIRECTORY
Pegasus will remove the DAGMan and all the jobs related to the DAGMan from the condor queue. A rescue DAG will be generated in case you want to resubmit the same workflow and continue execution from where it last stopped. A rescue DAG only skips jobs that have completely finished. It does not continue a partially running job unless the executable supports checkpointing.
To resubmit an aborted or failed workflow with the same submit files and rescue Dag just rerun the pegasus-run command
$ pegasus-run /Path/To/Workflow/Directory
Pegasus plotting and statistics tools queries the Stampede database created by pegasus-monitord for generating the output.The stampede scheme can be found here.
The statistics and plotting tools use the following terminology for defining tasks, jobs etc. Pegasus takes in a DAX which is composed of tasks. Pegasus plans it into a Condor DAG / Executable workflow that consists of Jobs. In case of Clustering, multiple tasks in the DAX can be captured into a single job in the Executable workflow. When DAGMan executes a job, a job instance is populated . Job instances capture information as seen by DAGMan. In case DAGMan retires a job on detecting a failure , a new job instance is populated. When DAGMan finds a job instance has finished , an invocation is associated with job instance. In case of clustered job, multiple invocations will be associated with a single job instance. If a Pre script or Post Script is associated with a job instance, then invocations are populated in the database for the corresponding job instance.
Pegasus-statistics generates workflow execution statistics. To generate statistics run the command as shown below.
$ pegasus-statistics /scratch/grid-setup/run0001/ -s all
...
******************************************** SUMMARY ********************************************
...
-----------------------------------------------------------------------------------------------------
Type Succeeded Failed Incomplete Total Retries Total Run (Retries Included)
Tasks 8 0 0 8 || 0 8
Jobs 27 0 0 27 || 0 27
Sub Workflows 2 0 0 2 || 0 2
-----------------------------------------------------------------------------------------------------
Workflow wall time : 21 mins, 9 secs, (total 1269 seconds)
Workflow cumulative job wall time : 8 mins, 4 secs, (total 484 seconds)
Cumulative job walltime as seen from submit side : 8 mins, 0 secs, (total 480 seconds)
Workflow execution statistics : /scratch/grid-setup/run0001/statistics/workflow.txt
Job instance statistics : /scratch/grid-setup/run0001/statistics/jobs.txt
Transformation statistics : /scratch/grid-setup/run0001/statistics/breakdown.txt
Time statistics : /scratch/grid-setup/run0001/statistics/time.txt
**************************************************************************************************
By default the output gets generated to a statistics folder inside the submit directory. The output that is generated by pegasus-statistics is based on the value set for command line option 's'(statistics_level). In the sample run the command line option 's' is set to 'all' to generate all the statistics information for the workflow run. Please consult the pegasus-statistics man page to find a detailed description of various command line options.
Note
In case of hierarchal workflows, the metrics that are displayed on stdout take into account all the jobs/tasks/sub workflows that make up the workflow by recursively iterating through each sub workflow.
pegasus-statistics summary which is printed on the stdout contains the following information.
-
Workflow summary - Summary of the workflow execution. In case of hierarchical workflow the calculation shows the statistics across all the sub workflows.It shows the following statistics about tasks, jobs and sub workflows.
Succeeded - total count of succeeded tasks/jobs/sub workflows.
Failed - total count of failed tasks/jobs/sub workflows.
Incomplete - total count of tasks/jobs/sub workflows that are not in succeeded or failed state. This includes all the jobs that are not submitted, submitted but not completed etc. This is calculated as difference between 'total' count and sum of 'succeeded' and 'failed' count.
Total - total count of tasks/jobs/sub workflows.
Retries - total retry count of tasks/jobs/sub workflows.
Total Run - total count of tasks/jobs/sub workflows executed during workflow run. This is the cumulative of total retries, succeeded and failed count.
Workflow wall time - The walltime from the start of the workflow execution to the end as reported by the DAGMAN.In case of rescue dag the value is the cumulative of all retries.
Workflow cummulate job wall time - The sum of the walltime of all jobs as reported by kickstart. In case of job retries the value is the cumulative of all retries. For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the walltime value includes jobs from the sub workflows as well.
Cumulative job walltime as seen from submit side - The sum of the walltime of all jobs as reported by DAGMan. This is similar to the regular cumulative job walltime, but includes job management overhead and delays. In case of job retries the value is the cumulative of all retries. For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the walltime value includes jobs from the sub workflows
pegasus-statistics generates the following statistics files based on the command line options set.
Workflow statistics file per workflow [workflow.txt]
Workflow statistics file per workflow contains the following information about each workflow run. In case of hierarchal workflows, the file contains a table for each sub workflow. The file also contains a 'Total' table at the bottom which is the cummulative of all the individual statistics details.
A sample table is shown below. It shows the following statistics about tasks, jobs and sub workflows.
Workflow retries - number of times a workflow was retried.
Succeeded - total count of succeeded tasks/jobs/sub workflows.
Failed - total count of failed tasks/jobs/sub workflows.
Incomplete - total count of tasks/jobs/sub workflows that are not in succeeded or failed state. This includes all the jobs that are not submitted, submitted but not completed etc. This is calculated as difference between 'total' count and sum of 'succeeded' and 'failed' count.
Total - total count of tasks/jobs/sub workflows.
Retries - total retry count of tasks/jobs/sub workflows.
Total Run - total count of tasks/jobs/sub workflows executed during workflow run. This is the cumulative of total retries, succeeded and failed count.
Table 7.1. Workflow Statistics
| # | Type | Succeeded | Failed | Incomplete | Total | Retries | Total Run | Workflow Retries |
|---|---|---|---|---|---|---|---|---|
| 2a6df11b-9972-4ba0-b4ba-4fd39c357af4 | 0 | |||||||
| Tasks | 4 | 0 | 0 | 4 | 0 | 4 | ||
| Jobs | 13 | 0 | 0 | 13 | 0 | 13 | ||
| Sub Workflows | 0 | 0 | 0 | 0 | 0 | 0 |
Job statistics file per workflow [jobs.txt]
Job statistics file per workflow contains the following details about the job instances in each workflow. A sample file is shown below.
Job - the name of the job instance
Try - the number representing the job instance run count.
Site - the site where the job instance ran
Kickstart(sec.) - the actual duration of the job instance in seconds on the remote compute node.
Post(sec.) - the postscript time as reported by DAGMan .
CondorQTime(sec.) - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node .
Resource(sec.) - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job instance spent in the remote queue .
Runtime(sec.) - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart .
Seqexec(sec.) - the time taken for the completion of a clustered job instance .
Seqexec-Delay(sec.) - the time difference between the time for the completion of a clustered job instance and sum of all the individual tasks kickstart time .
Table 7.2. Job statistics
| Job | Try | Site | Kickstart | Post | CondorQTime | Resource | Runtime | Seqexec | Seqexec-Delay |
|---|---|---|---|---|---|---|---|---|---|
| analyze_ID0000004 | 1 | local | 60.002 | 5.0 | 0.0 | - | 62.0 | - | - |
| create_dir_diamond_0_local | 1 | local | 0.027 | 5.0 | 5.0 | - | 0.0 | - | - |
| findrange_ID0000002 | 1 | local | 60.001 | 5.0 | 0.0 | - | 60.0 | - | - |
| findrange_ID0000003 | 1 | local | 60.002 | 5.0 | 10.0 | - | 61.0 | - | - |
| preprocess_ID0000001 | 1 | local | 60.002 | 5.0 | 5.0 | - | 60.0 | - | - |
| register_local_1_0 | 1 | local | 0.459 | 6.0 | 5.0 | - | 0.0 | - | - |
| register_local_1_1 | 1 | local | 0.338 | 5.0 | 5.0 | - | 0.0 | - | - |
| register_local_2_0 | 1 | local | 0.348 | 5.0 | 5.0 | - | 0.0 | - | - |
| stage_in_local_local_0 | 1 | local | 0.39 | 5.0 | 5.0 | - | 0.0 | - | - |
| stage_out_local_local_0_0 | 1 | local | 0.165 | 5.0 | 10.0 | - | 0.0 | - | - |
| stage_out_local_local_1_0 | 1 | local | 0.147 | 7.0 | 5.0 | - | 0.0 | - | - |
| stage_out_local_local_1_1 | 1 | local | 0.139 | 5.0 | 6.0 | - | 0.0 | - | - |
| stage_out_local_local_2_0 | 1 | local | 0.145 | 5.0 | 5.0 | - | 0.0 | - | - |
Transformation statistics file per workflow [breakdown.txt]
Transformation statistics file per workflow contains information about the invocations in each workflow grouped by transformation name. A sample file is shown below.
Transformation - name of the transformation.
Count - the number of times invocations with a given transformation name was executed.
Succeeded - the count of succeeded invocations with a given logical transformation name .
Failed - the count of failed invocations with a given logical transformation name .
Min (sec.) - the minimum runtime value of invocations with a given logical transformation name.
Max (sec.) - the minimum runtime value of invocations with a given logical transformation name.
Mean (sec.) - the mean of the invocation runtimes with a given logical transformation name.
Total (sec.) - the cumulative of runtime value of invocations with a given logical transformation name
Table 7.3. Transformation Statistics
| Transformation | Count | Succeeded | Failed | Min | Max | Mean | Total |
|---|---|---|---|---|---|---|---|
| dagman::post | 13 | 13 | 0 | 5.0 | 7.0 | 5.231 | 68.0 |
| diamond::analyze | 1 | 1 | 0 | 60.002 | 60.002 | 60.002 | 60.002 |
| diamond::findrange | 2 | 2 | 0 | 60.001 | 60.002 | 60.002 | 120.003 |
| diamond::preprocess | 1 | 1 | 0 | 60.002 | 60.002 | 60.002 | 60.002 |
| pegasus::dirmanager | 1 | 1 | 0 | 0.027 | 0.027 | 0.027 | 0.027 |
| pegasus::pegasus-transfer | 5 | 5 | 0 | 0.139 | 0.39 | 0.197 | 0.986 |
| pegasus::rc-client | 3 | 3 | 0 | 0.338 | 0.459 | 0.382 | 1.145 |
Time statistics file [time.txt]
Time statistics file contains job instance and invocation statistics information grouped by time and host. The time grouping can be on day/hour. The file contains the following tables Job instance statistics per day/hour, Invocation statistics per day/hour, Job instance statistics by host per day/hour and Invocation by host per day/hour. A sample Invocation statistics by host per day table is shown below.
Job instance statistics per day/hour - the number of job instances run, total runtime sorted by day/hour.
Invocation statistics per day/hour - the number of invocations , total runtime sorted by day/hour.
Job instance statistics by host per day/hour - the number of job instances run, total runtime on each host sorted by day/hour.
Invocation statistics by host per day/hour - the number of invocations , total runtime on each host sorted by day/hour.
Table 7.4. Invocation statistics by host per day
| Date [YYYY-MM-DD] | Host | Count | Runtime (Sec.) |
|---|---|---|---|
| 2011-07-15 | butterfly.isi.edu | 54 | 625.094 |
Pegasus-plots generates graphs and charts to visualize workflow execution. To generate graphs and charts run the command as shown below.
$ pegasus-plots -p all /scratch/grid-setup/run0001/
...
******************************************** SUMMARY ********************************************
Graphs and charts generated by pegasus-plots can be viewed by opening the generated html file in the web browser :
/scratch/grid-setup/run0001/plots/index.html
**************************************************************************************************
By default the output gets generated to plots folder inside the submit directory. The output that is generated by pegasus-plots is based on the value set for command line option 'p'(plotting_level).In the sample run the command line option 'p' is set to 'all' to generate all the charts and graphs for the workflow run. Please consult the pegasus-plots man page to find a detailed description of various command line options.pegasus-plots generates an index.html file which provides links to all the generated charts and plots. A sample index.html page is show below.
pegasus-plots generates the following plots and charts.
Dax Graph
Graph representation of the DAX file. A sample page is shown below.
Dag Graph
Graph representation of the DAG file. A sample page is shown below.
Gantt workflow execution chart
Gantt chart of the workflow execution run. A sample page is shown below.
The toolbar at the top provides zoom in/out , pan left/right/top/bottom and show/hide job name functionality.The toolbar at the bottom can be used to show/hide job states. Failed job instances are shown in red border in the chart. Clicking on a sub workflow job instance will take you to the corresponding sub workflow chart.
Host over time chart
Host over time chart of the workflow execution run. A sample page is shown below.
The toolbar at the top provides zoom in/out , pan left/right/top/bottom and show/hide host name functionality.The toolbar at the bottom can be used to show/hide job states. Failed job instances are shown in red border in the chart. Clicking on a sub workflow job instance will take you to the corresponding sub workflow chart.
Time chart
Time chart shows job instance/invocation count and runtime of the workflow run over time. A sample page is shown below.
The toolbar at the top provides zoom in/out and pan left/right/top/bottom functionality.The toolbar at the bottom can be used to switch between job instances/ invocations and day/hour filtering.
Breakdown chart
Breakdown chart shows invocation count and runtime of the workflow run grouped by transformation name. A sample page is shown below.
The toolbar at the bottom can be used to switch between invocation count and runtime filtering. Legends can be clicked to get more details.










