Pegasus comes bundled with useful tools that help users debug workflows and generate useful statistics and plots about their workflow runs. Most of the tools query a runtime workflow database ( usually a sqllite in the workflow submit directory ) populated at runtime by pegasus-monitord. With the exception of pegasus-monitord (see below), all tools take in the submit directory as an argument. Users can invoke the tools listed in this chapter as follows:
$ pegasus-[toolname] <path to the submit directory>
As the number of jobs and tasks in workflows increase, the ability to track the progress and quickly debug a workflow becomes more and more important. Pegasus comes with a series of utilities that can be used to monitor and debug workflows both in real-time as well as after execution is already completed.
To monitor the execution of the workflow run the pegasus-status command as suggested by the output of the pegasus-run command. pegasus-status shows the current status of the Condor Q as pertaining to the master workflow from the workflow directory you are pointing it to. In a second section, it will show a summary of the state of all jobs in the workflow and all of its sub-workflows.
The details of pegasus-status are described in its respective manual page. There are many options to help you gather the most out of this tool, including a watch-mode to repeatedly draw information, various modes to add more information, and legends if you are new to it, or need to present it.
$ pegasus-status /Workflow/dags/directory STAT IN_STATE JOB Run 05:08 level-3-0 Run 04:32 |-sleep_ID000005 Run 04:27 \_subdax_level-2_ID000004 Run 03:51 |-sleep_ID000003 Run 03:46 \_subdax_level-1_ID000002 Run 03:10 \_sleep_ID000001 Summary: 6 Condor jobs total (R:6) UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE 0 0 0 6 0 3 0 33.3 Summary: 3 DAGs total (Running:3)
-l option, the only a summary
of the workflow statistics is shown under the current queue status.
However, with the
-l option, it will show each
$ pegasus-status -l /Workflow/dags/directory STAT IN_STATE JOB Run 07:01 level-3-0 Run 06:25 |-sleep_ID000005 Run 06:20 \_subdax_level-2_ID000004 Run 05:44 |-sleep_ID000003 Run 05:39 \_subdax_level-1_ID000002 Run 05:03 \_sleep_ID000001 Summary: 6 Condor jobs total (R:6) UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME 0 0 0 1 0 1 0 50.0 Running level-2_ID000004/level-1_ID000002/level-1-0.dag 0 0 0 2 0 1 0 33.3 Running level-2_ID000004/level-2-0.dag 0 0 0 3 0 1 0 25.0 Running *level-3-0.dag 0 0 0 6 0 3 0 33.3 TOTALS (9 jobs) Summary: 3 DAGs total (Running:3)
The following output shows a successful workflow of workflow summary after it has finished.
$ pegasus-status work/2011080514 (no matching jobs found in Condor Q) UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE 0 0 0 0 0 7,137 0 100.0 Summary: 44 DAGs total (Success:44)
For large workflows with many jobs, please note that pegasus-status will take time to compile state from all workflow files. This typically affects the initial run, and sub-sequent runs are faster due to the file system's buffer cache. However, on a low-RAM machine, thrashing is a possibility.
The following output show a failed workflow after no more jobs from it exist. Please note how no active jobs are shown, and the failure status of the total workflow.
$ pegasus-status work/submit (no matching jobs found in Condor Q) UNREADY READY PRE QUEUED POST SUCCESS FAILURE %DONE 20 0 0 0 0 0 2 0.0 Summary: 1 DAG total (Failure:1)
Pegasus-analyzer is a command-line utility for parsing several files in the workflow directory and summarizing useful information to the user. It should be used after the workflow has already finished execution. pegasus-analyzer quickly goes through the jobstate.log file, and isolates jobs that did not complete successfully. It then parses their submit, and kickstart output files, printing to the user detailed information for helping the user debug what happened to his/her workflow.
The simplest way to invoke pegasus-analyzer is to simply give it a workflow run directory, like in the example below:
$ pegasus-analyzer /home/user/run0004 pegasus-analyzer: initializing... ************************************Summary************************************* Total jobs : 26 (100.00%) # jobs succeeded : 25 (96.15%) # jobs failed : 1 (3.84%) # jobs held : 1 (3.84%) # jobs unsubmitted : 0 (0.00%) *******************************Held jobs' details******************************* ================================sleep_ID0000001================================= submit file : sleep_ID0000001.sub last_job_instance_id : 7 reason : Error from email@example.com: STARTER at 184.108.40.206 failed to send file(s) to <220.127.116.11:62639>: error reading from /opt/condor/8.4.8/local.corbusier/execute/dir_76205/f.out: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <18.104.22.168:62653> ******************************Failed jobs' details****************************** ============================register_viz_glidein_7_0============================ last state: POST_SCRIPT_FAILURE site: local submit file: /home/user/run0004/register_viz_glidein_7_0.sub output file: /home/user/run0004/register_viz_glidein_7_0.out.002 error file: /home/user/run0004/register_viz_glidein_7_0.err.002 -------------------------------Task #1 - Summary-------------------------------- site : local executable : /lfs1/software/install/pegasus/default/bin/rc-client arguments : -Dpegasus.user.properties=/lfs1/work/pegasus/run0004/pegasus.15181.properties \ -Dpegasus.catalog.replica.url=rlsn://smarty.isi.edu --insert register_viz_glidein_7_0.in exitcode : 1 working dir : /lfs1/work/pegasus/run0004 ---------Task #1 - pegasus::rc-client - pegasus::rc-client:1.0 - stdout--------- 2009-02-20 16:25:13.467 ERROR [root] You need to specify the pegasus.catalog.replica property 2009-02-20 16:25:13.468 WARN [root] non-zero exit-code 1
In the case above, pegasus-analyzer's output contains a brief summary section, showing how many jobs have succeeded and how many have failed. If there are any held jobs, pegasus-analyzer will report the name of the job that was held, and the reason why , as determined from the dagman.out file for the workflow. The last_job_instance_id is the database id for the job in the job instance table of the monitoring database. After that, pegasus-analyzer will print information about each job that failed, showing its last known state, along with the location of its submit, output, and error files. pegasus-analyzer will also display any stdout and stderr from the job, as recorded in its kickstart record. Please consult pegasus-analyzer's man page for more examples and a detailed description of its various command-line options.
Starting with 4.0 release, by default pegasus analyzer queries the database to debug the workflow. If you want it to use files in the submit directory , use the --files option.
If you want to abort your workflow for any reason you can use the pegasus-remove command listed in the output of pegasus-run invocation or by specifying the Dag directory for the workflow you want to terminate.
$ pegasus-remove /PATH/To/WORKFLOW DIRECTORY
Pegasus will remove the DAGMan and all the jobs related to the DAGMan from the condor queue. A rescue DAG will be generated in case you want to resubmit the same workflow and continue execution from where it last stopped. A rescue DAG only skips jobs that have completely finished. It does not continue a partially running job unless the executable supports checkpointing.
To resubmit an aborted or failed workflow with the same submit files and rescue Dag just rerun the pegasus-run command
$ pegasus-run /Path/To/Workflow/Directory