6.2. Plotting and Statistics

Pegasus plotting and statistics tools queries the Stampede database created by pegasus-monitord for generating the output.The stampede scheme can be found here.

The statistics and plotting tools use the following terminology for defining tasks, jobs etc. Pegasus takes in a DAX which is composed of tasks. Pegasus plans it into a Condor DAG / Executable workflow that consists of Jobs. In case of Clustering, multiple tasks in the DAX can be captured into a single job in the Executable workflow. When DAGMan executes a job, a job instance is populated . Job instances capture information as seen by DAGMan. In case DAGMan retires a job on detecting a failure , a new job instance is populated. When DAGMan finds a job instance has finished , an invocation is associated with job instance. In case of clustered job, multiple invocations will be associated with a single job instance. If a Pre script or Post Script is associated with a job instance, then invocations are populated in the database for the corresponding job instance.

6.2.1. pegasus-statistics

Pegasus statistics can compute statistics over one or more than one workflow run.

Command to generate statistics over a single run is as shown below.

$ pegasus-statistics /scratch/grid-setup/run0001/ -s all 

#
# Pegasus Workflow Management System - http://pegasus.isi.edu
#
# Workflow summary:
#   Summary of the workflow execution. It shows total
#   tasks/jobs/sub workflows run, how many succeeded/failed etc.
#   In case of hierarchical workflow the calculation shows the
#   statistics across all the sub workflows.It shows the following
#   statistics about tasks, jobs and sub workflows.
#     * Succeeded - total count of succeeded tasks/jobs/sub workflows.
#     * Failed - total count of failed tasks/jobs/sub workflows.
#     * Incomplete - total count of tasks/jobs/sub workflows that are
#       not in succeeded or failed state. This includes all the jobs
#       that are not submitted, submitted but not completed etc. This
#       is calculated as  difference between 'total' count and sum of
#       'succeeded' and 'failed' count.
#     * Total - total count of tasks/jobs/sub workflows.
#     * Retries - total retry count of tasks/jobs/sub workflows.
#     * Total+Retries - total count of tasks/jobs/sub workflows executed
#       during workflow run. This is the cumulative of retries,
#       succeeded and failed count.
# Workflow wall time:
#   The wall time from the start of the workflow execution to the end as
#   reported by the DAGMAN.In case of rescue dag the value is the
#   cumulative of all retries.
# Cumulative job wall time:
#   The sum of the wall time of all jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job wall time as seen from submit side:
#   The sum of the wall time of all jobs as reported by DAGMan.
#   This is similar to the regular cumulative job wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.
# Cumulative job badput wall time:
#   The sum of the wall time of all failed jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job badput wall time as seen from submit side:
#   The sum of the wall time of all failed jobs as reported by DAGMan.
#   This is similar to the regular cumulative job badput wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.
------------------------------------------------------------------------------
Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
Tasks          4         0       0           4         0         4            
Jobs           20        0       0           20        0         20           
Sub-Workflows  0         0       0           0         0         0            
------------------------------------------------------------------------------

Workflow wall time                                       : 6 mins, 55 secs
Cumulative job wall time                                 : 4 mins, 58 secs
Cumulative job wall time as seen from submit side        : 5 mins, 11 secs
Cumulative job badput wall time                          : 0.0 secs
Cumulative job badput wall time as seen from submit side : 0.0 secs

Integrity Metrics
5 files checksums compared with total duration of 0.439 secs
8 files checksums generated with total duration of 1.031 secs

Summary                       : ./statistics/summary.txt
Workflow execution statistics : ./statistics/workflow.txt
Job instance statistics       : ./statistics/jobs.txt
Transformation statistics     : ./statistics/breakdown.txt
Integrity statistics          : ./statistics/integrity.txt
Time statistics               : ./statistics/time.txt


By default the output gets generated to a statistics folder inside the submit directory. The output that is generated by pegasus-statistics is based on the value set for command line option 's'(statistics_level). In the sample run the command line option 's' is set to 'all' to generate all the statistics information for the workflow run. Please consult the pegasus-statistics man page to find a detailed description of various command line options.

Note

In case of hierarchal workflows, the metrics that are displayed on stdout take into account all the jobs/tasks/sub workflows that make up the workflow by recursively iterating through each sub workflow.

Command to generate statistics over all workflow runs populated in a single database is as shown below.

$ pegasus-statistics -Dpegasus.monitord.output='mysql://s_user:s_user123@127.0.0.1:3306/stampede' -o /scratch/workflow_1_2/statistics -s all --multiple-wf 


#
# Pegasus Workflow Management System - http://pegasus.isi.edu
#
# Workflow summary:
#   Summary of the workflow execution. It shows total
#   tasks/jobs/sub workflows run, how many succeeded/failed etc.
#   In case of hierarchical workflow the calculation shows the
#   statistics across all the sub workflows.It shows the following
#   statistics about tasks, jobs and sub workflows.
#     * Succeeded - total count of succeeded tasks/jobs/sub workflows.
#     * Failed - total count of failed tasks/jobs/sub workflows.
#     * Incomplete - total count of tasks/jobs/sub workflows that are
#       not in succeeded or failed state. This includes all the jobs
#       that are not submitted, submitted but not completed etc. This
#       is calculated as  difference between 'total' count and sum of
#       'succeeded' and 'failed' count.
#     * Total - total count of tasks/jobs/sub workflows.
#     * Retries - total retry count of tasks/jobs/sub workflows.
#     * Total+Retries - total count of tasks/jobs/sub workflows executed
#       during workflow run. This is the cumulative of retries,
#       succeeded and failed count.
# Workflow wall time:
#   The wall time from the start of the workflow execution to the end as
#   reported by the DAGMAN.In case of rescue dag the value is the
#   cumulative of all retries.
# Workflow cumulative job wall time:
#   The sum of the wall time of all jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job wall time as seen from submit side:
#   The sum of the wall time of all jobs as reported by DAGMan.
#   This is similar to the regular cumulative job wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.
# Workflow cumulative job badput wall time:
#   The sum of the wall time of all failed jobs as reported by kickstart.
#   In case of job retries the value is the cumulative of all retries.
#   For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs),
#   the wall time value includes jobs from the sub workflows as well.
# Cumulative job badput wall time as seen from submit side:
#   The sum of the wall time of all failed jobs as reported by DAGMan.
#   This is similar to the regular cumulative job badput wall time, but includes
#   job management overhead and delays. In case of job retries the value
#   is the cumulative of all retries. For workflows having sub workflow
#   jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs
#   from the sub workflows as well.

------------------------------------------------------------------------------
Type           Succeeded Failed  Incomplete  Total     Retries   Total+Retries
Tasks          8         0       0           8         0         8
Jobs           34        0       0           34        0         34
Sub-Workflows  0         0       0           0         0         0
------------------------------------------------------------------------------

Workflow cumulative job wall time                        : 8 mins, 5 secs
Cumulative job wall time as seen from submit side        : 8 mins, 35 secs
Workflow cumulative job badput wall time                 : 0
Cumulative job badput wall time as seen from submit side : 0

Note

When computing statistics over multiple workflows, please note,

  1. All workflow run information should be populated in a single STAMPEDE database.

  2. The --output argument must be specified.

  3. Job statistics information is not computed.

  4. Workflow wall time information is not computed.

Pegasus statistics can also compute statistics over a few specified workflow runs, by specifying the either the submit directories, or the workflow UUIDs.

pegasus-statistics -Dpegasus.monitord.output='<DB_URL>' -o <OUTPUT_DIR> <SUBMIT_DIR_1> <SUBMIT_DIR_2> .. <SUBMIT_DIR_n>

OR

pegasus-statistics -Dpegasus.monitord.output='<DB_URL>' -o <OUTPUT_DIR> --isuuid <UUID_1> <UUID_2> .. <UUID_n>

pegasus-statistics generates the following statistics files based on the command line options set.

6.2.1.1. Summary Statistics File [summary.txt]

The summary statistics are listed on the stdout by default, and can be written out to a file by providing the -s summary option.

  • Workflow summary - Summary of the workflow execution. In case of hierarchical workflow the calculation shows the statistics across all the sub workflows.It shows the following statistics about tasks, jobs and sub workflows.

    • Succeeded - total count of succeeded tasks/jobs/sub workflows.

    • Failed - total count of failed tasks/jobs/sub workflows.

    • Incomplete - total count of tasks/jobs/sub workflows that are not in succeeded or failed state. This includes all the jobs that are not submitted, submitted but not completed etc. This is calculated as difference between 'total' count and sum of 'succeeded' and 'failed' count.

    • Total - total count of tasks/jobs/sub workflows.

    • Retries - total retry count of tasks/jobs/sub workflows.

    • Total Run - total count of tasks/jobs/sub workflows executed during workflow run. This is the cumulative of total retries, succeeded and failed count.

  • Workflow wall time - The wall time from the start of the workflow execution to the end as reported by the DAGMAN.In case of rescue dag the value is the cumulative of all retries.

  • Workflow cummulate job wall time - The sum of the wall time of all jobs as reported by kickstart. In case of job retries the value is the cumulative of all retries. For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs from the sub workflows as well. This value is multiplied by the multiplier_factor in the job instance table.

  • Cumulative job wall time as seen from submit side - The sum of the wall time of all jobs as reported by DAGMan. This is similar to the regular cumulative job wall time, but includes job management overhead and delays. In case of job retries the value is the cumulative of all retries. For workflows having sub workflow jobs (i.e SUBDAG and SUBDAX jobs), the wall time value includes jobs from the sub workflows. This value is multiplied by the multiplier_factor in the job instance table.

  • Integrity Metrics

    • Number of files for which the checksum was compared against a previously computed or provided checksum and total duration in seconds spent in doing it.

    • Number of files for which the checksum was generated during workflow execution and total duration in seconds spent in doing it.

6.2.1.2. Workflow statistics file per workflow [workflow.txt]

Workflow statistics file per workflow contains the following information about each workflow run. In case of hierarchal workflows, the file contains a table for each sub workflow. The file also contains a 'Total' table at the bottom which is the cumulative of all the individual statistics details.

A sample table is shown below. It shows the following statistics about tasks, jobs and sub workflows.

  • Workflow retries - number of times a workflow was retried.

  • Succeeded - total count of succeeded tasks/jobs/sub workflows.

  • Failed - total count of failed tasks/jobs/sub workflows.

  • Incomplete - total count of tasks/jobs/sub workflows that are not in succeeded or failed state. This includes all the jobs that are not submitted, submitted but not completed etc. This is calculated as difference between 'total' count and sum of 'succeeded' and 'failed' count.

  • Total - total count of tasks/jobs/sub workflows.

  • Retries - total retry count of tasks/jobs/sub workflows.

  • Total Run - total count of tasks/jobs/sub workflows executed during workflow run. This is the cumulative of total retries, succeeded and failed count.

Table 6.1. Workflow Statistics

# Type Succeeded Failed Incomplete Total Retries Total Run Workflow Retries
2a6df11b-9972-4ba0-b4ba-4fd39c357af4               0
  Tasks 4 0 0 4 0 4  
  Jobs 13 0 0 13 0 13  
  Sub Workflows 0 0 0 0 0 0  

6.2.1.3. Job statistics file per workflow [jobs.txt]

Job statistics file per workflow contains the following details about the job instances in each workflow. A sample file is shown below.

  • Job - the name of the job instance

  • Try - the number representing the job instance run count.

  • Site - the site where the job instance ran.

  • Kickstart(sec.) - the actual duration of the job instance in seconds on the remote compute node.

  • Mult - multiplier factor from the job instance table for the job.

  • Kickstart_Mult - value of the Kickstart column multiplied by Mult.

  • CPU-Time - remote CPU time computed as the stime + utime (when Kickstart is not used, this is empty).

  • Post(sec.) - the postscript time as reported by DAGMan.

  • CondorQTime(sec.) - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node .

  • Resource(sec.) - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job instance spent in the remote queue .

  • Runtime(sec.) - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart .

  • Seqexec(sec.) - the time taken for the completion of a clustered job instance .

  • Seqexec-Delay(sec.) - the time difference between the time for the completion of a clustered job instance and sum of all the individual tasks kickstart time .

Table 6.2. Job statistics

Job Try Site Kickstart Mult Kickstart_Mult CPU-Time Post CondorQTime Resource Runtime Seqexec Seqexec-Delay
analyze_ID0000004 1 local 60.002 1 60.002 59.843 5.0 0.0 - 62.0 - -
create_dir_diamond_0_local 1 local 0.027 1 0.027 0.003 5.0 5.0 - 0.0 - -
findrange_ID0000002 1 local 60.001 10 600.01 59.921 5.0 0.0 - 60.0 - -
findrange_ID0000003 1 local 60.002 10 600.02 59.912 5.0 10.0 - 61.0 - -
preprocess_ID0000001 1 local 60.002 1 60.002 59.898 5.0 5.0 - 60.0 - -
register_local_1_0 1 local 0.459 1 0.459 0.432 6.0 5.0 - 0.0 - -
register_local_1_1 1 local 0.338 1 0.338 0.331 5.0 5.0 - 0.0 - -
register_local_2_0 1 local 0.348 1 0.348 0.342 5.0 5.0 - 0.0 - -
stage_in_local_local_0 1 local 0.39 1 0.39 0.032 5.0 5.0 - 0.0 - -
stage_out_local_local_0_0 1 local 0.165 1 0.165 0.108 5.0 10.0 - 0.0 - -
stage_out_local_local_1_0 1 local 0.147 1 0.147 0.098 7.0 5.0 - 0.0 - -
stage_out_local_local_1_1 1 local 0.139 1 0.139 0.089 5.0 6.0 - 0.0 - -
stage_out_local_local_2_0 1 local 0.145 1 0.145 0.101 5.0 5.0 - 0.0 - -

6.2.1.4. Transformation statistics file per workflow [breakdown.txt]

Transformation statistics file per workflow contains information about the invocations in each workflow grouped by transformation name. A sample file is shown below.

  • Transformation - name of the transformation.

  • Count - the number of times invocations with a given transformation name was executed.

  • Succeeded - the count of succeeded invocations with a given logical transformation name .

  • Failed - the count of failed invocations with a given logical transformation name .

  • Min (sec.) - the minimum runtime value of invocations with a given logical transformation name times the multipler_factor.

  • Max (sec.) - the minimum runtime value of invocations with a given logical transformation name times the multiplier_factor.

  • Mean (sec.) - the mean of the invocation runtimes with a given logical transformation name times the multiplier_factor.

  • Total (sec.) - the cumulative of runtime value of invocations with a given logical transformation name times the multiplier_factor.

Table 6.3. Transformation Statistics

Transformation Count Succeeded Failed Min Max Mean Total
dagman::post 13 13 0 5.0 7.0 5.231 68.0
diamond::analyze 1 1 0 60.002 60.002 60.002 60.002
diamond::findrange 2 2 0 600.01 600.02 600.02 1200.03
diamond::preprocess 1 1 0 60.002 60.002 60.002 60.002
pegasus::dirmanager 1 1 0 0.027 0.027 0.027 0.027
pegasus::pegasus-transfer 5 5 0 0.139 0.39 0.197 0.986
pegasus::rc-client 3 3 0 0.338 0.459 0.382 1.145

6.2.1.5. Time statistics file [time.txt]

Time statistics file contains job instance and invocation statistics information grouped by time and host. The time grouping can be on day/hour. The file contains the following tables Job instance statistics per day/hour, Invocation statistics per day/hour, Job instance statistics by host per day/hour and Invocation by host per day/hour. A sample Invocation statistics by host per day table is shown below.

  • Job instance statistics per day/hour - the number of job instances run, total runtime sorted by day/hour.

  • Invocation statistics per day/hour - the number of invocations , total runtime sorted by day/hour.

  • Job instance statistics by host per day/hour - the number of job instances run, total runtime on each host sorted by day/hour.

  • Invocation statistics by host per day/hour - the number of invocations , total runtime on each host sorted by day/hour.

Table 6.4. Invocation statistics by host per day

Date [YYYY-MM-DD] Host Count Runtime (Sec.)
2011-07-15 butterfly.isi.edu 54 625.094

6.2.1.6. Integrity statistics file per workflow [integrity.txt]

Integrity statistics file contains integrity metrics grouped by file type (input or output) and integrity type (check or compute). A sample table is shown below. It shows the following statistics about integrity checks.

  • Type - the type of integrity metric. Check means checksum was compared for a file, and compute means a checksum was generated for a file.

  • File type - the type of file: input or output from a job perspective.

  • Count - the number of times type, file type integrity check was performed.

  • Total duration - sum of duration in seconds for the 'count' number of records matching the particular type, file-type combo.

Table 6.5. Integrity Statistics

# Type File Type Count Total Duration
4555392d-1b37-407c-98d3-60fb86cb9d57        
  check input 5 0.164
  check output 5 1.456
  compute input 5 0.693
  compute output 5 0.758


6.2.2. pegasus-plots

Pegasus-plots generates graphs and charts to visualize workflow execution. To generate graphs and charts run the command as shown below.

$ pegasus-plots  -p all  /scratch/grid-setup/run0001/


...

******************************************** SUMMARY ********************************************

Graphs and charts generated by pegasus-plots can be viewed by opening the generated html file in the web browser  :
/scratch/grid-setup/run0001/plots/index.html

**************************************************************************************************

By default the output gets generated to plots folder inside the submit directory. The output that is generated by pegasus-plots is based on the value set for command line option 'p'(plotting_level).In the sample run the command line option 'p' is set to 'all' to generate all the charts and graphs for the workflow run. Please consult the pegasus-plots man page to find a detailed description of various command line options. pegasus-plots generates an index.html file which provides links to all the generated charts and plots. A sample index.html page is shown below.

Figure 6.1. pegasus-plot index page

pegasus-plot index page

pegasus-plots generates the following plots and charts.

Dax Graph

Graph representation of the DAX file. A sample page is shown below.

Figure 6.2. DAX Graph

DAX Graph

Dag Graph

Graph representation of the DAG file. A sample page is shown below.

Figure 6.3. DAG Graph

DAG Graph

Gantt workflow execution chart

Gantt chart of the workflow execution run. A sample page is shown below.

Figure 6.4. Gantt Chart

Gantt Chart

The toolbar at the top provides zoom in/out , pan left/right/top/bottom and show/hide job name functionality.The toolbar at the bottom can be used to show/hide job states. Failed job instances are shown in red border in the chart. Clicking on a sub workflow job instance will take you to the corresponding sub workflow chart.

Host over time chart

Host over time chart of the workflow execution run. A sample page is shown below.

Figure 6.5. Host over time chart

Host over time chart

The toolbar at the top provides zoom in/out , pan left/right/top/bottom and show/hide host name functionality.The toolbar at the bottom can be used to show/hide job states. Failed job instances are shown in red border in the chart. Clicking on a sub workflow job instance will take you to the corresponding sub workflow chart.

Time chart

Time chart shows job instance/invocation count and runtime of the workflow run over time. A sample page is shown below.

Figure 6.6. Time chart

Time chart

The toolbar at the top provides zoom in/out and pan left/right/top/bottom functionality. The toolbar at the bottom can be used to switch between job instances/ invocations and day/hour filtering.

Breakdown chart

Breakdown chart shows invocation count and runtime of the workflow run grouped by transformation name. A sample page is shown below.

Figure 6.7. Breakdown chart

Breakdown chart

The toolbar at the bottom can be used to switch between invocation count and runtime filtering. Legends can be clicked to get more details.