7. Submit Directory Details

This chapter describes the submit directory content after Pegasus has planned a workflow. Pegasus takes in an Abstract Workflow and generates an executable workflow (DAG) in the submit directory.

This document also describes the various Replica Selection Strategies in Pegasus.

7.1. Layout

Each executable workflow is associated with a submit directory, and includes the following:

  1. <wflabel-wfindex>.dag

    This is the Condor DAGMman dag file corresponding to the executable workflow generated by Pegasus. The dag file describes the edges in the DAG and information about the jobs in the DAG. Pegasus generated .dag file usually contains the following information for each job

    1. The job submit file for each job in the DAG.

    2. The post script that is to be invoked when a job completes. This is usually located at $PEGASUS_HOME/bin/exitpost and parses the kickstart record in the job’s.out file and determines the exitcode.

    3. JOB RETRY - the number of times the job is to be retried in case of failure. In Pegasus, the job postscript exits with a non zero exitcode if it determines a failure occurred.

  2. <wflabel-wfindex>.dag.dagman.out

    When a DAG ( .dag file ) is executed by Condor DAGMan , the DAGMan writes out it’s output to the <daxlabel-wfindex>.dag.dagman.out file . This file tells us the progress of the workflow, and can be used to determine the status of the workflow. Most of pegasus tools mine the dagman.out or jobstate.log to determine the progress of the workflows.

  3. <wflabel-wfindex>.static.bp

    This file contains netlogger events that link jobs in the DAG with the jobs in the Abstract Workflow. This file is parsed by pegasus-monitord when a workflow starts and populates to the stampede backend.

  4. <wflabel-wfindex>.stampede.db

    This is the stampede backend (a sqlite database) to which pegasus-monitord populates all the runtime provenance information from workflow dagman.out and job .out and .err files.

  5. <wflabel-wfindex>.replicas.db

    This is the default output replica catalog (a sqlite database) to which the registration jobs populate output file locations and associated metadata.

  6. <wflabel-wfindex>.replica.store

    This is a file based replica catalog, that only lists file locations are mentioned in the Abstract Workflow in the Replica Catalog Section, and is written out by the planner at mapping time. This is used to pass locations of files mentioned in the parent workflow to a sub workflow in case of hierarchical workflows.

  7. <wflabel-wfindex>.cache

    This is a cache file generated by the planner that records where all the files in the workflow will be placed on the staging site, as the workflow executes. This is used in hierarchical workflows to pass locations of files in the parent workflow to when planner is invoked on the sub workflows.

  8. <wflabel-wfindex>.cache.meta

    This is a file populated by pegasus-exitcode to record the checksum information gleamed from parsing the kickstart output present in the job.out.* files. This is used in hierarchical workflows to pass checksum information of files in the parent workflow to when planner is invoked on the sub workflows.

  9. <wflabel-wfindex>.metadata

    This is a worflow level metadata json formatted file that is written out by the planner at mapping time, and lists all metadata for workflow, jobs and files specified in the Abstract Workflow.

  10. <wflabel-wfindex>.metrics

    This is a json formatted file generated by the planner, that includes some planning metrics and workflow level metrics such as different type of jobs and files.

  11. <wflabel-wfindex>.notify

    This file contains all the notifications that need to be set for the workflow and the jobs in the executable workflow. The format of notify file is described here .

  12. <wflabel-wfindex>.dot

    Pegasus creates a dot file for the executable workflow in addition to the .dag file. This can be used to visualize the executable workflow using the dot program.

  13. jobstate.log

    The jobstate.log file is written out by the pegasus-monitord daemon that is launched when a workflow is submitted for execution by pegasus-run. The pegasus-monitord daemon parses the dagman.out file and writes out the jobstate.log that is easier to parse. The jobstate.log captures the various states through which a job goes during the workflow. There are other monitoring related files that are explained in the monitoring chapter.

  14. braindump.yml

    Contains information about pegasus version, dax file, dag file, dax label.

  15. <job>.sub

    Each job in the executable workflow is associated with it’s own submit file. The submit file tells Condor how to execute the job.

  16. <job>.out.00n

    The stdout of the executable referred in the job submit file. In Pegasus, most jobs are launched via kickstart. Hence, this file contains the kickstart XML provenance record that captures runtime provenance on the remote node where the job was executed. n varies from 1-N where N is the JOB RETRY value in the .dag file. The exitpost executable is invoked on the <job>.out file and it moves the <job>.out to <job>.out.00n so that the the job’s .out files are preserved across retries.

  17. <job>.err.00n

    The stderr of the executable referred in the job submit file. In case of Pegasus, mostly the jobs are launched via kickstart. Hence, this file contains stderr of kickstart. This is usually empty unless there in an error in kickstart e.g. kickstart segfaults , or kickstart location specified in the submit file is incorrect. The exitpost executable is invoked on the <job>.out file and it moves the <job>.err to <job>.err.00n so that the the job’s .out files are preserved across retries.

  18. <job>.meta

    This is a file created at runtime when pegasus-exitcode parses the kickstart output in the job.out file. This file records metadata and checksum information for output files created by the job and recorded by pegasus-kickstart.

7.2. HTCondor DAGMan File

The Condor DAGMan file ( .dag ) is the input to Condor DAGMan ( the workflow executor used by Pegasus ) .

Pegasus generated .dag file usually contains the following information for each job:

  1. The job submit file for each job in the DAG.

  2. The post script that is to be invoked when a job completes. This is usually found in $PEGASUS_HOME/bin/exitpost and parses the kickstart record in the job’s .out file and determines the exitcode.

  3. JOB RETRY - the number of times the job is to be retried in case of failure. In case of Pegasus, job postscript exits with a non zero exitcode if it determines a failure occurred.

  4. The pre script to be invoked before running a job. This is usually for the dax jobs in the DAX. The pre script is pegasus-plan invocation for the subdax.

In the last section of the DAG file the relations between the jobs ( that identify the underlying DAG structure ) are highlighted.

7.3. Sample Condor DAG File

#####################################################################
# PEGASUS WMS GENERATED DAG FILE
# DAG blackdiamond
# Index = 0, Count = 1
######################################################################

JOB create_dir_blackdiamond_0_isi_viz create_dir_blackdiamond_0_isi_viz.sub
SCRIPT POST create_dir_blackdiamond_0_isi_viz /pegasus/bin/pegasus-exitcode   \
                                   /submit-dir/create_dir_blackdiamond_0_isi_viz.out
RETRY create_dir_blackdiamond_0_isi_viz 3

JOB create_dir_blackdiamond_0_local create_dir_blackdiamond_0_local.sub
SCRIPT POST create_dir_blackdiamond_0_local /pegasus/bin/pegasus-exitcode
                                   /submit-dir/create_dir_blackdiamond_0_local.out

JOB pegasus_concat_blackdiamond_0 pegasus_concat_blackdiamond_0.sub

JOB stage_in_local_isi_viz_0 stage_in_local_isi_viz_0.sub
SCRIPT POST stage_in_local_isi_viz_0 /pegasus/bin/pegasus-exitcode   \
                                     /submit-dir/stage_in_local_isi_viz_0.out

JOB chmod_preprocess_ID000001_0 chmod_preprocess_ID000001_0.sub
SCRIPT POST chmod_preprocess_ID000001_0 /pegasus/bin/pegasus-exitcode \
                                        /submit-dir/chmod_preprocess_ID000001_0.out

JOB preprocess_ID000001 preprocess_ID000001.sub
SCRIPT POST preprocess_ID000001 /pegasus/bin/pegasus-exitcode   \
                                         /submit-dir/preprocess_ID000001.out

JOB subdax_black_ID000002 subdax_black_ID000002.sub
SCRIPT PRE subdax_black_ID000002 /pegasus/bin/pegasus-plan  \
      -Dpegasus.user.properties=/submit-dir/./dag_1/test_ID000002/pegasus.3862379342822189446.properties\
      -Dpegasus.log.*=/submit-dir/subdax_black_ID000002.pre.log \
      -Dpegasus.dir.exec=app_domain/app -Dpegasus.dir.storage=duncan -Xmx1024 -Xms512\
      --dir /pegasus-features/dax-3.2/dags \
      --relative-dir user/pegasus/blackdiamond/run0005/user/pegasus/blackdiamond/run0005/./dag_1 \
      --relative-submit-dir user/pegasus/blackdiamond/run0005/./dag_1/test_ID000002\
      --basename black --sites dax_site \
      --output local --force  --nocleanup  \
      --verbose  --verbose  --verbose  --verbose  --verbose  --verbose  --verbose \
      --verbose  --monitor  --deferred  --group pegasus --rescue 0 \
      --dax /submit-dir/./dag_1/test_ID000002/dax/blackdiamond_dax.xml

JOB stage_out_local_isi_viz_0_0 stage_out_local_isi_viz_0_0.sub
SCRIPT POST stage_out_local_isi_viz_0_0 /pegasus/bin/pegasus-exitcode   /submit-dir/stage_out_local_isi_viz_0_0.out

SUBDAG EXTERNAL subdag_black_ID000003 /Users/user/Pegasus/work/dax-3.2/black.dag DIR /duncan/test

JOB clean_up_stage_out_local_isi_viz_0_0 clean_up_stage_out_local_isi_viz_0_0.sub
SCRIPT POST clean_up_stage_out_local_isi_viz_0_0 /lfs1/devel/Pegasus/pegasus/bin/pegasus-exitcode  \
                                          /submit-dir/clean_up_stage_out_local_isi_viz_0_0.out

JOB clean_up_preprocess_ID000001 clean_up_preprocess_ID000001.sub
SCRIPT POST clean_up_preprocess_ID000001 /lfs1/devel/Pegasus/pegasus/bin/pegasus-exitcode  \
                                     /submit-dir/clean_up_preprocess_ID000001.out

PARENT create_dir_blackdiamond_0_isi_viz CHILD pegasus_concat_blackdiamond_0
PARENT create_dir_blackdiamond_0_local CHILD pegasus_concat_blackdiamond_0
PARENT stage_out_local_isi_viz_0_0 CHILD clean_up_stage_out_local_isi_viz_0_0
PARENT stage_out_local_isi_viz_0_0 CHILD clean_up_preprocess_ID000001
PARENT preprocess_ID000001 CHILD subdax_black_ID000002
PARENT preprocess_ID000001 CHILD stage_out_local_isi_viz_0_0
PARENT subdax_black_ID000002 CHILD subdag_black_ID000003
PARENT stage_in_local_isi_viz_0 CHILD chmod_preprocess_ID000001_0
PARENT stage_in_local_isi_viz_0 CHILD preprocess_ID000001
PARENT chmod_preprocess_ID000001_0 CHILD preprocess_ID000001
PARENT pegasus_concat_blackdiamond_0 CHILD stage_in_local_isi_viz_0
######################################################################
# End of DAG
######################################################################

7.4. Kickstart Record

Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.

Kickstart does not work with:

  1. Condor Standard Universe Jobs

  2. MPI Jobs

Pegasus automatically disables kickstart for the above jobs.

Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to its own stdout. The stdout appears in the workflow submit directory as <job>.out.00n . The following information is captured by kickstart and logged:

  1. The exitcode with which the job it launched exited.

  2. The duration of the job

  3. The start time for the job

  4. The node on which the job ran

  5. The stdout and stderr of the job

  6. The arguments with which it launched the job

  7. The environment that was set for the job before it was launched.

  8. The machine information about the node that the job ran on

Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time.

7.5. Reading a Kickstart Output File

Starting with Pegasus 5.0 pegasus-kickstart now writes out the runtime provenance as a YAML document instead of the earlier XML formatted document. The kickstart file below has the following fields highlighted:

  1. The host on which the job executed and the ipaddress of that host

  2. The duration and start time of the job. The time here is in reference to the clock on the remote node where the job is executed.

  3. The exitcode with which the job executed

  4. The arguments with which the job was launched.

  5. The directory in which the job executed on the remote site

  6. The stdout of the job

  7. The stderr of the job

  8. The environment of the job

- invocation: True
  version: 3.0
  start: 2020-06-12T22:25:51.876-07:00
  duration: 60.039
  transformation: "diamond::preprocess:4.0"
  derivation: "ID0000001"
  resource: "CCG"
  wf-label: "blackdiamond"
  wf-stamp: "2020-06-12T22:24:09-07:00"
  interface: eth0
  hostaddr: 128.9.36.72
  hostname: compute-2.isi.edu
  pid: 10187
  uid: 579
  user: ptesting
  gid: 100
  group: users
  umask: 0o0022
  mainjob:
    start: 2020-06-12T22:25:51.913-07:00
    duration: 60.002
    pid: 10188
    usage:
      utime: 59.993
      stime: 0.002
      maxrss: 1312
      minflt: 394
      majflt: 0
      nswap: 0
      inblock: 0
      outblock: 16
      msgsnd: 0
      msgrcv: 0
      nsignals: 0
      nvcsw: 2
      nivcsw: 326
    status:
      raw: 0
      regular_exitcode: 0
    executable:
      file_name: /var/lib/condor/execute/dir_9997/pegasus.nInvqOjMu/diamond-preprocess-4_0
      mode: 0o100755
      size: 82976
      inode: 369207696
      nlink: 1
      blksize: 4096
      blocks: 168
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
    argument_vector:
      - -a
      - preprocess
      - -T
      - 60
      - -i
      - f.a
      - -o
      - f.b1
      - f.b2
    procs:
  jobids:
    condor: 9774913.0
    gram: https://obelix.isi.edu:49384/16866322196481424206/5750061617434002842/
  cwd: /var/lib/condor/execute/dir_9997/pegasus.nInvqOjMu
  usage:
    utime: 0.004
    stime: 0.034
    maxrss: 816
    minflt: 1358
    majflt: 1
    nswap: 0
    inblock: 544
    outblock: 0
    msgsnd: 0
    msgrcv: 0
    nsignals: 0
    nvcsw: 4
    nivcsw: 3
  machine:
    page-size: 4096
    uname_system: linux
    uname_nodename: compute-2.isi.edu
    uname_release: 3.10.0-1062.4.1.el7.x86_64
    uname_machine: x86_64
    ram_total: 7990140
    ram_free: 3355064
    ram_shared: 0
    ram_buffer: 0
    swap_total: 0
    swap_free: 0
    cpu_count: 4
    cpu_speed: 2600
    cpu_vendor: GenuineIntel
    cpu_model: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
    load_min1: 0.02
    load_min5: 0.06
    load_min15: 0.06
    procs_total: 215
    procs_running: 1
    procs_sleeping: 214
    procs_vmsize: 42446148
    procs_rss: 1722380
    task_total: 817
    task_running: 1
    task_sleeping: 816
  files:
    f.b2:
      lfn: "f.b2"
      file_name: /var/lib/condor/execute/dir_9997/pegasus.nInvqOjMu/f.b2
      mode: 0o100644
      size: 114
      inode: 369207699
      nlink: 1
      blksize: 4096
      blocks: 8
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
      output: True
      sha256: deac67f380112ecfa4b65879846a5f27abd64c125c25f8958cb1be44decf567f
      checksum_timing: 0.019

    f.b1:
      lfn: "f.b1"
      file_name: /var/lib/condor/execute/dir_9997/pegasus.nInvqOjMu/f.b1
      mode: 0o100644
      size: 114
      inode: 369207698
      nlink: 1
      blksize: 4096
      blocks: 8
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
      output: True
      sha256: deac67f380112ecfa4b65879846a5f27abd64c125c25f8958cb1be44decf567f
      checksum_timing: 0.018

    stdin:
      file_name: /dev/null
      mode: 0o20666
      size: 0
      inode: 1034
      nlink: 1
      blksize: 4096
      blocks: 0
      mtime: 2019-10-29T08:35:24-07:00
      atime: 2019-10-29T08:35:24-07:00
      ctime: 2019-10-29T08:35:24-07:00
      uid: 0
      user: root
      gid: 0
      group: root
    stdout:
      temporary_name: /var/lib/condor/execute/dir_9997/ks.out.1uMt3U
      descriptor: 3
      mode: 0o100600
      size: 0
      inode: 302035961
      nlink: 1
      blksize: 4096
      blocks: 0
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
    data_truncated: false
    data: |
         Tue Oct  6 15:25:25 PDT 2020
    stderr:
      temporary_name: /var/lib/condor/execute/dir_9997/ks.err.ict5LD
      descriptor: 4
      mode: 0o100600
      size: 0
      inode: 302035962
      nlink: 1
      blksize: 4096
      blocks: 0
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
    metadata:
      temporary_name: /var/lib/condor/execute/dir_9997/ks.meta.TplHum
      descriptor: 5
      mode: 0o100600
      size: 0
      inode: 302035963
      nlink: 1
      blksize: 4096
      blocks: 0
      mtime: 2020-06-12T22:25:51-07:00
      atime: 2020-06-12T22:25:51-07:00
      ctime: 2020-06-12T22:25:51-07:00
      uid: 579
      user: ptesting
      gid: 100
      group: users
<?xml version="1.0" encoding="ISO-8859-1"?>

<invocation xmlns="http://pegasus.isi.edu/schema/invocation" \
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \
      xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.0.xsd" \
      version="2.0" start="2009-01-30T19:17:41.157-06:00" duration="0.321" transformation="pegasus::dirmanager"\
     derivation="pegasus::dirmanager:1.0" resource="cobalt" wf-label="scb" \
     wf-stamp="2009-01-30T17:12:55-08:00" hostaddr="141.142.30.219" hostname="co-login.ncsa.uiuc.edu"\
     pid="27714" uid="29548" user="vahi" gid="13872" group="bvr" umask="0022">

<mainjob start="2009-01-30T19:17:41.426-06:00" duration="0.052" pid="27783">

<usage utime="0.036" stime="0.004" minflt="739" majflt="0" nswap="0" nsignals="0" nvcsw="36" nivcsw="3"/>

<status raw="0"><regular exitcode="0"/></status>

<statcall error="0">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/dirmanager">23212F7573722F62696E2F656E762070</file>
<statinfo mode="0100755" size="8202" inode="85904615883" nlink="1" blksize="16384" \
   blocks="24" mtime="2008-09-22T18:52:37-05:00" atime="2009-01-30T14:54:18-06:00" \
   ctime="2009-01-13T19:09:47-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>

<argument-vector>
<arg nr="1">--create</arg>
<arg nr="2">--dir</arg>
<arg nr="3">/u/ac/vahi/globus-test/EXEC/vahi/pegasus/scb/run0001</arg>
</argument-vector>

</mainjob>

<cwd>/u/ac/vahi/globus-test/EXEC</cwd>

<usage utime="0.012" stime="0.208" minflt="4232" majflt="0" nswap="0" nsignals="0" nvcsw="15" nivcsw="74"/>
<machine page-size="16384" provider="LINUX">
<stamp>2009-01-30T19:17:41.157-06:00</stamp>
<uname system="linux" nodename="co-login" release="2.6.16.54-0.2.5-default" machine="ia64">#1 SMP Mon Jan 21\
        13:29:51 UTC 2008</uname>
<ram total="148299268096" free="123371929600" shared="0" buffer="2801664"/>
<swap total="1179656486912" free="1179656486912"/>
<boot idle="1315786.920">2009-01-15T10:19:50.283-06:00</boot>
<cpu count="32" speed="1600" vendor=""></cpu>
<load min1="3.50" min5="3.50" min15="2.60"/>
<proc total="841" running="5" sleeping="828" stopped="5" vmsize="10025418752" rss="2524299264"/>
<task total="1125" running="6" sleeping="1114" stopped="5"/>
</machine>
<statcall error="0" id="stdin">
<!-- deferred flag: 0 -->
<file name="/dev/null"/>
<statinfo mode="020666" size="0" inode="68697" nlink="1" blksize="16384" blocks="0" \
    mtime="2007-05-04T05:54:02-05:00" atime="2007-05-04T05:54:02-05:00" \
  ctime="2009-01-15T10:21:54-06:00" uid="0" user="root" gid="0" group="root"/>
</statcall>

<statcall error="0" id="stdout">
<temporary name="/tmp/gs.out.s9rTJL" descriptor="3"/>
<statinfo mode="0100600" size="29" inode="203420686" nlink="1" blksize="16384" blocks="128" \
mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00"\
ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
<data>mkdir finished successfully.
</data>
</statcall>
<statcall error="0" id="stderr">
<temporary name="/tmp/gs.err.kobn3S" descriptor="5"/>
<statinfo mode="0100600" size="0" inode="203420689" nlink="1" blksize="16384" blocks="0" \
mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" \
ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>

<statcall error="0" id="gridstart">
<!-- deferred flag: 0 -->
<file name="/u/ac/vahi/SOFTWARE/pegasus/default/bin/kickstart">7F454C46020101000000000000000000</file>
<statinfo mode="0100755" size="255445" inode="85904615876" nlink="1" blksize="16384" blocks="504" \
 mtime="2009-01-30T18:06:28-06:00" atime="2009-01-30T19:17:41-06:00"\
ctime="2009-01-30T18:06:28-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="logfile">
<descriptor number="1"/>
<statinfo mode="0100600" size="0" inode="53040253" nlink="1" blksize="16384" blocks="0" \
mtime="2009-01-30T19:17:39-06:00" atime="2009-01-30T19:17:39-06:00" \
ctime="2009-01-30T19:17:39-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>
<statcall error="0" id="channel">
<fifo name="/tmp/gs.app.Ien1m0" descriptor="7" count="0" rsize="0" wsize="0"/>
<statinfo mode="010640" size="0" inode="203420696" nlink="1" blksize="16384" blocks="0" \
 mtime="2009-01-30T19:17:41-06:00" atime="2009-01-30T19:17:41-06:00" \
ctime="2009-01-30T19:17:41-06:00" uid="29548" user="vahi" gid="13872" group="bvr"/>
</statcall>

<environment>
<env key="GLOBUS_GRAM_JOB_CONTACT">https://co-login.ncsa.uiuc.edu:50001/27456/1233364659/</env>
<env key="GLOBUS_GRAM_MYJOB_CONTACT">URLx-nexus://co-login.ncsa.uiuc.edu:50002/</env>
<env key="GLOBUS_LOCATION">/usr/local/prews-gram-4.0.7-r1/</env>
....
</environment>

<resource>
<soft id="RLIMIT_CPU">unlimited</soft>
<hard id="RLIMIT_CPU">unlimited</hard>
<soft id="RLIMIT_FSIZE">unlimited</soft>
....
</resource>
</invocation>

Note

pegasus-kickstart writes out the job environment in case job exits with failure (non zero exitcode). To see job environment for a successful job, pass -f option to pegasus-kickstart.

7.6. Jobstate.Log File

The jobstate.log file logs the various states that a job goes through during workflow execution. It is created by the pegasus-monitord daemon that is launched when a workflow is submitted to Condor DAGMan by pegasus-run. pegasus-monitord parses the dagman.out file and writes out the jobstate.log file, the format of which is more amenable to parsing.

Note

The jobstate.log file is not created if a user uses condor_submit_dag to submit a workflow to Condor DAGMan.

The jobstate.log file can be created after a workflow has finished executing by running pegasus-monitord on the .dagman.out file in the workflow submit directory.

Below is a snippet from the jobstate.log for a single job executed via condorg:

1239666049 create_dir_blackdiamond_0_isi_viz SUBMIT 3758.0 isi_viz - 1
1239666059 create_dir_blackdiamond_0_isi_viz EXECUTE 3758.0 isi_viz - 1
1239666059 create_dir_blackdiamond_0_isi_viz GLOBUS_SUBMIT 3758.0 isi_viz - 1
1239666059 create_dir_blackdiamond_0_isi_viz GRID_SUBMIT 3758.0 isi_viz - 1
1239666064 create_dir_blackdiamond_0_isi_viz JOB_TERMINATED 3758.0 isi_viz - 1
1239666064 create_dir_blackdiamond_0_isi_viz JOB_SUCCESS 0 isi_viz - 1
1239666064 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_STARTED - isi_viz - 1
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_TERMINATED 3758.0 isi_viz - 1
1239666069 create_dir_blackdiamond_0_isi_viz POST_SCRIPT_SUCCESS - isi_viz - 1

Each entry in jobstate.log has the following:

  1. The ISO timestamp for the time at which the particular event happened.

  2. The name of the job.

  3. The event recorded by DAGMan for the job.

  4. The condor id of the job in the queue on the submit node.

  5. The pegasus site to which the job is mapped.

  6. The job time requirements from the submit file.

  7. The job submit sequence for this workflow.

The job lifecycle when executed as part of the workflow

STATE/EVENT

DESCRIPTION

SUBMIT

job is submitted by condor schedd for execution.

EXECUTE

condor schedd detects that a job has started execution.

GLOBUS_SUBMIT

the job has been submitted to the remote resource. It’s only written for GRAM jobs (i.e. gt2 and gt4).

GRID_SUBMIT

same as GLOBUS_SUBMIT event. The ULOG_GRID_SUBMIT event is written for all grid universe jobs./

JOB_TERMINATED

job terminated on the remote node.

JOB_SUCCESS

job succeeded on the remote host, condor id will be zero (successful exit code).

JOB_FAILURE

job failed on the remote host, condor id will be the job’s exit code.

POST_SCRIPT_STARTED

post script started by DAGMan on the submit host, usually to parse the kickstart output

POST_SCRIPT_TERMINATED

post script finished on the submit node.

POST_SCRIPT_SUCCESS | POST_SCRIPT_FAILURE

post script succeeded or failed.

There are other monitoring related files that are explained in the monitoring chapter.

7.7. Pegasus Workflow Job States and Delays

The various job states that a job goes through ( as caputured in the dagman.out and jobstate.log file) during it’s lifecycle are illustrated below. The figure below highlights the various local and remote delays during job lifecycle.

image0

7.8. Braindump File

The braindump file is created per workflow in the submit file and contains metadata about the workflow.

Information Captured in Braindump File

KEY

DESCRIPTION

user

the username of the user that ran pegasus-plan

grid_dn

the Distinguished Name in the proxy

submit_hostname

the hostname of the submit host

root_wf_uuid

the workflow uuid of the root workflow

wf_uuid

the workflow uuid of the current workflow i.e the one whose submit directory the braindump file is.

dax

the path to the dax file

dax_label

the label attribute in the adag element of the dax

dax_index

the index in the dax.

dax_version

the version of the DAX schema that DAX referred to.

pegasus_wf_name

the workflow name constructed by pegasus when planning

timestamp

the timestamp when planning occured

basedir

the base submit directory

submit_dir

the full path for the submit directory

properties

the full path to the properties file in the submit directory

planner

the planner used to construct the executable workflow. always pegasus

planner_version

the versions of the planner

pegasus_build

the build timestamp

planner_arguments

the arguments with which the planner is invoked.

jsd

the path to the jobstate file

rundir

the rundir in the numbering scheme for the submit directories

pegasushome

the root directory of the pegasus installation

vogroup

the vo group to which the user belongs to. Defaults to pegasus

condor_log

the full path to condor common log in the submit directory

notify

the notify file that contains any notifications that need to be sent for the workflow.

dag

the basename of the dag file created

type

the type of executable workflow. Can be dag | shell| pmc

A Sample Braindump File is displayed below:

user vahi
grid_dn null
submit_hostname obelix
root_wf_uuid a4045eb6-317a-4710-9a73-96a745cb1fe8
wf_uuid a4045eb6-317a-4710-9a73-96a745cb1fe8
dax /data/scratch/vahi/examples/synthetic-scec/Test.dax
dax_label Stampede-Test
dax_index 0
dax_version 3.3
pegasus_wf_name Stampede-Test-0
timestamp 20110726T153746-0700
basedir /data/scratch/vahi/examples/synthetic-scec/dags
submit_dir /data/scratch/vahi/examples/synthetic-scec/dags/vahi/pegasus/Stampede-Test/run0005
properties pegasus.6923599674234553065.properties
planner /data/scratch/vahi/software/install/pegasus/default/bin/pegasus-plan
planner_version 3.1.0cvs
pegasus_build 20110726221240Z
planner_arguments "--conf ./conf/properties --dax Test.dax --sites local --output local --dir dags --force --submit "
jsd jobstate.log
rundir run0005
pegasushome /data/scratch/vahi/software/install/pegasus/default
vogroup pegasus
condor_log Stampede-Test-0.log
notify Stampede-Test-0.notify
dag Stampede-Test-0.dag
type dag

7.9. Pegasus static.bp File

Pegasus creates a workflow.static.bp file that links jobs in the DAG with the jobs in the DAX. The contents of the file are in netlogger format. The purpose of this file is to be able to link an invocation record of a task to the corresponding job in the DAX

The workflow is replaced by the name of the workflow i.e. same prefix as the .dag file

In the file there are five types of events:

  • task.info

    This event is used to capture information about all the tasks in the DAX( abstract workflow)

  • task.edge

    This event is used to capture information about the edges between the tasks in the DAX ( abstract workflow )

  • job.info

    This event is used to capture information about the jobs in the DAG ( executable workflow generated by Pegasus )

  • job.edge

    This event is used to capture information about edges between the jobs in the DAG ( executable workflow ).

  • wf.map.task_job

    This event is used to associate the tasks in the DAX with the corresponding jobs in the DAG.