Chapter 1. Student notes for Pegasus WMS tutorial

1. Introduction

These are the student notes for the Pegasus WMS tutorial. They are designed to be used in conjunction with instructor presentation and support.

You will see two styles of machine text here:

Text like this is input that you should type.

Text like this is the output you should get.

For example:

$ date

Mon June 1 11:54:58 BST 2007

You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.

On Linux or Mac OS X, open a terminal window and type:

For the purpose of this tutorial replace any instance of trainXX with your viz-login username.

$ ssh trainXX@viz-login.isi.edu

[welcome message] 
trainXX@viz-login:~$

You will need to obtain Grid Credentials to run workflows on the Grid.

You can generate your proxy using grid-proxy-init

$ grid-proxy-init

Your identity: /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
Creating proxy ........................................ Done
Your proxy is valid until: Sun Apr 27 08:05:18 2008

Check your proxy using grid-proxy-info.

$ grid-proxy-info

subject  : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02/CN=2104110451
issuer   : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
identity : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u1044
timeleft : 11:57:20

2. Mapping and executing workflows using Pegasus WMS

In this chapter you will be introduced to planning and executing a workflow through Pegasus WMS locally. You will then plan and execute a larger Montage workflow on the GRID.

All the exercises in this Chapter will be run from the $HOME/pegasus-wms/ directory. All the files that are required reside in this directory

$ cd $HOME/pegasus-wms

Files for the exercise are stored in subdirectories:

$ ls

config dags dax

You may also see some other files here.

2.1. Examine DAX

An abstract DAG has been generated for Montage application and output in XML format into dax/montage.dax.

Open montage.dax in a file viewer:

$ cat dax/montage.dax

Inside the DAX, you should see three sections.

  1. list of all the files used in the workflow

  2. definition of all jobs - each job in the workflow.

  3. list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.

2.2. Setting up the Replica Catalog

In this exercise you will insert entries into the Replica Catalog. The replica catalog that we will use today is a simple file based catalog. We also support and recommend GLOBUS RLS or a JDBC implementation for production runs.

A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.

You can use the rc-client command to insert , query and delete from the replica catalog.

The input data to be used for your workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.

The instructors have provided:

  • A file replicas.in, the input data file for the rc-client that contains the mappings that need to be populated in the RLS. The file is inside the config directory

Instructions:

  • Let us see what the file looks like.

    $ cat config/rc.in
    
    # file-based replica catalog: 2007-06-02T13:11:35.954-07:00
    
    statfile_20070529_153243_22618.tbl
         gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/statfile.tbl
              pool="local"
    2mass-atlas-990502s-j1440198.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1440198.fits
              pool="local"
    2mass-atlas-990502s-j1440186.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1440186.fits
              pool="local"
     2mass-atlas-990502s-j1430092.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1430092.fits
              pool="local"
     2mass-atlas-990502s-j1420198.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1420198.fits
              pool="local"
     2mass-atlas-990502s-j1420186.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1420186.fits
              pool="local"
     cimages_20070529_153243_22618.tbl
         gsiftp://viz-login.isi.edu/scratch/0.2degree/cimages.tbl pool="local"
     pimages_20070529_153243_22618.tbl
         gsiftp://viz-login.isi.edu/scratch/0.2degree/pimages.tbl pool="local"
     region_20070529_153243_22618.hdr
         gsiftp://viz-login.isi.edu/scratch/0.2degree/region.hdr pool="local"
     2mass-atlas-990502s-j1430080.fits
         gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1430080.fits
              pool="local"
  • Now we are ready to run rc-client and populate the data. Since each of you have unique lfns that are being registered, all the 10 entries should be successfully registered.

    $ rc-client -Dpegasus.user.properties=config/properties --insert config/rc.in
    
    #Successfully worked on : 10 lines
    #Worked on total number of : 11 lines.
  • Now the entries have been successfully inserted into the Replica Catalog. We should query the replica catalog for a particular lfn.

    $ rc-client -Dpegasus.user.properties=config/properties \
                       lookup pimages_20070529_153243_22618.tbl
    
    pimages_20070529_153243_22618.tbl
                 gsiftp://viz-login.isi.edu/scratch/0.2degree/pimages.tbl pool="local"

Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.

2.3. Setting up the Site Catalog and Transformation Catalog

In this exercise you will setup your Site Catalog and the Transformation Catalog.

The instructors have provided:

  • A ready transformation catalog (tc.data) in the $HOME/pegasus-wms/config directory.

  • A semi ready properties file in the $HOME/pegasus-wms/config directory.

  • The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectory, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.

    You can use the sc-client command to generate a site catalog from a hand written sites.txt file.

    $ sc-client -f config/sites.txt -o config/sites.xml
    
    2007.06.02 17:06:12.215 PDT: [INFO] Reading config/sites.txt 
    2007.06.02 17:06:11.262 PDT: [INFO] Reading config/sites.txt (completed) 
    2007.06.02 17:06:11.276 PDT: [INFO] Written xml output to file : config/sites.xml
  • The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the Diamond or Montage code is installed on the various grid sites.

    Take a look at the Transformation Catalog

    $ cat config/tc.data
    
    viz  bin/mDiff       
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mDiff       
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   bin/mFitplane
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mFitplane
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mAdd:3.0  
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mAdd
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mBackground:3.0
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mBackground
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mBgModel:3.0
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mBgModel
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mConcatFit:3.0 
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mConcatFit
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mDiffFit:3.0
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mDiffFit 
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mImgtbl:3.0 
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mImgtbl  
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mJPEG:3.0       
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mJPEG  
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mProject:3.0 
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mProjectPP 
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mProjectPP:3.0
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mProjectPP
                                    STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
    viz   mShrink:3.0
                 gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mShrink
                                    STATIC_BINARY   INTEL32::LINUX  NULL
  • Open the properties file and check a few properties.

    $ cat config/properties
    
    ## SELECT THE REPLICAT CATALOG MODE AND URL
    pegasus.catalog.replica = SimpleFile
    pegasus.catalog.replica.file = ${user.home}/pegasus-wms/config/rc.data
    
    ## SELECT THE SITE CATALOG MODE AND FILE
    pegasus.catalog.site = XML
    pegasus.catalog.site.file = ${user.home}/pegasus-wms/config/sites.xml
    
    ## SELECT THE TRANSFORMATION CATALOG MODE AND FILE
    pegasus.catalog.transformation = File
    pegasus.catalog.transformation.file = ${user.home}/pegasus-wms/config/tc.data
    
    ## SET UP THE WORK AND INVOCATION DATABASE
    pegasus.catalog.work = Database
    pegasus.catalog.provenance = InvocationSchema
    
    ## Database related properties
    pegasus.catalog.*.db.driver = MySQL
    pegasus.catalog.*.db.url = jdbc:mysql://smarty.isi.edu/tg2007
    pegasus.catalog.*.db.user = tg2007user
    pegasus.catalog.*.db.password = Teragrid2007
    
    ## USE DAGMAN RETRY FEATURE FOR FAILURES
    pegasus.dagman.retry=2
    
    ## STAGE ALL OUR EXECUTABLES
    pegasus.catalog.transformation.mapper = Staged
    
    ## CHECK JOB EXIT CODES FOR FAILURE
    pegasus.exitcode.scope=all
    
    ## OPTIMIZE DATA AND EXECUTABLE TRANSFERS
    pegasus.transfer.refiner=Bundle
    
    #STAGE DATA AND EXECUTABLES USING GRIDFTP 3rd PARTY MODE
    pegasus.transfer.*.thirdparty.sites=*
    
    ## WORK AND STORAGE DIR
    
    pegasus.dir.storage = ${user.home}/storage
    pegasus.dir.exec = $(user.home}/exec
  • Also the client pegasus-get-sites can be used to generate a site catalog and transformation catalog for the Open Science Grid.

    $ pegasus-get-sites --source vors --grid osg --vo osgedu --sc /nfs/home/trainxx/sc.xml \
                        --tc /nfs/home/trainxx/tc.data
    
    # using default transformation mappings.
    # Querying the source "vors" for site information
    # assembling information for grid "osg".
    # site BNL_ATLAS_1  33 is ACCESSIBLE
    # site NYSGRID-CCR-U2  11 is ACCESSIBLE
    # site WISC-OSG-EDU  342 is ACCESSIBLE
    # site CIT_CMS_T2:srm_v1  263 is ACCESSIBLE
    # substituting OSG_GRID value to  for site  
    # site UTA_SWT2  347 is ACCESSIBLE
    [...]

2.4. Planning workflow using pegasus-plan and running locally using pegasus-run.

In this exercise we are going to run pegasus-plan to generate a concrete workflow from the abstract workflow (diamond.dax). The Concrete workflow generated, are condor submit files that are submitted locally using pegasus-run

The instructors have provided:

  • A dax (diamond.dax) in the $HOME/pegasus-wms/dax directory.

You will need to write some things yourself, by following the instructions below:

  • Run pegasus-plan to generate the condor submit files out of the dax.

  • Run pegasus-run to submit the workflow locally.

Instructions:

  • Let us run pegasus-plan on the diamond dax.

    $ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
                   --dax /nfs/home/trainXX/pegasus-wms/dax/diamond.dax \
                   --dir dags -s local -o local --nocleanup

    The above command says that we need to plan the diamond dax locally. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data to be saved. Here is the output of pegasus-plan.

    2008.01.28 15:00:49.536 PST: [INFO] Parsing the DAX 
    2008.01.28 15:00:50.063 PST: [INFO] Parsing the DAX (completed)
    2008.01.28 15:00:50.174 PST: [INFO] Parsing the site catalog
    2008.01.28 15:00:50.327 PST: [INFO] Parsing the site catalog (completed) 
    2008.01.28 15:00:50.394 PST: [INFO] Doing site selection
    2008.01.28 15:00:50.436 PST: [INFO] Doing site selection (completed)
    2008.01.28 15:00:50.437 PST: [INFO] Grafting transfer nodes in the workflow 
    2008.01.28 15:00:50.508 PST: [INFO] Grafting transfer nodes in the workflow (completed) 
    2008.01.28 15:00:50.523 PST: [INFO] Grafting the remote workdirectory creation jobs 
                                        in the workflow
    2008.01.28 15:00:50.537 PST: [INFO] Grafting the remote workdirectory creation jobs 
                                        in the workflow (completed) 
    2008.01.28 15:00:50.538 PST: [INFO] Generating the cleanup workflow 
    2008.01.28 15:00:50.542 PST: [INFO] Generating the cleanup workflow (completed)
    2008.01.28 15:00:50.563 PST: [INFO] Generating codes for the concrete workflow 
    2008.01.28 15:00:50.684 PST: [INFO] Generating codes for the concrete workflow (completed)
    2008.01.28 15:00:50.684 PST: [INFO] Generating code for the cleanup workflow 
    2008.01.28 15:00:50.718 PST: [INFO] Generating code for the cleanup workflow (completed) 
    2008.01.28 15:00:51.087 PST: [INFO] I have concretized your abstract workflow.
    
    The workflow has been entered into the workflow database with a state of "planned".
    The next step is to start or execute your workflow. The invocation required is
              
    pegasus-run -Dpegasus.user.properties=/home/trainXX/pegasus-wms/dags/trainXX/pegasus\
    /diamond/run0001/pegasus.7543.properties \
              /home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
  • Now run pegasus-run as mentioned in the output of pegasus-plan. Do not copy the command below it is just for illustration purpose.

    $ pegasus-run -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX\
    /pegasus/diamond/run0001/pegasus.7543.properties \
              /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
    
    Checking all your submit files for log file names.
    This might take a while... Done.
    
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor : diamond-0.dag.condor.sub
    Log of DAGMan debugging messages : diamond-0.dag.dagman.out
    Log of Condor library output : diamond-0.dag.lib.out
    Log of Condor library error messages : diamond-0.dag.lib.err
    Log of the life of condor_dagman itself : diamond-0.dag.dagman.log
    Condor Log file for all jobs of this DAG : /tmp/diamond-07544.log
         -no_submit given, not submitting DAG to Condor.
    You can do this with: "condor_submit diamond-0.dag.condor.sub"
    -----------------------------------------------------------------------
    Submitting job(s). Logging submit event(s). 
    1 job(s) submitted to cluster 20068. 
    I have started your workflow, committed it to DAGMan,
    and updated its state in the work database. 
    A separate daemon was started to collect information about the progress of the workflow.
    The job state will soon be visible. Your workflow runs in base directory.
    
    cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
    
    *** To monitor the workflow you can run ***
    pegasus-status -w diamond-0 -t 20080128T150049-0800
      or
    pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001 
    
    *** To remove your workflow run ***
    pegasus-remove -d 20068.0 
      or 
    pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001

2.5. Tracking the progress of the workflow and debugging the workflows.

In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.

We will change into the directory, that was mentioned by the output of pegasus-run command.

$ cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXX

In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow

  • Run the command pegasus-status as mentioned by pegasus-run above to check the status of your jobs. Use the watch command to auto repeat the command every 2 seconds.

    $ watch pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXX
    
    -- Submitter: viz-login.isi.edu : <128.9.72.178:46426> : viz-login.isi.edu 
    ID     OWNER/NODENAME  SUBMITTED RUN_TIME ST PRI SIZE CMD
    19982.0 train01 1/28 13:58 0+00:03:20 R 0 9.8 condor_dagman -f - 
    19986.0 |-findrange 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas 
    19987.0 |-findrange 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas

    The above output shows that a couple of jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in Condor DAGMan releasing the next job into the queue after a job has finished successfully.

    If output of pegasus-status is empty, then either your workflow has - successfully completed - stopped midway due to non recoverable error

  • Another way to monitor the workflow is to check the jobstate.log file. This is the output file of the monitoring daemon that is parsing all the condor log files to determine the status of the jobs. It logs the events seen by Condor into a more readable form for us.

    $ more jobstate.log
    
    1201557528 INTERNAL *** TAILSTATD_STARTED ***
    1201557528 INTERNAL *** DAGMAN_STARTED *** 
    1201557528 generate_ID000001_0 UN_READY - - - 
    1201557528 findrange_ID000003_0 UN_READY - - - 
    1201557528 findrange_ID000002_0 UN_READY - - - 
    1201557528 analyze_ID000004_0 UN_READY - - -
    [..]

    In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY. After sometime you will see entries in jobstate.log, that shows a job is being executed etc.

    1201557747 generate_ID000001_0 EXECUTE 19996.0 local - 
    1201557747 generate_ID000001_0 GLOBUS_SUBMIT 19996.0 local - 
    1201557812 generate_ID000001_0 JOB_TERMINATED 19996.0 local - 
    1201557812 generate_ID000001_0 POST_SCRIPT_STARTED - local - 
    1201557817 generate_ID000001_0 POST_SCRIPT_TERMINATED 19996.0 local - 
    1201557817 generate_ID000001_0 POST_SCRIPT_SUCCESS - local -

    The above shows the being submitted and then executed on the grid. In addition it lists that job is being run on the grid site local (which is your submit machine). The various states of the job while it goes through submission to execution to post processing are in UPPERCASE.

  • Successfully Completed : Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log

    $ tail jobstate.log
    
    1201559232 analyze_ID000004 JOB_TERMINATED 20023.0 local - 
    1201559232 analyze_ID000004 POST_SCRIPT_STARTED - local -
    1201559238 analyze_ID000004 POST_SCRIPT_TERMINATED 20023.0 local -
    1201559238 analyze_ID000004 POST_SCRIPT_SUCCESS - local -
    1201559238 INTERNAL *** DAGMAN_FINISHED ***
    1201559239 INTERNAL *** TAILSTATD_FINISHED 0 ***

    Looking at the last two lines we see that DAGMan finished, and tailstatd finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the local site successfully. The workflow generates a single output file montage.jpg that resides in the directory /scratch/trainXX/storage/f.d where trainXX is your user id.

    To view the file, you can copy f.d to your viz-login webspace, and view it in your web browser:

    $ cp /scratch/trainXX/storage/f.d ~/public_html

    Point your web browser to: http://viz-login.isi.edu/~trainXX/f.d where trainXX is your viz-login user id.

  • Unsuccessfully Completed (Workflow execution stopped midway) : Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log

    $ tail jobstate.log
    
    1180840233 analyze_ID000004_0 JOB_TERMINATED 2787.0 local -
    1180840233 analyze_ID000004_0 POST_SCRIPT_STARTED - local - 
    1180840238 analyze_ID000004_0 POST_SCRIPT_TERMINATED 2787.0 local - 
    1180840238 analyze_ID000004_0 POST_SCRIPT_FAILURE 1 local -
    1180840373 INTERNAL *** DAGMAN_FINISHED *** 
    1180840373 INTERNAL *** TAILSTATD_FINISHED 1 ***

    Looking at the last two lines we see that DAGMan finished, and tailstatd finished unsuccessfully with a status 1. We can easily determine which job failed. It is inter_tx_mDiffFit_ID000007_0 in this case. To determine the reason for failure we need to look at it's kickstart output file which is JOBNAME.out.NNN. where NNN is 000 - NNN

2.6. Condor DAGMan format and log files etc.

In this exercise we will learn about the DAG file format and some of the log files generated when the DAG runs.

  • Now take a look at the DAG file...

    $ cat dags/trainXX/pegasus/diamond/run0001/diamond-0.dag
    
    ######################################################################
    # PEGASUS GENERATED SUBMIT FILE 
    # DAG diamond 
    # Index = 0, Count = 1
    ######################################################################
    JOB generate_ID000001 generate_ID000001.sub
    RETRY generate_ID000001 2
    
    JOB findrange_ID000002 findrange_ID000002.sub
    RETRY findrange_ID000002 2
    
    JOB findrange_ID000003 findrange_ID000003.sub
    RETRY findrange_ID000003 2
    
    JOB analyze_ID000004 analyze_ID000004.sub
    RETRY analyze_ID000004 2
    
    JOB diamond_0_pegasus_concat diamond_0_pegasus_concat.sub 
    
    JOB diamond_0_local_cdir diamond_0_local_cdir.sub 
    SCRIPT POST diamond_0_local_cdir /nfs/software/pegasus/default/bin/exitpost 
          -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus
    /diamond/run0001/pegasus.31433.properties
    /home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001/diamond_0_local_cdir.out
              
    RETRY diamond_0_local_cdir 2 
    
    PARENT generate_ID000001 CHILD findrange_ID000002 
    PARENT generate_ID000001 CHILD findrange_ID000003
    PARENT findrange_ID000002 CHILD analyze_ID000004
    PARENT findrange_ID000003 CHILD analyze_ID000004
    PARENT diamond_0_pegasus_concat CHILD generate_ID000001
    PARENT diamond_0_local_cdir CHILD diamond_0_pegasus_concat
    ######################################################################
    # End of DAG
    ######################################################################
  • ... and the dagman.out file.

    $ cat dags/train02/pegasus/diamond/run0001/diamond-0.dag.dagman.out
    
    1/29 10:32:14 ******************************************************
    1/29 10:32:14 ** condor_scheduniv_exec.20133.0 (CONDOR_DAGMAN) STARTING UP
    1/29 10:32:14 ** /nfs/software/condor/6.9.4/bin/condor_dagman
    1/29 10:32:14 ** $CondorVersion: 6.9.4 Aug 30 2007 $
    1/29 10:32:14 ** $CondorPlatform: I386-LINUX_RHEL3 $
    1/29 10:32:14 ** PID = 3202
    1/29 10:32:14 ** Log last touched time unavailable (No such file or directory)
    1/29 10:32:14 ****************************************************** [....]
    1/29 10:32:27 Submitting Condor Node diamond_0_local_cdir job(s)...
    1/29 10:32:27 submitting: condor_submit 
                  -a dag_node_name' '=' 'diamond_0_local_cdir 
                  -a +DAGManJobId' '=' '20133 
                  -a DAGManJobId' '=' '20133
                  -a submit_event_notes' '=' 'DAG' 'Node:''diamond_0_local_cdir 
                  -a +DAGParentNodeNames' '=' '""diamond_0_local_cdir.sub
    1/29 10:32:27 From submit: Submitting job(s).
    1/29 10:32:27 From submit: Logging submit event(s).
    1/29 10:32:27 From submit: 1 job(s) submitted to cluster 20134.
    1/29 10:32:27 assigned Condor ID (20134.0)
    1/29 10:32:27 Just submitted 1 job this cycle...
    1/29 10:32:27 Event: ULOG_SUBMIT for Condor Node diamond_0_local_cdir (20134.0)
    1/29 10:32:27 Number of idle job procs: 1
    1/29 10:32:27 Of 6 nodes total: 1/29 10:32:27 Done Pre Queued Post Ready Un-Ready Failed
    1/29 10:32:27 === === === === === === ===
    1/29 10:32:27 0 0 1 0 0 5 0 [....]
    1/29 10:33:24 Done Pre Queued Post Ready Un-Ready Failed
    1/29 10:33:24 === === === === === === ===
    1/29 10:33:24 6 0 0 0 0 0 0 1/29 10:33:24 All jobs Completed!
    1/29 10:33:24 Note: 0 total job deferrals because of -MaxJobs limit (0)
    1/29 10:33:24 Note: 0 total job deferrals because of -MaxIdle limit (0)
    1/29 10:33:24 Note: 0 total PRE script deferrals because of -MaxPre limit (20)
    1/29 10:33:24 Note: 0 total POST script deferrals because of -MaxPost limit (20)
    1/29 10:33:24 **** condor_scheduniv_exec.20133.0 (condor_DAGMAN) EXITING WITH STATUS 0

2.7. Removing a running workflow

Sometimes you may want to halt the execution of the workflow or just permanently remove it. You can stop/halt a workflow by running the pegasus-remove command mentioned in the output of pegasus-run

$ pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXX

Job 2788.0 marked for removal

2.8. Planning workflow using pegasus-plan and Running pegasus-run to submit the workflow to a grid resource.

In this exercise we are going to run pegasus-plan to generate a concrete workflow from the abstract workflow (montage.dax). The Concrete workflow generated, are condor submit files that are submitted to remote grid resources using pegasus-run

The instructors have provided:

  • A dax (montage.dax) in the $HOME/pegasus-wms/dax/ directory.

You will need to write some things yourself, by following the instructions below:

  • Run pegasus-plan to generate the condor submit files out of the dax.

  • Run pegasus-run to submit the workflow to the grid.

Instructions:

  • Let us run pegasus-plan on the montage dax on the Viz cluster. If multiple sites are available you could provide the sites using a comma "," separated list like viz,uwm etc.

    $ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
                   --dir dags --sites viz --output local \
                   --nocleanup --dax /nfs/home/trainXX/pegasus-wms/dax/montage.dax

    The above command says that we need to plan the montage dax on the isi_viz site. The output data needs to be transferred back to the local host. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data on the remote host. Here is the output of pegasus-plan.

    2008.04.27 12:59:15.171 PDT: [INFO] Parsing the DAX
    2008.04.27 12:59:15.972 PDT: [INFO] Parsing the DAX (completed)
    2008.04.27 12:59:16.085 PDT: [INFO] Parsing the site catalog 
    2008.04.27 12:59:16.257 PDT: [INFO] Parsing the site catalog (completed)
    2008.04.27 12:59:16.323 PDT: [INFO] Doing site selection
    2008.04.27 12:59:16.427 PDT: [INFO] Doing site selection (completed)
    2008.04.27 12:59:16.428 PDT: [INFO] Grafting transfer nodes in the workflow
    2008.04.27 12:59:16.621 PDT: [INFO] Grafting transfer nodes in the workflow (completed)
    2008.04.27 12:59:16.628 PDT: [INFO] Grafting the remote workdirectory creation jobs 
                                        in the workflow
    2008.04.27 12:59:16.644 PDT: [INFO] Grafting the remote workdirectory creation jobs
                                        in the workflow (completed)
    2008.04.27 12:59:16.645 PDT: [INFO] Generating the cleanup workflow
    2008.04.27 12:59:16.649 PDT: [INFO] Generating the cleanup workflow (completed)
    2008.04.27 12:59:16.668 PDT: [INFO] Generating codes for the concrete workflow
    2008.04.27 12:59:17.330 PDT: [INFO] Generating codes for the concrete workflow (completed)
    2008.04.27 12:59:17.331 PDT: [INFO] Generating code for the cleanup workflow
    2008.04.27 12:59:17.414 PDT: [INFO] Generating code for the cleanup workflow (completed)
    2008.04.27 12:59:17.421 PDT: [INFO] 
    
    
    I have concretized your abstract workflow. The workflow has been entered 
    into the workflow database with a state of "planned". The next step is 
    to start or execute your workflow. The invocation required is
    
    
    pegasus-run -Dpegasus.user.properties=/nfs/home/train02/pegasus-wms/dags/train02\
    /pegasus/montage/run0001/pegasus.57010.properties \
      --nodatabase /nfs/home/train02/pegasus-wms/dags/train02/pegasus/montage/run0001
  • If you get any errors above while running pegasus-plan you can add -vvvvv to enable maximum verbosity on pegasus-run.

  • Now run pegasus-run as mentioned in the output of pegasus-plan.

    $ pegasus-run -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX\
    /pegasus/montage/run0001/pegasus.51773.properties \
              /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
    
    Checking all your submit files for log file names.
    This might take a while... Done.
    
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor : montage-0.dag.condor.sub
    Log of DAGMan debugging messages : montage-0.dag.dagman.out
    Log of Condor library output : montage-0.dag.lib.out
    Log of Condor library error messages : montage-0.dag.lib.err
    Log of the life of condor_dagman itself : montage-0.dag.dagman.log
    Condor Log file for all jobs of this DAG : /tmp/montage-051774.log -no_submit given, not
    submitting DAG to Condor.
    You can do this with: "condor_submit montage-0.dag.condor.sub"
    -----------------------------------------------------------------------
    Submitting job(s). Logging submit event(s).
    1 job(s) submitted to cluster 19968. 
    
    I have started your workflow, committed it to DAGMan,
    and updated its state in the work database. 
    A separate daemon was started to collect information about the progress of the workflow.
              
    The job state will soon be visible. Your workflow runs in base directory. 
    
    cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
    
    *** To monitor the workflow you can run ***
    
    pegasus-status -w montage-0 -t 20080128T114525-0800 
      or
    pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
    
    *** To remove your workflow run ***
    
    pegasus-remove -d 19968.0
      or
    pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001

    The above command submits the workflow to Condor DAGMan/CondorG. After submitting it starts a monitoring daemon tailstatd that parses the condor log files to update the status of the jobs and push it in a work database.

    Monitor the workflow using the commands provided in the output of the pegasus-run command and other commands explained earlier.

    The workflow generates a single output file montage.jpg that resides in the directory /scratch/trainXX/storage/montage.jpg where trainXX is your user id if it runs successfully

    To view the image, you can copy montage.jpg to your viz-login webspace, and view it in your web browser:

    $ cp /scratch/trainXX/storage/montage.jpg ~/public_html

    Point your web browser to: http://viz-login.isi.edu/~trainXX/montage.jpg where trainXX is your viz-login user id.

3. Advanced Exercises

3.1. DAGMan node categories and category throttles, VARS, and scripts

In this exercise we will learn how to use node category throttles, VARS (to use a single submit file for multiple jobs), and PRE and POST scripts.

  • Take a look at the sample DAG file for this exercise:

    $ cat dagman/node_categories/node_categories.dag
    # DAG to illustrate node categories/category throttles, and VARS.
    
    JOB Setup setup.submit
    SCRIPT PRE Setup setup.pre
    
    JOB BigProc1 big_proc.submit
    VARS BigProc1 ARGS = "-sleep 60 -trials 10000 -seed 1234567"
    PARENT Setup CHILD BigProc1
    CATEGORY BigProc1 BigProc
    
    JOB BigProc2 big_proc.submit
    VARS BigProc2 ARGS = "-sleep 80 -trials 20000 -seed 7654321"
    PARENT Setup CHILD BigProc2
    CATEGORY BigProc2 BigProc
    ...
  • Now edit this DAG file:

    $ vi dagman/node_categories/node_categories.dag

    Insert the following lines that set node category throttles:

    MAXJOBS BigProc 1
    MAXJOBS SmallProc 4

    Also, change the two "YOUR ARGS HERE" strings to appropriate arguments, based on the other nodes with the same submit files (remember to keep the double quotes in place).

  • Now submit the dag:

    $ condor_submit_dag -f -usedagdir dagman/node_categories/node_categories.dag
    
    Checking all your submit files for log file names.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor      
                                      : dagman/node_categories/node_categories.dag.condor.sub
    Log of DAGMan debugging messages  
                                      : dagman/node_categories/node_categories.dag.dagman.out
    Log of Condor library output
                                      : dagman/node_categories/node_categories.dag.lib.out
    Log of Condor library error messages 
                                      : dagman/node_categories/node_categories.dag.lib.err
    Log of the life of condor_dagman itself
                                      : dagman/node_categories/node_categories.dag.dagman.log
    
    Condor Log file for all jobs of this DAG
                                 : /nfs/home/wenger/ccgrid08/dagman/node_categories/setup.log
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 21306.
    -----------------------------------------------------------------------
  • You can monitor the DAG using the condor_q command:

    $ condor_q -dag trainXX
    
    -- Submitter: viz-login.isi.edu : <128O.9.72.178:40977> : viz-login.isi.edu
     ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    21276.0   wenger          4/28 09:26   0+00:00:31 R  0   4.4  condor_dagman -f -
    21278.0    |-BigProc1     4/28 09:27   0+00:00:11 R  0   0.0  pi -sleep 60 -tria
    21279.0    |-SmallProc1   4/28 09:27   0+00:00:06 R  0   0.0  factor -sleep 20 -
    21280.0    |-SmallProc2   4/28 09:27   0+00:00:06 R  0   0.0  factor -sleep 20 -
    21281.0    |-SmallProc3   4/28 09:27   0+00:00:06 R  0   0.0  factor -sleep 20 -
    21282.0    |-SmallProc4   4/28 09:27   0+00:00:06 R  0   0.0  factor -sleep 20 -
    
    6 jobs; 0 idle, 6 running, 0 held 

    and by looking at the dagman.out file:

    $ tail -f dagman/node_categories/node_categories.dag.dagman.out
    
    ...
    4/28 09:24:08 Of 15 nodes total:
    4/28 09:24:08  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
    4/28 09:24:08   ===     ===      ===     ===     ===        ===      ===
    4/28 09:24:08    15       0        0       0       0          0        0
    4/28 09:24:08 Note: 82 total job deferrals because of node category throttles
    4/28 09:24:08 All jobs Completed!
    4/28 09:24:08 Note: 0 total job deferrals because of -MaxJobs limit (0)
    4/28 09:24:08 Note: 0 total job deferrals because of -MaxIdle limit (0)
    4/28 09:24:08 Note: 82 total job deferrals because of node category throttles
    4/28 09:24:08 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
    4/28 09:24:08 Note: 0 total POST script deferrals because of -MaxPost limit
    (0)
    4/28 09:24:08 **** condor_scheduniv_exec.21260.0 (condor_DAGMAN) EXITING WITH STATUS 0
  • Also note that even though the Cleanup node Condor job failed, the POST script succeeded, and therefore the node succeeded overall:

    $ cat dagman/node_categories/node_categories.dag.dagman.out
    
    ...
    4/28 09:24:03 Event: ULOG_JOB_TERMINATED for Condor Node Cleanup (21275.0)
    4/28 09:24:03 Node Cleanup job proc (21275.0) failed with status 1.
    4/28 09:24:03 Node Cleanup job completed
    4/28 09:24:03 Running POST script of Node Cleanup...
    ...
    4/28 09:24:08 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node Cleanup
    (21275.0)
    4/28 09:24:08 POST Script of Node Cleanup completed successfully.
    ...
  • Once the DAG has completed, you should have two .result files in the DAG directory:

    $ cat dagman/node_categories/factor.result
    
    output/factor.out.21279:887766 = 2 * 3 * 11 * 13451
    output/factor.out.21280:234234 = 2 * 3^2 * 7 * 11 * 13^2
    output/factor.out.21281:9699690 = 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19
    output/factor.out.21282:234238 = 2 * 117119
    output/factor.out.21283:9234973 = 11 * 61 * 13763
    output/factor.out.21284:230953 = 41 * 43 * 131
    output/factor.out.21285:6249832 = 2^3 * 781229
    output/factor.out.21286:321094 = 2 * 181 * 887
    output/factor.out.21287:723400 = 2^3 * 5^2 * 3617
    output/factor.out.21289:1934023 = 7 * 13 * 53 * 401
    
    $ cat dagman/node_categories/pi.result
    
    output/pi.out.21278:Estimate of pi: 3.160400 (10000 trials)
    output/pi.out.21288:Estimate of pi: 3.166600 (20000 trials)
    output/pi.out.21290:Estimate of pi: 3.145900 (40000 trials)

3.2. DAGMan nested DAGs, DAG config files, and overall job throttling

In this exercise we will learn how to use nested DAGs (DAGs as DAG nodes), DAG config files (allowing us to set any DAGMan-related configuration macros), and overall job throttling.

  • Because of some limitations of the current version of DAGMan, this exercise must be done in the directory containing the DAG files:

    $ cd dagman/nested_dags
  • Take a look at the top-level DAG file:

    $ cat top.dag
    
    # Top-level DAG for nested DAGs exercise.
    
    JOB Setup setup.submit
    SCRIPT PRE Setup setup.pre
    
    JOB Factor factor.dag.condor.sub
    
    INSERT Pi JOB HERE
    
    JOB Cleanup cleanup.submit
    SCRIPT POST Cleanup cleanup.post
    
    PARENT Setup CHILD Factor Pi
    PARENT Factor Pi CHILD Cleanup
  • Now edit the DAG file:

    $ vi top.dag

    Change the following line

    INSERT Pi JOB HERE

    to be a new job that's like the Factor job, except that the job name is Pi, and the submit file corresponds to pi.dag instead of factor.dag.

  • Take a look at the nested DAG files:

    $ cat factor.dag
    
    # First lower-level DAG for nested DAG exercise.
    
    JOB A factor.submit
    VARS A ARGS = "-sleep 20 -num 23489820"
    
    JOB B1 factor.submit
    VARS B1 ARGS = "-sleep 20 -num 77201856"
    
    JOB B2 factor.submit
    VARS B2 ARGS = "-sleep 20 -num 292342"
    
    JOB C factor.submit
    VARS C ARGS = "-sleep 20 -num 92340234"
    
    PARENT A CHILD B1 B2
    PARENT B1 B2 CHILD C
    
    $ cat pi.dag
    
    # Second lower-level DAG for nested DAG exercise.
    
    # This works only with version 6.9.2 and later.
    CONFIG pi.config
    
    JOB A pi.submit
    VARS A ARGS = "-sleep 20 -trials 10000 -seed 2348235"
    
    JOB B1 pi.submit
    VARS B1 ARGS = "-sleep 20 -trials 50000 -seed 99082348"
    
    JOB B2 pi.submit
    VARS B2 ARGS = "-sleep 20 -trials 50000 -seed 66899243"
    
    JOB C pi.submit
    VARS C ARGS = "-sleep 20 -trials 10000 -seed 66899243"
    
    PARENT A CHILD B1 B2
    PARENT B1 B2 CHILD C
  • Also examine the config file for the pi DAG:

    $ cat pi.config
    
    DAGMAN_MAX_JOBS_SUBMITTED = 1
  • Now create the .condor.sub files for the nested DAGs:

    $ condor_submit_dag -f -no_submit factor.dag
    
    Checking all your submit files for log file names.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor           : factor.dag.condor.sub
    Log of DAGMan debugging messages                 : factor.dag.dagman.out
    Log of Condor library output                     : factor.dag.lib.out
    Log of Condor library error messages             : factor.dag.lib.err
    Log of the life of condor_dagman itself          : factor.dag.dagman.log
    
    Condor Log file for all jobs of this DAG         :
    /nfs/home/wenger/ccgrid08/dagman/nested_dags/factor.log
    -no_submit given, not submitting DAG to Condor.  You can do this with:
    "condor_submit factor.dag.condor.sub"
    -----------------------------------------------------------------------
    
    $ condor_submit_dag -f -no_submit pi.dag
    Checking all your submit files for log file names.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor           : pi.dag.condor.sub
    Log of DAGMan debugging messages                 : pi.dag.dagman.out
    Log of Condor library output                     : pi.dag.lib.out
    Log of Condor library error messages             : pi.dag.lib.err
    Log of the life of condor_dagman itself          : pi.dag.dagman.log
    
    Condor Log file for all jobs of this DAG         :
    /nfs/home/wenger/ccgrid08/dagman/nested_dags/pi.log
    -no_submit given, not submitting DAG to Condor.  You can do this with:
    "condor_submit pi.dag.condor.sub"
    -----------------------------------------------------------------------
  • Finally, you are ready to submit the top-level DAG:

    $ condor_submit_dag -f top.dag
    
    Checking all your submit files for log file names.
    This might take a while... 
    Done.
    -----------------------------------------------------------------------
    File for submitting this DAG to Condor           : top.dag.condor.sub
    Log of DAGMan debugging messages                 : top.dag.dagman.out
    Log of Condor library output                     : top.dag.lib.out
    Log of Condor library error messages             : top.dag.lib.err
    Log of the life of condor_dagman itself          : top.dag.dagman.log
    
    Condor Log file for all jobs of this DAG         :
    /nfs/home/wenger/ccgrid08/dagman/nested_dags/top.log
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 21292.
    -----------------------------------------------------------------------
  • Again, you can monitor the DAG with condor_q or by looking at the dagman.out file.

  • Note that the dagman.out file for the pi DAG shows the effects of the throttle in the config file:

    $ cat pi.dag.dagman.out
    
    4/28 09:46:35 Of 4 nodes total:
    4/28 09:46:35  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
    4/28 09:46:35   ===     ===      ===     ===     ===        ===      ===
    4/28 09:46:35     4       0        0       0       0          0        0
    4/28 09:46:35 Note: 7 total job deferrals because of -MaxJobs limit (1)
    4/28 09:46:35 All jobs Completed!
    4/28 09:46:35 Note: 7 total job deferrals because of -MaxJobs limit (1)
    4/28 09:46:35 Note: 0 total job deferrals because of -MaxIdle limit (0)
    4/28 09:46:35 Note: 0 total job deferrals because of node category throttles
    4/28 09:46:35 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
    4/28 09:46:35 Note: 0 total POST script deferrals because of -MaxPost limit (0)
    4/28 09:46:35 **** condor_scheduniv_exec.21296.0 (condor_DAGMAN) EXITING WITH STATUS 0
  • Again, you should have two .result files:

    $ cat factor.result
    
    output/factor.out.21297:23489820 = 2^2 * 3^2 * 5 * 37 * 3527
    output/factor.out.21299:77201856 = 2^6 * 3^3 * 43 * 1039
    output/factor.out.21300:292342 = 2 * 313 * 467
    output/factor.out.21302:92340234 = 2 * 3^2 * 7 * 29 * 37 * 683
    
    $ cat pi.result
    
    output/pi.out.21298:Estimate of pi: 3.170800 (10000 trials)
    output/pi.out.21301:Estimate of pi: 3.147280 (50000 trials)
    output/pi.out.21303:Estimate of pi: 3.141040 (50000 trials)
    output/pi.out.21304:Estimate of pi: 3.154000 (10000 trials)

3.3. Optimizing a workflow by clustering small jobs

Sometimes a workflow may have too many jobs whose execution time is a few seconds long. In such instances the overhead of scheduling each job on a grid is too large and the runtime of the entire workflow can be optimized by using Pegasus clustering techniques. One such technique is to cluster jobs horizontally on the same level into one or more sequential jobs.

$ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
            --dir /nfs/home/trainXX/pegasus-wms/dags --sites viz --output local --nocleanup \
            --cluster horizontal --dax /nfs/home/trainXX/pegasus-wms/dax/montage.dax 

3.4. Executing the workflow in a non shared filesystem environment. (To Be Done offline)

All the jobs till now have been executed on the shared filesystem on the viz-cluster. The input data required by the Montage workflow was staged to a directory on the shared filesystem. All the jobs were then executed in that directory.

A recent feature addition to Pegasus ( still in testing phase ), allows you to execute each of the jobs in a tmp directory on the worker nodes filesystem. For this to happen, a Second Level Staging (SLS) needs to occur, that transfers the data from the directory on the shared filesystem to a directory on the local filesystem of the worker node.

Set the property 
pegasus.execute.*.filesystem.local=true

in your properties file. 

Repeat Exercise 2.6.