2.4. Submitting an Example Workflow

All of the example workflows described in the previous section can be generated with the pegasus-init command. For this tutorial we will be using the split workflow, which can be created like this:

$ cd /home/tutorial
$ pegasus-init split
Do you want to generate a tutorial workflow? (y/n) [n]: y
1: Local Machine Condor Pool
2: USC HPCC Cluster
3: OSG from ISI submit node
4: XSEDE, with Bosco
5: Bluewaters, with Glite
6: TACC Wrangler with Glite
7: OLCF TITAN with Glite
What environment is tutorial to be setup for? (1-7) [1]: 1
1: Process
2: Pipeline
3: Split
4: Merge
5: EPA (requires R)
6: Population Modeling using Containers
7: Diamond
What tutorial workflow do you want? (1-7) [1]: 3
Pegasus Tutorial setup for example workflow - split for execution on submit-host in directory /home/tutorial/split
$ cd split
$ ls
README.md        sites.xml  tc.txt  bin                 daxgen.py  
generate_dax.sh  input      output  pegasus.properties  plan_cluster_dax.sh  
plan_dax.sh      rc.txt  


The pegasus-init tool can be used to generate workflow skeletons from templates by asking the user questions. It is easier to use pegasus-init than to start a new workflow from scratch.

The split workflow looks like this:

Figure 2.6. Split Workflow

Split Workflow

The input workflow description for Pegasus is called the DAX. It can be generated by running the generate_dax.sh script from the split directory, like this:

$ ./generate_dax.sh split.dax
Generated dax split.dax

This script will run a small Python program (daxgen.py) that generates a file with a .dax extension using the Pegasus Python API. We will cover the details of creating a DAX programmatically later in the tutorial. Pegasus reads the DAX and generates an executable HTCondor workflow that is run on an execution site.

The pegasus-plan command is used to submit the workflow through Pegasus. The pegasus-plan command reads the input workflow (DAX file specified by --dax option), maps the abstract DAX to one or more execution sites, and submits the generated executable workflow to HTCondor. Among other things, the options to pegasus-plan tell Pegasus

  • the workflow to run

  • where (what site) to run the workflow

  • the input directory where the inputs are placed

  • the output directory where the outputs are placed

By default, the workflow is setup to run on the compute sites (i.e sites with handle other than "local") defined in the sites.xml file. In our example, the workflow will run on a site named "condorpool" in the sites.xml file.


If there are multiple compute sites specified in your sites.xml, and you want to choose a specific site, use the --sites option to pegasus-plan

To plan the split workflow invoke the pegasus-plan command using the plan_dax.sh wrapper script as follows:

$ ./plan_dax.sh split.dax
2019.08.22 18:51:29.289 UTC:    
2019.08.22 18:51:29.295 UTC:   ----------------------------------------------------------------------- 
2019.08.22 18:51:29.300 UTC:   File for submitting this DAG to HTCondor           : split-0.dag.condor.sub 
2019.08.22 18:51:29.305 UTC:   Log of DAGMan debugging messages                 : split-0.dag.dagman.out 
2019.08.22 18:51:29.310 UTC:   Log of HTCondor library output                     : split-0.dag.lib.out 
2019.08.22 18:51:29.315 UTC:   Log of HTCondor library error messages             : split-0.dag.lib.err 
2019.08.22 18:51:29.321 UTC:   Log of the life of condor_dagman itself          : split-0.dag.dagman.log 
2019.08.22 18:51:29.326 UTC:    
2019.08.22 18:51:29.331 UTC:   -no_submit given, not submitting DAG to HTCondor.  You can do this with: 
2019.08.22 18:51:29.341 UTC:   ----------------------------------------------------------------------- 
2019.08.22 18:51:29.932 UTC:   Created Pegasus database in: sqlite:////home/tutorial/.pegasus/workflow.db 
2019.08.22 18:51:29.937 UTC:   Your database is compatible with Pegasus version: 4.9.2 
2019.08.22 18:51:29.997 UTC:   Submitting to condor split-0.dag.condor.sub 
2019.08.22 18:51:30.021 UTC:   Submitting job(s). 
2019.08.22 18:51:30.026 UTC:   1 job(s) submitted to cluster 1. 
2019.08.22 18:51:30.032 UTC:    
2019.08.22 18:51:30.037 UTC:   Your workflow has been started and is running in the base directory: 
2019.08.22 18:51:30.042 UTC:    
2019.08.22 18:51:30.047 UTC:     /home/tutorial/split/submit/tutorial/pegasus/split/run0001 
2019.08.22 18:51:30.052 UTC:    
2019.08.22 18:51:30.058 UTC:   *** To monitor the workflow you can run *** 
2019.08.22 18:51:30.063 UTC:    
2019.08.22 18:51:30.068 UTC:     pegasus-status -l /home/tutorial/split/submit/tutorial/pegasus/split/run0001 
2019.08.22 18:51:30.074 UTC:    
2019.08.22 18:51:30.079 UTC:   *** To remove your workflow run *** 
2019.08.22 18:51:30.084 UTC:    
2019.08.22 18:51:30.089 UTC:     pegasus-remove /home/tutorial/split/submit/tutorial/pegasus/split/run0001 
2019.08.22 18:51:30.095 UTC:    
2019.08.22 18:51:30.658 UTC:   Time taken to execute is 1.495 seconds 


The line in the output that starts with pegasus-status, contains the command you can use to monitor the status of the workflow. The path it contains is the path to the submit directory where all of the files required to submit and monitor the workflow are stored.

This is what the split workflow looks like after Pegasus has finished planning the DAX:

Figure 2.7. Split DAG

Split DAG

For this workflow the only jobs Pegasus needs to add are a directory creation job, a stage-in job (for pegasus.html), and stage-out jobs (for wc count outputs). The cleanup jobs remove data that is no longer required as workflow executes.