2.4. Submitting an Example Workflow

All of the example workflows described in the previous section can be generated with the pegasus-init command. For this tutorial we will be using the split workflow, which can be created like this:

$ cd /home/tutorial
$ pegasus-init split
Do you want to generate a tutorial workflow? (y/n) [n]: y
1: Process
2: Pipeline
3: Split
4: Merge
5: Diamond
What tutorial workflow do you want? (1-5) [1]: 3
$ cd split
$ ls
README.md          input              plan_dax.sh        tc.txt
daxgen.py          output             rc.txt
generate_dax.sh    pegasus.properties sites.xml

Tip

The pegasus-init tool can be used to generate workflow skeletons from templates by asking the user questions. It is easier to use pegasus-init than to start a new workflow from scratch.

The split workflow looks like this:

Figure 2.6. Split Workflow

Split Workflow

The input workflow description for Pegasus is called the DAX. It can be generated by running the generate_dax.sh script from the split directory, like this:

$ ./generate_dax.sh split.dax
Generated dax split.dax
    

This script will run a small Python program (daxgen.py) that generates a file with a .dax extension using the Pegasus Python API. We will cover the details of creating a DAX programmatically later in the tutorial. Pegasus reads the DAX and generates an executable HTCondor workflow that is run on an execution site.

The pegasus-plan command is used to submit the workflow through Pegasus. The pegasus-plan command reads the input workflow (DAX file specified by --dax option), maps the abstract DAX to one or more execution sites, and submits the generated executable workflow to HTCondor. Among other things, the options to pegasus-plan tell Pegasus

  • the workflow to run

  • where (what site) to run the workflow

  • the input directory where the inputs are placed

  • the output directory where the outputs are placed

By default, the workflow is setup to run on the compute sites (i.e sites with handle other than "local") defined in the sites.xml file. In our example, the workflow will run on a site named "condorpool" in the sites.xml file.

Note

If there are multiple compute sites specified in your sites.xml, and you want to choose a specific site, use the --sites option to pegasus-plan

To plan the split workflow invoke the pegasus-plan command using the plan_dax.sh wrapper script as follows:

$ ./plan_dax.sh split.dax
2015.10.22 19:12:10.402 PDT:
2015.10.22 19:12:10.409 PDT:   -----------------------------------------------------------------------
2015.10.22 19:12:10.414 PDT:   File for submitting this DAG to Condor           : split-0.dag.condor.sub
2015.10.22 19:12:10.420 PDT:   Log of DAGMan debugging messages                 : split-0.dag.dagman.out
2015.10.22 19:12:10.426 PDT:   Log of Condor library output                     : split-0.dag.lib.out
2015.10.22 19:12:10.431 PDT:   Log of Condor library error messages             : split-0.dag.lib.err
2015.10.22 19:12:10.436 PDT:   Log of the life of condor_dagman itself          : split-0.dag.dagman.log
2015.10.22 19:12:10.442 PDT:
2015.10.22 19:12:10.458 PDT:   -----------------------------------------------------------------------
2015.10.22 19:12:14.292 PDT:   Your database is compatible with Pegasus version: 4.5.3
2015.10.22 19:12:15.198 PDT:   Submitting to condor split-0.dag.condor.sub
2015.10.22 19:12:15.997 PDT:   Submitting job(s).
2015.10.22 19:12:16.003 PDT:   1 job(s) submitted to cluster 111.
2015.10.22 19:12:16.035 PDT:
2015.10.22 19:12:16.055 PDT:   Your workflow has been started and is running in the base directory:
2015.10.22 19:12:16.061 PDT:
2015.10.22 19:12:16.071 PDT:     /home/tutorial/split/submit/tutorial/pegasus/split/run0001
2015.10.22 19:12:16.078 PDT:
2015.10.22 19:12:16.084 PDT:   *** To monitor the workflow you can run ***
2015.10.22 19:12:16.090 PDT:
2015.10.22 19:12:16.098 PDT:     pegasus-status -l /home/tutorial/split/submit/tutorial/pegasus/split/run0001
2015.10.22 19:12:16.114 PDT:
2015.10.22 19:12:16.119 PDT:   *** To remove your workflow run ***
2015.10.22 19:12:16.125 PDT:
2015.10.22 19:12:16.131 PDT:     pegasus-remove /home/tutorial/split/submit/tutorial/pegasus/split/run0001
2015.10.22 19:12:16.137 PDT:
2015.10.22 19:12:17.630 PDT:   Time taken to execute is 1.918 seconds

Note

The line in the output that starts with pegasus-status, contains the command you can use to monitor the status of the workflow. The path it contains is the path to the submit directory where all of the files required to submit and monitor the workflow are stored.

This is what the split workflow looks like after Pegasus has finished planning the DAX:

Figure 2.7. Split DAG

Split DAG

For this workflow the only jobs Pegasus needs to add are a directory creation job, a stage-in job (for pegasus.html), and stage-out jobs (for wc count outputs). The cleanup jobs remove data that is no longer required as workflow executes.