Note: There is a newer version of Pegasus available. Please see the main documentation page.

Chapter 2. New User Walkthrough

2.1. Walkthrough Objectives

This walkthrough is intended for new users who want so get a quick overview of the Pegasus concepts and system. A preconfigured virtual machine is provided so that no software installation (except Virtal Box) is required. The walkthrough covers creating a workflow, submitting, monitoring, debugging, and generating run statistics. As concepts and tools are introduced, links to the main Pegasus documentation are provided.

2.2. Virtual Box Pegasus VM

Note

Virtual Box is required to run the virtual machine on your computer. If you do not already have it installed, download the binary version desired and install it from the Virtual Box Website

Download the corresponding disk image.

Virtual Box Pegasus Image

It is around 600 MB in size. The Image is in zip format. You will need to unzip it.

After untarring a folder named Pegasus-3.1-Debian-6-x86.vbox will be created that has the vmdk files for the VM. Load this VM using Virtual Box. Once you see the simple Linux desktop, move on to the next step.

2.2.1. Running the VM with Virtual Box

Launch Virtual Box on your machine. Follow the steps to add the vmdk file to Virtual Box and create a virtual machine inside the Virtual Box

  1. In the Menu, click File and select Virtual Media Manager ( File > Virtual Media Manager )

  2. The Virtual Media Manager Windows opens up.

  3. Click on "Add" button to add thePegasus-3.1.0-Debian-6-x86.vbox/Debian-6-x86.vmdk file that you just downloaded and unzipped.

  4. You will now see the Debian-6-x86.vmdk in the list of hard disks with Actual size listed as around 3.0 GB

  5. Close the Window for the Virtual Media Manager

We will now create a Virtual Machine in the Virtual Box.

  1. In the Menu, click Machine and select New ( Machine > New )

  2. It will open the New Virtual Machine Wizard. Click Continue

  3. In the VM Name and OS Type Window specify the name as PegasusVM-3.1.0.

  4. Select the Operating System as Linuxand Version as Debian. Click Continue.

  5. Set the base memory to 384 MB . It defaults to 512 MB. Click Continue

  6. We now select the Virtual Hard Disk to use with the machine. Select the option box for Use Existing Hard Disk. Select Debian-6-x86.vmdk from the list . Click Continue

  7. Click Done.

  8. Now in the Virtual Box , start the PegasusVM-3.1.0 machine.

2.3. Creating the Workflow (DAX)

Pegasus takes in an abstract workflow (DAX) and generates an executable workflow (DAG) that is run in an environment. For the purposes of this walkthrough, we will demonstrate the characteristics and structure of workflows by generating a workflow with a bit of Python code that uses the DAX API to generate a DAX. For a detailed description of workflows, how to create them, and how they are used in Pegasus see Creating Workflows

The workflow we will be creating is called Black Diamond because its shape. It is a made up workflow which has 4 jobs, files (f.*) passed between the jobs, and it is an interesting example as it shows data and job dependencies. The workflow looks like:

Figure 2.1. Figure: Black Diamond DAX

Figure: Black Diamond DAX

To create the DAX, open a new terminal:

Figure 2.2. Terminal Window

Terminal Window


All the exercises in this Chapter will be run from the $HOME/walkthrough/ directory . All the files that are required reside in this directory.

Change the directory to $HOME/walkthrough:

$ cd $HOME/walkthrough

The piece of code which generates the DAX is called a DAX generator, and in this case the code is written in Python. APIs are also available for Java and Perl, and if that does not fit in your tool chain, you can also write the DAX XML directly.

Open the file create_diamond_dax.py

$ nano create_diamond_dax.py

The code has 6 logical sections:

  1. Imports and Pegasus location setup. We need to know the location of Pegasus because we need to import the DAX3 Pegasus Python module, and we need to know where on the file system the keg executable is.

  2. A new abstract dag (DAX) is created. This is the main DAX object that we will add data, jobs and flow information to.

  3. A replica catalog is set up. Replica catalogs tell Pegasus where to find data. In this example workflow, we only have one input, f.a, and thus the replica catalog only has one entry. For larger workflows, it is not uncommon to have thousands of entries in the replica catalog, and sometimes mulitple physical locations for the same logical filename so that Pegasus can pick which replica to use. Replica catalogs can either be included in the DAX (like in this example) or be standalone files or services. For more information about replica catalogs, see the Data Discovery chapter.

  4. Executables are added. Just like we the replica catalog informs Pegasus on where to find data, the transformation catalog tells Pegasus where to find the executables for the workflow. The transformation catalog can exist inside the DAX (as in this example) or in a standalone file. You can list the same logical exectuable existing on multiple resources, and Pegasus will pick the appropiate one when the workflow is planned. More information can be found in the Executable Discovery chapter.

  5. Jobs are added. The 4 jobs in the Black Diamond in the picture above are added. Arguments are defined, and uses clauses are added to list input and output files. This is an imporant step, as it allows Pegasus to track the files, and stage the data if necessary.

  6. Control flows are set up. This is the edges in the picture, and defines parent/child relationships between the jobs. When the workflow is executing, this is the order the jobs will run in.

Close the file and execute it giving the location of the Pegasus install as argument:

$ ./create_diamond_dax.py /opt/pegasus/default

The output is the DAX XML:

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated: 2011-07-19 12:59:14.617059 -->
<!-- generated by: tutorial -->
<!-- generator: python -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX
                          http://pegasus.isi.edu/schema/dax-3.3.xsd"
      version="3.3" name="diamond">
        <file name="f.a">
                <pfn url="file:///home/tutorial/walkthrough/f.a" site="PegasusVM"/>
        </file>
        <executable name="preprocess" namespace="diamond" version="4.0"
                    arch="x86_64" os="linux" installed="true">
                <pfn url="file:///opt/pegasus/default/bin/keg" site="PegasusVM"/>
        </executable>
        <executable name="analyze" namespace="diamond" version="4.0"
                    arch="x86_64" os="linux" installed="true">
                <pfn url="file:///opt/pegasus/default/bin/keg" site="PegasusVM"/>
        </executable>
        <executable name="findrange" namespace="diamond" version="4.0"
                    arch="x86_64" os="linux" installed="true">
                <pfn url="file:///opt/pegasus/default/bin/keg" site="PegasusVM"/>
        </executable>
        <job id="ID0000001" namespace="diamond" name="preprocess" version="4.0">
                <argument>-a preprocess -T60 -i <file name="f.a"/> 
                          -o <file name="f.b1"/> <file name="f.b2"/></argument>
                <uses name="f.b1" link="output" executable="false"/>
                <uses name="f.a" link="input" executable="false"/>
                <uses name="f.b2" link="output" executable="false"/>
        </job>
        <job id="ID0000002" namespace="diamond" name="findrange" version="4.0">
                <argument>-a findrange -T60 -i <file name="f.b1"/>
                          -o <file name="f.c1"/></argument>
                <uses name="f.c1" link="output" executable="false"/>
                <uses name="f.b1" link="input" executable="false"/>
        </job>
        <job id="ID0000003" namespace="diamond" name="findrange" version="4.0">
                <argument>-a findrange -T60 -i <file name="f.b2"/>
                          -o <file name="f.c2"/></argument>
                <uses name="f.c2" link="output" executable="false"/>
                <uses name="f.b2" link="input" executable="false"/>
        </job>
        <job id="ID0000004" namespace="diamond" name="analyze" version="4.0">
                <argument>-a analyze -T60 -i <file name="f.c1"/> 
                          <file name="f.c2"/> -o <file name="f.d"/></argument>
                <uses name="f.c2" link="input" executable="false"/>
                <uses name="f.d" link="output" register="true" executable="false"/>
                <uses name="f.c1" link="input" executable="false"/>
        </job>
        <child ref="ID0000002">
                <parent ref="ID0000001"/>
        </child>
        <child ref="ID0000003">
                <parent ref="ID0000001"/>
        </child>
        <child ref="ID0000004">
                <parent ref="ID0000002"/>
                <parent ref="ID0000003"/>
        </child>
</adag>

We need the DAX in a file to give to Pegasus, so run the same command again, but redirect it to file:

$ ./create_diamond_dax.py /opt/pegasus/default > diamond.xml 

More information about creating workflows can be found in the Creating Workflows chapter.

2.4. Submitting the Workflow

Submitting a Pegasus workflow consists of two steps, planning and submitting, but are often made by one single command for convenience. However, the planning stage is where Pegasus is doing powerful transformations to your workflow, so it is important to at least have an idea on what is happening under the covers. Planning includes, but is not limited to:

  1. Adding remote create dir jobs

  2. Adding stage in jobs to transfer data into the remote work directory

  3. Adding cleanup jobs to clean up the work directory as the workflow progresses

  4. Adding stage out jobs to transfer data to the final output location

  5. Adding registration jobs to register the data in a replica catalog

  6. Clusters job together - useful if you have many short tasks

  7. Adds wrappers to the jobs to collect provenance information - this is so statistics and plots can be created after a run

In the case of our black diamond workflow, here is what it looks like after the Pegasus planner has processed the DAX:

Figure 2.3. Figure: Black Diamond DAG Image

Figure: Black Diamond DAG Image

To plan and submit the workflow, run:

$ pegasus-plan --conf pegasusrc --sites PegasusVM --dir runs --output local --dax diamond.xml --submit

The output will look something like:

Submitting job(s). 
1 job(s) submitted to cluster 18. 

----------------------------------------------------------------------- 
File for submitting this DAG to Condor           : diamond-0.dag.condor.sub 
Log of DAGMan debugging messages                 : diamond-0.dag.dagman.out 
Log of Condor library output                     : diamond-0.dag.lib.out 
Log of Condor library error messages             : diamond-0.dag.lib.err 
Log of the life of condor_dagman itself          : diamond-0.dag.dagman.log 
----------------------------------------------------------------------- 

Your Workflow has been started and runs in base directory given below 

cd /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001 

*** To monitor the workflow you can run *** 

pegasus-status -l /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001 

*** To remove your workflow run *** 
pegasus-remove /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001

Tip

The work directory created by Pegasus is where the concrete workflow exists, and the directory is also the handle for Pegasus commands acting on that instance. Using this handle will be covered in the next section.

Further information about planning and submitting workflows can be found in the Running Workflows chapter. Information about the files in the work directory can be found in the Submit Directory Details chapter.

2.5. Monitoring, Debugging and Statistics

Once the workflow has been submitted, you can check status of it with the pegasus-status tool. Use the directory handle (which is different for every workflow you submit) from the previous step and run it with the -l flag:

$ pegasus-status -l /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
STAT  IN_STATE  JOB                                               
Run      02:59  diamond-0                                         
Run      00:34   |-findrange_ID0000002                            
Run      00:32   |-findrange_ID0000003                            
Idle     00:19   \_clean_up_preprocess_ID0000001                  
Summary: 4 Condor jobs total (I:1 R:3)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
    8     0     0     4     0     4     0  25.0 Running *diamond-0.dag                          
Summary: 1 DAG total (Running:1)

The first section shows jobs released to be handled by Condor. The second section shows a summary of the state of the workflow. Keep on checking the workflow with pegasus-status until it is 100% done.

Once the workflow has finished, you may use the pegasus-analyzer to debug the workflow. This is obviously most useful when workflows have failed for some reason. pegasus-analyzer will show you which jobs failed and the output of those jobs. Our simple black diamond should finish successfully, and pegasus-analyzer output should look like:

$ pegasus-analyzer --dir=/home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
pegasus-analyzer: initializing...

************************************Summary*************************************

 Total jobs         :     15 (100.00%)
 # jobs succeeded   :     15 (100.00%)
 # jobs failed      :      0 (0.00%)
 # jobs unsubmitted :      0 (0.00%)

**************************************Done**************************************

pegasus-analyzer: end of status report

To get detailed run statistics, use the pegasus-statistics tool:

$ pegasus-statistics /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001

Workflow summary - Summary of the workflow execution. It shows total
                tasks/jobs/sub workflows run, how many succeeded/failed etc.
                In case of hierarchical workflow the calculation shows the 
                statistics across all the sub workflow.

Workflow wall time - The walltime from the start of the workflow execution
                to the end as reported by the DAGMAN.In case of rescue dag the value
                is the cumulative of all retries.

Workflow cumulative job wall time - The sum of the walltime of all jobs as
                reported by kickstart. In case of job retries the value is the
                cumulative of all retries. For workflows having sub workflow jobs
                (i.e SUBDAG and SUBDAX jobs), the walltime value includes jobs from
                the sub workflows as well.

Cumulative job walltime as seen from submit side - The sum of the walltime of
                all jobs as reported by DAGMan. This is similar to the regular
                cumulative job walltime, but includes job management overhead and
                delays. In case of job retries the value is the cumulative of all
                retries. For workflows having sub workflow jobs (i.e SUBDAG and
                SUBDAX jobs), the walltime value includes jobs from the sub workflows
                as well.

-------------------------------------------------------------------------------------
Type                Succeeded           Failed              Unsubmitted         Total
Tasks               4                   0                   0                   4
Jobs                15                  0                   0                   15 
Sub Workflows       0                   0                   0                   0
--------------------------------------------------------------------------------------
Workflow wall time                               : 5 mins, 35 secs, (total 335 seconds)

Workflow cumulative job wall time                : 4 mins, 1 sec,   (total 241 seconds)

Cumulative job walltime as seen from submit side : 4 mins, 2 secs,  (total 242 seconds)

More information about monitoring, debugging and statistics can be found in the Monitoring, Debugging and Statistics chapter.