Chapter 3. New User Walkthrough

3.1. Walkthrough Objectives

This walkthrough is intended for new users who want so get a quick overview of the Pegasus concepts and system. A preconfigured virtual machine is provided so that no software installation (except Nimbus cloud-client) is required. The walkthrough covers creating a workflow, submitting, monitoring, debugging, and generating run statistics at a high level. As concepts and tools are introduced, links to the main Pegasus documentation are provided.

3.2. Start Your Virtual Machine

Please refer to chapter "Starting The Virtual Machine" to start a virtual machine on FutureGrid.

That chapter also includes the first instruction, how to log into your virtual machine instance as user tutorial using ssh. Please do so now.

3.3. Creating the Workflow (DAX)

Pegasus takes in an abstract workflow (DAX) and generates an executable workflow (DAG) that is run in an environment. For the purposes of this walkthrough, we will demonstrate the characteristics and structure of workflows by generating a workflow with a bit of Python code that uses the DAX API to generate a DAX. For a detailed description of workflows, how to create them, and how they are used in Pegasus see Creating Workflows

The workflow we will be creating is called Black Diamond because its shape. It is a made up workflow which has 4 jobs, files (f.*) passed between the jobs, and it is an interesting example as it shows data and job dependencies. The workflow looks like:

Figure 3.1. Figure: A "black diamond" DAX is the logical representation of tasks and dependencies.

Figure: A "black diamond" DAX is the logical representation of tasks and dependencies.

All the exercises in this chapter will be run from the $HOME/walkthrough/ directory . All the files that are required reside in this directory.

Change the directory to $HOME/walkthrough:

$ cd $HOME/walkthrough

The piece of code which generates the DAX is called a DAX generator, and in this case the code is written in Python. (The proper version of Python is pre-installed on the VM you are on.) APIs are also available for Java and Perl, and if that does not fit in your tool chain, you can also write the DAX XML directly.

Open the file create_diamond_dax.py

$ nano create_diamond_dax.py

The code has 6 logical sections:

  1. Imports and Pegasus location setup. We need to know the location of Pegasus because we need to import the DAX3 Pegasus Python module, and we need to know where on the file system the pegasus-keg executable is.

  2. A new abstract dag (DAX) is created. This is the main DAX object that we will add data, jobs and flow information to.

  3. A replica catalog is set up. Replica catalogs tell Pegasus where to find data. In this example workflow, we only have one input, f.a, and thus the replica catalog only has one entry. For larger workflows, it is not uncommon to have thousands of entries in the replica catalog, and sometimes mulitple physical locations for the same logical filename so that Pegasus can pick which replica to use. Replica catalogs can either be included in the DAX (like in this example) or be standalone files or services. For more information about replica catalogs, see the Data Discovery chapter.

  4. Executables are added. Just like we the replica catalog informs Pegasus on where to find data, the transformation catalog tells Pegasus where to find the executables for the workflow. The transformation catalog can exist inside the DAX (as in this example) or in a standalone file. You can list the same logical exectuable existing on multiple resources, and Pegasus will pick the appropiate one when the workflow is planned. More information can be found in the Executable Discovery chapter.

  5. Jobs are added. The 4 jobs in the Black Diamond in the picture above are added. Arguments are defined, and uses clauses are added to list input and output files. This is an imporant step, as it allows Pegasus to track the files, and stage the data if necessary.

  6. Control flows are set up. This is the edges in the picture, and defines parent/child relationships between the jobs. When the workflow is executing, this is the order the jobs will run in.

Close the file (Ctrl+X) and execute it. Since Pegasus 4.0, all Pegasus tools are installed according to the LFS with a prefix of pegasus- before binaries.

$ ./create_diamond_dax.py

The output is the DAX XML:

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated: 2012-03-08 18:00:14.935638 -->
<!-- generated by: tutorial -->
<!-- generator: python -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.3.xsd" version="3.3" name="diamond">
        <file name="f.a">
                <pfn url="file:///home/tutorial/walkthrough/f.a" site="PegasusVM"/>
        </file>
        <executable name="preprocess" namespace="diamond" version="4.0" arch="x86_64" os="linux" installed="true">
                <pfn url="file:///usr/bin/pegasus-keg" site="PegasusVM"/>
        </executable>
        <executable name="analyze" namespace="diamond" version="4.0" arch="x86_64" os="linux" installed="true">
                <pfn url="file:///usr/bin/pegasus-keg" site="PegasusVM"/>
        </executable>
        <executable name="findrange" namespace="diamond" version="4.0" arch="x86_64" os="linux" installed="true">
                <pfn url="file:///usr/bin/pegasus-keg" site="PegasusVM"/>
        </executable>
        <job id="ID0000001" namespace="diamond" name="preprocess" version="4.0">
                <argument>-a preprocess -T60 -i <file name="f.a"/> -o <file name="f.b1"/> <file name="f.b2"/></argument>
                <uses name="f.b1" link="output"/>
                <uses name="f.a" link="input"/>
                <uses name="f.b2" link="output"/>
        </job>
        <job id="ID0000002" namespace="diamond" name="findrange" version="4.0">
                <argument>-a findrange -T60 -i <file name="f.b1"/> -o <file name="f.c1"/></argument>
                <uses name="f.c1" link="output"/>
                <uses name="f.b1" link="input"/>
        </job>
        <job id="ID0000003" namespace="diamond" name="findrange" version="4.0">
                <argument>-a findrange -T60 -i <file name="f.b2"/> -o <file name="f.c2"/></argument>
                <uses name="f.c2" link="output"/>
                <uses name="f.b2" link="input"/>
        </job>
        <job id="ID0000004" namespace="diamond" name="analyze" version="4.0">
                <argument>-a analyze -T60 -i <file name="f.c1"/> <file name="f.c2"/> -o <file name="f.d"/></argument>
                <uses name="f.c2" link="input"/>
                <uses name="f.d" link="output" register="true"/>
                <uses name="f.c1" link="input"/>
        </job>
        <child ref="ID0000002">
                <parent ref="ID0000001"/>
        </child>
        <child ref="ID0000003">
                <parent ref="ID0000001"/>
        </child>
        <child ref="ID0000004">
                <parent ref="ID0000002"/>
                <parent ref="ID0000003"/>
        </child>
</adag>

Note

We broke the overly long line of the XML root element for this document.

We need the DAX in a file to give to Pegasus, so run the same command again, but redirect it to file:

$ ./create_diamond_dax.py > diamond.dax 

More information about creating workflows can be found in the Creating Workflows chapter.

3.4. Submitting the Workflow

Submitting a Pegasus workflow consists of two steps, planning and submitting, but are often made by one single command for convenience. However, the planning stage is where Pegasus is doing powerful transformations to your workflow, so it is important to at least have an idea on what is happening under the covers. Planning includes, but is not limited to:

  1. Adding remote create dir jobs

  2. Adding stage in jobs to transfer data into the remote work directory

  3. Adding cleanup jobs to clean up the work directory as the workflow progresses

  4. Adding stage out jobs to transfer data to the final output location

  5. Adding registration jobs to register the data in a replica catalog

  6. Clusters job together - useful if you have many short tasks

  7. Adds wrappers to the jobs to collect provenance information - this is so statistics and plots can be created after a run

In the case of our black diamond workflow, here is what it looks like after the Pegasus planner has processed the DAX into a directed acyclic graph (DAG) of jobs:

Figure 3.2. Figure: The "black diamond" is an example for a concrete, runnable DAG of jobs.

Figure: The "black diamond" is an example for a concrete, runnable DAG of jobs.

In the above image, find the tasks from the abstract diamond DAX as the purple nodes in this concrete DAG.

Please note that the exact shape of your DAG may be slightly different. This is an effect of certain randomization during job clustering that have no impact on the correctness of the workflow. The minor differences may also be expressed in the total number of jobs you see later in this chapter.

To plan and submit the workflow, run:

$ pegasus-plan --conf pegasusrc --dir runs --sites PegasusVM --output local \
               --dax diamond.dax --submit

The output will look something like:

2012.03.08 18:07:15.936 EST:   Submitting job(s).
2012.03.08 18:07:15.945 EST:   1 job(s) submitted to cluster 1.
2012.03.08 18:07:15.963 EST:
2012.03.08 18:07:15.969 EST:   -----------------------------------------------------------------------
2012.03.08 18:07:15.978 EST:   File for submitting this DAG to Condor           : diamond-0.dag.condor.sub
2012.03.08 18:07:15.989 EST:   Log of DAGMan debugging messages                 : diamond-0.dag.dagman.out
2012.03.08 18:07:16.001 EST:   Log of Condor library output                     : diamond-0.dag.lib.out
2012.03.08 18:07:16.025 EST:   Log of Condor library error messages             : diamond-0.dag.lib.err
2012.03.08 18:07:16.033 EST:   Log of the life of condor_dagman itself          : diamond-0.dag.dagman.log
2012.03.08 18:07:16.041 EST:
2012.03.08 18:07:16.049 EST:   -----------------------------------------------------------------------
2012.03.08 18:07:16.057 EST:
2012.03.08 18:07:16.065 EST:   Your Workflow has been started and runs in base directory given below
2012.03.08 18:07:16.073 EST:
2012.03.08 18:07:16.081 EST:   cd /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
2012.03.08 18:07:16.089 EST:
2012.03.08 18:07:16.097 EST:   *** To monitor the workflow you can run ***
2012.03.08 18:07:16.109 EST:
2012.03.08 18:07:16.117 EST:   pegasus-status -l /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
2012.03.08 18:07:16.125 EST:
2012.03.08 18:07:16.133 EST:   *** To remove your workflow run ***
2012.03.08 18:07:16.141 EST:   pegasus-remove /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
2012.03.08 18:07:16.149 EST:
2012.03.08 18:07:16.157 EST:   Time taken to execute is 2.213 seconds

Tip

The work directory created by Pegasus is where the concrete workflow exists, and the directory is also the handle for Pegasus commands acting on that instance. Using this handle will be covered in the next section.

Further information about planning and submitting workflows can be found in the Running Workflows chapter. Information about the files in the work directory can be found in the Submit Directory Details chapter.

3.5. Monitoring, Debugging and Statistics

Once the workflow has been submitted, you can check status of it with the pegasus-status tool. Use the directory handle (which is different for every workflow you submit) from the previous step and run it with the -l flag:

$ pegasus-status -l /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
STAT  IN_STATE  JOB
Run      03:06  diamond-0
Run      00:46   |-findrange_ID0000002
Run      00:44   |-findrange_ID0000003
Idle     00:32   \_clean_up_preprocess_ID0000001
Summary: 4 Condor jobs total (I:1 R:3)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
    8     0     0     4     0     4     0  25.0 Running *diamond-0.dag
Summary: 1 DAG total (Running:1)

The first section shows jobs currently available to the Condor scheduling sub-system. Pegasus builds on top of Condor to manage jobs. The second section shows a summary of the state of the workflow. Keep on checking the workflow with pegasus-status until it is 100% done.

Once the workflow has finished, you may use the pegasus-analyzer to debug the workflow. This is obviously most useful when workflows have failed for some reason. pegasus-analyzer will show you which jobs failed and the output of those jobs. Our simple black diamond should finish successfully, and pegasus-analyzer output should look like:

$ pegasus-analyzer /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001
pegasus-analyzer: initializing...

************************************Summary*************************************

 Total jobs         :     15 (100.00%)
 # jobs succeeded   :     15 (100.00%)
 # jobs failed      :      0 (0.00%)
 # jobs unsubmitted :      0 (0.00%)

To get detailed run statistics, use the pegasus-statistics tool:

$ pegasus-statistics /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001

**********************************************SUMMARY***********************************************
#legends

Workflow summary - Summary of the workflow execution. It shows total
                tasks/jobs/sub workflows run, how many succeeded/failed etc.
                In case of hierarchical workflow the calculation shows the
                statistics across all the sub workflows.It shows the following
                statistics about tasks, jobs and sub workflows.
                * Succeeded - total count of succeeded tasks/jobs/sub workflows.
                * Failed - total count of failed tasks/jobs/sub workflows.
                * Incomplete - total count of tasks/jobs/sub workflows that are
                not in succeeded or failed state. This includes all the jobs
                that are not submitted, submitted but not completed etc. This
                is calculated as  difference between 'total' count and sum of
                'succeeded' and 'failed' count.
                * Total - total count of tasks/jobs/sub workflows.
                * Retries - total retry count of tasks/jobs/sub workflows.
                * Total Run - total count of tasks/jobs/sub workflows executed
                during workflow run. This is the cumulative of retries,
                succeeded and failed count.

Workflow wall time - The walltime from the start of the workflow execution
                to the end as reported by the DAGMAN.In case of rescue dag the value
                is the cumulative of all retries.

Workflow cumulative job wall time - The sum of the walltime of all jobs as
                reported by kickstart. In case of job retries the value is the
                cumulative of all retries. For workflows having sub workflow jobs
                (i.e SUBDAG and SUBDAX jobs), the walltime value includes jobs from
                the sub workflows as well.

Cumulative job walltime as seen from submit side - The sum of the walltime of
                all jobs as reported by DAGMan. This is similar to the regular
                cumulative job walltime, but includes job management overhead and
                delays. In case of job retries the value is the cumulative of all
                retries. For workflows having sub workflow jobs (i.e SUBDAG and
                SUBDAX jobs), the walltime value includes jobs from the sub workflows
                as well.

-------------------------------------------------------------------------------------------------------------------------------------------------
Type                Succeeded           Failed              Incomplete          Total                    Retries             Total Run (Retries Included)
Tasks               4                   0                   0                   4                   ||   0                   4
Jobs                15                  0                   0                   15                  ||   0                   15
Sub Workflows       0                   0                   0                   0                   ||   0                   0
-------------------------------------------------------------------------------------------------------------------------------------------------

Workflow wall time                               : 5 mins, 17 secs,     (total 317 seconds)

Workflow cumulative job wall time                : 4 mins, 0 secs,      (total 240 seconds)

Cumulative job walltime as seen from submit side : 4 mins, 7 secs,      (total 247 seconds)

Summary                           : /home/tutorial/walkthrough/runs/tutorial/pegasus/diamond/run0001/statistics/summary.txt

****************************************************************************************************

More information about monitoring, debugging and statistics can be found in the Monitoring, Debugging and Statistics chapter.

3.6. Terminate Your Virtual Machine

Please refer to chapter "Terminate Your Virtual Machine" to shut down the currently running virtual machine.

However, if you feel like going on, skip the shutdown, go to the next chapter, and skip the start-up and log-in at the beginning of the next chapter.