Student notes for Pegasus tutorial

Introduction

These are the student notes for the Pegasus tutorial. They are designed to be used in conjunction with instructor presentation and support.

You will see two styles of machine text here:

Text like this is input that you should type.
Text like this is the output you should get.

For example:

$ date
Mon June 1 11:54:58 BST 2007

You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.

On Linux or Mac OS X, open a terminal window and type:

On Windows, PuTTY is recommended as an ssh client.

$ ssh viz-username@viz-login.isi.edu
[welcome message]
viz-username@viz-login:~$

You will need to obtain Grid Credentials to run the workflows on Teragrid.
Teragrid provides facility to obtain grid credentials using MyProxy.

$ myproxy-logon -s myproxy.teragrid.org -l tg-username
Enter MyProxy pass phrase:
A credential has been received for user train22 in /tmp/x509up_u1055

Chapter 1: Running on the GRID using Pegasus

In this chapter you will be introduced to planning and running a workflow through Pegasus on a cluster. You will take a Montage workflow generated and run it on the GRID.

All the exercises in this Chapter will be run from the $HOME/tutorial/ directory. All the files that are required reside in this sub directory

$ cd $HOME/tutorial
$ 

Files for the exercise are stored in subdirectories:

$ ls

config dags data fmri.xml templates 

You may also see some other files here.

Exercise 1.1: DAX

An abstract DAG has been generated and output in XML format into fmri.xml. Open fmri.xml in a file viewer:

$ cat fmri.xml

Inside the DAX, you should see two sections.

  1. definition of all jobs - each job in the workflow.
  2. list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.

Exercise 1.2 SETTING UP THE REPLICA CATALOG (RLS)

In this exercise you will insert entries into the Replica Catalog. The replica catalog that we use is GLOBUS RLS.

A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.

You can use the rc-client command to insert , query and delete from the replica catalog.

The input data to be used for your workflow resides in the $HOME/Data directory. We are going to insert entries into the replica catalog that point to the files in this directory.

The instructors have provided:

You will need to write some things yourself, by following the instructions below:

Instructions:

Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.

Exercise 1.3 SETTING UP THE SITE CATALOG AND THE TRANSFORMATION

In this exercise you will setup your Site Catalog and the Transformation Catalog.

The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the fmri code is installed on the various grid sites.

The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectories, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.

The instructors have provided:

You can look at them to have an idea as to what they look like. But for now we will move ahead and plan your workflow through Pegasus. We need to get running on the GRID fast :). Time is short!!

Exercise 1.4 Running vds-plan to generate concrete workflow (condor submit files) and vds-run to submit the workflow to a grid resource

In this exercise we are going to run vds-planto generate a concrete workflow from the abstract workflow (fmri.dax). The concrete workflow generated, are condor submit files that are submitted to remote grid resources using Condor DAGMan and CondorG. Then we will submit the workflow to the grid using vds-run

Firstly, we are slightly going to change the fmri.dax. The changes which we will be doing are to ensure that each of yours DAX refers to unique lfns. This way you cannot clobber your fellow students runs.

The instructors have provided:

You will need to write some things yourself, by following the instructions below:

Instructions:

Exercise 1.5 Tracking the progress of the workflow and debugging the workflows.

In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.

We will change into the directory, that was mentioned by the vds-run command.

$ cd /nfs/home/@user@/tutorial/dags/ivdgl1/fmri/run000X

In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow

At the first go you should be concerned with only one file


To see which job of yours is being executed, you can also use condor_q. By default condor_q list all the jobs on the submit host. However, we are just interested in our own respective jobs. So we will use some classad magic to narrow the results.

Run command condor_q -const '(Owner == "@user@")' with @user@ replaced by the username.

 $ condor_q -const '(Owner == "vdsuser-4")'
 252.0   vdsuser-4       11/29  17:04   0+00:11:21 R  0   9.8  condor_dagman -f -
 260.0   vdsuser-4       11/29  17:15   0+00:00:00 I  0   9.8  kickstart -n convert
 

The above indicates that currently we have one job running. The dagman job is the graph manager and convert is the application code


Keep a lookout on the condor_q to track whether a workflow is running or not. If you do not see any of your job in the condor_q for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in CondorDAGMAN releasing the next job into the queue after a job has finished successfully.

If condor_q is empty, then either your workflow has
- successfully completed
- stopped midway due to non recoverable error

The End