Student notes for Pegasus WMS tutorial

Chapter 1: Introduction

These are the student notes for the Pegasus WMS tutorial. They are designed to be used in conjunction with instructor presentation and support.

You will see two styles of machine text here:

Text like this is input that you should type.
Text like this is the output you should get.

For example:

$ date

Mon June 1 11:54:58 BST 2007

You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.

On Linux or Mac OS X, open a terminal window and type:

On Windows, PuTTY is recommended as an ssh client.

For the purpose of this tutorial replace any instance of @trainXX@ with your viz-login username.

$ ssh @trainXX@@viz-login.isi.edu

[welcome message] trainXX@viz-login:~$

You will need to obtain Grid Credentials to run the workflows on the Grid.
You can generate your proxy using grid-proxy-init

$ grid-proxy-init

Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 Enter GRID pass phrase for this identity: Creating proxy ......................................... Done Your proxy is valid until: Mon Jan 28 22:38:42 2008

Check your proxy using grid-proxy-info.

$ grid-proxy-info

subject : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01/CN=378830928 issuer : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 identity : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /tmp/x509up_u1045 timeleft : 11:58:23

Chapter 2: Mapping and executing workflows using Pegasus WMS

In this chapter you will be introduced to planning and executing a workflow through Pegasus WMS locally. You will then plan and execute a larger Montage workflow on the GRID.

All the exercises in this Chapter will be run from the $HOME/tutorial/ directory. All the files that are required reside in this directory

$ cd $HOME/tutorial
$ 

Files for the exercise are stored in subdirectories:

$ ls

config dags dax

You may also see some other files here.

Exercise 2.1: DAX

An abstract DAG has been generated for Montage application and output in XML format into dax/montage.dax. Open montage.dax in a file viewer:

$ cat dax/montage.dax

Inside the DAX, you should see three sections.

  1. list of all the files used in the workflow
  2. definition of all jobs - each job in the workflow.
  3. list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.

Exercise 2.2 Setting up the Replica Catalog

In this exercise you will insert entries into the Replica Catalog. The replica catalog that we will use today is a simple file based catalog.
We also support and recommend GLOBUS RLS or a JDBC implementation for production runs.

A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.

You can use the rc-client command to insert , query and delete from the replica catalog.

The input data to be used for your workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.

The instructors have provided:

Instructions:

Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.

Exercise 2.3 Setting up the Site Catalog and Transformation Catalog

In this exercise you will setup your Site Catalog and the Transformation Catalog.

The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the Montage code is installed on the various grid sites.

The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectories, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.

The instructors have provided:

$ cat config/tc.data

local   bin/mDiff       gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mDiff              STATIC_BINARY   INTEL32::LINUX  ENV::MONTAGE_HOME="."
local   bin/mDiff       gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mDiff       STATIC_BINARY   INTEL64::LINUX  ENV::MONTAGE_HOME="."
local   bin/mFitplane   gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mFitplane          STATIC_BINARY   INTEL32::LINUX  NULL
local   bin/mFitplane   gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mFitplane   STATIC_BINARY   INTEL64::LINUX  NULL
local   mAdd:3.0        gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mAdd               STATIC_BINARY   INTEL32::LINUX  NULL
local   mAdd:3.0        gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mAdd        STATIC_BINARY   INTEL64::LINUX  NULL

Open the properties file and check a few properties.

$ cat config/properties

## SELECT THE REPLICAT CATALOG MODE AND URL
pegasus.catalog.replica = SimpleFile
pegasus.catalog.replica.file = ${user.home}/tutorial/config/rc.data
#pegasus.catalog.replica.url=rlsn://smarty.isi.edu

## SELECT THE SITE CATALOG MODE AND FILE
pegasus.catalog.site = XML
pegasus.catalog.site.file = ${user.home}/tutorial/config/sites.xml


## SELECT THE TRANSFORMATION CATALOG MODE AND FILE
pegasus.catalog.transformation = File
pegasus.catalog.transformation.file = ${user.home}/tutorial/config/tc.data

## SET UP THE WORK AND INVOCATION DATABASE
pegasus.catalog.work =  Database
pegasus.catalog.provenance = InvocationSchema

## Database related properties
pegasus.catalog.*.db.driver = MySQL
pegasus.catalog.*.db.url = jdbc:mysql://smarty.isi.edu/tg2007
pegasus.catalog.*.db.user = tg2007user
pegasus.catalog.*.db.password =  Teragrid2007

## USE DAGMAN RETRY FEATURE FOR FAILURES
pegasus.dagman.retry=2

## STAGE ALL OUR EXECUTABLES
pegasus.catalog.transformation.mapper = Staged

## CHECK JOB EXIT CODES FOR FAILURE
pegasus.exitcode.scope=all

## OPTIMZE DATA & EXECUTABLE TRANSFERS
pegasus.transfer.refiner=Bundle

#STAGE DATA AND EXECUTABLES USING GRIDFTP 3rd PARTY MODE
pegasus.transfer.*.thirdparty.sites=*

## WORK AND STORAGE DIR  
## CHANGE THESE TO YOUR TERAGRID USERNAME
pegasus.dir.storage = xxxxx/storage
pegasus.dir.exec = xxxxx/exec

Edit the properties pegasus.dir.storage and pegasus.dir.exec to specify relative paths for your workflow execution and data storage directory. Change the xxxxx value to your @trainXX@ value.

$ vim config/properties
[...]
$ cat config/properties

pegasus.dir.storage = @trainXX@/storage
pegasus.dir.exec = @trainXX@/exec

You can look at them to have an idea as to what they look like. But for now we will move ahead and plan your workflow through Pegasus. We need to get running on the GRID fast :). Time is short!!

In production mode the sc-client interfaces with Globus MDS to retrieve the information about various sites.
Also the client pegasus-get-sites can be used to generate a site catalog and transformation catalog for the Open Science Grid.

Exercise 2.4 Running pegasus-plan to generate executable workflow (condor submit files) and pegasus-run to submit the workflow locally

In this exercise we are going to run pegasus-planto generate a concrete workflow from the abstract workflow (diamond.dax). The Concrete workflow generated, are condor submit files that are submitted locally using pegasus-run

The instructors have provided:

You will need to write some things yourself, by following the instructions below:

Instructions: