These are the student notes for the Pegasus WMS tutorial on the Virtual Machine that can be downloaded from the Pegasus Website. They are designed to be used in conjunction with instructor presentation and support.
You will see two styles of machine text here:
Text like this is input that you should type.
Text like this is the output you should get.
For example:
$ date
Fri Mar 18 12:50:05 PDT 2011
You will need to install Virtual Box to run the virtual machine on your computer. If you already have one of the tools installed, use that. Otherwise download the binary versions and install them from the Virtual Box Website .
The instructors have tested the image with Virtual Box 4.0.6
Download the corresponding disk image.
-
It is around 576 MB in size. We recommend using a command line tool like wget to download the image. Downloading the image using the browser may sometimes corrupt the image. If you are running windows you try downloading using firefox instead of Internet Explorer.
$ wget http://pegasus.isi.edu/wms/download/3.0/Pegasus-3.0.2-Debian-6-x86.vbox.tar.bz2 --12:43:50-- http://pegasus.isi.edu/wms/download/3.0/Pegasus-3.0.2-Debian-6-x86.vbox.tar.bz2 => `Pegasus-3.0.2-Debian-6-x86.vbox.tar.bz2' Resolving pegasus.isi.edu... 128.9.64.219 Connecting to pegasus.isi.edu|128.9.64.219|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 604,017,271 (576M) [application/x-bzip2]The Image is bzipped . You will need to unzip it. For Windows you may need winzip or similar tools to extract the VM files.
If you have gnu tar you can do this directly
$ gtar jxvf Pegasus-3.0.2-Debian-6-x86.vbox.tar.bz2Else you need to do the following
$ bunzip2 Pegasus-3.0.2-Debian-6-x86.vbox.tar.bz2 $ tar xvf Pegasus-3.0.2-Debian-6-x86.vbox.tarAfter untarring a folder named Pegasus-3.0.2-Debian-6-x86.vbox will be created that has the vmdk files for the VM.
Launch Virtual Box on your machine. Follow the steps to add the vmdk file to Virtual Box and create a virtual machine inside the Virtual Box
In the Menu, click Machine and select New ( Machine > New )
It will open the New Virtual Machine Wizard. Click Continue
In the VM Name and OS Type Window specify the name as PegasusVM-3.0.2 .Select the Operating System as Linux and Version as Debian . Click Continue.
Set the base memory to 384 MB . It defaults to 512 MB. If you have more ram on your laptop/deskop feel free to adjust this setting. Click Continue
We now select the Virtual Hard Disk to use with the machine. Select the option box for Use Existing Hard Disk. Click the folder icon next to the list and locate the file Debian-6-x86.vmdk in the folder Pegasus-3.0.2-Debian-6-x86.vbox. Click Continue
Click Done
Now in the Virtual Box , start the PegasusVM-3.0.2 machine.
In this chapter you will be introduced to planning and executing a workflow through Pegasus WMS locally. You will then plan and execute a larger Montage workflow on the GRID.
When the virtual machine starts , it will automatically log you in as user tutorial . The password for this account is pegasus.
After logging on, start a terminal. There is a shortcut on the desktop for the terminal.
$ tutorial@pegasus-vm:$ pwd /home/tutorial
In general, to run workflows on the Grid you will need to obtain Grid Credentials. The VM already has a user certificate installed for the pegasus user. To generate the proxy ( grid credentials ) run the grid-proxy-init command.
$ [pegasus@pegasus ~]$ grid-proxy-init
Your identity: /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User
Creating proxy ............................................. Done
Your proxy is valid until: Thu Dec 23 22:41:36 2010
All the exercises in this Chapter will be run from the $HOME/pegasus-wms/ directory. All the files that are required reside in this directory
$ cd $HOME/pegasus-wms
Files for the exercise are stored in subdirectories:
$ ls
config dax
You may also see some other files here.
We generate a 4 node diamond dax. There is a small piece of java code that uses the DAX API to generate the DAX. Open the file $HOME/pegasus-wms/dax/CreateDAX.java in a file editor:
$ vi dax/CreateDAX.java
There is a function Diamond( String site_handle, String pegasus_location ) that constructs the DAX. Towards the end of the function there is some commented out code.
// Add analyze job
//To be uncommented for exercise 2.1
Job j4 = new Job("j4", "pegasus", "analyze", "4.0");
j4.addArgument("-a analyze -T 60 -i ").addArgument(fc1);
j4.addArgument(" ").addArgument(fc2);
j4.addArgument("-o ").addArgument(fd);
j4.uses(fc1, File.LINK.INPUT);
j4.uses(fc2, File.LINK.INPUT);
j4.uses(fd, File.LINK.OUTPUT);
//add job to the DAX
dax.addJob(j4);
//analyze job is a child to the findrange jobs
dax.addDependency("j2", "j4");
dax.addDependency("j3", "j4");
//End of commented out code for Exercise 2.1
The above snippet of code, adds a job with the ID0000004 to the DAX. It illustrates how to specify
the arguments for the job
the logical files used by the job
the dependencies to other jobs
adding the job to the dax
After uncommenting the code, compile and run the CreateDAX program.
$ cd dax
$ javac -classpath .:/opt/pegasus/default/lib/pegasus.jar CreateDAX.java
$ java -classpath .:/opt/pegasus/default/lib/pegasus.jar CreateDAX local /opt/pegasus/default ./diamond.dax
Let us view the generated diamond.dax.
$ cat diamond.dax
Inside the DAX, you should see three sections.
list of input file locations
list of executable locations
definition of all jobs - each job in the workflow. 4 jobs in total.
list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.
First lets change to the tutorial base directory.
$ cd $HOME/pegasus-wms
In this exercise you will insert entries into the Replica Catalog. The replica catalog that we will use today is a simple file based catalog. We also support and recommend the following for production runs
Globus RLS
JDBC implementation
A Replica Catalog maintains the LFN to PFN mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into Replica Catalog for data reuse later on.
The instructors have provided a File based Replica Catalog configured for the tutorial exercises. The file is inside the config directory.
-
Let us see what the file looks like.
$ cat config/rc.data
statfile_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/tutorial/inputdata/0.2degree/statfile.tbl pool="local" 2mass-atlas-990502s-j1440198.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1440198.fits pool="local" 2mass-atlas-990502s-j1440186.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1440186.fits pool="local" 2mass-atlas-990502s-j1430092.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1430092.fits pool="local" 2mass-atlas-990502s-j1420198.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1420198.fits pool="local" 2mass-atlas-990502s-j1420186.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1420186.fits pool="local" cimages_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/0.2degree/cimages.tbl pool="local" pimages_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/0.2degree/pimages.tbl pool="local" region_20070529_153243_22618.hdr gsiftp://pegasus-vm/scratch/0.2degree/region.hdr pool="local" 2mass-atlas-990502s-j1430080.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1430080.fits pool="local" ...
You can use the pegasus-rc-client command to insert , query and delete from the replica catalog.
Before executing any of the pegasus-rc-client exercises lets us remove the pre populated replica catalog.
$ rm $HOME/pegasus-wms/config/rc.data
To execute the diamond dax created in exercise 2.1, we will need to register input file f.a in the replica catalog. The file f.a resides at /scratch/tutorial/inputdata/diamond/f.a . Let us insert a single entry into the replica catalog.
$ pegasus-rc-client insert f.a \
gsiftp://pegasus-vm/scratch/tutorial/inputdata/diamond/f.a pool=local
Let us know verify if f.a has been registered successfully by querying the replica catalog using pegasus-rc-client
$ pegasus-rc-client lookup f.a
f.a gsiftp://pegasus-vm/scratch/tutorial/inputdata/diamond/f.a pool="local"
The pegasus-rc-client also allows for bulk insertion of entries. We will be inserting the entries for montage workflow using the bulk mode. The input data to be used for the montage workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.
The instructors have provided:
A file replicas.in, the input data file for the pegasus-rc-client that contains the mappings that need to be populated in the Replica Catalog. The file is inside the config directory
-
Let us see what the file looks like.
$ cat config/rc.in
statfile_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/tutorial/inputdata/0.2degree/statfile.tbl pool="local" 2mass-atlas-990502s-j1440198.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1440198.fits pool="local" 2mass-atlas-990502s-j1440186.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1440186.fits pool="local" 2mass-atlas-990502s-j1430092.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1430092.fits pool="local" 2mass-atlas-990502s-j1420198.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1420198.fits pool="local" 2mass-atlas-990502s-j1420186.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1420186.fits pool="local" cimages_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/0.2degree/cimages.tbl pool="local" pimages_20070529_153243_22618.tbl gsiftp://pegasus-vm/scratch/0.2degree/pimages.tbl pool="local" region_20070529_153243_22618.hdr gsiftp://pegasus-vm/scratch/0.2degree/region.hdr pool="local" 2mass-atlas-990502s-j1430080.fits gsiftp://pegasus-vm/scratch/0.2degree/2mass-atlas-990502s-j1430080.fits pool="local" -
Now we are ready to run rc-client and populate the data. Since each of you have an individual file replica catalog, all the 10 entries should be successfully registered.
$ pegasus-rc-client --insert config/rc.in
#Successfully worked on : 12 lines#Worked on total number of : 12 lines. -
Now the entries have been successfully inserted into the Replica Catalog. We should query the replica catalog for a particular lfn.
$ pegasus-rc-client lookup pimages_20080505_143233_14944.tbl
pimages_20080505_143233_14944.tbl gsiftp://pegasus-vm/scratch/tutorial/inputdata/0.2degree/pimages.tbl pool="local"
The site catalog contains information about the layout of your grid where you want to run your workflows. For each site following information is maintained
grid gateways
head node filesystem
worker node filesystem
scratch and shared file systems on the head nodes and worker nodes
replica catalog URL for the site
site wide information like environment variables to be set when a job is run.
The instructors have provided a pre-populated site catalog for use in the tutorial in $HOME/pegasus-wms/config directory.
Lets see the site catalog for the Pegasus VM. It refers to two sites local and cluster .
$ cat $HOME/pegasus-wms/config/sites.xml3
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0">
<site handle="cluster" arch="x86" os="LINUX" osrelease="" osversion="" glibc="">
<grid type="gt2" contact="pegasus-vm/jobmanager-fork" scheduler="Fork" jobtype="auxillary"/>
<grid type="gt2" contact="pegasus-vm/jobmanager-condor" scheduler="SGE" jobtype="compute"/>
<head-fs>
<scratch>
<shared>
<file-server protocol="gsiftp" url="gsiftp://pegasus" mount-point="/home/tutorial/cluster-scratch"/>
<internal-mount-point mount-point="/home/tutorial/cluster-scratch"/>
</shared>
</scratch>
<storage>
<shared>
<file-server protocol="gsiftp" url="gsiftp://pegasus" mount-point="/home/tutorial/cluster-storage"/>
<internal-mount-point mount-point="/home/tutorial/cluster-storage"/>
</shared>
</storage>
</head-fs>
<replica-catalog type="LRC" url="rlsn://localhost"/>
<profile namespace="env" key="GLOBUS_LOCATION" >/opt/globus/default</profile>
<profile namespace="env" key="JAVA_HOME" >/usr</profile>
<profile namespace="env" key="LD_LIBRARY_PATH" >/opt/globus/default/lib</profile>
<profile namespace="env" key="PEGASUS_HOME" >/opt/pegasus/default</profile>
<profile namespace="pegasus" key="clusters.num" >1</profile>
<profile namespace="pegasus" key="stagein.clusters" >1</profile>
</site>
<site handle="local" arch="x86" os="LINUX" osrelease="" osversion="" glibc="">
<grid type="gt2" contact="localhost/jobmanager-fork" scheduler="Fork" jobtype="auxillary"/>
<grid type="gt2" contact="localhost/jobmanager-fork" scheduler="Fork" jobtype="compute"/>
<head-fs>
<scratch>
<shared>
<file-server protocol="gsiftp" url="file://" mount-point="/home/tutorial/local-scratch"/>
<internal-mount-point mount-point="/home/tutorial/local-scratch"/>
</shared>
</scratch>
<storage>
<shared>
<file-server protocol="gsiftp" url="file://" mount-point="/home/tutorial/local-storage"/>
<internal-mount-point mount-point="/home/tutorial/local-storage"/>
</shared>
</storage>
</head-fs>
<replica-catalog type="LRC" url="rlsn://localhost"/>
<profile namespace="env" key="GLOBUS_LOCATION" >/opt/globus/default</profile>
<profile namespace="env" key="JAVA_HOME" >/usr</profile>
<profile namespace="env" key="LD_LIBRARY_PATH" >/opt/globus/default/lib</profile>
<profile namespace="env" key="PEGASUS_HOME" >/opt/pegasus/default</profile>
</site>
</sitecatalog>
The client pegasus-sc-client can be used to generate a site catalog and transformation catalog for the Open Science Grid.
$[pegasus@pegasus pegasus-wms]$ pegasus-sc-client --vo engage --sc ./engage-osg-sc.xml \ --source OSGMM --grid OSG -vvvv2010.11.24 18:00:46.410 PST: [INFO] Skipping site CIT_CMS_T2 2010.11.24 18:00:46.416 PST: [INFO] Adding site RENCI-Engagement 2010.11.24 18:00:46.475 PST: [INFO] Adding site Nebraska 2010.11.24 18:00:46.476 PST: [INFO] Adding site Prairiefire 2010.11.24 18:00:46.476 PST: [INFO] Adding site BNL-ATLAS 2010.11.24 18:00:46.477 PST: [INFO] Adding site BNL-ATLAS__1 2010.11.24 18:00:46.478 PST: [INFO] Adding site UFlorida-PG 2010.11.24 18:00:46.478 PST: [INFO] Skipping site CIT_CMS_T2__1 2010.11.24 18:00:46.478 PST: [INFO] Adding site RENCI-Blueridge 2010.11.24 18:00:46.479 PST: [INFO] Adding site Nebraska__1 2010.11.24 18:00:46.480 PST: [INFO] Adding site UMissHEP 2010.11.24 18:00:46.480 PST: [INFO] Adding site UCR-HEP 2010.11.24 18:00:46.481 PST: [INFO] Adding site LIGO_UWM_NEMO 2010.11.24 18:00:46.482 PST: [INFO] Adding site FNAL_FERMIGRID 2010.11.24 18:00:46.482 PST: [INFO] Adding site USCMS-FNAL-WC1 2010.11.24 18:00:46.483 PST: [INFO] Adding site UConn-OSG 2010.11.24 18:00:46.484 PST: [INFO] Adding site UFlorida-HPC 2010.11.24 18:00:46.484 PST: [INFO] Adding site GridUNESP_CENTRAL 2010.11.24 18:00:46.493 PST: [INFO] Adding site NWICG_NotreDame 2010.11.24 18:00:46.494 PST: [INFO] Site LOCAL . Creating default entry 2010.11.24 18:00:46.527 PST: [INFO] Loaded 19 sites 2010.11.24 18:00:46.527 PST: Writing out site catalog to /home/tutorial/pegasus-wms/./engage-osg-sc.xml 2010.11.24 18:00:46.959 PST: Number of SRM Properties retrieved 14 2010.11.24 18:00:46.970 PST: Writing out properties to /home/tutorial/pegasus-wms/./pegasus.6475454308491531036.properties 2010.11.24 18:00:46.972 PST: [INFO] Time taken to execute is 1.101 seconds 2010.11.24 18:00:46.972 PST: [INFO] event.pegasus.planner planner.version 3.0.0 - FINISHED
The transformation catalog maintains information about where the application code resides on the grid. It also provides additional information about the transformation as to what system they are compiled for, what profiles or environment variables need to be set when the transformation is invoked etc.
The instructors have provided a ready transformation catalog (tc.data.text) in the $HOME/pegasus-wms/config directory
In our case, it contains the locations where the Diamond or Montage code is installed in the Pegasus VM. Let us see the Transformation Catalog
For each transformation the following information is captured
tr - A transformation identifier. (Normally a Namespace::Name:Version.. The Namespace and Version are optional.)
pfn - URL or file path for the location of the executable. The pfn is a file path if the transformation is of type INSTALLED and generally a url (file:/// or http:// or gridftp://) if of type STAGEABLE
site - The site identifier for the site where the transformation is available
type - The type of transformation. Whether it is Iinstalled ("INSTALLED") on the remote site or is availabe to stage ("STAGEABLE").
-
arch os, osrelease, osversion - The arch/os/osrelease/osversion of the transformation. osrelease and osversion are optional.
ARCH can have one of the following values x86, x86_64, sparcv7, sparcv9, ppc, aix. The default value for arch is x86
OS can have one of the following values linux,sunos,macosx. The default value for OS if none specified is linux
Profiles - One or many profiles can be attached to a transformation for all sites or to a transformation on a particular site.
$ cat $HOME/pegasus-wms/config/tc.data.text
# multiple line text-based transformation catalog: 2010-11-24T20:46:41.710-08:00
tr bin/mDiff {
site local {
profile env "MONTAGE_HOME" "."
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mDiff"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr bin/mFitplane {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mFitplane"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr condor::dagman {
site local {
pfn "/usr/bin/condor_dagman"
arch "x86"
os "LINUX"
type "INSTALLED"
}
}
tr diamond::findrange:2.0 {
site local {
pfn "/opt/pegasus/default/bin/keg"
arch "x86"
os "LINUX"
type "INSTALLED"
}
}
tr diamond::preprocess:2.0 {
site local {
pfn "/opt/pegasus/default/bin/keg"
arch "x86"
os "LINUX"
type "INSTALLED"
}
}
tr mAdd:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mAdd"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mBackground:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mBackground"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mBgModel:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mBgModel"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mConcatFit:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mConcatFit"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mDiffFit:3.0 {
site local {
profile env "MONTAGE_HOME" "."
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mDiffFit"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mImgtbl:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mImgtbl"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mJPEG:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mJPEG"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mProjectPP:3.0 {
site local {
profile condor "priority" "25"
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mProjectPP"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
tr mShrink:3.0 {
site local {
pfn "gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/
bin/mShrink"
arch "x86"
os "LINUX"
type "STAGEABLE"
}
}
We will use the pegasus-tc-client to add the entry for the transformation dummy into the transformation catalog.
-
$ pegasus-tc-client -a -l diamond::dummy:2.0 \ -p /opt/pegasus/default/bin/keg -r local -t INSTALLED -s INTEL32::LINUX 2008.04.30 15:11:59.313 PDT: [INFO] Added tc entry sucessfullyLet us try and query for the entry we inserted.
$ pegasus-tc-client -q -P -l diamond::dummy:2.0 #RESID LTX PFN TYPE SYSINFO local diamond::analyze:2.0 /cluster-software/pegasus/current/bin/keg INSTALLED INTEL32::LINUXLet us try and query the transformation catalog for all the entries in it. Let us see what our transformation catalog looks like
$ pegasus-tc-client -q -B
local mDiff gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mDiff STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mFitplane gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mFitplane STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mAdd:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mAdd STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mBackground:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mBackground STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mBgModel:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mBgModel STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mConcatFit:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mConcatFit STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mDiffFit:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mDiffFit STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mImgtbl:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mImgtbl STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mJPEG:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mJPEG STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mProject:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mProjectPP STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mProjectPP:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mProjectPP STAGEABLE INTEL32::LINUX ENV::MONTAGE_HOME="." local mShrink:3.0 gsiftp://pegasus-vm/scratch/tutorial/software/montage/3.0/x86/bin/mShrink STAGEABLE INTEL32::LINUX NULL
Pegasus Workflow Planner is configured via the use of java properties. The instructors have provided a ready properties file at $HOME/.pegasusrc .
$ cat $HOME/.pegasusrc
##########################
# PEGASUS USER PROPERTIES
##########################
## SELECT THE REPLICAT CATALOG MODE AND URL
pegasus.catalog.replica = File
pegasus.catalog.replica.file = ${user.home}/pegasus-wms/config/rc.data
## SELECT THE SITE CATALOG MODE AND FILE
pegasus.catalog.site = XML3
pegasus.catalog.site.file = ${user.home}/pegasus-wms/config/sites.xml3
## SELECT THE TRANSFORMATION CATALOG MODE AND FILE
pegasus.catalog.transformation = Text
pegasus.catalog.transformation.file = ${user.home}/pegasus-wms/config/tc.data.text
## USE DAGMAN RETRY FEATURE FOR FAILURES
dagman.retry=2
## CHECK JOB EXIT CODES FOR FAILURE
dagman.post.scope=all
## STAGE ALL OUR EXECUTABLES OR USE INSTALLED ONES
pegasus.catalog.transformation.mapper = All
## WORK AND STORAGE DIR
pegasus.dir.storage = storage
pegasus.dir.exec = exec
#JOB CATEGORIES
dagman.projection.maxjobs 2
[pegasus@pegasus pegasus-wms
In this exercise we are going to run pegasus-plan to generate a executable workflow from the abstract workflow (diamond.dax). The Executable workflow generated, are condor submit files that are submitted locally using pegasus-run
The instructors have provided:
A dax (diamond.dax) in the $HOME/pegasus-wms/dax directory.
You will need to write some things yourself, by following the instructions below:
Run pegasus-plan to generate the condor submit files out of the dax.
Run pegasus-run to submit the workflow locally.
Instructions:
-
Let us run pegasus-plan on the diamond dax.
$ cd ~/pegasus-wms $ pegasus-plan --dax `pwd`/dax/diamond.dax --force\ --dir dags -s local -o local --nocleanup -vThe above command says that we need to plan the diamond dax locally. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data to be saved. Here is the output of pegasus-plan.
2010.12.23 10:54:02.180 PST: [INFO] event.pegasus.refinement dax.id blackdiamond_0 - STARTED 2010.12.23 10:54:02.189 PST: [INFO] event.pegasus.siteselection dax.id blackdiamond_0 - STARTED 2010.12.23 10:54:02.203 PST: [INFO] event.pegasus.siteselection dax.id blackdiamond_0 - FINISHED 2010.12.23 10:54:02.317 PST: [INFO] Grafting transfer nodes in the workflow 2010.12.23 10:54:02.318 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id blackdiamond_0 - STARTED 2010.12.23 10:54:02.447 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id blackdiamond_0 - FINISHED 2010.12.23 10:54:02.449 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id blackdiamond_0 - STARTED 2010.12.23 10:54:02.452 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id blackdiamond_0 - FINISHED 2010.12.23 10:54:02.452 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id blackdiamond_0 - STARTED 2010.12.23 10:54:02.453 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id blackdiamond_0 - FINISHED 2010.12.23 10:54:02.453 PST: [INFO] event.pegasus.refinement dax.id blackdiamond_0 - FINISHED 2010.12.23 10:54:02.539 PST: [INFO] Generating codes for the concrete workflow 2010.12.23 10:54:03.340 PST: [INFO] Generating codes for the concrete workflow -DONE 2010.12.23 10:54:03.340 PST: [INFO] Generating code for the cleanup workflow 2010.12.23 10:54:03.482 PST: [INFO] Generating code for the cleanup workflow -DONE 2010.12.23 10:54:03.530 PST: I have concretized your abstract workflow. The workflow has been entered into the workflow database with a state of "planned". The next step is to start or execute your workflow. The invocation required is pegasus-run -Dpegasus.user.properties=$HOME/.../blackdiamond/run0001/pegasus.7289539421670233327.properties /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 2010.12.23 10:54:03.530 PST: Time taken to execute is 1.757 seconds 2010.12.23 10:54:03.530 PST: [INFO] event.pegasus.planner planner.version 3.0.2 - FINISHED -
Now run pegasus-run as mentioned in the output of pegasus-plan. Do not copy the command below it is just for illustration purpose.
[pegasus@pegasus pegasus-wms]$ pegasus-run \ -Dpegasus.user.properties=$HOME/.../blackdiamond/run0001/pegasus.350356687577055673.properties \ $HOME/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 ----------------------------------------------------------------------- File for submitting this DAG to Condor : blackdiamond-0.dag.condor.sub Log of DAGMan debugging messages : blackdiamond-0.dag.dagman.out Log of Condor library output : blackdiamond-0.dag.lib.out Log of Condor library error messages : blackdiamond-0.dag.lib.err Log of the life of condor_dagman itself : blackdiamond-0.dag.dagman.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit blackdiamond-0.dag.condor.sub" ----------------------------------------------------------------------- Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 320. Your Workflow has been started and runs in base directory given below cd /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 *** To monitor the workflow you can run *** pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 *** To remove your workflow run *** pegasus-remove -d 320.0 or pegasus-remove /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 [pegasus@pegasus pegasus-wms]$
In this section, we are going to list ways to track your workflow, how to debug a failed workflow and how to generates statistics and plots for a workflow run.
We will change into the directory, that was mentioned by the output of pegasus-run command.
$ cd /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001
In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow
-
Run the command pegasus-status as mentioned by pegasus-run above to check the status of your jobs. Use the watch command to auto repeat the command every 2 seconds.
$ watch pegasus-status /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001
-- Submitter: pegasus : <172.16.80.128:40195> : pegasus ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 84.0 pegasus 7/19 16:59 0+00:01:17 R 0 7.3 condor_dagman -f - 87.0 |-preprocess_ 7/19 17:00 0+00:00:31 R 10 0.1 kickstart -n diamoTip
watch does not end with ESC nor (q)uit, but with Ctrl+C.
The above output shows that a couple of jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in Condor DAGMan releasing the next job into the queue after a job has finished successfully.
If output of pegasus-status is empty, then either your workflow has
successfully completed
stopped midway due to non recoverable error.
We can now run pegasus-analyzer to analyze the workflow.
-
Using pegasus-analyzer to analyze the workflow
$ pegasus-analyzer -i /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001 pegasus-analyzer: initializing... ************************************Summary************************************* Total jobs : 8 (100.00%) # jobs succeeded : 8 (100.00%) # jobs failed : 0 (0.00%) # jobs unsubmitted : 0 (0.00%) **************************************Done************************************** pegasus-analyzer: end of status report -
Another way to monitor the workflow is to check the jobstate.log file. This is the output file of the monitoring daemon that is parsing all the condor log files to determine the status of the jobs. It logs the events seen by Condor into a more readable form for us.
$ more jobstate.log
1290676248 INTERNAL *** MONITORD_STARTED *** 1290676247 INTERNAL *** DAGMAN_STARTED 339.0 *** [..]In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY. After sometime you will see entries in jobstate.log, that shows a job is being executed etc.
1290676261 create_dir_blackdiamond_0_local SUBMIT 340.0 local - 1 1290676266 create_dir_blackdiamond_0_local EXECUTE 340.0 local - 1 1290676266 create_dir_blackdiamond_0_local JOB_TERMINATED 340.0 local - 1 1290676266 create_dir_blackdiamond_0_local JOB_SUCCESS 0 local - 1 1290676266 create_dir_blackdiamond_0_local POST_SCRIPT_STARTED 340.0 local - 1 1290676271 create_dir_blackdiamond_0_local POST_SCRIPT_TERMINATED 340.0 local - 1 1290676271 create_dir_blackdiamond_0_local POST_SCRIPT_SUCCESS 0 local - 1The above shows the being submitted and then executed on the grid. In addition it lists that job is being run on the grid site local (which is your submit machine). The various states of the job while it goes through submission to execution to post processing are in UPPERCASE.
-
Successfully Completed : Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1290676542 register_local_2_0 SUBMIT 347.0 local - 8 1290676547 register_local_2_0 EXECUTE 347.0 local - 8 1290676547 register_local_2_0 JOB_TERMINATED 347.0 local - 8 1290676547 register_local_2_0 JOB_SUCCESS 0 local - 8 1290676547 register_local_2_0 POST_SCRIPT_STARTED 347.0 local - 8 1290676552 register_local_2_0 POST_SCRIPT_TERMINATED 347.0 local - 8 1290676552 register_local_2_0 POST_SCRIPT_SUCCESS 0 local - 8 1290676552 INTERNAL *** DAGMAN_FINISHED 0 *** 1290676554 INTERNAL *** MONITORD_FINISHED 0 ***Looking at the last two lines we see that DAGMan finished, and pegasus-monitord finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the local site successfully. The workflow generates a final output file f.d that resides in the directory /home/tutorial/local-storage/storage/f.d .
To view the file, you can do the following
$ cat /home/tutorial/local-storage/storage/f.d --- start f.c1 ---- --- start f.b1 ---- --- start f.a ---- Input File for the Diamond Workflow.--- final f.a ---- Timestamp Today: 20101223T105659.955-08:00 (1293130619.955;60.002) Applicationname: preprocess @ 10.0.2.15 (VPN) Current Workdir: /home/tutorial/local-scratch/exec/tutorial/pegasus/blackdiamond/run0001 Systemenvironm.: i686-Linux 2.6.32-5-686 Processor Info.: 1 x Intel(R) Xeon(R) CPU E5462 @ 2.80GHz @ 2797.463 Load Averages : 0.646 0.192 0.060 Memory Usage MB: 502 total, 229 free, 0 shared, 39 buffered Swap Usage MB: 397 total, 397 free Filesystem Info: /media/cdrom0 udf,iso9660 31MB total, 0B avail Filesystem Info: /media/floppy0 auto 7668MB total, 5436MB avail Output Filename: f.b1 Input Filenames: f.a --- final f.b1 ---- Timestamp Today: 20101223T105815.334-08:00 (1293130695.334;60.003) Applicationname: findrange @ 10.0.2.15 (VPN) Current Workdir: /home/tutorial/local-scratch/exec/tutorial/pegasus/blackdiamond/run0001 Systemenvironm.: i686-Linux 2.6.32-5-686 Processor Info.: 1 x Intel(R) Xeon(R) CPU E5462 @ 2.80GHz @ 2797.463 Load Averages : 1.444 0.509 0.177 Memory Usage MB: 502 total, 227 free, 0 shared, 39 buffered Swap Usage MB: 397 total, 397 free Filesystem Info: /media/cdrom0 udf,iso9660 31MB total, 0B avail Filesystem Info: /media/floppy0 auto 7668MB total, 5436MB avail Output Filename: f.c1 Input Filenames: f.b1 --- final f.c1 ---- --- start f.c2 ---- --- start f.b2 ---- --- start f.a ---- Input File for the Diamond Workflow.--- final f.a ---- Timestamp Today: 20101223T105659.955-08:00 (1293130619.955;60.003) Applicationname: preprocess @ 10.0.2.15 (VPN) Current Workdir: /home/tutorial/local-scratch/exec/tutorial/pegasus/blackdiamond/run0001 Systemenvironm.: i686-Linux 2.6.32-5-686 Processor Info.: 1 x Intel(R) Xeon(R) CPU E5462 @ 2.80GHz @ 2797.463 Load Averages : 0.646 0.192 0.060 Memory Usage MB: 502 total, 229 free, 0 shared, 39 buffered Swap Usage MB: 397 total, 397 free Filesystem Info: /media/cdrom0 udf,iso9660 31MB total, 0B avail Filesystem Info: /media/floppy0 auto 7668MB total, 5436MB avail Output Filename: f.b2 Input Filenames: f.a --- final f.b2 ---- Timestamp Today: 20101223T105820.478-08:00 (1293130700.478;60.001) Applicationname: findrange @ 10.0.2.15 (VPN) Current Workdir: /home/tutorial/local-scratch/exec/tutorial/pegasus/blackdiamond/run0001 Systemenvironm.: i686-Linux 2.6.32-5-686 Processor Info.: 1 x Intel(R) Xeon(R) CPU E5462 @ 2.80GHz @ 2797.463 Load Averages : 1.409 0.517 0.182 Memory Usage MB: 502 total, 228 free, 0 shared, 39 buffered Swap Usage MB: 397 total, 397 free Filesystem Info: /media/cdrom0 udf,iso9660 31MB total, 0B avail Filesystem Info: /media/floppy0 auto 7668MB total, 5436MB avail Output Filename: f.c2 Input Filenames: f.b2 --- final f.c2 ---- Timestamp Today: 20101223T105936.718-08:00 (1293130776.718;60.000) Applicationname: analyze @ 10.0.2.15 (VPN) Current Workdir: /home/tutorial/local-scratch/exec/tutorial/pegasus/blackdiamond/run0001 Systemenvironm.: i686-Linux 2.6.32-5-686 Processor Info.: 1 x Intel(R) Xeon(R) CPU E5462 @ 2.80GHz @ 2797.463 Load Averages : 1.033 0.581 0.226 Memory Usage MB: 502 total, 228 free, 0 shared, 40 buffered Swap Usage MB: 397 total, 397 free Filesystem Info: /media/cdrom0 udf,iso9660 31MB total, 0B avail Filesystem Info: /media/floppy0 auto 7668MB total, 5436MB avail Output Filename: f.d Input Filenames: f.c1 f.c2 -
Unsuccessfully Completed (Workflow execution stopped midway) : Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1290677127 stage_in_local_local_0 EXECUTE 352.0 local - 4 1290677127 stage_in_local_local_0 JOB_TERMINATED 352.0 local - 4 1290677127 stage_in_local_local_0 JOB_FAILURE 1 local - 4 1290677127 stage_in_local_local_0 POST_SCRIPT_STARTED 352.0 local - 4 1290677132 stage_in_local_local_0 POST_SCRIPT_TERMINATED 352.0 local - 4 1290677132 stage_in_local_local_0 POST_SCRIPT_FAILURE 1 local - 4 1290677132 INTERNAL *** DAGMAN_FINISHED 1 *** 1290677134 INTERNAL *** MONITORD_FINISHED 0 ***Looking at the last two lines we see that DAGMan finished, and pegasus-monitord finished unsuccessfully with a status 1. We can easily determine which job failed. It is stage_in_local_local_0 in this case. To determine the reason for failure we need to look at it's kickstart output file which is JOBNAME.out.NNN. where NNN is 000 - NNN
In this section, we will run the diamond workflow but remove the input file so that the workflow fails during execution. This is to highlight how to use pegasus-analyzer to debug a failed workflow.
First of all lets rename the input file f.a
$ mv /scratch/tutorial/inputdata/diamond/f.a /scratch/tutorial/inputdata/diamond/f.a.old
$ cd $HOME/pegasus-wms
We will now repeat exercise 2.4 and 2.5 and submit the workflow again.
Plan and Submit the diamond workflow . Pass --submit to pegasus-plan to submit in case of successful planning $ pegasus-plan --dax `pwd`/dax/diamond.dax --force \ --dir dags -s local -o local --nocleanup --submit -v Use pegasus-status to track the workflow and wait it to fail $ watch pegasus-status /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002 -- Submitter: pegasus : <172.16.80.128:40195> : pegasus ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 96.0 pegasus 7/19 17:40 0+00:01:06 R 0 7.3 condor_dagman -f - The --long option to pegasus-status of a running workflow gives more detail [pegasus@pegasus pegasus-wms]$ pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002 blackdiamond-0.dag is running. 11/25 01:25:06 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:25:06 === === === === === === === 11/25 01:25:06 1 0 1 0 0 6 0 WORKFLOW STATUS : RUNNING | 1/8 ( 12% ) | (condor processing workflow) We can also use --long option to pegasus-status to see the FINAL status of the workflow $ pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002 blackdiamond-0.dag FAILED (status 1) 11/25 01:25:32 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:25:32 === === === === === === === 11/25 01:25:32 1 0 0 0 0 6 1 WORKFLOW STATUS : FAILED | 1/8 ( 12% ) | (rescue needs to be submitted)
We will now run pegasus-analyzer on the failed workflow submit directory to see what job failed.
$ pegasus-analyzer -i $HOME/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002
pegasus-analyzer: initializing...
************************************Summary*************************************
Total jobs : 8 (100.00%)
# jobs succeeded : 1 (12.50%)
# jobs failed : 1 (12.50%)
# jobs unsubmitted : 6 (75.00%)
******************************Failed jobs' details******************************
=============================stage_in_local_local_0=============================
last state: POST_SCRIPT_FAILURE
site: local
submit file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.sub
output file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.out.002
error file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.err.002
**************************************Done**************************************
pegasus-analyzer: end of status report
[pegasus@pegasus pegasus-wms]$ pegasus-analyzer -i /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002
pegasus-analyzer: initializing...
************************************Summary*************************************
Total jobs : 8 (100.00%)
# jobs succeeded : 1 (12.50%)
# jobs failed : 1 (12.50%)
# jobs unsubmitted : 6 (75.00%)
******************************Failed jobs' details******************************
=============================stage_in_local_local_0=============================
last state: POST_SCRIPT_FAILURE
site: local
submit file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.sub
output file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.out.002
error file: /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.err.002
-------------------------------Task #1 - Summary--------------------------------
site : local
hostname : pegasus
executable : /opt/pegasus/default/bin/pegasus-transfer
arguments :
exitcode : 1
working dir : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002
--Task #1 - pegasus::pegasus-transfer - pegasus::pegasus-transfer:1.0 - stdout--
2010-11-25 01:25:22,320 INFO: Reading URL pairs from stdin
2010-11-25 01:25:22,321 INFO: PATH=/usr/local/globus/default/bin:/opt/pegasus/default/bin:/usr/bin:/bin
2010-11-25 01:25:22,321 INFO: LD_LIBRARY_PATH=/usr/local/globus/default/lib:/usr/java/jdk1.6.0_20/jre/lib/amd64
2010-11-25 01:25:22,321 INFO: Executing cp commands
/bin/cp: cannot stat `/scratch/tutorial/inputdata/diamond/f.a': No such file or directory
2010-11-25 01:25:22,331 CRITICAL: Command'/bin/cp -L"/scratch/tutorial/inputdata/diamond/f.a"
"/home/tutorial/local-scratch/exec/pegasus/pegasus/blackdiamond/run0002/f.a"'failed with error code 1
**************************************Done**************************************
pegasus-analyzer: end of status report
[pegasus@pegasus pegasus-wms]$
The above tells us that the stage-in job for the workflow failed, and points us to the stdout of the job. By default, all jobs in Pegasus are launched via kickstart that captures runtime provenance of the job and helps in debugging. Hence, the stdout of the job is the kickstart stdout which is in XML.
. the duration of the job the start time for the job the node on which the job ran the stdout/stderr of the job the arguments with which it launched the job the environment that was set for the job before it was launched. the machine information about the node that the job ran on Amongst the above information, the dagman.out file gives a coarser grained estimate of the job duration and start time
This section explains how to read kickstart output and DAGMan Condor log files.
Kickstart is a light weight C executable that is shipped with the pegasus worker package. All jobs are launced via Kickstart on the remote end, unless explicitly disabled at the time of running pegasus-plan.
Kickstart does not work with
Condor Standard Universe Jobs
MPI jobs
Pegasus automatically disables kickstart for the above jobs.
Kickstart captures useful runtime provenance information about the job launched by it on the remote note, and puts in an XML record that it writes to it's stdout. The stdout appears in the workflow submit directory as <job>.out.00n . Some useful information captured by kickstart and logged are as follows
the exitcode with which the job it launched exited
the duration of the job
the start time for the job
the node on which the job ran
the directory in which the job ran
the stdout/stderr of the job
the arguments with which it launched the job
the environment that was set for the job before it was launched.
the machine information about the node that the job ran on
Lets look at the stdout of our failed job.
$ cat /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002/stage_in_local_local_0.out.002 <?xml version="1.0" encoding="ISO-8859-1"?> <invocation xmlns="http://pegasus.isi.edu/schema/invocation" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/invocation http://pegasus.isi.edu/schema/iv-2.1.xsd" version="2.1" start="2010-11-29T19:10:23.862-08:00" duration="0.076" transformation="pegasus::pegasus-transfer" derivation="pegasus::pegasus-transfer:1.0" resource="local" wf-label="blackdiamond" wf-stamp="2010-11-29T18:57:59-08:00" interface="eth0" hostaddr="10.0.2.15" hostname="pegasus-vm.local" pid="5428" uid="501" user="pegasus" gid="501" group="pegasus" umask="0022"> <mainjob start="2010-11-29T19:10:23.876-08:00" duration="0.063" pid="5429"> <usage utime="0.040" stime="0.023" minflt="2758" majflt="0" nswap="0" nsignals="0" nvcsw="5" nivcsw="20"/> <status raw="256"><regular exitcode="1"/></status> <statcall error="0"> <file name="/opt/pegasus/default/bin/pegasus-transfer">23212F7573722F62696E2F656E762070</file> <statinfo mode="0100775" size="25314" inode="2022205" nlink="1" blksize="4096" blocks="64" mtime="2010-11-23T13:14:52-08:00" atime="2010-11-29T19:10:07-08:00" ctime="2010-11-25T00:01:52-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> </statcall> <argument-vector/> </mainjob> <cwd>/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0002</cwd> <usage utime="0.002" stime="0.013" minflt="475" majflt="0" nswap="0" nsignals="0" nvcsw="1" nivcsw="5"/> <machine page-size="4096"> <stamp>2010-12-23T10:56:43.817-08:00</stamp> <uname system="linux" nodename="pegasus-vm" release="2.6.32-5-686" machine="i686"> #1 SMP Fri Dec 10 16:12:40 UTC 2010</uname> <linux> <ram total="527044608" free="242290688" shared="0" buffer="41041920"/> <swap total="417325056" free="417325056"/> <boot idle="1597.500">2010-12-23T10:29:16.599-08:00</boot> <cpu count="1" speed="2797" vendor="GenuineIntel">Intel(R) Xeon(R) CPU E5462 @ 2.80GHz</cpu> <load min1="0.05" min5="0.02" min15="0.00"/> <proc total="88" running="1" sleeping="87" vmsize="344793088" rss="123768832"/> <task total="101" running="1" sleeping="100"/> </linux> </machine> <statcall error="0" id="stdin"> <descriptor number="0"/> <statinfo mode="0100664" size="142" inode="2250032" nlink="1" blksize="4096" blocks="16" mtime="2010-11-29T19:09:20-08:00" atime="2010-11-29T19:10:07-08:00" ctime="2010-11-29T19:09:20-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> </statcall> <statcall error="0" id="stdout"> <temporary name="/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/gs.out.awOX6p" descriptor="3"/> <statinfo mode="0100600" size="762" inode="2054511" nlink="1" blksize="4096" blocks="16" mtime="2010-11-29T19:10:23-08:00" atime="2010-11-29T19:10:23-08:00" ctime="2010-11-29T19:10:23-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> <data>2010-11-29 19:10:23,920 INFO: Reading URL pairs from stdin 2010-11-29 19:10:23,921 INFO: PATH=/usr/local/globus/default/bin:/opt/pegasus/default/bin:/usr/bin:/bin 2010-11-29 19:10:23,921 INFO: LD_LIBRARY_PATH=/usr/local/globus/default/lib:/usr/java/jdk1.6.0_20/jre/lib/amd64/ 2010-11-29 19:10:23,921 INFO: Executing cp commands /bin/cp: cannot stat `/scratch/tutorial/inputdata/diamond/f.a': No such file or directory 2010-11-29 19:10:23,932 CRITICAL: Command '/bin/cp -L "/scratch/tutorial/inputdata/diamond/f.a" "/home/tutorial/local-scratch/exec/pegasus/pegasus/blackdiamond/run0002/f.a"' failed with error code 1 </data> </statcall> <statcall error="0" id="stderr"> <temporary name="/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/gs.err.oz9MOG" descriptor="4"/> <statinfo mode="0100600" size="0" inode="2054512" nlink="1" blksize="4096" blocks="8" mtime="2010-11-29T19:10:23-08:00" atime="2010-11-29T19:10:23-08:00" ctime="2010-11-29T19:10:23-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> </statcall> <statcall error="2" id="gridstart"> <!-- ignore above error --> <file name="condor_exec.exe"/> </statcall> <statcall error="0" id="logfile"> <descriptor number="1"/> <statinfo mode="0100644" size="0" inode="2250072" nlink="1" blksize="4096" blocks="8" mtime="2010-11-29T19:10:23-08:00" atime="2010-11-29T19:10:23-08:00" ctime="2010-11-29T19:10:23-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> </statcall> <statcall error="0" id="channel"> <fifo name="/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/gs.app.qCOCwX" descriptor="5" count="0" rsize="0" wsize="0"/> <statinfo mode="010640" size="0" inode="2054524" nlink="1" blksize="4096" blocks="8" mtime="2010-11-29T19:10:23-08:00" atime="2010-11-29T19:10:23-08:00" ctime="2010-11-29T19:10:23-08:00" uid="501" user="pegasus" gid="501" group="pegasus"/> </statcall> <environment> <env key="GLOBUS_LOCATION">/usr/local/globus/default</env> <env key="GRIDSTART_CHANNEL">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/gs.app.qCOCwX</env> <env key="JAVA_HOME">/usr</env> <env key="LD_LIBRARY_PATH">/usr/java/jdk1.6.0_20/jre/lib/amd64/server:/usr/java/jdk1.6.0_20/jre/lib/amd64:</env> <env key="PEGASUS_HOME">/opt/pegasus/default</env> <env key="TEMP">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427</env> <env key="TMP">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427</env> <env key="TMPDIR">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427</env> <env key="_CONDOR_ANCESTOR_4843">4862:1291085504:2790807554</env> <env key="_CONDOR_ANCESTOR_4862">5427:1291086623:1798288782</env> <env key="_CONDOR_ANCESTOR_5427">5428:1291086623:2750667008</env> <env key="_CONDOR_HIGHPORT">41000</env> <env key="_CONDOR_JOB_AD">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/.job.ad</env> <env key="_CONDOR_LOWPORT">40000</env> <env key="_CONDOR_MACHINE_AD">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427/.machine.ad</env> <env key="_CONDOR_SCRATCH_DIR">/opt/condor/local.pegasus/spool/local_univ_execute/dir_5427</env> <env key="_CONDOR_SLOT">1</env> </environment> <resource> <soft id="RLIMIT_CPU">unlimited</soft> <hard id="RLIMIT_CPU">unlimited</hard> <soft id="RLIMIT_FSIZE">unlimited</soft> <hard id="RLIMIT_FSIZE">unlimited</hard> <soft id="RLIMIT_DATA">unlimited</soft> <hard id="RLIMIT_DATA">unlimited</hard> <soft id="RLIMIT_STACK">unlimited</soft> <hard id="RLIMIT_STACK">unlimited</hard> <soft id="RLIMIT_CORE">0</soft> <hard id="RLIMIT_CORE">0</hard> <soft id="RESOURCE_5">unlimited</soft> <hard id="RESOURCE_5">unlimited</hard> <soft id="RLIMIT_NPROC">unlimited</soft> <hard id="RLIMIT_NPROC">unlimited</hard> <soft id="RLIMIT_NOFILE">1024</soft> <hard id="RLIMIT_NOFILE">1024</hard> <soft id="RLIMIT_MEMLOCK">32768</soft> <hard id="RLIMIT_MEMLOCK">32768</hard> <soft id="RLIMIT_AS">unlimited</soft> <hard id="RLIMIT_AS">unlimited</hard> <soft id="RLIMIT_LOCKS">unlimited</soft> <hard id="RLIMIT_LOCKS">unlimited</hard> <soft id="RLIMIT_SIGPENDING">8192</soft> <hard id="RLIMIT_SIGPENDING">8192</hard> <soft id="RLIMIT_MSGQUEUE">819200</soft> <hard id="RLIMIT_MSGQUEUE">819200</hard> <soft id="RLIMIT_NICE">0</soft> <hard id="RLIMIT_NICE">0</hard> <soft id="RLIMIT_RTPRIO">0</soft> <hard id="RLIMIT_RTPRIO">0</hard> </resource> </invocation>
In this exercise we will learn about the DAG file format and some of the log files generated when the DAG runs.
-
Now take a look at the DAG file...
$ cat $HOME/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/blackdiamond-0.dag
###################################################################### # PEGASUS WMS GENERATED DAG FILE # DAG blackdiamond # Index = 0, Count = 1 ###################################################################### MAXJOBS projection 2 JOB create_dir_blackdiamond_0_local create_dir_blackdiamond_0_local.sub SCRIPT POST create_dir_blackdiamond_0_local /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/create_dir_blackdiamond_0_local.out RETRY create_dir_blackdiamond_0_local 2 JOB stage_in_local_local_0 stage_in_local_local_0.sub SCRIPT POST stage_in_local_local_0 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/stage_in_local_local_0.out RETRY stage_in_local_local_0 2 JOB preprocess_j1 preprocess_j1.sub SCRIPT POST preprocess_j1 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/preprocess_j1.out RETRY preprocess_j1 2 JOB findrange_j2 findrange_j2.sub SCRIPT POST findrange_j2 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/findrange_j2.out RETRY findrange_j2 2 JOB findrange_j3 findrange_j3.sub SCRIPT POST findrange_j3 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/findrange_j3.out RETRY findrange_j3 2 JOB analyze_j4 analyze_j4.sub SCRIPT POST analyze_j4 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/analyze_j4.out RETRY analyze_j4 2 JOB stage_out_local_local_2_0 stage_out_local_local_2_0.sub SCRIPT POST stage_out_local_local_2_0 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/stage_out_local_local_2_0.out RETRY stage_out_local_local_2_0 2 JOB register_local_2_0 register_local_2_0.sub SCRIPT POST register_local_2_0 /opt/pegasus/default/bin/pegasus-exitcode /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/register_local_2_0.out RETRY register_local_2_0 2 PARENT findrange_j2 CHILD analyze_j4 PARENT preprocess_j1 CHILD findrange_j2 PARENT preprocess_j1 CHILD findrange_j3 PARENT findrange_j3 CHILD analyze_j4 PARENT analyze_j4 CHILD stage_out_local_local_2_0 PARENT stage_in_local_local_0 CHILD preprocess_j1 PARENT stage_out_local_local_2_0 CHILD register_local_2_0 PARENT create_dir_blackdiamond_0_local CHILD analyze_j4 PARENT create_dir_blackdiamond_0_local CHILD findrange_j2 PARENT create_dir_blackdiamond_0_local CHILD preprocess_j1 PARENT create_dir_blackdiamond_0_local CHILD findrange_j3 PARENT create_dir_blackdiamond_0_local CHILD stage_in_local_local_0 ###################################################################### # End of DAG ################################################################## -
... and the dagman.out file.
$
cat $HOME/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/blackdiamond-0.dag.dagman.out11/25 01:10:47 ****************************************************** 11/25 01:10:47 ** condor_scheduniv_exec.339.0 (CONDOR_DAGMAN) STARTING UP 11/25 01:10:47 ** /opt/condor/7.4.2/bin/condor_dagman 11/25 01:10:47 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1) 11/25 01:10:47 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON 11/25 01:10:47 ** $CondorVersion: 7.4.2 Mar 29 2010 BuildID: 227044 $ 11/25 01:10:47 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 11/25 01:10:47 ** PID = 7844 11/25 01:10:47 ** Log last touched time unavailable (No such file or directory) 11/25 01:10:47 ****************************************************** 11/25 01:10:47 Using config source: /opt/condor/config/condor_config 11/25 01:10:47 Using local config sources: 11/25 01:10:47 /opt/condor/config/condor_config.local 11/25 01:10:47 DaemonCore: Command Socket at <172.16.80.129:40035> 11/25 01:10:47 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880 11/25 01:10:47 DAGMAN_DEBUG_CACHE_ENABLE setting: False 11/25 01:10:47 DAGMAN_SUBMIT_DELAY setting: 0 11/25 01:10:47 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 11/25 01:10:47 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 11/25 01:10:47 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 11/25 01:10:47 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5 11/25 01:10:47 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 11/25 01:10:47 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 11/25 01:10:47 DAGMAN_RETRY_NODE_FIRST setting: 0 11/25 01:10:47 DAGMAN_MAX_JOBS_IDLE setting: 0 11/25 01:10:47 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 11/25 01:10:47 DAGMAN_MUNGE_NODE_NAMES setting: 1 11/25 01:10:47 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 11/25 01:10:47 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 11/25 01:10:47 DAGMAN_ABORT_DUPLICATES setting: 1 11/25 01:10:47 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 11/25 01:10:47 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 11/25 01:10:47 DAGMAN_AUTO_RESCUE setting: 1 11/25 01:10:47 DAGMAN_MAX_RESCUE_NUM setting: 100 11/25 01:10:47 DAGMAN_DEFAULT_NODE_LOG setting: null 11/25 01:10:47 ALL_DEBUG setting: 11/25 01:10:47 DAGMAN_DEBUG setting: .... 11/25 01:10:47 Default node log file is: </home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/blackdiamond-0.dag.nodes.log> 11/25 01:10:47 DAG Lockfile will be written to blackdiamond-0.dag.lock 11/25 01:10:47 DAG Input file is blackdiamond-0.dag 11/25 01:10:47 Parsing 1 dagfiles 11/25 01:10:47 Parsing blackdiamond-0.dag ... 11/25 01:10:47 Dag contains 8 total jobs 11/25 01:10:47 Sleeping for 12 seconds to ensure ProcessId uniqueness 11/25 01:10:59 Bootstrapping... 11/25 01:10:59 Number of pre-completed nodes: 0 11/25 01:10:59 Registering condor_event_timer... 11/25 01:11:00 Sleeping for one second for log file consistency 11/25 01:11:01 Submitting Condor Node create_dir_blackdiamond_0_local job(s)... 11/25 01:11:01 submitting: condor_submit -a dag_node_name' '=' 'create_dir_blackdiamond_0_local -a +DAGManJobId' '=' '339 -a DAGManJobId' '=' '339 -a submit_event_notes' '=' 'DAG' 'Node:' ' create_dir_blackdiamond_0_local -a +DAGParentNodeNames' '=' '"" create_dir_blackdiamond_0_local.sub 11/25 01:11:01 From submit: Submitting job(s). 11/25 01:11:01 From submit: Logging submit event(s). 11/25 01:11:01 From submit: 1 job(s) submitted to cluster 340. 11/25 01:11:01 assigned Condor ID (340.0) 11/25 01:11:01 Just submitted 1 job this cycle... 11/25 01:11:01 Currently monitoring 1 Condor log file(s) 11/25 01:11:01 Event: ULOG_SUBMIT for Condor Node create_dir_blackdiamond_0_local (340.0) 11/25 01:11:01 Number of idle job procs: 1 11/25 01:11:01 Of 8 nodes total: 11/25 01:11:01 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:11:01 === === === === === === === 11/25 01:11:01 0 0 1 0 0 7 0 .... 11/25 01:11:06 Currently monitoring 1 Condor log file(s) 11/25 01:11:06 Event: ULOG_EXECUTE for Condor Node create_dir_blackdiamond_0_local (340.0) 11/25 01:11:06 Number of idle job procs: 0 11/25 01:11:06 Event: ULOG_JOB_TERMINATED for Condor Node create_dir_blackdiamond_0_local (340.0) 11/25 01:11:06 Node create_dir_blackdiamond_0_local job proc (340.0) completed successfully. 11/25 01:11:06 Node create_dir_blackdiamond_0_local job completed 11/25 01:11:06 Running POST script of Node create_dir_blackdiamond_0_local... 11/25 01:11:06 Number of idle job procs: 0 11/25 01:11:06 Of 8 nodes total: 11/25 01:11:06 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:11:06 === === === === === === === 11/25 01:11:06 0 0 0 1 0 7 0 11/25 01:11:11 Currently monitoring 1 Condor log file(s) 11/25 01:11:11 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node create_dir_blackdiamond_0_local (340.0) 11/25 01:11:11 POST Script of Node create_dir_blackdiamond_0_local completed successfully. 11/25 01:11:11 Of 8 nodes total: 11/25 01:11:11 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:11:11 === === === === === === === 11/25 01:11:11 1 0 0 0 1 6 0 .... 11/25 01:15:52 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node register_local_2_0 (347.0) 11/25 01:15:52 POST Script of Node register_local_2_0 completed successfully. 11/25 01:15:52 Of 8 nodes total: 11/25 01:15:52 Done Pre Queued Post Ready Un-Ready Failed 11/25 01:15:52 === === === === === === === 11/25 01:15:52 8 0 0 0 0 0 0 11/25 01:15:52 All jobs Completed! 11/25 01:15:52 Note: 0 total job deferrals because of -MaxJobs limit (0) 11/25 01:15:52 Note: 0 total job deferrals because of -MaxIdle limit (0) 11/25 01:15:52 Note: 0 total job deferrals because of node category throttles 11/25 01:15:52 Note: 0 total PRE script deferrals because of -MaxPre limit (20) 11/25 01:15:52 Note: 0 total POST script deferrals because of -MaxPost limit (20) 11/25 01:15:52 **** condor_scheduniv_exec.339.0 (condor_DAGMAN) pid 7844 EXITING WITH STATUS 0 [p
Sometimes you may want to halt the execution of the workflow or just permanently remove it. You can stop/halt a workflow by running the pegasus-remove command mentioned in the output of pegasus-run
$ pegasus-remove $HOME/pegasus-wms/dags/tutorial/pegasus/diamond/runXXXX
Job 2788.0 marked for removal
In this section, we will generate statistics and plots of the diamond workflow we ran using pegasus-statistics and pegasus-plots
pegasus-statistics generates workflow execution statistics. To generate statistics run the command as shown below
$ cd $HOME/pegasus-wms
$ pegasus-statistics dags/tutorial/pegasus/blackdiamond/run0001/
tutorial@pegasus-vm:~/pegasus-wms$ pegasus-statistics dags/tutorial/pegasus/blackdiamond/run0001/
******************************************** SUMMARY ********************************************
#Legends
#Workflow runtime (min,sec) - the waltime from the start of the workflow execution to the end as
reported by the DAGMAN.In case of rescue dag the value is the cumulative
of all retries.
#Cumulative workflow runtime (min,sec) - the sum of the walltime of all jobs as reported by the DAGMan .
In case of job retries the value is the cumulative of all retries.
Job summary
Total - the total number of jobs in the workflow. The total number of jobs is calculated by parsing
the .dag file. For workflows having SUBDAX jobs , the SUDBAX job is skipped , but the
calculation takes into account all the jobs that make up the SUBDAX sub workflow.
For workflows having SUBDAG jobs , the SUBDAG jobs are treated like regular jobs.
Succeeded - the total number of succeeded jobs in the workflow .
Failed - the total number of failed jobs in the workflow .
Unsubmitted - the total number of unsubmitted jobs in the workflow .
Unknown - the total number of jobs that are submitted, but has not completed execution or the state
is unknown in the workflow.
SUBDAX summary
Total - the total number of SUBDAX jobs in the workflow
Succeeded - the total number of succeeded SUBDAX jobs in the workflow.
Failed - the total number of failed SUBDAX jobs in the workflow.
Unsubmitted - the total number of unsubmitted SUBDAX jobs in the workflow.
Unknown - the total number of SUBDAX jobs that are submitted, but has not completed execution or
the state is unknown in the workflow.
Workflow runtime : 5 min. 5 sec.
Cumulative workflow runtime : 4 min. 0 sec.
Total Succeeded Failed Unsubmitted Unknown
Jobs 8 8 0 0 0
SUBDAX 0 0 0 0 0
Workflow execution statistics :
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/workflow.txt
Job statistics :
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/jobs.txt
Workflow events with time starting from zero :
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/jobstate.txt
Logical transformation statistics :
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/breakdown.txt
**************************************************************************************************
Workflow statistics table
Workflow statistics table contains information about the workflow run like total execution time, job's failed etc.
Table 1.1. Table Workflow Statistics
| Workflow runtime | 5 min. 5 sec. |
| Cumulative workflow runtime | 4 min. 0 sec. |
| Total jobs | 8 |
| # jobs succeeded | 8 |
| # jobs failed | 0 |
| # jobs unsubmitted | 0 |
| # jobs unknown | 0 |
Job statistics table
Job statistics table contains the following details about the jobs in the workflow. A sample table is shown below.
Job - the name of the job
Site - the site where the job ran
Kickstart(sec.) - the actual duration of the job in seconds on the remote compute node. In case of retries the value is the cumulative of all retries.
Post(sec.) - the postscript time as reported by DAGMan .In case of retries the value is the cumulative of all retries.
DAGMan(sec.) - the time between the last parent job of a job completes and the job gets submitted.In case of retries the value of the last retry is used for calculation.
CondorQTime(sec.) - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor q on the submit node .In case of retries the value is the cumulative of all retries.
Resource(sec.) - the time between the remote Grid submission and start of remote execution . It is an estimate of the time job spent in the remote queue .In case of retries the value is the cumulative of all retries.
Runtime(sec.) - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart .In case of retries the value is the cumulative of all retries.
Seqexec(sec.) - the time taken for the completion of a clustered job .In case of retries the value is the cumulative of all retries.
Seqexec-Delay(sec.) - the time difference between the time for the completion of a clustered job and sum of all the individual tasks kickstart time .In case of retries the value is the cumulative of all retries.
Table 1.2. Table Job Statistics
| Job | Site | Kickstart | Post | DAGMan | CondorQTime | Resource | Runtime | CondorQLen | Seqexec | Seqexec-Delay |
|---|---|---|---|---|---|---|---|---|---|---|
| analyze_j4 | local | 60.03 | 6.00 | 6.00 | 0.00 | 0.00 | 60.00 | 0 | - | - |
| create_dir_blackdiamond_0_local | local | 0.04 | 5.00 | 14.00 | 0.00 | 0.00 | 0.06 | 0 | - | - |
| findrange_j2 | local | 60.03 | 5.00 | 6.00 | 0.00 | 0.00 | 65.00 | 0 | - | - |
| findrange_j3 | local | 60.03 | 5.00 | 6.00 | 0.00 | 0.00 | 60.00 | 0 | - | - |
| preprocess_j1 | local | 60.03 | 5.00 | 6.00 | 0.00 | 0.00 | 60.00 | 0 | - | - |
| register_local_2_0 | local | 0.50 | 5.00 | 6.00 | 0.00 | 0.00 | 0.05 | 0 | - | - |
| stage_in_local_local_0 | local | 0.08 | 6.00 | 6.00 | 0.00 | 0.00 | 0.04 | 0 | - | - |
| stage_out_local_local_2_0 | local | 0.08 | 5.00 | 6.00 | 0.00 | 0.00 | 0.03 | 0 | - | - |
Logical transformation statistics table
Logical transformation statistics table contains information about each type of transformation in the workflow.
Table 1.3. Table: Logical Transformation Statistics
| Transformation | Count | Mean | Variance | Min | Max | Total |
|---|---|---|---|---|---|---|
| diamond::analyze:2.0 | 1 | 60.1600 | 0.0000 | 60.1600 | 60.1600 | 60.1600 |
| diamond::findrange:2.0 | 2 | 60.3100 | 0.0100 | 60.2500 | 60.3700 | 120.6200 |
| diamond::preprocess:2.0 | 1 | 60.4800 | 0.0000 | 60.4800 | 60.4800 | 60.4800 |
pegasus-plots generates graphs and charts to visualize workflow execution. To generate graphs and charts run the command as shown below.
$ cd $HOME/pegasus-wms
$ pegasus-plots dags/tutorial/pegasus/blackdiamond/run0001/
****** show-job *****
Please wait, this may take a few minutes ...
****** Finished executing show-job *****
****** show-run *****
Please wait, this may take a few minutes ...
****** Finished executing show-run *****
****** dag2dot *****
Please wait, this may take a few minutes ...
****** Finished executing dag2dot *****
****** dot *****
****** Finished executing dot2png *****
****** dax2dot *****
Please wait, this may take a few minutes ...
****** Finished executing dax2dot *****
****** dot *****
****** Finished executing dot2png *****
******************************************** SUMMARY ********************************************
DAX graph -
png format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/diamond-dax.png
eps format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/diamond-dax.eps
DAG graph -
png format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-dag.png
eps format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-dag.eps
Workflow execution Gantt chart -
png format : Failed to generate png format.Application 'convert' not available.
eps format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-2.eps
Host over time chart -
png format : Failed to generate png format.Application 'convert' not available.
eps format : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-host.eps
**************************************************************************************************
[pegasus@pegasus pegasus-wms]$
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/diamond-dax.png
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-dag.png
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/graph/blackdiamond-2.png
X axis - time in seconds . Each tic is 60 seconds
Y axis - Job Number .
In this exercise we are going to run pegasus-plan to generate a executable workflow from the abstract workflow (montage.dax). The Executable workflow generated, are condor submit files that are submitted to remote grid resources using pegasus-run
The instructors have provided:
A dax (montage.dax) in the $HOME/pegasus-wms/dax/ directory.
You will need to write some things yourself, by following the instructions below:
Run pegasus-plan to generate the condor submit files out of the dax.
Instructions:
-
Let us run pegasus-plan on the montage dax on the tg_ncsa cluster. If multiple sites are available you could provide the sites using a comma "," separated list like tg_ncsa,viz etc.
$ cd $HOME/pegasus-wms $ pegasus-plan -Dpegasus.schema.dax=/opt/pegasus/default/etc/dax-2.1.xsd \ --dir dags --sites cluster --output local --force \ --nocleanup --dax `pwd`/dax/montage.dax --submit -vThe above command says that we need to plan the montage dax on the cluster site. The cluster site in the VM is managed by SGE that is running in the VM. The jobs for this workflow will be submitted to jobmanager-condor in the VM. The output data needs to be transferred back to the local host. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data on the remote host. Here is the output of pegasus-plan.
2010.11.24 18:20:10.948 PST: [INFO] event.pegasus.parse.dax dax.id /home/tutorial/pegasus-wms/dax/montage.dax 2010.11.24 18:20:11.309 PST: [INFO] event.pegasus.parse.dax dax.id /home/tutorial/pegasus-wms/dax/montage.dax 2010.11.24 18:20:11.350 PST: [INFO] event.pegasus.refinement dax.id montage_0 - STARTED 2010.11.24 18:20:11.360 PST: [INFO] event.pegasus.siteselection dax.id montage_0 - STARTED 2010.11.24 18:20:11.416 PST: [INFO] event.pegasus.siteselection dax.id montage_0 - FINISHED 2010.11.24 18:20:11.504 PST: [INFO] Grafting transfer nodes in the workflow 2010.11.24 18:20:11.505 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id montage_0 - STARTED 2010.11.24 18:20:11.655 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id montage_0 - FINISHED 2010.11.24 18:20:11.657 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id montage_0 - STARTED 2010.11.24 18:20:11.660 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id montage_0 - FINISHED 2010.11.24 18:20:11.660 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id montage_0 - STARTED 2010.11.24 18:20:11.661 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id montage_0 - FINISHED 2010.11.24 18:20:11.661 PST: [INFO] event.pegasus.refinement dax.id montage_0 - FINISHED 2010.11.24 18:20:11.715 PST: [INFO] Generating codes for the concrete workflow 2010.11.24 18:20:12.406 PST: [INFO] Generating codes for the concrete workflow -DONE 2010.11.24 18:20:12.406 PST: [INFO] Generating code for the cleanup workflow 2010.11.24 18:20:12.528 PST: [INFO] Generating code for the cleanup workflow -DONE 2010.11.24 18:20:12.672 PST: 2010.11.24 18:20:12.679 PST: ----------------------------------------------------------------------- 2010.11.24 18:20:12.685 PST: File for submitting this DAG to Condor : montage-0.dag.condor.sub 2010.11.24 18:20:12.691 PST: Log of DAGMan debugging messages : montage-0.dag.dagman.out 2010.11.24 18:20:12.704 PST: Log of Condor library output : montage-0.dag.lib.out 2010.11.24 18:20:12.711 PST: Log of Condor library error messages : montage-0.dag.lib.err 2010.11.24 18:20:12.726 PST: Log of the life of condor_dagman itself : montage-0.dag.dagman.log 2010.11.24 18:20:12.731 PST: 2010.11.24 18:20:12.762 PST: -no_submit given, not submitting DAG to Condor. You can do this with: 2010.11.24 18:20:12.792 PST: "condor_submit montage-0.dag.condor.sub" 2010.11.24 18:20:12.798 PST: ----------------------------------------------------------------------- 2010.11.24 18:20:12.804 PST: Submitting job(s). 2010.11.24 18:20:12.815 PST: Logging submit event(s). 2010.11.24 18:20:12.821 PST: 1 job(s) submitted to cluster 275. 2010.11.24 18:20:13.504 PST: 2010.11.24 18:20:13.510 PST: Your Workflow has been started and runs in base directory given below 2010.11.24 18:20:13.519 PST: 2010.11.24 18:20:13.530 PST: cd /home/tutorial/pegasus-wms/dags/tutorial/pegasus/montage/run0001 2010.11.24 18:20:13.535 PST: 2010.11.24 18:20:13.542 PST: *** To monitor the workflow you can run *** 2010.11.24 18:20:13.555 PST: 2010.11.24 18:20:13.562 PST: pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/montage/run0001 2010.11.24 18:20:13.570 PST: 2010.11.24 18:20:13.578 PST: *** To remove your workflow run *** 2010.11.24 18:20:13.585 PST: pegasus-remove -d 275.0 2010.11.24 18:20:13.592 PST: or 2010.11.24 18:20:13.604 PST: pegasus-remove /home/tutorial/pegasus-wms/dags/tutorial/pegasus/montage/run0001 2010.11.24 18:20:13.610 PST: 2010.11.24 18:20:13.617 PST: Time taken to execute is 3.76 seconds 2010.11.24 18:20:13.617 PST: [INFO] event.pegasus.planner planner.version 3.0.0 - FINISHED If you get any errors above while running pegasus-plan you can add -vvvvv to enable maximum verbosity on pegasus-run.
The above command submits the workflow to Condor DAGMan/CondorG. After submitting it starts a monitoring daemon pegasus-monitord that parses the condor log files to update the status of the jobs and push it in a work database.
Monitor the workflow using the commands provided in the output of the pegasus-run command and other commands explained earlier.
The workflow generates a single output file montage.jpg that resides in the directory /home/tutorial/local-storage/storage/montage.jpg if it runs successfully
The grid workflow will take time to execute on the VM. On the instructor's MAC Pro Desktop it took about 30 minutes to run.
Sometimes a workflow may have too many jobs whose execution time is a few seconds long. In such instances the overhead of scheduling each job on a grid is too large and the runtime of the entire workflow can be optimized by using Pegasus clustering techniques. One such technique is to cluster jobs horizontally on the same level into one or more sequential jobs.
$ cd $HOME/pegasus-wms
$ pegasus-plan -Dpegasus.schema.dax=/opt/pegasus/default/etc/dax-2.1.xsd \
--dir `pwd`/dags --sites cluster --output local --nocleanup --force\
--cluster horizontal --dax `pwd`/dax/montage.dax -v
After clustering the executable workflow will contain 26 jobs compared to 44 in the non clustered mode.
In the DAX you can specify what output data products you want to track in the replica catalog. This is done by setting the register flags with the output files for a job. For our tutorial, we only register the final output data products. So if you were able to execute the diamond or the montage workflow successfully, we can do data reuse. Let us run pegasus-plan on the diamond workflow again. However, this time we will remove the --force option.
$ cd $HOME/pegasus-wms
$ pegasus-plan --dax `pwd`/dax/diamond.dax --dir `pwd`/dags -s local -o local --nocleanup -v
2010.11.25 01:35:11.186 PST: [INFO] event.pegasus.refinement dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.210 PST: [INFO] event.pegasus.reduce dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.211 PST: [INFO] Nodes/Jobs Deleted from the Workflow during reduction
2010.11.25 01:35:11.211 PST: [INFO] analyze_j4
2010.11.25 01:35:11.211 PST: [INFO] findrange_j2
2010.11.25 01:35:11.211 PST: [INFO] findrange_j3
2010.11.25 01:35:11.211 PST: [INFO] preprocess_j1
2010.11.25 01:35:11.211 PST: [INFO] Nodes/Jobs Deleted from the Workflow during reduction - DONE
2010.11.25 01:35:11.212 PST: [INFO] event.pegasus.reduce dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.212 PST: [INFO] event.pegasus.siteselection dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.219 PST: [INFO] event.pegasus.siteselection dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.289 PST: [INFO] Grafting transfer nodes in the workflow
2010.11.25 01:35:11.290 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.370 PST: [INFO] Adding stage out jobs for jobs deleted from the workflow
2010.11.25 01:35:11.370 PST: [INFO] The leaf file f.d is already at the output pool local
2010.11.25 01:35:11.371 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.372 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.374 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.374 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id blackdiamond_0 - STARTED
2010.11.25 01:35:11.375 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.375 PST: [INFO] event.pegasus.refinement dax.id blackdiamond_0 - FINISHED
2010.11.25 01:35:11.426 PST: [INFO] Generating codes for the concrete workflow
2010.11.25 01:35:12.078 PST: [INFO] Generating codes for the concrete workflow -DONE
2010.11.25 01:35:12.083 PST:
The executable workflow generated contains only a single NOOP job.
It seems that the output files are already at the output site.
To regenerate the output data from scratch specify --force option.
pegasus-run -Dpegasus.user.properties=$HOME/.../blackdiamond/run0003/pegasus.4078026914028890643.properties\
/home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0003
2010.11.25 01:35:12.083 PST: Time taken to execute is 1.508 seconds
2010.11.25 01:35:12.083 PST: [INFO] event.pegasus.planner planner.version 3.0.0 - FINISHED
You can increase the debug level to see how pegasus deletes the jobs bottom up of the workflow. Pass -vvvv to pegasus-plan command.
Pegasus 3.0 allows you to create workflows of workflows i.e your workflow can contain dax jobs that refer to the sub-workflows. In this exercise, we will execute a workflow super-diamond that will execute two diamond workflows.
Let us look at superdiamond.dax in the dax directory
$ cat $HOME/pegasus-wms/dax/superdiamond.dax <?xml version="1.0" encoding="UTF-8"?> <!-- generated on: 2010-11-25T08:42:30-08:00 --> <!-- generated by: pegasus [ ?? ] --> <adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.2.xsd" versi on="3.2" name="superdiamond" index="0" count="1"> <!-- Section 1: Files - Acts as a Replica Catalog (can be empty) --> <file name="f.a"> <pfn url="file:///scratch/tutorial/inputdata/diamond/f.a" site="local"/> </file> <file name="black-1.dax"> <pfn url="/home/tutorial/pegasus-wms/dax/black-1.dax" site="local"/> </file> <file name="black-2.dax"> <pfn url="/home/tutorial/pegasus-wms/dax/black-2.dax" site="local"/> </file> <!-- Section 2: Executables - Acts as a Transformaton Catalog (can be empty) --> <!-- Section 3: Transformations - Aggregates executables and Files (can be empty) --> <!-- Section 4: Job's, DAX's or Dag's - Defines a JOB or DAX or DAG (Atleast 1 required) --> <dax id="d1" file="black-1.dax" > <argument>-s local --force -o local</argument> </dax> <dax id="d2" file="black-2.dax" > <argument>-s local --force -o local</argument> </dax> <!-- Section 5: Dependencies - Parent Child relationships (can be empty) --> <child ref="d2"> <parent ref="d1"/> </child> </adag>
Now let us submit this super diamond workflow
$ pegasus-plan --dax `pwd`/dax/superdiamond.dax --force --submit\
--dir dags -s local -o local --nocleanup -v
2010.11.29 21:15:49.110 PST: [INFO] event.pegasus.refinement dax.id superdiamond_0 - STARTED
2010.11.29 21:15:49.123 PST: [INFO] event.pegasus.siteselection dax.id superdiamond_0 - STARTED
2010.11.29 21:15:49.142 PST: [INFO] event.pegasus.siteselection dax.id superdiamond_0 - FINISHED
2010.11.29 21:15:49.220 PST: [INFO] Grafting transfer nodes in the workflow
2010.11.29 21:15:49.221 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id superdiamond_0 - STARTED
2010.11.29 21:15:49.305 PST: [INFO] event.pegasus.generate.transfer-nodes dax.id superdiamond_0 - FINISHED
2010.11.29 21:15:49.307 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id superdiamond_0 - STARTED
2010.11.29 21:15:49.312 PST: [INFO] event.pegasus.generate.workdir-nodes dax.id superdiamond_0 - FINISHED
2010.11.29 21:15:49.312 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id superdiamond_0 - STARTED
2010.11.29 21:15:49.314 PST: [INFO] event.pegasus.generate.cleanup-wf dax.id superdiamond_0 - FINISHED
2010.11.29 21:15:49.314 PST: [INFO] event.pegasus.refinement dax.id superdiamond_0 - FINISHED
2010.11.29 21:15:49.371 PST: [INFO] Generating codes for the concrete workflow
2010.11.29 21:15:50.200 PST: [INFO] Generating codes for the concrete workflow -DONE
2010.11.29 21:15:50.200 PST: [INFO] Generating code for the cleanup workflow
2010.11.29 21:15:50.323 PST: [INFO] Generating code for the cleanup workflow -DONE
2010.11.29 21:15:50.496 PST:
2010.11.29 21:15:50.502 PST: -----------------------------------------------------------------------
2010.11.29 21:15:50.508 PST: File for submitting this DAG to Condor : superdiamond-0.dag.condor.sub
2010.11.29 21:15:50.514 PST: Log of DAGMan debugging messages : superdiamond-0.dag.dagman.out
2010.11.29 21:15:50.521 PST: Log of Condor library output : superdiamond-0.dag.lib.out
2010.11.29 21:15:50.528 PST: Log of Condor library error messages : superdiamond-0.dag.lib.err
2010.11.29 21:15:50.559 PST: Log of the life of condor_dagman itself : superdiamond-0.dag.dagman.log
2010.11.29 21:15:50.578 PST:
2010.11.29 21:15:50.588 PST: -no_submit given, not submitting DAG to Condor. You can do this with:
2010.11.29 21:15:50.601 PST: "condor_submit superdiamond-0.dag.condor.sub"
2010.11.29 21:15:50.618 PST: -----------------------------------------------------------------------
2010.11.29 21:15:50.625 PST: Submitting job(s).
2010.11.29 21:15:50.637 PST: Logging submit event(s).
2010.11.29 21:15:50.642 PST: 1 job(s) submitted to cluster 1.
2010.11.29 21:15:51.179 PST:
2010.11.29 21:15:51.185 PST: Your Workflow has been started and runs in base directory given below
2010.11.29 21:15:51.191 PST:
2010.11.29 21:15:51.197 PST: cd /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001
2010.11.29 21:15:51.208 PST:
2010.11.29 21:15:51.214 PST: *** To monitor the workflow you can run ***
2010.11.29 21:15:51.220 PST:
2010.11.29 21:15:51.227 PST: pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001
2010.11.29 21:15:51.234 PST:
2010.11.29 21:15:51.240 PST: *** To remove your workflow run ***
2010.11.29 21:15:51.245 PST: pegasus-remove -d 1.0
2010.11.29 21:15:51.253 PST: or
2010.11.29 21:15:51.261 PST: pegasus-remove /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001
2010.11.29 21:15:51.268 PST:
2010.11.29 21:15:51.277 PST: Time taken to execute is 2.745 seconds
2010.11.29 21:15:51.277 PST: [INFO] event.pegasus.planner planner.version 3.0.0 - FINISHED
You can track the workflow using the pegasus-status command
$ watch pegasus-status -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001
After the workflow has completed you will see the black-1-f.d and black-2-f.d in the storage directory
$ ls -lh /home/tutorial/local-storage/storage/black-*
-rw-r--r-- 1 pegasus pegasus 3.6K Nov 29 21:36 /home/tutorial/local-storage/storage/black-1-f.d
-rw-r--r-- 1 pegasus pegasus 3.6K Nov 29 21:41 /home/tutorial/local-storage/storage/black-2-f.d
Pegasus ensures that each of the workflows have their own submit directory and execution directories.
The table below lists the submit directories for all the workflows in this exercise
Table 1.4. Table: Submit Directory Structure for Hierarchal Workflows
| superdiamond ( the outer level workflow ) | /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001 |
| black-1 ( the first sub workflow ) | /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001/black-1_d1 |
| black-2 ( the second sub workflow ) | /home/tutorial/pegasus-wms/dags/tutorial/pegasus/superdiamond/run0001/black-2_d2 |
The table below lists the execution directories ( one per workflow ) in this exercise
Table 1.5. Table: Execution Directory Structure for Hierarchal Workflows
| superdiamond ( the outer level workflow ) | /home/tutorial/local-scratch/exec/tutorial/pegasus/superdiamond/run0001 |
| black-1 ( the first sub workflow ) | /home/tutorial/local-scratch/exec/tutorial/pegasus/superdiamond/run0001/black-1_d1 |
| black-2 ( the second sub workflow ) | /home/tutorial/local-scratch/exec/tutorial/pegasus/superdiamond/run0001/black-2_d2 |



