These are the student notes for the Pegasus WMS tutorial. They are designed to be used in conjunction with instructor presentation and support.
You will see two styles of machine text here:
Text like this is input that you should type.
Text like this is the output you should get.For example:
$ date
Mon June 1 11:54:58 BST 2007You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.
On Linux or Mac OS X, open a terminal window and type:
For the purpose of this tutorial replace any instance of trainXX with your viz-login username.
$ ssh trainXX@viz-login.isi.edu
[welcome message]
trainXX@viz-login:~$You will need to obtain Grid Credentials to run workflows on the Grid.
You can generate your proxy using grid-proxy-init
$ grid-proxy-init
Your identity: /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
Creating proxy ........................................ Done
Your proxy is valid until: Sun Apr 27 08:05:18 2008
Check your proxy using grid-proxy-info.
$ grid-proxy-info
subject : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02/CN=2104110451
issuer : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
identity : /O=edu/OU=ISI/OU=isi.edu/CN=Tutorial User 02
type : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path : /tmp/x509up_u1044
timeleft : 11:57:20In this chapter you will be introduced to planning and executing a workflow through Pegasus WMS locally. You will then plan and execute a larger Montage workflow on the GRID.
All the exercises in this Chapter will be run from the $HOME/pegasus-wms/ directory. All the files that are required reside in this directory
$ cd $HOME/pegasus-wmsFiles for the exercise are stored in subdirectories:
$ ls
config dags daxYou may also see some other files here.
An abstract DAG has been generated for Montage application and output in XML format into dax/montage.dax.
Open montage.dax in a file viewer:
$ cat dax/montage.daxInside the DAX, you should see three sections.
list of all the files used in the workflow
definition of all jobs - each job in the workflow.
list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.
In this exercise you will insert entries into the Replica Catalog. The replica catalog that we will use today is a simple file based catalog. We also support and recommend GLOBUS RLS or a JDBC implementation for production runs.
A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.
You can use the rc-client command to insert , query and delete from the replica catalog.
The input data to be used for your workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.
The instructors have provided:
A file replicas.in, the input data file for the rc-client that contains the mappings that need to be populated in the RLS. The file is inside the config directory
Instructions:
Let us see what the file looks like.
$ cat config/rc.in
# file-based replica catalog: 2007-06-02T13:11:35.954-07:00
statfile_20070529_153243_22618.tbl
gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/statfile.tbl
pool="local"
2mass-atlas-990502s-j1440198.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1440198.fits
pool="local"
2mass-atlas-990502s-j1440186.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1440186.fits
pool="local"
2mass-atlas-990502s-j1430092.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1430092.fits
pool="local"
2mass-atlas-990502s-j1420198.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1420198.fits
pool="local"
2mass-atlas-990502s-j1420186.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1420186.fits
pool="local"
cimages_20070529_153243_22618.tbl
gsiftp://viz-login.isi.edu/scratch/0.2degree/cimages.tbl pool="local"
pimages_20070529_153243_22618.tbl
gsiftp://viz-login.isi.edu/scratch/0.2degree/pimages.tbl pool="local"
region_20070529_153243_22618.hdr
gsiftp://viz-login.isi.edu/scratch/0.2degree/region.hdr pool="local"
2mass-atlas-990502s-j1430080.fits
gsiftp://viz-login.isi.edu/scratch/0.2degree/2mass-atlas-990502s-j1430080.fits
pool="local"Now we are ready to run rc-client and populate the data. Since each of you have unique lfns that are being registered, all the 10 entries should be successfully registered.
$ rc-client -Dpegasus.user.properties=config/properties --insert config/rc.in
#Successfully worked on : 10 lines
#Worked on total number of : 11 lines.Now the entries have been successfully inserted into the Replica Catalog. We should query the replica catalog for a particular lfn.
$ rc-client -Dpegasus.user.properties=config/properties \
lookup pimages_20070529_153243_22618.tbl
pimages_20070529_153243_22618.tbl
gsiftp://viz-login.isi.edu/scratch/0.2degree/pimages.tbl pool="local"Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.
In this exercise you will setup your Site Catalog and the Transformation Catalog.
The instructors have provided:
A ready transformation catalog (tc.data) in the $HOME/pegasus-wms/config directory.
A semi ready properties file in the $HOME/pegasus-wms/config directory.
The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectory, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.
You can use the sc-client command to generate a site catalog from a hand written sites.txt file.
$ sc-client -f config/sites.txt -o config/sites.xml
2007.06.02 17:06:12.215 PDT: [INFO] Reading config/sites.txt
2007.06.02 17:06:11.262 PDT: [INFO] Reading config/sites.txt (completed)
2007.06.02 17:06:11.276 PDT: [INFO] Written xml output to file : config/sites.xmlThe transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the Diamond or Montage code is installed on the various grid sites.
Take a look at the Transformation Catalog
$ cat config/tc.data
viz bin/mDiff
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mDiff
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz bin/mFitplane
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mFitplane
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mAdd:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mAdd
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mBackground:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mBackground
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mBgModel:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mBgModel
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mConcatFit:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mConcatFit
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mDiffFit:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mDiffFit
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mImgtbl:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mImgtbl
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mJPEG:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mJPEG
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mProject:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mProjectPP
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mProjectPP:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mProjectPP
STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="."
viz mShrink:3.0
gsiftp://viz-login.isi.edu/nfs/software/montage/default/bin/mShrink
STATIC_BINARY INTEL32::LINUX NULLOpen the properties file and check a few properties.
$ cat config/properties
## SELECT THE REPLICAT CATALOG MODE AND URL
pegasus.catalog.replica = SimpleFile
pegasus.catalog.replica.file = ${user.home}/pegasus-wms/config/rc.data
## SELECT THE SITE CATALOG MODE AND FILE
pegasus.catalog.site = XML
pegasus.catalog.site.file = ${user.home}/pegasus-wms/config/sites.xml
## SELECT THE TRANSFORMATION CATALOG MODE AND FILE
pegasus.catalog.transformation = File
pegasus.catalog.transformation.file = ${user.home}/pegasus-wms/config/tc.data
## SET UP THE WORK AND INVOCATION DATABASE
pegasus.catalog.work = Database
pegasus.catalog.provenance = InvocationSchema
## Database related properties
pegasus.catalog.*.db.driver = MySQL
pegasus.catalog.*.db.url = jdbc:mysql://smarty.isi.edu/tg2007
pegasus.catalog.*.db.user = tg2007user
pegasus.catalog.*.db.password = Teragrid2007
## USE DAGMAN RETRY FEATURE FOR FAILURES
pegasus.dagman.retry=2
## STAGE ALL OUR EXECUTABLES
pegasus.catalog.transformation.mapper = Staged
## CHECK JOB EXIT CODES FOR FAILURE
pegasus.exitcode.scope=all
## OPTIMIZE DATA AND EXECUTABLE TRANSFERS
pegasus.transfer.refiner=Bundle
#STAGE DATA AND EXECUTABLES USING GRIDFTP 3rd PARTY MODE
pegasus.transfer.*.thirdparty.sites=*
## WORK AND STORAGE DIR
pegasus.dir.storage = ${user.home}/storage
pegasus.dir.exec = $(user.home}/execAlso the client pegasus-get-sites can be used to generate a site catalog and transformation catalog for the Open Science Grid.
$ pegasus-get-sites --source vors --grid osg --vo osgedu --sc /nfs/home/trainxx/sc.xml \
--tc /nfs/home/trainxx/tc.data
# using default transformation mappings.
# Querying the source "vors" for site information
# assembling information for grid "osg".
# site BNL_ATLAS_1 33 is ACCESSIBLE
# site NYSGRID-CCR-U2 11 is ACCESSIBLE
# site WISC-OSG-EDU 342 is ACCESSIBLE
# site CIT_CMS_T2:srm_v1 263 is ACCESSIBLE
# substituting OSG_GRID value to for site
# site UTA_SWT2 347 is ACCESSIBLE
[...]In this exercise we are going to run pegasus-plan to generate a concrete workflow from the abstract workflow (diamond.dax). The Concrete workflow generated, are condor submit files that are submitted locally using pegasus-run
The instructors have provided:
A dax (diamond.dax) in the $HOME/pegasus-wms/dax directory.
You will need to write some things yourself, by following the instructions below:
Run pegasus-plan to generate the condor submit files out of the dax.
Run pegasus-run to submit the workflow locally.
Instructions:
Let us run pegasus-plan on the diamond dax.
$ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
--dax /nfs/home/trainXX/pegasus-wms/dax/diamond.dax \
--dir dags -s local -o local --nocleanupThe above command says that we need to plan the diamond dax locally. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data to be saved. Here is the output of pegasus-plan.
2008.01.28 15:00:49.536 PST: [INFO] Parsing the DAX
2008.01.28 15:00:50.063 PST: [INFO] Parsing the DAX (completed)
2008.01.28 15:00:50.174 PST: [INFO] Parsing the site catalog
2008.01.28 15:00:50.327 PST: [INFO] Parsing the site catalog (completed)
2008.01.28 15:00:50.394 PST: [INFO] Doing site selection
2008.01.28 15:00:50.436 PST: [INFO] Doing site selection (completed)
2008.01.28 15:00:50.437 PST: [INFO] Grafting transfer nodes in the workflow
2008.01.28 15:00:50.508 PST: [INFO] Grafting transfer nodes in the workflow (completed)
2008.01.28 15:00:50.523 PST: [INFO] Grafting the remote workdirectory creation jobs
in the workflow
2008.01.28 15:00:50.537 PST: [INFO] Grafting the remote workdirectory creation jobs
in the workflow (completed)
2008.01.28 15:00:50.538 PST: [INFO] Generating the cleanup workflow
2008.01.28 15:00:50.542 PST: [INFO] Generating the cleanup workflow (completed)
2008.01.28 15:00:50.563 PST: [INFO] Generating codes for the concrete workflow
2008.01.28 15:00:50.684 PST: [INFO] Generating codes for the concrete workflow (completed)
2008.01.28 15:00:50.684 PST: [INFO] Generating code for the cleanup workflow
2008.01.28 15:00:50.718 PST: [INFO] Generating code for the cleanup workflow (completed)
2008.01.28 15:00:51.087 PST: [INFO] I have concretized your abstract workflow.
The workflow has been entered into the workflow database with a state of "planned".
The next step is to start or execute your workflow. The invocation required is
pegasus-run -Dpegasus.user.properties=/home/trainXX/pegasus-wms/dags/trainXX/pegasus\
/diamond/run0001/pegasus.7543.properties \
/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001Now run pegasus-run as mentioned in the output of pegasus-plan. Do not copy the command below it is just for illustration purpose.
$ pegasus-run -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX\
/pegasus/diamond/run0001/pegasus.7543.properties \
/nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
Checking all your submit files for log file names.
This might take a while... Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : diamond-0.dag.condor.sub
Log of DAGMan debugging messages : diamond-0.dag.dagman.out
Log of Condor library output : diamond-0.dag.lib.out
Log of Condor library error messages : diamond-0.dag.lib.err
Log of the life of condor_dagman itself : diamond-0.dag.dagman.log
Condor Log file for all jobs of this DAG : /tmp/diamond-07544.log
-no_submit given, not submitting DAG to Condor.
You can do this with: "condor_submit diamond-0.dag.condor.sub"
-----------------------------------------------------------------------
Submitting job(s). Logging submit event(s).
1 job(s) submitted to cluster 20068.
I have started your workflow, committed it to DAGMan,
and updated its state in the work database.
A separate daemon was started to collect information about the progress of the workflow.
The job state will soon be visible. Your workflow runs in base directory.
cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
*** To monitor the workflow you can run ***
pegasus-status -w diamond-0 -t 20080128T150049-0800
or
pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001
*** To remove your workflow run ***
pegasus-remove -d 20068.0
or
pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.
We will change into the directory, that was mentioned by the output of pegasus-run command.
$ cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXXIn this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow
Run the command pegasus-status as mentioned by pegasus-run above to check the status of your jobs. Use the watch command to auto repeat the command every 2 seconds.
$ watch pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXX
-- Submitter: viz-login.isi.edu : <128.9.72.178:46426> : viz-login.isi.edu
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
19982.0 train01 1/28 13:58 0+00:03:20 R 0 9.8 condor_dagman -f -
19986.0 |-findrange 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas
19987.0 |-findrange 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegasThe above output shows that a couple of jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in Condor DAGMan releasing the next job into the queue after a job has finished successfully.
If output of pegasus-status is empty, then either your workflow has - successfully completed - stopped midway due to non recoverable error
Another way to monitor the workflow is to check the jobstate.log file. This is the output file of the monitoring daemon that is parsing all the condor log files to determine the status of the jobs. It logs the events seen by Condor into a more readable form for us.
$ more jobstate.log
1201557528 INTERNAL *** TAILSTATD_STARTED ***
1201557528 INTERNAL *** DAGMAN_STARTED ***
1201557528 generate_ID000001_0 UN_READY - - -
1201557528 findrange_ID000003_0 UN_READY - - -
1201557528 findrange_ID000002_0 UN_READY - - -
1201557528 analyze_ID000004_0 UN_READY - - -
[..]In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY. After sometime you will see entries in jobstate.log, that shows a job is being executed etc.
1201557747 generate_ID000001_0 EXECUTE 19996.0 local -
1201557747 generate_ID000001_0 GLOBUS_SUBMIT 19996.0 local -
1201557812 generate_ID000001_0 JOB_TERMINATED 19996.0 local -
1201557812 generate_ID000001_0 POST_SCRIPT_STARTED - local -
1201557817 generate_ID000001_0 POST_SCRIPT_TERMINATED 19996.0 local -
1201557817 generate_ID000001_0 POST_SCRIPT_SUCCESS - local -The above shows the being submitted and then executed on the grid. In addition it lists that job is being run on the grid site local (which is your submit machine). The various states of the job while it goes through submission to execution to post processing are in UPPERCASE.
Successfully Completed : Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1201559232 analyze_ID000004 JOB_TERMINATED 20023.0 local -
1201559232 analyze_ID000004 POST_SCRIPT_STARTED - local -
1201559238 analyze_ID000004 POST_SCRIPT_TERMINATED 20023.0 local -
1201559238 analyze_ID000004 POST_SCRIPT_SUCCESS - local -
1201559238 INTERNAL *** DAGMAN_FINISHED ***
1201559239 INTERNAL *** TAILSTATD_FINISHED 0 ***Looking at the last two lines we see that DAGMan finished, and tailstatd finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the local site successfully. The workflow generates a single output file montage.jpg that resides in the directory /scratch/trainXX/storage/f.d where trainXX is your user id.
To view the file, you can copy f.d to your viz-login webspace, and view it in your web browser:
$ cp /scratch/trainXX/storage/f.d ~/public_htmlPoint your web browser to: http://viz-login.isi.edu/~trainXX/f.d where trainXX is your viz-login user id.
Unsuccessfully Completed (Workflow execution stopped midway) : Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1180840233 analyze_ID000004_0 JOB_TERMINATED 2787.0 local -
1180840233 analyze_ID000004_0 POST_SCRIPT_STARTED - local -
1180840238 analyze_ID000004_0 POST_SCRIPT_TERMINATED 2787.0 local -
1180840238 analyze_ID000004_0 POST_SCRIPT_FAILURE 1 local -
1180840373 INTERNAL *** DAGMAN_FINISHED ***
1180840373 INTERNAL *** TAILSTATD_FINISHED 1 ***Looking at the last two lines we see that DAGMan finished, and tailstatd finished unsuccessfully with a status 1. We can easily determine which job failed. It is inter_tx_mDiffFit_ID000007_0 in this case. To determine the reason for failure we need to look at it's kickstart output file which is JOBNAME.out.NNN. where NNN is 000 - NNN
In this exercise we will learn about the DAG file format and some of the log files generated when the DAG runs.
Now take a look at the DAG file...
$ cat dags/trainXX/pegasus/diamond/run0001/diamond-0.dag
######################################################################
# PEGASUS GENERATED SUBMIT FILE
# DAG diamond
# Index = 0, Count = 1
######################################################################
JOB generate_ID000001 generate_ID000001.sub
RETRY generate_ID000001 2
JOB findrange_ID000002 findrange_ID000002.sub
RETRY findrange_ID000002 2
JOB findrange_ID000003 findrange_ID000003.sub
RETRY findrange_ID000003 2
JOB analyze_ID000004 analyze_ID000004.sub
RETRY analyze_ID000004 2
JOB diamond_0_pegasus_concat diamond_0_pegasus_concat.sub
JOB diamond_0_local_cdir diamond_0_local_cdir.sub
SCRIPT POST diamond_0_local_cdir /nfs/software/pegasus/default/bin/exitpost
-Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus
/diamond/run0001/pegasus.31433.properties
/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/run0001/diamond_0_local_cdir.out
RETRY diamond_0_local_cdir 2
PARENT generate_ID000001 CHILD findrange_ID000002
PARENT generate_ID000001 CHILD findrange_ID000003
PARENT findrange_ID000002 CHILD analyze_ID000004
PARENT findrange_ID000003 CHILD analyze_ID000004
PARENT diamond_0_pegasus_concat CHILD generate_ID000001
PARENT diamond_0_local_cdir CHILD diamond_0_pegasus_concat
######################################################################
# End of DAG
######################################################################... and the dagman.out file.
$ cat dags/train02/pegasus/diamond/run0001/diamond-0.dag.dagman.out
1/29 10:32:14 ******************************************************
1/29 10:32:14 ** condor_scheduniv_exec.20133.0 (CONDOR_DAGMAN) STARTING UP
1/29 10:32:14 ** /nfs/software/condor/6.9.4/bin/condor_dagman
1/29 10:32:14 ** $CondorVersion: 6.9.4 Aug 30 2007 $
1/29 10:32:14 ** $CondorPlatform: I386-LINUX_RHEL3 $
1/29 10:32:14 ** PID = 3202
1/29 10:32:14 ** Log last touched time unavailable (No such file or directory)
1/29 10:32:14 ****************************************************** [....]
1/29 10:32:27 Submitting Condor Node diamond_0_local_cdir job(s)...
1/29 10:32:27 submitting: condor_submit
-a dag_node_name' '=' 'diamond_0_local_cdir
-a +DAGManJobId' '=' '20133
-a DAGManJobId' '=' '20133
-a submit_event_notes' '=' 'DAG' 'Node:''diamond_0_local_cdir
-a +DAGParentNodeNames' '=' '""diamond_0_local_cdir.sub
1/29 10:32:27 From submit: Submitting job(s).
1/29 10:32:27 From submit: Logging submit event(s).
1/29 10:32:27 From submit: 1 job(s) submitted to cluster 20134.
1/29 10:32:27 assigned Condor ID (20134.0)
1/29 10:32:27 Just submitted 1 job this cycle...
1/29 10:32:27 Event: ULOG_SUBMIT for Condor Node diamond_0_local_cdir (20134.0)
1/29 10:32:27 Number of idle job procs: 1
1/29 10:32:27 Of 6 nodes total: 1/29 10:32:27 Done Pre Queued Post Ready Un-Ready Failed
1/29 10:32:27 === === === === === === ===
1/29 10:32:27 0 0 1 0 0 5 0 [....]
1/29 10:33:24 Done Pre Queued Post Ready Un-Ready Failed
1/29 10:33:24 === === === === === === ===
1/29 10:33:24 6 0 0 0 0 0 0 1/29 10:33:24 All jobs Completed!
1/29 10:33:24 Note: 0 total job deferrals because of -MaxJobs limit (0)
1/29 10:33:24 Note: 0 total job deferrals because of -MaxIdle limit (0)
1/29 10:33:24 Note: 0 total PRE script deferrals because of -MaxPre limit (20)
1/29 10:33:24 Note: 0 total POST script deferrals because of -MaxPost limit (20)
1/29 10:33:24 **** condor_scheduniv_exec.20133.0 (condor_DAGMAN) EXITING WITH STATUS 0Sometimes you may want to halt the execution of the workflow or just permanently remove it. You can stop/halt a workflow by running the pegasus-remove command mentioned in the output of pegasus-run
$ pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/diamond/runXXXX
Job 2788.0 marked for removalIn this exercise we are going to run pegasus-plan to generate a concrete workflow from the abstract workflow (montage.dax). The Concrete workflow generated, are condor submit files that are submitted to remote grid resources using pegasus-run
The instructors have provided:
A dax (montage.dax) in the $HOME/pegasus-wms/dax/ directory.
You will need to write some things yourself, by following the instructions below:
Run pegasus-plan to generate the condor submit files out of the dax.
Run pegasus-run to submit the workflow to the grid.
Instructions:
Let us run pegasus-plan on the montage dax on the Viz cluster. If multiple sites are available you could provide the sites using a comma "," separated list like viz,uwm etc.
$ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
--dir dags --sites viz --output local \
--nocleanup --dax /nfs/home/trainXX/pegasus-wms/dax/montage.daxThe above command says that we need to plan the montage dax on the isi_viz site. The output data needs to be transferred back to the local host. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data on the remote host. Here is the output of pegasus-plan.
2008.04.27 12:59:15.171 PDT: [INFO] Parsing the DAX
2008.04.27 12:59:15.972 PDT: [INFO] Parsing the DAX (completed)
2008.04.27 12:59:16.085 PDT: [INFO] Parsing the site catalog
2008.04.27 12:59:16.257 PDT: [INFO] Parsing the site catalog (completed)
2008.04.27 12:59:16.323 PDT: [INFO] Doing site selection
2008.04.27 12:59:16.427 PDT: [INFO] Doing site selection (completed)
2008.04.27 12:59:16.428 PDT: [INFO] Grafting transfer nodes in the workflow
2008.04.27 12:59:16.621 PDT: [INFO] Grafting transfer nodes in the workflow (completed)
2008.04.27 12:59:16.628 PDT: [INFO] Grafting the remote workdirectory creation jobs
in the workflow
2008.04.27 12:59:16.644 PDT: [INFO] Grafting the remote workdirectory creation jobs
in the workflow (completed)
2008.04.27 12:59:16.645 PDT: [INFO] Generating the cleanup workflow
2008.04.27 12:59:16.649 PDT: [INFO] Generating the cleanup workflow (completed)
2008.04.27 12:59:16.668 PDT: [INFO] Generating codes for the concrete workflow
2008.04.27 12:59:17.330 PDT: [INFO] Generating codes for the concrete workflow (completed)
2008.04.27 12:59:17.331 PDT: [INFO] Generating code for the cleanup workflow
2008.04.27 12:59:17.414 PDT: [INFO] Generating code for the cleanup workflow (completed)
2008.04.27 12:59:17.421 PDT: [INFO]
I have concretized your abstract workflow. The workflow has been entered
into the workflow database with a state of "planned". The next step is
to start or execute your workflow. The invocation required is
pegasus-run -Dpegasus.user.properties=/nfs/home/train02/pegasus-wms/dags/train02\
/pegasus/montage/run0001/pegasus.57010.properties \
--nodatabase /nfs/home/train02/pegasus-wms/dags/train02/pegasus/montage/run0001If you get any errors above while running pegasus-plan you can add -vvvvv to enable maximum verbosity on pegasus-run.
Now run pegasus-run as mentioned in the output of pegasus-plan.
$ pegasus-run -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/dags/trainXX\
/pegasus/montage/run0001/pegasus.51773.properties \
/nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
Checking all your submit files for log file names.
This might take a while... Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : montage-0.dag.condor.sub
Log of DAGMan debugging messages : montage-0.dag.dagman.out
Log of Condor library output : montage-0.dag.lib.out
Log of Condor library error messages : montage-0.dag.lib.err
Log of the life of condor_dagman itself : montage-0.dag.dagman.log
Condor Log file for all jobs of this DAG : /tmp/montage-051774.log -no_submit given, not
submitting DAG to Condor.
You can do this with: "condor_submit montage-0.dag.condor.sub"
-----------------------------------------------------------------------
Submitting job(s). Logging submit event(s).
1 job(s) submitted to cluster 19968.
I have started your workflow, committed it to DAGMan,
and updated its state in the work database.
A separate daemon was started to collect information about the progress of the workflow.
The job state will soon be visible. Your workflow runs in base directory.
cd /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
*** To monitor the workflow you can run ***
pegasus-status -w montage-0 -t 20080128T114525-0800
or
pegasus-status /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001
*** To remove your workflow run ***
pegasus-remove -d 19968.0
or
pegasus-remove /nfs/home/trainXX/pegasus-wms/dags/trainXX/pegasus/montage/run0001The above command submits the workflow to Condor DAGMan/CondorG. After submitting it starts a monitoring daemon tailstatd that parses the condor log files to update the status of the jobs and push it in a work database.
Monitor the workflow using the commands provided in the output of the pegasus-run command and other commands explained earlier.
The workflow generates a single output file montage.jpg that resides in the directory /scratch/trainXX/storage/montage.jpg where trainXX is your user id if it runs successfully
To view the image, you can copy montage.jpg to your viz-login webspace, and view it in your web browser:
$ cp /scratch/trainXX/storage/montage.jpg ~/public_htmlPoint your web browser to: http://viz-login.isi.edu/~trainXX/montage.jpg where trainXX is your viz-login user id.
In this exercise we will learn how to use node category throttles, VARS (to use a single submit file for multiple jobs), and PRE and POST scripts.
Take a look at the sample DAG file for this exercise:
$ cat dagman/node_categories/node_categories.dag
# DAG to illustrate node categories/category throttles, and VARS.
JOB Setup setup.submit
SCRIPT PRE Setup setup.pre
JOB BigProc1 big_proc.submit
VARS BigProc1 ARGS = "-sleep 60 -trials 10000 -seed 1234567"
PARENT Setup CHILD BigProc1
CATEGORY BigProc1 BigProc
JOB BigProc2 big_proc.submit
VARS BigProc2 ARGS = "-sleep 80 -trials 20000 -seed 7654321"
PARENT Setup CHILD BigProc2
CATEGORY BigProc2 BigProc
...Now edit this DAG file:
$ vi dagman/node_categories/node_categories.dagInsert the following lines that set node category throttles:
MAXJOBS BigProc 1 MAXJOBS SmallProc 4
Also, change the two "YOUR ARGS HERE" strings to appropriate arguments, based on the other nodes with the same submit files (remember to keep the double quotes in place).
Now submit the dag:
$ condor_submit_dag -f -usedagdir dagman/node_categories/node_categories.dag
Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor
: dagman/node_categories/node_categories.dag.condor.sub
Log of DAGMan debugging messages
: dagman/node_categories/node_categories.dag.dagman.out
Log of Condor library output
: dagman/node_categories/node_categories.dag.lib.out
Log of Condor library error messages
: dagman/node_categories/node_categories.dag.lib.err
Log of the life of condor_dagman itself
: dagman/node_categories/node_categories.dag.dagman.log
Condor Log file for all jobs of this DAG
: /nfs/home/wenger/ccgrid08/dagman/node_categories/setup.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 21306.
-----------------------------------------------------------------------You can monitor the DAG using the condor_q command:
$ condor_q -dag trainXX
-- Submitter: viz-login.isi.edu : <128O.9.72.178:40977> : viz-login.isi.edu
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
21276.0 wenger 4/28 09:26 0+00:00:31 R 0 4.4 condor_dagman -f -
21278.0 |-BigProc1 4/28 09:27 0+00:00:11 R 0 0.0 pi -sleep 60 -tria
21279.0 |-SmallProc1 4/28 09:27 0+00:00:06 R 0 0.0 factor -sleep 20 -
21280.0 |-SmallProc2 4/28 09:27 0+00:00:06 R 0 0.0 factor -sleep 20 -
21281.0 |-SmallProc3 4/28 09:27 0+00:00:06 R 0 0.0 factor -sleep 20 -
21282.0 |-SmallProc4 4/28 09:27 0+00:00:06 R 0 0.0 factor -sleep 20 -
6 jobs; 0 idle, 6 running, 0 held and by looking at the dagman.out file:
$ tail -f dagman/node_categories/node_categories.dag.dagman.out
...
4/28 09:24:08 Of 15 nodes total:
4/28 09:24:08 Done Pre Queued Post Ready Un-Ready Failed
4/28 09:24:08 === === === === === === ===
4/28 09:24:08 15 0 0 0 0 0 0
4/28 09:24:08 Note: 82 total job deferrals because of node category throttles
4/28 09:24:08 All jobs Completed!
4/28 09:24:08 Note: 0 total job deferrals because of -MaxJobs limit (0)
4/28 09:24:08 Note: 0 total job deferrals because of -MaxIdle limit (0)
4/28 09:24:08 Note: 82 total job deferrals because of node category throttles
4/28 09:24:08 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
4/28 09:24:08 Note: 0 total POST script deferrals because of -MaxPost limit
(0)
4/28 09:24:08 **** condor_scheduniv_exec.21260.0 (condor_DAGMAN) EXITING WITH STATUS 0Also note that even though the Cleanup node Condor job failed, the POST script succeeded, and therefore the node succeeded overall:
$ cat dagman/node_categories/node_categories.dag.dagman.out
...
4/28 09:24:03 Event: ULOG_JOB_TERMINATED for Condor Node Cleanup (21275.0)
4/28 09:24:03 Node Cleanup job proc (21275.0) failed with status 1.
4/28 09:24:03 Node Cleanup job completed
4/28 09:24:03 Running POST script of Node Cleanup...
...
4/28 09:24:08 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node Cleanup
(21275.0)
4/28 09:24:08 POST Script of Node Cleanup completed successfully.
...Once the DAG has completed, you should have two .result files in the DAG directory:
$ cat dagman/node_categories/factor.resultoutput/factor.out.21279:887766 = 2 * 3 * 11 * 13451 output/factor.out.21280:234234 = 2 * 3^2 * 7 * 11 * 13^2 output/factor.out.21281:9699690 = 2 * 3 * 5 * 7 * 11 * 13 * 17 * 19 output/factor.out.21282:234238 = 2 * 117119 output/factor.out.21283:9234973 = 11 * 61 * 13763 output/factor.out.21284:230953 = 41 * 43 * 131 output/factor.out.21285:6249832 = 2^3 * 781229 output/factor.out.21286:321094 = 2 * 181 * 887 output/factor.out.21287:723400 = 2^3 * 5^2 * 3617 output/factor.out.21289:1934023 = 7 * 13 * 53 * 401$ cat dagman/node_categories/pi.resultoutput/pi.out.21278:Estimate of pi: 3.160400 (10000 trials) output/pi.out.21288:Estimate of pi: 3.166600 (20000 trials) output/pi.out.21290:Estimate of pi: 3.145900 (40000 trials)
In this exercise we will learn how to use nested DAGs (DAGs as DAG nodes), DAG config files (allowing us to set any DAGMan-related configuration macros), and overall job throttling.
Because of some limitations of the current version of DAGMan, this exercise must be done in the directory containing the DAG files:
$ cd dagman/nested_dagsTake a look at the top-level DAG file:
$ cat top.dag
# Top-level DAG for nested DAGs exercise.
JOB Setup setup.submit
SCRIPT PRE Setup setup.pre
JOB Factor factor.dag.condor.sub
INSERT Pi JOB HERE
JOB Cleanup cleanup.submit
SCRIPT POST Cleanup cleanup.post
PARENT Setup CHILD Factor Pi
PARENT Factor Pi CHILD CleanupNow edit the DAG file:
$ vi top.dagChange the following line
INSERT Pi JOB HERE
to be a new job that's like the Factor job, except that the job name is Pi, and the submit file corresponds to pi.dag instead of factor.dag.
Take a look at the nested DAG files:
$ cat factor.dag# First lower-level DAG for nested DAG exercise. JOB A factor.submit VARS A ARGS = "-sleep 20 -num 23489820" JOB B1 factor.submit VARS B1 ARGS = "-sleep 20 -num 77201856" JOB B2 factor.submit VARS B2 ARGS = "-sleep 20 -num 292342" JOB C factor.submit VARS C ARGS = "-sleep 20 -num 92340234" PARENT A CHILD B1 B2 PARENT B1 B2 CHILD C$ cat pi.dag# Second lower-level DAG for nested DAG exercise. # This works only with version 6.9.2 and later. CONFIG pi.config JOB A pi.submit VARS A ARGS = "-sleep 20 -trials 10000 -seed 2348235" JOB B1 pi.submit VARS B1 ARGS = "-sleep 20 -trials 50000 -seed 99082348" JOB B2 pi.submit VARS B2 ARGS = "-sleep 20 -trials 50000 -seed 66899243" JOB C pi.submit VARS C ARGS = "-sleep 20 -trials 10000 -seed 66899243" PARENT A CHILD B1 B2 PARENT B1 B2 CHILD C
Also examine the config file for the pi DAG:
$ cat pi.config
DAGMAN_MAX_JOBS_SUBMITTED = 1Now create the .condor.sub files for the nested DAGs:
$ condor_submit_dag -f -no_submit factor.dagChecking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : factor.dag.condor.sub Log of DAGMan debugging messages : factor.dag.dagman.out Log of Condor library output : factor.dag.lib.out Log of Condor library error messages : factor.dag.lib.err Log of the life of condor_dagman itself : factor.dag.dagman.log Condor Log file for all jobs of this DAG : /nfs/home/wenger/ccgrid08/dagman/nested_dags/factor.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit factor.dag.condor.sub" -----------------------------------------------------------------------$ condor_submit_dag -f -no_submit pi.dagChecking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : pi.dag.condor.sub Log of DAGMan debugging messages : pi.dag.dagman.out Log of Condor library output : pi.dag.lib.out Log of Condor library error messages : pi.dag.lib.err Log of the life of condor_dagman itself : pi.dag.dagman.log Condor Log file for all jobs of this DAG : /nfs/home/wenger/ccgrid08/dagman/nested_dags/pi.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit pi.dag.condor.sub" -----------------------------------------------------------------------
Finally, you are ready to submit the top-level DAG:
$ condor_submit_dag -f top.dag
Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : top.dag.condor.sub
Log of DAGMan debugging messages : top.dag.dagman.out
Log of Condor library output : top.dag.lib.out
Log of Condor library error messages : top.dag.lib.err
Log of the life of condor_dagman itself : top.dag.dagman.log
Condor Log file for all jobs of this DAG :
/nfs/home/wenger/ccgrid08/dagman/nested_dags/top.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 21292.
-----------------------------------------------------------------------Again, you can monitor the DAG with condor_q or by looking at the dagman.out file.
Note that the dagman.out file for the pi DAG shows the effects of the throttle in the config file:
$ cat pi.dag.dagman.out
4/28 09:46:35 Of 4 nodes total:
4/28 09:46:35 Done Pre Queued Post Ready Un-Ready Failed
4/28 09:46:35 === === === === === === ===
4/28 09:46:35 4 0 0 0 0 0 0
4/28 09:46:35 Note: 7 total job deferrals because of -MaxJobs limit (1)
4/28 09:46:35 All jobs Completed!
4/28 09:46:35 Note: 7 total job deferrals because of -MaxJobs limit (1)
4/28 09:46:35 Note: 0 total job deferrals because of -MaxIdle limit (0)
4/28 09:46:35 Note: 0 total job deferrals because of node category throttles
4/28 09:46:35 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
4/28 09:46:35 Note: 0 total POST script deferrals because of -MaxPost limit (0)
4/28 09:46:35 **** condor_scheduniv_exec.21296.0 (condor_DAGMAN) EXITING WITH STATUS 0Again, you should have two .result files:
$ cat factor.resultoutput/factor.out.21297:23489820 = 2^2 * 3^2 * 5 * 37 * 3527 output/factor.out.21299:77201856 = 2^6 * 3^3 * 43 * 1039 output/factor.out.21300:292342 = 2 * 313 * 467 output/factor.out.21302:92340234 = 2 * 3^2 * 7 * 29 * 37 * 683$ cat pi.resultoutput/pi.out.21298:Estimate of pi: 3.170800 (10000 trials) output/pi.out.21301:Estimate of pi: 3.147280 (50000 trials) output/pi.out.21303:Estimate of pi: 3.141040 (50000 trials) output/pi.out.21304:Estimate of pi: 3.154000 (10000 trials)
Sometimes a workflow may have too many jobs whose execution time is a few seconds long. In such instances the overhead of scheduling each job on a grid is too large and the runtime of the entire workflow can be optimized by using Pegasus clustering techniques. One such technique is to cluster jobs horizontally on the same level into one or more sequential jobs.
$ pegasus-plan -Dpegasus.user.properties=/nfs/home/trainXX/pegasus-wms/config/properties \
--dir /nfs/home/trainXX/pegasus-wms/dags --sites viz --output local --nocleanup \
--cluster horizontal --dax /nfs/home/trainXX/pegasus-wms/dax/montage.dax All the jobs till now have been executed on the shared filesystem on the viz-cluster. The input data required by the Montage workflow was staged to a directory on the shared filesystem. All the jobs were then executed in that directory.
A recent feature addition to Pegasus ( still in testing phase ), allows you to execute each of the jobs in a tmp directory on the worker nodes filesystem. For this to happen, a Second Level Staging (SLS) needs to occur, that transfers the data from the directory on the shared filesystem to a directory on the local filesystem of the worker node.
Set the property
pegasus.execute.*.filesystem.local=true
in your properties file.
Repeat Exercise 2.6.