These are the student notes for the Pegasus WMS tutorial. They are designed to be used in conjunction with instructor presentation and support.
You will see two styles of machine text here:
Text like this is input that you should type.
Text like this is the output you should get.
For example:
$ date
Mon June 1 11:54:58 BST 2007
You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.
On Linux or Mac OS X, open a terminal window and type:
On Windows, PuTTY is recommended as an ssh client.
For the purpose of this tutorial replace any instance of @trainXX@ with your viz-login username.
$ ssh @trainXX@@viz-login.isi.edu
[welcome message] trainXX@viz-login:~$
You will need to obtain Grid Credentials to run the workflows on the Grid.
You can generate your proxy using grid-proxy-init
$ grid-proxy-init
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 Enter GRID pass phrase for this identity: Creating proxy ......................................... Done Your proxy is valid until: Mon Jan 28 22:38:42 2008
Check your proxy using grid-proxy-info.
$ grid-proxy-info
subject : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01/CN=378830928 issuer : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 identity : /O=Grid/OU=GlobusTest/OU=simpleCA-smarty.isi.edu/OU=isi.edu/CN=Tutorial User 01 type : Proxy draft (pre-RFC) compliant impersonation proxy strength : 512 bits path : /tmp/x509up_u1045 timeleft : 11:58:23
In this chapter you will be introduced to planning and executing a workflow through Pegasus WMS locally. You will then plan and execute a larger Montage workflow on the GRID.
All the exercises in this Chapter will be run from the $HOME/tutorial/ directory. All the files that are required reside in this directory
$ cd $HOME/tutorial $
Files for the exercise are stored in subdirectories:
$ ls config dags dax
You may also see some other files here.
An abstract DAG has been generated for Montage application and output in XML format into
dax/montage.dax. Open montage.dax in a file
viewer:
$ cat dax/montage.dax
Inside the DAX, you should see three sections.
In this exercise you will insert entries into the Replica Catalog.
The replica catalog that we will use today is a simple file based catalog.
We also support and recommend GLOBUS RLS or a JDBC implementation for production runs.
A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.
You can use the rc-client command to insert , query and delete from the replica catalog.
The input data to be used for your workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.
The instructors have provided:
Instructions:
$ cat config/rc.in
# file-based replica catalog: 2007-06-02T13:11:35.954-07:00 statfile_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/statfile.tbl pool="local" 2mass-atlas-990502s-j1440198.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1440198.fits pool="local" 2mass-atlas-990502s-j1440186.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1440186.fits pool="local" 2mass-atlas-990502s-j1430092.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1430092.fits pool="local" 2mass-atlas-990502s-j1420198.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1420198.fits pool="local" 2mass-atlas-990502s-j1420186.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1420186.fits pool="local" cimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/cimages.tbl pool="local" pimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/pimages.tbl pool="local" region_20070529_153243_22618.hdr gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/region.hdr pool="local" 2mass-atlas-990502s-j1430080.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1430080.fits pool="local"
rc-client and populate the
data. Since each of you have unique lfns that are being registered,
all the 10 entries should be successfully registered.
$ rc-client -Dpegasus.user.properties=config/properties --insert config/rc.in #Successfully worked on : 10 lines #Worked on total number of : 11 lines.
$ rc-client -Dpegasus.user.properties=config/properties lookup pimages_20070529_153243_22618.tbl
pimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/pimages.tbl pool="local"
Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.
In this exercise you will setup your Site Catalog and the Transformation Catalog.
The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the Montage code is installed on the various grid sites.
The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectories, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.
sc-client command to generate a
site catalog from a hand written sites.txt file.
$ sc-client -f config/sites.txt -o config/sites.xml 2007.06.02 17:06:12.215 PDT: [INFO] Reading config/sites.txt 2007.06.02 17:06:11.262 PDT: [INFO] Reading config/sites.txt (completed) 2007.06.02 17:06:11.276 PDT: [INFO] Written xml output to file : config/sites.xml
The instructors have provided:
$ cat config/tc.data local bin/mDiff gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mDiff STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="." local bin/mDiff gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mDiff STATIC_BINARY INTEL64::LINUX ENV::MONTAGE_HOME="." local bin/mFitplane gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mFitplane STATIC_BINARY INTEL32::LINUX NULL local bin/mFitplane gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mFitplane STATIC_BINARY INTEL64::LINUX NULL local mAdd:3.0 gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mAdd STATIC_BINARY INTEL32::LINUX NULL local mAdd:3.0 gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mAdd STATIC_BINARY INTEL64::LINUX NULL
Open the properties file and check a few properties.
$ cat config/properties ## SELECT THE REPLICAT CATALOG MODE AND URL pegasus.catalog.replica = SimpleFile pegasus.catalog.replica.file = ${user.home}/tutorial/config/rc.data #pegasus.catalog.replica.url=rlsn://smarty.isi.edu ## SELECT THE SITE CATALOG MODE AND FILE pegasus.catalog.site = XML pegasus.catalog.site.file = ${user.home}/tutorial/config/sites.xml ## SELECT THE TRANSFORMATION CATALOG MODE AND FILE pegasus.catalog.transformation = File pegasus.catalog.transformation.file = ${user.home}/tutorial/config/tc.data ## SET UP THE WORK AND INVOCATION DATABASE pegasus.catalog.work = Database pegasus.catalog.provenance = InvocationSchema ## Database related properties pegasus.catalog.*.db.driver = MySQL pegasus.catalog.*.db.url = jdbc:mysql://smarty.isi.edu/tg2007 pegasus.catalog.*.db.user = tg2007user pegasus.catalog.*.db.password = Teragrid2007 ## USE DAGMAN RETRY FEATURE FOR FAILURES pegasus.dagman.retry=2 ## STAGE ALL OUR EXECUTABLES pegasus.catalog.transformation.mapper = Staged ## CHECK JOB EXIT CODES FOR FAILURE pegasus.exitcode.scope=all ## OPTIMZE DATA & EXECUTABLE TRANSFERS pegasus.transfer.refiner=Bundle #STAGE DATA AND EXECUTABLES USING GRIDFTP 3rd PARTY MODE pegasus.transfer.*.thirdparty.sites=* ## WORK AND STORAGE DIR ## CHANGE THESE TO YOUR TERAGRID USERNAME pegasus.dir.storage = xxxxx/storage pegasus.dir.exec = xxxxx/exec
Edit the properties pegasus.dir.storage and pegasus.dir.exec to specify relative paths for your workflow execution and data storage directory. Change the xxxxx value to your @trainXX@ value.
$ vim config/properties [...] $ cat config/properties pegasus.dir.storage = @trainXX@/storage pegasus.dir.exec = @trainXX@/exec
You can look at them to have an idea as to what they look like. But for now we will move ahead and plan your workflow through Pegasus. We need to get running on the GRID fast :). Time is short!!
In production mode the sc-client interfaces with Globus MDS to retrieve the information about various sites.
Also the client pegasus-get-sites can be used to generate a site catalog and transformation catalog for the Open Science Grid.
pegasus-plan to generate executable workflow (condor
submit files) and pegasus-run to submit the workflow locallyIn this exercise we are going to run pegasus-planto generate a
concrete workflow from the abstract workflow (diamond.dax). The
Concrete workflow generated, are condor submit files that are
submitted locally using pegasus-run
The instructors have provided:
You will need to write some things yourself, by following the instructions below:
Instructions:
$ pegasus-plan -Dpegasus.user.properties=`pwd`/config/properties --dax `pwd`/dax/diamond.dax --dir `pwd`/dags \ -s local --nocleanup
The above command says that we need to plan the diamond dax locally.
The output data needs to be transferred back to the local host. The condor submit files are to be generated in a di\
rectory structure whose base is dags.
We also are requesting that no cleanup jobs be added as we require the inte\
rmediate data on the remote host.
2008.01.28 15:00:49.536 PST: [INFO] Parsing the DAX
2008.01.28 15:00:50.063 PST: [INFO] Parsing the DAX (completed)
2008.01.28 15:00:50.174 PST: [INFO] Parsing the site catalog
2008.01.28 15:00:50.327 PST: [INFO] Parsing the site catalog (completed)
2008.01.28 15:00:50.394 PST: [INFO] Doing site selection
2008.01.28 15:00:50.436 PST: [INFO] Doing site selection (completed)
2008.01.28 15:00:50.437 PST: [INFO] Grafting transfer nodes in the workflow
2008.01.28 15:00:50.508 PST: [INFO] Grafting transfer nodes in the workflow (completed)
2008.01.28 15:00:50.523 PST: [INFO] Grafting the remote workdirectory creation jobs in the workflow
2008.01.28 15:00:50.537 PST: [INFO] Grafting the remote workdirectory creation jobs in the workflow (completed)
2008.01.28 15:00:50.538 PST: [INFO] Generating the cleanup workflow
2008.01.28 15:00:50.542 PST: [INFO] Generating the cleanup workflow (completed)
2008.01.28 15:00:50.563 PST: [INFO] Generating codes for the concrete workflow
2008.01.28 15:00:50.684 PST: [INFO] Generating codes for the concrete workflow (completed)
2008.01.28 15:00:50.684 PST: [INFO] Generating code for the cleanup workflow
2008.01.28 15:00:50.718 PST: [INFO] Generating code for the cleanup workflow (completed)
2008.01.28 15:00:51.087 PST: [INFO]
I have concretized your abstract workflow. The workflow has been entered
into the workflow database with a state of "planned". The next step is
to start or execute your workflow. The invocation required is
pegasus-run -Dpegasus.user.properties=/nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001/pegasus.7543.properties \
/nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001
$ pegasus-run -Dpegasus.user.properties=/nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001/pegasus.7543.properties \ /nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001
Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond-0.dag.condor.sub Log of DAGMan debugging messages : diamond-0.dag.dagman.out Log of Condor library output : diamond-0.dag.lib.out Log of Condor library error messages : diamond-0.dag.lib.err Log of the life of condor_dagman itself : diamond-0.dag.dagman.log Condor Log file for all jobs of this DAG : /tmp/diamond-07544.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit diamond-0.dag.condor.sub" ----------------------------------------------------------------------- Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 20068. I have started your workflow, committed it to DAGMan, and updated its state in the work database. A separate daemon was started to collect information about the progress of the workflow. The job state will soon be visible. Your workflow runs in base directory. cd /nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001 *** To monitor the workflow you can run *** pegasus-status -w diamond-0 -t 20080128T150049-0800 or pegasus-status /nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001 *** To remove your workflow run *** pegasus-remove -d 20068.0 or pegasus-remove /nfs/home/train01/tutorial/dags/train01/pegasus/diamond/run0001
In this exercise we will learn about the DAG file format and some of the log files generated when the DAG runs.
$ cat dags/train02/pegasus/diamond/run0001/diamond-0.dag ###################################################################### # PEGASUS GENERATED SUBMIT FILE # DAG diamond # Index = 0, Count = 1 ###################################################################### JOB generate_ID000001 generate_ID000001.sub RETRY generate_ID000001 2 JOB findrange_ID000002 findrange_ID000002.sub RETRY findrange_ID000002 2 JOB findrange_ID000003 findrange_ID000003.sub RETRY findrange_ID000003 2 JOB analyze_ID000004 analyze_ID000004.sub RETRY analyze_ID000004 2 JOB diamond_0_pegasus_concat diamond_0_pegasus_concat.sub JOB diamond_0_local_cdir diamond_0_local_cdir.sub SCRIPT POST diamond_0_local_cdir /nfs/software/pegasus/default/bin/exitpost -Dpegasus.user.properties=/nfs/home/train02/tutorial/dags/train02/pegasus/diamond/run0001/pegasus.31433.properties /nfs/home/train02/tutorial/dags/train02/pegasus/diamond/run0001/diamond_0_local_cdir.out RETRY diamond_0_local_cdir 2 PARENT generate_ID000001 CHILD findrange_ID000002 PARENT generate_ID000001 CHILD findrange_ID000003 PARENT findrange_ID000002 CHILD analyze_ID000004 PARENT findrange_ID000003 CHILD analyze_ID000004 PARENT diamond_0_pegasus_concat CHILD generate_ID000001 PARENT diamond_0_local_cdir CHILD diamond_0_pegasus_concat ###################################################################### # End of DAG ######################################################################
$ cat dags/train02/pegasus/diamond/run0001/diamond-0.dag.dagman.out 1/29 10:32:14 ****************************************************** 1/29 10:32:14 ** condor_scheduniv_exec.20133.0 (CONDOR_DAGMAN) STARTING UP 1/29 10:32:14 ** /nfs/software/condor/6.9.4/bin/condor_dagman 1/29 10:32:14 ** $CondorVersion: 6.9.4 Aug 30 2007 $ 1/29 10:32:14 ** $CondorPlatform: I386-LINUX_RHEL3 $ 1/29 10:32:14 ** PID = 3202 1/29 10:32:14 ** Log last touched time unavailable (No such file or directory) 1/29 10:32:14 ****************************************************** [....] 1/29 10:32:27 Submitting Condor Node diamond_0_local_cdir job(s)... 1/29 10:32:27 submitting: condor_submit -a dag_node_name' '=' 'diamond_0_local_cdir -a +DAGManJobId' '=' '20133 -a DAGManJobId' '=' '20133 -a submit_event_notes' '=' 'DAG' 'Node:' 'diamond_0_local_cdir -a +DAGParentNodeNames' '=' '"" diamond_0_local_cdir.sub 1/29 10:32:27 From submit: Submitting job(s). 1/29 10:32:27 From submit: Logging submit event(s). 1/29 10:32:27 From submit: 1 job(s) submitted to cluster 20134. 1/29 10:32:27 assigned Condor ID (20134.0) 1/29 10:32:27 Just submitted 1 job this cycle... 1/29 10:32:27 Event: ULOG_SUBMIT for Condor Node diamond_0_local_cdir (20134.0) 1/29 10:32:27 Number of idle job procs: 1 1/29 10:32:27 Of 6 nodes total: 1/29 10:32:27 Done Pre Queued Post Ready Un-Ready Failed 1/29 10:32:27 === === === === === === === 1/29 10:32:27 0 0 1 0 0 5 0 [....] 1/29 10:33:24 Done Pre Queued Post Ready Un-Ready Failed 1/29 10:33:24 === === === === === === === 1/29 10:33:24 6 0 0 0 0 0 0 1/29 10:33:24 All jobs Completed! 1/29 10:33:24 Note: 0 total job deferrals because of -MaxJobs limit (0) 1/29 10:33:24 Note: 0 total job deferrals because of -MaxIdle limit (0) 1/29 10:33:24 Note: 0 total PRE script deferrals because of -MaxPre limit (20) 1/29 10:33:24 Note: 0 total POST script deferrals because of -MaxPost limit (20) 1/29 10:33:24 **** condor_scheduniv_exec.20133.0 (condor_DAGMAN) EXITING WITH STATUS 0
pegasus-plan to executable workflow (condor
submit files) and pegasus-run to submit the workflow to a grid resourceIn this exercise we are going to run pegasus-planto generate a
concrete workflow from the abstract workflow (montage.dax). The
Concrete workflow generated, are condor submit files that are
submitted to remote grid resources using pegasus-run
The instructors have provided:
You will need to write some things yourself, by following the instructions below:
Instructions:
$ pegasus-plan -Dpegasus.user.properties=`pwd`/config/properties --dir `pwd`/dags --sites isi_viz --output local \ --nocleanup --dax `pwd`/dax/montage.dax
The above command says that we need to plan the montage dax on the isi_viz site.
The output data needs to be transferred back to the local host. The condor submit files are to be generated in a directory structure whose base is dags.
We also are requesting that no cleanup jobs be added as we require the intermediate data on the remote host.
2008.01.28 11:45:26.150 PST: [INFO] Parsing the DAX
2008.01.28 11:45:26.860 PST: [INFO] Parsing the DAX (completed)
2008.01.28 11:45:26.922 PST: [INFO] Parsing the site catalog
2008.01.28 11:45:27.161 PST: [INFO] Parsing the site catalog (completed)
2008.01.28 11:45:27.288 PST: [INFO] Doing site selection
2008.01.28 11:45:27.359 PST: [INFO] Doing site selection (completed)
2008.01.28 11:45:27.360 PST: [INFO] Grafting transfer nodes in the workflow
2008.01.28 11:45:27.611 PST: [INFO] Grafting transfer nodes in the workflow (completed)
2008.01.28 11:45:27.619 PST: [INFO] Grafting the remote workdirectory creation jobs in the workflow
2008.01.28 11:45:27.639 PST: [INFO] Grafting the remote workdirectory creation jobs in the workflow (completed)
2008.01.28 11:45:27.639 PST: [INFO] Generating the cleanup workflow
2008.01.28 11:45:27.643 PST: [INFO] Generating the cleanup workflow (completed)
2008.01.28 11:45:27.662 PST: [INFO] Generating codes for the concrete workflow
2008.01.28 11:45:28.442 PST: [INFO] Generating codes for the concrete workflow (completed)
2008.01.28 11:45:28.443 PST: [INFO] Generating code for the cleanup workflow
2008.01.28 11:45:28.641 PST: [INFO] Generating code for the cleanup workflow (completed)
2008.01.28 11:45:30.407 PST: [INFO]
I have concretized your abstract workflow. The workflow has been entered
into the workflow database with a state of "planned". The next step is
to start or execute your workflow. The invocation required is
pegasus-run -Dpegasus.user.properties=/nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001/pegasus.51773.properties \
/nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001
2008.01.28 11:45:30.408 PST: [INFO] Time taken to execute is 4.571 seconds
$ pegasus-run -Dpegasus.user.properties=/nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001/pegasus.51773.properties \ /nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001The above command submits the workflow to Condor DAGMan/CondorG. After submitting it starts a monitoring daemon tailstatd that parses the condor log files to update the status of the jobs and push it in a work database.
Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : montage-0.dag.condor.sub Log of DAGMan debugging messages : montage-0.dag.dagman.out Log of Condor library output : montage-0.dag.lib.out Log of Condor library error messages : montage-0.dag.lib.err Log of the life of condor_dagman itself : montage-0.dag.dagman.log Condor Log file for all jobs of this DAG : /tmp/montage-051774.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit montage-0.dag.condor.sub" ----------------------------------------------------------------------- Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 19968. I have started your workflow, committed it to DAGMan, and updated its state in the work database. A separate daemon was started to collect information about the progress of the workflow. The job state will soon be visible. Your workflow runs in base directory. cd /nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001 *** To monitor the workflow you can run *** pegasus-status -w montage-0 -t 20080128T114525-0800 or pegasus-status /nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001 *** To remove your workflow run *** pegasus-remove -d 19968.0 or pegasus-remove /nfs/home/train01/tutorial/dags/train01/pegasus/montage/run0001
In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.
We will change into the directory, that was mentioned by the pegasus-run command.
$ cd /nfs/home/@trainXX@/tutorial/dags/@trainXX@/pegasus/montage/run0001
In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow
$ pegasus-status `pwd`
-- Submitter: viz-login.isi.edu : <128.9.72.178:46426> : viz-login.isi.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 19982.0 train01 1/28 13:58 0+00:03:20 R 0 9.8 condor_dagman -f - 19986.0 |-chmod_mProj 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas 19987.0 |-chmod_mDiff 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas 19988.0 |-chmod_mDiff 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas 19989.0 |-chmod_mDiff 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas 19990.0 |-chmod_mConc 1/28 14:02 0+00:00:00 I 0 9.8 kickstart -n pegas
The above output shows that several jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in Condor DAGMan releasing the next job into the queue after a job has finished successfully.
If output of pegasus-status is empty, then either your workflow has - successfully completed - stopped midway due to non recoverable error
$ more jobstate.log 1201557528 INTERNAL *** TAILSTATD_STARTED *** 1201557528 INTERNAL *** DAGMAN_STARTED *** 1201557528 chmod_mBackground_ID000018_0 UN_READY - - - 1201557528 chmod_mJPEG_ID000027_0 UN_READY - - - 1201557528 mAdd_ID000025 UN_READY - - - 1201557528 mImgtbl_ID000024 UN_READY - - - 1201557528 chmod_mAdd_ID000025_0 UN_READY - - - 1201557528 montage_0_isi_viz_cdir UN_READY - - - 1201557528 mDiffFit_ID000013 UN_READY - - - [..]
In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY
After sometime you will see entries in jobstate.log, that shows a job is being executed etc
1201557747 chmod_mJPEG_ID000027_0 EXECUTE 19996.0 isi_viz -
1201557747 chmod_mJPEG_ID000027_0 GLOBUS_SUBMIT 19996.0 isi_viz -
1201557747 chmod_mJPEG_ID000027_0 GRID_SUBMIT 19996.0 isi_viz -
1201557812 chmod_mConcatFit_ID000016_0 JOB_TERMINATED 19990.0 isi_viz -
1201557812 chmod_mConcatFit_ID000016_0 POST_SCRIPT_STARTED - isi_viz -
1201557817 chmod_mConcatFit_ID000016_0 POST_SCRIPT_TERMINATED 19990.0 isi_viz -
1201557817 chmod_mConcatFit_ID000016_0 POST_SCRIPT_SUCCESS - isi_viz -
The above shows the data transfer job being submitted and then executed
on the grid. In addition it lists that job is being run on the grid
site local (which is your submit machine). The various states of the job while it
goes through submission to execution to postprocessing are in UPPERCASE.
Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1201559232 mJPEG_ID000027 JOB_TERMINATED 20023.0 isi_viz - 1201559232 mJPEG_ID000027 POST_SCRIPT_STARTED - isi_viz - 1201559238 mJPEG_ID000027 POST_SCRIPT_TERMINATED 20023.0 isi_viz - 1201559238 mJPEG_ID000027 POST_SCRIPT_SUCCESS - isi_viz - 1201559238 INTERNAL *** DAGMAN_FINISHED *** 1201559239 INTERNAL *** TAILSTATD_FINISHED 0 ***
Looking at the last two lines we see that DAGMan finshed, and tailstatd finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the grid successfully.
The workflow generates a single output file montage.jpg that resides
in the directory /scratch/@trainXX@/storage/montage.jpg where @trainXX@ is your teragrid user id.
To view the images, you can copy montage.jpg to your
viz-login webspace, and view it in your web browser:
$ cp /scratch/@trainXX@/storage/montage.jpg ~/public_html $
Point your web browser to: http://viz-login.isi.edu/~@trainXX@/montage.jpg where @trainXX@ is your viz-login userid
Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1180840228 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_STARTED - local - 1180840233 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_TERMINATED 2786.0 local - 1180840233 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_FAILURE 1 local - 1180840233 inter_tx_mDiffFit_ID000007_0 JOB_TERMINATED 2787.0 local - 1180840233 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_STARTED - local - 1180840238 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_TERMINATED 2787.0 local - 1180840238 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_FAILURE 1 local - 1180840373 INTERNAL *** DAGMAN_FINISHED *** 1180840373 INTERNAL *** TAILSTATD_FINISHED 1 ***
Looking at the last two lines we see that DAGMan finshed, and tailstatd finished unsuccessfully with a status 1.
We can easily determine which job failed. It is inter_tx_mDiffFit_ID000007_0 in
this case.
To determine the reason for failure we need to look at
it's kickstart output file which is $JOBNAME.out.NNN. where NNN is 000 - NNN
$ pegasus-remove /nfs/home/@trainXX@/tutorial/dags/@trainXX@/pegasus/montage/run0001
Job 2788.0 marked for removal
Sometimes a workflow may have too many jobs whose execution time is a few seconds long. In such instances the overhead of scheduling each job on a grid is too large and the runtime of the entire workflow can be optimized by using pegasus clustering techniques. One such technique is to cluster jobs horizontally on the same level into one or more sequential jobs.
$ pegasus-plan -Dpegasus.user.properties=`pwd`/config/properties --dir `pwd`/dags --sites isi_viz --output local \ --nocleanup --cluster horizontal --dax `pwd`/dax/montage.dax [....]
All the jobs till now have been executed on the shared filesystem on the viz-cluster. The input data required by the Montage workflow was staged to a directory on the shared filesystem. All the jobs were then executed in that directory.
A recent feature addition to Pegasus ( still in testing phase ), allows you to execute each of the jobs in a tmp directory on the worker nodes filesystem. For this to happen, a Second Level Staging (SLS) needs to occur, that transfers the data from the directory on the shared filesytem to a directory on the local filesystem of the worker node.$ Set the property pegasus.execute.*.filesystem.local to true in your properties file.
Repeat Exercise 2.6. Refer to slide 76 in the tutorial notes.
In this exercise we will learn how to throttle job submission from
DAGMan.
$ cat dagman/throttling/throttling.dag # DAG with lots of siblings to illustrate throttling. # This only works with version 6.9.2 or later. # CONFIG dagman/throttling/throttling.config JOB Setup dagman/throttling/setup.submit JOB Proc1 dagman/throttling/proc1.submit PARENT Setup CHILD Proc1 JOB Proc2 dagman/throttling/proc2.submit PARENT Setup CHILD Proc2 [....] JOB Proc10 dagman/throttling/proc10.submit PARENT Setup CHILD Proc10 JOB Cleanup dagman/throttling/cleanup.submit PARENT Proc1 CHILD Cleanup PARENT Proc2 CHILD Cleanup PARENT Proc3 CHILD Cleanup PARENT Proc4 CHILD Cleanup PARENT Proc5 CHILD Cleanup PARENT Proc6 CHILD Cleanup PARENT Proc7 CHILD Cleanup PARENT Proc8 CHILD Cleanup PARENT Proc9 CHILD Cleanup PARENT Proc10 CHILD Cleanup
$ cat dagman/throttling/throttling.config DAGMAN_MAX_JOBS_SUBMITTED = 4
$ condor_submit_dag -f -maxjobs 4 dagman/throttling/throttling.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : dagman/throttling/throttling.dag.condor.sub Log of DAGMan debugging messages : dagman/throttling/throttling.dag.dagman.out Log of Condor library output : dagman/throttling/throttling.dag.lib.out Log of Condor library error messages : dagman/throttling/throttling.dag.lib.err Log of the life of condor_dagman itself : dagman/throttling/throttling.dag.dagman.log Condor Log file for all jobs of this DAG : /nfs/home/train02/tutorial/dagman/throttling/job.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 20230. -----------------------------------------------------------------------
$ condor_q -dag train02 -- Submitter: viz-login.isi.edu : <128.9.72.178:46426> : viz-login.isi.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 20230.0 train02 1/29 14:20 0+00:01:26 R 0 9.8 condor_dagman -f - 20237.0 |-Proc6 1/29 14:21 0+00:00:12 R 0 9.8 nodejob Processing 20238.0 |-Proc7 1/29 14:21 0+00:00:07 R 0 9.8 nodejob Processing 20239.0 |-Proc8 1/29 14:21 0+00:00:01 R 0 9.8 nodejob Processing 20240.0 |-Proc9 1/29 14:21 0+00:00:01 R 0 9.8 nodejob Processing
$ tail dagman/throttling/throttling.dag.dagman.out 1/29 16:40:07 Done Pre Queued Post Ready Un-Ready Failed 1/29 16:40:07 === === === === === === === 1/29 16:40:07 12 0 0 0 0 0 0 1/29 16:40:07 Note: 41 total job deferrals because of -MaxJobs limit (4) 1/29 16:40:07 All jobs Completed! 1/29 16:40:07 Note: 41 total job deferrals because of -MaxJobs limit (4) 1/29 16:40:07 Note: 0 total job deferrals because of -MaxIdle limit (0) 1/29 16:40:07 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 1/29 16:40:07 Note: 0 total POST script deferrals because of -MaxPost limit (0) 1/29 16:40:07 **** condor_scheduniv_exec.20263.0 (condor_DAGMAN) EXITING WITH STATUS 0
In this exercise we will learn about PRE and POST scripts in DAGMan.
$ cat dagman/scripts/scripts.dag # DAG with PRE and POST scripts. JOB Setup setup.submit SCRIPT PRE Setup pre_script $JOB SCRIPT POST Setup post_script $JOB $RETURN JOB Proc1 proc1.submit SCRIPT PRE Proc1 pre_script $JOB SCRIPT POST Proc1 post_script $JOB $RETURN JOB Proc2 proc2.submit SCRIPT PRE Proc2 pre_script $JOB SCRIPT POST Proc2 post_script $JOB $RETURN JOB Cleanup cleanup.submit SCRIPT PRE Cleanup pre_script $JOB SCRIPT POST Cleanup post_script $JOB $RETURN PARENT Setup CHILD Proc1 Proc2 PARENT Proc1 Proc2 CHILD Cleanup
$ condor_submit_dag -f -usedagdir dagman/scripts/scripts.dag [....]
$ tail -f dagman/scripts/scripts.dag.dagman.out [....] 1/29 16:44:24 Event: ULOG_JOB_TERMINATED for Condor Node Proc2 (20279.0) 1/29 16:44:24 Node Proc2 job proc (20279.0) failed with status 1. 1/29 16:44:24 Node Proc2 job completed 1/29 16:44:24 Running POST script of Node Proc2... [....] 1/29 16:44:29 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node Proc2 (20279.0) 1/29 16:44:29 POST Script of Node Proc2 completed successfully. [....] 1/29 16:45:05 **** condor_scheduniv_exec.20276.0 (condor_DAGMAN) EXITING WITH STATUS 0
In this exercise we will learn how the DAGMan VARS feature can be used to re-use a single submit file for many DAG nodes.
$ cat dagman/vars/vars.dag # DAG with lots of similar nodes to illustrate VARS. JOB Setup setup.submit JOB Proc1 proc.submit VARS Proc1 ARGS = "Proc1 Alpe_dHuez" PARENT Setup CHILD Proc1 JOB Proc2 proc.submit VARS Proc2 ARGS = "Proc2 Col_du_Glandon" PARENT Setup CHILD Proc2 JOB Proc3 proc.submit VARS Proc3 ARGS = "Proc3 Col_de_la_Madeleine" PARENT Setup CHILD Proc3 JOB Proc4 proc.submit VARS Proc4 ARGS = "Proc4 Col_de_la_Forclaz" PARENT Setup CHILD Proc4 JOB Proc5 proc.submit VARS Proc5 ARGS = "Proc5 Col_de_Romeyere" PARENT Setup CHILD Proc5 [....]
$ condor_submit_dag -f -usedagdir dagman/vars/vars.dag [....]
$ cat dagman/vars/proc.out.* Sleeping for 20 seconds... Arguments: condor_scheduniv_exec.20283.0 Processing node -sleep 20 Proc1 Alpe_dHuez Sleeping for 20 seconds... Arguments: condor_scheduniv_exec.20284.0 Processing node -sleep 20 Proc2 Col_du_Glandon Sleeping for 20 seconds... Arguments: condor_scheduniv_exec.20285.0 Processing node -sleep 20 Proc3 Col_de_la_Madeleine [....]
In this exercise we will learn about DAG recovery mode and node retries.
$ cat dagman/recovery/recovery.dag # DAG illustrating node retries. JOB Setup setup.submit SCRIPT PRE Setup pre_script $JOB SCRIPT POST Setup post_script $JOB $RETURN JOB Proc proc.submit SCRIPT PRE Proc pre_script $JOB SCRIPT POST Proc post_script $JOB $RETURN RETRY Proc 2 UNLESS-EXIT 2 PARENT Setup CHILD Proc
$ condor_submit_dag -f -usedagdir dagman/recovery/recovery.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : dagman/recovery/recovery.dag.condor.sub Log of DAGMan debugging messages : dagman/recovery/recovery.dag.dagman.out Log of Condor library output : dagman/recovery/recovery.dag.lib.out Log of Condor library error messages : dagman/recovery/recovery.dag.lib.err Log of the life of condor_dagman itself : dagman/recovery/recovery.dag.dagman.log Condor Log file for all jobs of this DAG : /nfs/home/train02/test_tar/tutorial/dagman/recovery/job.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 20294. -----------------------------------------------------------------------
$ condor_hold condor_hold 20294 Cluster 20294 held.
$ condor_q -dag train02 -- Submitter: viz-login.isi.edu : <128.9.72.178:46426> : viz-login.isi.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 20294.0 train02 1/29 17:15 0+00:00:13 H 0 9.8 condor_dagman -f - 20295.0 |-Setup 1/29 17:15 0+00:00:07 R 0 9.8 nodejob Setup node 2 jobs; 0 idle, 1 running, 1 held
$ condor_release 20294 Cluster 20294 released.
$ cat dagman/recovery/recovery.dag.dagman.out [....] 1/29 17:19:53 Bootstrapping... 1/29 17:19:53 Number of pre-completed nodes: 0 1/29 17:19:53 Running in RECOVERY mode... 1/29 17:19:53 Event: ULOG_SUBMIT for Condor Node Setup (20295.0) 1/29 17:19:53 Number of idle job procs: 1 1/29 17:19:53 Event: ULOG_EXECUTE for Condor Node Setup (20295.0) 1/29 17:19:53 Number of idle job procs: 0 1/29 17:19:53 Event: ULOG_JOB_TERMINATED for Condor Node Setup (20295.0) 1/29 17:19:53 Node Setup job proc (20295.0) completed successfully. 1/29 17:19:53 Node Setup job completed 1/29 17:19:53 Number of idle job procs: 0 1/29 17:19:53 ------------------------------ 1/29 17:19:53 Condor Recovery Complete 1/29 17:19:53 ------------------------------ [....]
$ cat dagman/recovery/recovery.dag.dagman.out [....] 1/29 17:20:30 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node Proc (20296.0) 1/29 17:20:30 POST Script of Node Proc failed with status 1 1/29 17:20:30 Retrying node Proc (retry #1 of 2)... [....] 1/29 17:21:05 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Node Proc (20297.0) 1/29 17:21:05 POST Script of Node Proc failed with status 1 1/29 17:21:05 Retrying node Proc (retry #2 of 2)... [....] 1/29 17:21:40 ERROR: the following job(s) failed: 1/29 17:21:40 ---------------------- Job ---------------------- 1/29 17:21:40 Node Name: Proc 1/29 17:21:40 NodeID: 1 1/29 17:21:40 Node Status: STATUS_ERROR 1/29 17:21:40 Node return val: 1 1/29 17:21:40 Error: Job exited with status 1 and POST Script failed with status 1 (after 2 node retries) 1/29 17:21:40 Job Submit File: proc.submit 1/29 17:21:40 PRE Script: pre_script $JOB 1/29 17:21:40 POST Script: post_script $JOB $RETURN 1/29 17:21:40 Retry: 2 1/29 17:21:40 Condor Job ID: (20298) 1/29 17:21:40 Q_PARENTS: 0,1/29 17:21:40 Q_WAITING: 1/29 17:21:40 Q_CHILDREN: 1/29 17:21:40 --------------------------------------- [....]
The End