Student notes for Pegasus tutorial
Introduction
These are the student notes for the Pegasus tutorial. They are designed to be used in conjunction with instructor presentation and support.
You will see two styles of machine text here:
Text like this is input that you should type.
Text like this is the output you should get.
For example:
$ date
Mon June 1 11:54:58 BST 2007
You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.
On Linux or Mac OS X, open a terminal window and type:
On Windows, PuTTY is recommended as an ssh client.
For the purpose of this tutorial replace any instance of @viz-user@ with your viz-login username and @tg-user@ with your teragrid username. If you use your teragrid username remember to use your teragrid password and for viz-login username use your viz password.
$ ssh @viz-user@@viz-login.isi.edu
[welcome message] viz-username@viz-login:~$
You will need to obtain Grid Credentials to run the workflows on Teragrid.
Teragrid provides facility to obtain grid credentials using MyProxy.
$ myproxy-logon -s myproxy.teragrid.org -l @tg-user@
Enter MyProxy pass phrase: A credential has been received for user xxxxx in /tmp/x509up_u1055
Check your proxy using grid-proxy-info.
$ grid-proxy-info
subject : /C=US/O=National Center for Supercomputing Applications/CN=Training - trainxxxx REL issuer : /C=US/O=National Center for Supercomputing Applications/CN=Certification Authority identity : /C=US/O=National Center for Supercomputing Applications/CN=Training - trainxxxx REL type : end entity credential strength : 1024 bits path : /tmp/x509up_u1055 timeleft : 2:59:24
Chapter 2: Running on the GRID using Pegasus
In this chapter you will be introduced to planning and running a workflow through Pegasus on a cluster. You will take a Montage workflow generated and run it on the GRID.
All the exercises in this Chapter will be run from the $HOME/tutorial/ directory. All the files that are required reside in this directory
$ cd $HOME/tutorial $
Files for the exercise are stored in subdirectories:
$ ls config dags dax
You may also see some other files here.
Exercise 2.1: DAX
An abstract DAG has been generated for Montage application and output in XML format into
dax/montage.dax. Open montage.dax in a file
viewer:
$ cat dax/montage.dax
Inside the DAX, you should see three sections.
- list of all the files used in the workflow
- definition of all jobs - each job in the workflow.
- list of control-flow dependencies - this section specifies a partial order in which jobs are to executed.
Exercise 2.2 SETTING UP THE REPLICA CATALOG
In this exercise you will insert entries into the Replica Catalog.
The replica catalog that we will use today is a simple file based catalog.
We also support and recommmend GLOBUS RLS or a JDBC impelentation for production runs.
A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.
You can use the rc-client command to insert , query and delete from the replica catalog.
The input data to be used for your workflow resides in the /scratch/tutorial/inputdata/0.2degree directory. We are going to insert entries into the replica catalog that point to the files in this directory.
The instructors have provided:
- A file replicas.in, the input data file for the rc-client that contains the mappings that need to be populated in the RLS. The file is inside the config directory
Instructions:
- Let us see what the file looks like.
$ cat config/rc.in
# file-based replica catalog: 2007-06-02T13:11:35.954-07:00 statfile_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/statfile.tbl pool="local" 2mass-atlas-990502s-j1440198.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1440198.fits pool="local" 2mass-atlas-990502s-j1440186.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1440186.fits pool="local" 2mass-atlas-990502s-j1430092.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1430092.fits pool="local" 2mass-atlas-990502s-j1420198.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1420198.fits pool="local" 2mass-atlas-990502s-j1420186.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1420186.fits pool="local" cimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/cimages.tbl pool="local" pimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/pimages.tbl pool="local" region_20070529_153243_22618.hdr gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/region.hdr pool="local" 2mass-atlas-990502s-j1430080.fits gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/2mass-atlas-990502s-j1430080.fits pool="local" - Now we are ready to run
rc-clientand populate the data. Since each of you have uniques lfn's that are being registered, all the 10 entries should be successfully registered.$ rc-client -Dpegasus.user.properties=config/properties --insert config/rc.in #Successfully worked on : 10 lines #Worked on total number of : 11 lines.
- Now the entries have been successfully inserted into the Replica Catalog.
We should query the replica catalog for a particular lfn.
$ rc-client -Dpegasus.user.properties=config/properties lookup pimages_20070529_153243_22618.tbl
pimages_20070529_153243_22618.tbl gsiftp://viz-login.isi.edu/scratch/tutorial/inputdata/0.2degree/pimages.tbl pool="local"
Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.
Exercise 2.3 SETTING UP THE SITE CATALOG AND THE TRANSFORMATION Catalog
In this exercise you will setup your Site Catalog and the Transformation Catalog.
The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the Montage code is installed on the various grid sites.
The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectories, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.
- You can use the
sc-clientcommand to generate a site catalog from a hand written sites.txt file.$ sc-client -f config/sites.txt -o config/sites.xml 2007.06.02 17:06:12.215 PDT: [INFO] Reading config/sites.txt 2007.06.02 17:06:11.262 PDT: [INFO] Reading config/sites.txt (completed) 2007.06.02 17:06:11.276 PDT: [INFO] Written xml output to file : config/sites.xml
The instructors have provided:
- A ready transformation catalog (tc.data) in the $HOME/tutorial/config directory.
- A semi ready properties file in the $HOME/tutorial/config directory.
$ cat config/tc.data local bin/mDiff gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mDiff STATIC_BINARY INTEL32::LINUX ENV::MONTAGE_HOME="." local bin/mDiff gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mDiff STATIC_BINARY INTEL64::LINUX ENV::MONTAGE_HOME="." local bin/mFitplane gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mFitplane STATIC_BINARY INTEL32::LINUX NULL local bin/mFitplane gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mFitplane STATIC_BINARY INTEL64::LINUX NULL local mAdd:3.0 gsiftp://sukhna.isi.edu/usr/sukhna/work/montage/software/default/bin/mAdd STATIC_BINARY INTEL32::LINUX NULL local mAdd:3.0 gsiftp://viz-login.isi.edu/nfs/software/montage/montage-3.0_beta33-ia64/bin/mAdd STATIC_BINARY INTEL64::LINUX NULL
Open the properties file and check a few properties.
$ cat config/properties ## SELECT THE REPLICAT CATALOG MODE AND URL pegasus.catalog.replica = SimpleFile pegasus.catalog.replica.file = ${user.home}/tutorial/config/rc.data #pegasus.catalog.replica.url=rlsn://smarty.isi.edu ## SELECT THE SITE CATALOG MODE AND FILE pegasus.catalog.site = XML pegasus.catalog.site.file = ${user.home}/tutorial/config/sites.xml ## SELECT THE TRANSFORMATION CATALOG MODE AND FILE pegasus.catalog.transformation = File pegasus.catalog.transformation.file = ${user.home}/tutorial/config/tc.data ## SET UP THE WORK AND INVOCATION DATABASE pegasus.catalog.work = Database pegasus.catalog.provenance = InvocationSchema ## Database related properties pegasus.catalog.*.db.driver = MySQL pegasus.catalog.*.db.url = jdbc:mysql://smarty.isi.edu/tg2007 pegasus.catalog.*.db.user = tg2007user pegasus.catalog.*.db.password = Teragrid2007 ## USE DAGMAN RETRY FEATURE FOR FAILURES pegasus.dagman.retry=2 ## STAGE ALL OUR EXECUTABLES pegasus.catalog.transformation.mapper = Staged ## CHECK JOB EXIT CODES FOR FAILURE pegasus.exitcode.scope=all ## OPTIMZE DATA & EXECUTABLE TRANSFERS pegasus.transfer.refiner=Bundle #STAGE DATA AND EXECUTABLES USING GRIDFTP 3rd PARTY MODE pegasus.transfer.*.thirdparty.sites=* ## WORK AND STORAGE DIR ## CHANGE THESE TO YOUR TERAGRID USERNAME pegasus.dir.storage = xxxxx/storage pegasus.dir.exec = xxxxx/exec
Edit the properties pegasus.dir.storage and pegasus.dir.exec to specify relative paths for your workflow execution and data storage directory. Change the xxxxx value to your @tg-user@ value.
$ vim config/properties [...] $ cat config/properties pegasus.dir.storage = @tg-user@/storage pegasus.dir.exec = @tg-user@/exec
You can look at them to have an idea as to what they look like. But for now we will move ahead and plan your workflow through Pegasus. We need to get running on the GRID fast :). Time is short!!
In production mode the sc-client interfaces with Globus MDS to retrieve the information about various sites.
Also the client pegasus-get-sites can be used to generate a site catalog and transformation catalog for the Open Science Grid.
Exercise 2.4 Running pegasus-plan to generate concrete workflow (condor
submit files) and pegasus-run to submit the workflow to a grid resource
In this exercise we are going to run pegasus-planto generate a
concrete workflow from the abstract workflow (montage.dax). The
Concrete workflow generated, are condor submit files that are
submitted to remote grid resources using pegasus-run
The instructors have provided:
- A dax (montage.dax) in the $HOME/tutorial/dax/ directory.
You will need to write some things yourself, by following the instructions below:
- Run pegasus-plan to generate the condor submit files out of the dax.
- Run pegasus-run to submit the workflow to the grid.
Instructions:
- Let us run pegasus-plan on the montage dax.
$ pegasus-plan -Dpegasus.user.properties=`pwd`/config/properties --dir `pwd`/dags --sites tg_ncsa --output local \ --nocleanup --dax `pwd`/dax/montage.dax
The above command says that we need to plan the montage dax on the teragrid sites tg_ncsa and/or tg_sdsc.
Here is the output of pegasus-plan.
The output data needs to be transferred back to the local host. The condor submit files are to be generated in a directory structure whose base is dags.
We also are requesting that no cleanup jobs be added as we require the intermediate data on the remote host.2007.06.02 19:31:38.912 PDT: [INFO] Parsing the DAX 2007.06.02 19:31:39.569 PDT: [INFO] Parsing the DAX (completed) 2007.06.02 19:31:39.669 PDT: [INFO] Parsing the site catalog 2007.06.02 19:31:39.876 PDT: [INFO] Parsing the site catalog (completed) 2007.06.02 19:31:39.947 PDT: [INFO] Doing site selection 2007.06.02 19:31:40.014 PDT: [INFO] Doing site selection (completed) 2007.06.02 19:31:40.015 PDT: [INFO] Grafting transfer nodes in the workflow 2007.06.02 19:31:40.264 PDT: [INFO] Grafting transfer nodes in the workflow (completed) 2007.06.02 19:31:40.272 PDT: [INFO] Grafting the remote workdirectory creation jobs in the workflow -2007.06.02 19:31:40.291 PDT: [INFO] Grafting the remote workdirectory creation jobs in the workflow (completed) 2007.06.02 19:31:40.292 PDT: [INFO] Generating the cleanup workflow 2007.06.02 19:31:40.296 PDT: [INFO] Generating the cleanup workflow (completed) 2007.06.02 19:31:40.407 PDT: [INFO] Generating codes for the concrete workflow 2007.06.02 19:31:41.022 PDT: [INFO] Generating codes for the concrete workflow (completed) 2007.06.02 19:31:41.023 PDT: [INFO] Generating code for the cleanup workflow 2007.06.02 19:31:41.064 PDT: [INFO] Generating code for the cleanup workflow (completed) I have concretized your abstract workflow. The workflow has been entered into the workflow database with a state of "planned". The next step is to start or execute your workflow. The invocation required is pegasus-run -Dpegasus.user.properties=/nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001/pegasus.60465.properties /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001 2007.06.02 19:31:41.409 PDT: [INFO] Time taken to execute is 2.76 seconds - If you get any errors above while running pegasus-plan you can add -vvvvv to enable maximum verbosity on pegasus-run.
- Now run pegasus-run as mentioned in the output of pegasus-plan.
$ pegasus-run -Dpegasus.user.properties=/nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001/pegasus.60465.properties \ /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001
The above command submits the workflow to Condor DAGMAN/CondorG. After submittting it starts a monitoring daemon tailstatd that parses the condor log files to update the status of the jobs and push it in a work database.
Checking all your submit files for log file names. This might take a while... checking /tmp instead... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : montage-0.dag.condor.sub Log of DAGMan debugging messages : montage-0.dag.dagman.out Log of Condor library debug messages : montage-0.dag.lib.out Log of the life of condor_dagman itself : montage-0.dag.dagman.log Condor Log file for all jobs of this DAG : /tmp/montage-060466.log -no_submit given, not submitting DAG to Condor. You can do this with: "condor_submit montage-0.dag.condor.sub" ----------------------------------------------------------------------- Submitting job(s) WARNING: Log file /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001/montage-0.dag.dagman.log is on NFS. This could cause log file corruption and is _not_ recommended. . Logging submit event(s). 1 job(s) submitted to cluster 2758. I have started your workflow, committed it to DAGMan, and updated its state in the work database. A separate daemon was started to collect information about the progress of the workflow. The job state will soon be visible. Your workflow runs in base directory. cd /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001 *** To monitor the workflow you can run *** pegasus-status -w montage-0 -t 20070602T193139-0700 or pegasus-status /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001 *** To remove your workflow run *** pegasus-remove -d 2758.0 or pegasus-remove /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001
Exercise 2.5 Tracking the progress of the workflow and debugging the workflows.
In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.
We will change into the directory, that was mentioned by the pegasus-run command.
$ cd /nfs/home/@tg-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001
In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow
-
Run the command pegasus-status as mentioned by pegasus-run above to check the status of your jobs
$ pegasus-status `pwd`
-- Submitter: viz-login.isi.edu : <128.9.72.178:43684> : viz-login.isi.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 2758.0 @viz-user@ 6/2 19:54 0+00:05:32 R 0 9.8 condor_dagman -f - 2764.0 |-chmod_mProj 6/2 19:58 0+00:01:56 R 0 0.0 kickstart -n pegas 2765.0 |-chmod_mDiff 6/2 19:58 0+00:01:56 R 0 0.0 kickstart -n pegas 2766.0 |-chmod_mDiff 6/2 19:58 0+00:01:56 R 0 0.0 kickstart -n pegas 2767.0 |-chmod_mDiff 6/2 19:58 0+00:01:56 R 0 0.0 kickstart -n pegas 2769.0 |-chmod_mBgMo 6/2 19:58 0+00:01:51 R 0 0.0 kickstart -n pegas 2771.0 |-chmod_mJPEG 6/2 19:58 0+00:01:51 R 0 0.0 kickstart -n pegas 2773.0 |-rc_tx_tg_nc 6/2 19:59 0+00:00:21 R 0 9.8 kickstart -n pegasThe above output shows that several jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in CondorDAGMAN releasing the next job into the queue after a job has finished successfully.
If output of pegasus-status is empty, then either your workflow has - successfully completed - stopped midway due to non recoverable error
- Another way to monitor the workflow is to check the jobstate.log file.
This is the output file of the monitoring daemon that is parsing all the condor log files to determine the status of the jobs.
It logs the events seen by Condor into a more readable form for us.$ more jobstate.log 1180839289 INTERNAL *** TAILSTATD_STARTED *** 1180839288 INTERNAL *** DAGMAN_STARTED *** 1180839288 chmod_mBackground_ID000018_0 UN_READY - - - 1180839288 chmod_mJPEG_ID000027_0 UN_READY - - - 1180839288 mAdd_ID000025 UN_READY - - - 1180839288 mImgtbl_ID000024 UN_READY - - - 1180839288 chmod_mAdd_ID000025_0 UN_READY - - - 1180839288 mDiffFit_ID000013 UN_READY - - - 1180839288 mDiffFit_ID000008 UN_READY - - - [..]
In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY
After sometime you will see entries in jobstate.log, that shows a job is being executed etc
1180839463 rc_tx_tg_sdsc_0 SUBMIT 2763.0 local - 1180839473 rc_tx_tg_sdsc_0 EXECUTE 2763.0 local - 1180839478 rc_tx_tg_sdsc_0 JOB_TERMINATED 2763.0 local - 1180839478 rc_tx_tg_sdsc_0 POST_SCRIPT_STARTED - local - 1180839483 rc_tx_tg_sdsc_0 POST_SCRIPT_TERMINATED 2763.0 local - 1180839483 rc_tx_tg_sdsc_0 POST_SCRIPT_SUCCESS - local -The above shows the data transfer job being submitted and then executed on the grid. In addition it lists that job is being run on the grid site
local(which is your submit machine). The various states of the job while it goes through submission to execution to postprocessing are in UPPERCASE. - Successfully Completed
Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1180840373 INTERNAL *** DAGMAN_FINISHED *** 1180840373 INTERNAL *** TAILSTATD_FINISHED 0 ***Looking at the last two lines we see that DAGMAN finshed, and tailstatd finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the grid successfully.
The workflow generates a single output file
montage.jpgthat resides in the directory /scratch/@tg-user@/storage/montage.jpg where @tg-user@ is your teragrid user id.To view the images, you can copy
montage.jpgto your viz-login webspace, and view it in your web browser:$ cp /scratch/@tg-user@/storage/montage.jpg ~/public_html $
Point your web browser to: http://viz-login.isi.edu/~@vizuser@/montage.jpg where @vizuser@ is your viz-login userid
- Unsuccessfully Completed (Workflow execution stopped midway)
Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log
$ tail jobstate.log
1180840228 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_STARTED - local - 1180840233 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_TERMINATED 2786.0 local - 1180840233 inter_tx_mDiffFit_ID000011_0 POST_SCRIPT_FAILURE 1 local - 1180840233 inter_tx_mDiffFit_ID000007_0 JOB_TERMINATED 2787.0 local - 1180840233 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_STARTED - local - 1180840238 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_TERMINATED 2787.0 local - 1180840238 inter_tx_mDiffFit_ID000007_0 POST_SCRIPT_FAILURE 1 local - 1180840373 INTERNAL *** DAGMAN_FINISHED *** 1180840373 INTERNAL *** TAILSTATD_FINISHED 1 ***Looking at the last two lines we see that DAGMAN finshed, and tailstatd finished unsuccessfully with a status 1.
We can easily determine which job failed. It is inter_tx_mDiffFit_ID000007_0 in this case.
To determine the reason for failure we need to look at it's kickstart output file which is $JOBNAME.out.NNN. where NNN is 000 - NNN
Exercise 2.6 Removing a running workflow
Sometimes you may want to halt the execution of the workflow or just permanently remove it. You can stop/halt a workflow by running the pegasus-remove command mentioned in the output of pegasus-run$ pegasus-remove /nfs/home/@viz-user@/tutorial/dags/@viz-user@/pegasus/montage/run0001
Job 2788.0 marked for removal
Exercise 2.7 Optimizing a workflow by clustering small jobs
Sometimes a workflow may have too many jobs whose execution time is a few seconds long. In such instances the overhead of scheduling each job on a grid is too large and the runtime of the entire workflow can be optimized by using pegasus clustering techniques. One such technique is to cluster jobs horizontally on the same level into one or more sequential jobs.
$ pegasus-plan -Dpegasus.user.properties=`pwd`/config/properties --dir `pwd`/dags --sites tg_sdsc --output local \ --nocleanup --cluster horizontal --dax `pwd`/dax/montage.dax [....]
Exercise 2.8 Filling out the Teragrid tutorial Survey
http://tinyurl.com/23u8bc
