These are the student notes for the Pegasus tutorial. They are designed to be used in conjunction with instructor presentation and support.
You will see two styles of machine text here:
Text like this is input that you should type.
Text like this is the output you should get.
For example:
$ date
Mon June 1 11:54:58 BST 2007
You will need to log into the tutorial machine, using an ssh client and the login name and password supplied separately.
On Linux or Mac OS X, open a terminal window and type:
On Windows, PuTTY is recommended as an ssh client.
$ ssh viz-username@viz-login.isi.edu [welcome message] viz-username@viz-login:~$
You will need to obtain Grid Credentials to run the workflows on Teragrid.
Teragrid provides facility to obtain grid credentials using MyProxy.
$ myproxy-logon -s myproxy.teragrid.org -l tg-username Enter MyProxy pass phrase: A credential has been received for user train22 in /tmp/x509up_u1055
In this chapter you will be introduced to planning and running a workflow through Pegasus on a cluster. You will take a Montage workflow generated and run it on the GRID.
All the exercises in this Chapter will be run from the $HOME/tutorial/ directory. All the files that are required reside in this sub directory
$ cd $HOME/tutorial $
Files for the exercise are stored in subdirectories:
$ ls config dags data fmri.xml templates
You may also see some other files here.
An abstract DAG has been generated and output in XML format into
fmri.xml. Open fmri.xml in a file
viewer:
$ cat fmri.xml
Inside the DAX, you should see two sections.
In this exercise you will insert entries into the Replica Catalog. The replica catalog that we use is GLOBUS RLS.
A Replica Catalog maintains the lfn to pfn mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into RLS for data reuse later on.
You can use the rc-client command to insert , query
and delete from the replica catalog.
The input data to be used for your workflow resides in the $HOME/Data directory. We are going to insert entries into the replica catalog that point to the files in this directory.
The instructors have provided:
You will need to write some things yourself, by following the instructions below:
Instructions:
$ cd config $ perl -pi -e 's/\@user\@/vdsuser-4/g' replicas.in
$ cat replicas.in vdsuser-4-anatomy4.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy4.img pool=isi vdsuser-4-anatomy4.hdr gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy4.hdr pool=isi vdsuser-4-anatomy3.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy3.img pool=isi vdsuser-4-anatomy3.hdr gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy3.hdr pool=isi vdsuser-4-anatomy2.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy2.img pool=isi vdsuser-4-anatomy2.hdr gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy2.hdr pool=isi vdsuser-4-anatomy1.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy1.img pool=isi vdsuser-4-anatomy1.hdr gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy1.hdr pool=isi vdsuser-4-reference.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/reference.img pool=isi vdsuser-4-reference.hdr gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/reference.hdr pool=isi
rc-client and populate the
data. Since each of you have uniques lfn's that are being registered,
all the 10 entries should be successfully registered.
$ rc-client --insert replicas.in #Successfully worked on : 10 lines #Worked on total number of : 10 lines.
$ rc-client lookup vdsuser-4-anatomy1.img vdsuser-4-anatomy1.img gsiftp://skynet-login/nfs/home/vdsuser-4/tutorial/data/anatomy1.img pool="isi"
Congratulations!! You have the replica catalog setup correctly for use. This is the catalog which you will tinker with most, while running Pegasus.
In this exercise you will setup your Site Catalog and the Transformation Catalog.
The transformation catalog maintains information about where the application code resides on the grid. In our case, it contains the locations where the fmri code is installed on the various grid sites.
The site catalog contains information about the layout of your grid where you want to run your workflows. For each site information like workdirectories, jobmanagers to use, gridftp servers to use and other site wide information like environment variables to be set is maintained.
vds-get-sites command to generate a
site catalog and a transformation catalog to be used.
$ vds-get-sites -t $HOME/my.tc.data -s $HOME/my.sites.xml --default-rls rlsn://smarty.isi.edu # using default transformation mappings. # assembling information for grid "tg". # using URI dbi:SQLite2:dbname=/nfs/software/vds/default/contrib/OurGrids/tg.db # processing isi-skynet, 93 CPUs # processing tg_ncsa, 10 CPUs # processing tg_uc, 10 CPUs # processing tg_sdsc, 10 CPUs # adding myself as local site # dumping catalogs... # dumping SC into my.sites.xml... # backup /nfs/home/vdsuser-4/tc.data # -> /nfs/home/vdsuser-4/tc.data.0 # dumping TC into /nfs/home/vdsuser-4/tc.data..
The instructors have provided:
You can look at them to have an idea as to what they look like. But for now we will move ahead and plan your workflow through Pegasus. We need to get running on the GRID fast :). Time is short!!
vds-plan to generate concrete workflow (condor
submit files) and vds-run to submit the workflow to a grid resourceIn this exercise we are going to run vds-planto generate a
concrete workflow from the abstract workflow (fmri.dax). The
concrete workflow generated, are condor submit files that are
submitted to remote grid resources using Condor DAGMan and
CondorG. Then we will submit the workflow to the grid using
vds-run
Firstly, we are slightly going to change the fmri.dax. The changes which we will be doing are to ensure that each of yours DAX refers to unique lfns. This way you cannot clobber your fellow students runs.
The instructors have provided:
You will need to write some things yourself, by following the instructions below:
Instructions:
$ cd $HOME/tutorial $ perl -pi -e 's/\@user\@/vdsuser-4/g' fmri.xml
$ more fmri.xml
$vds-plan --pegasus --base dags --option pools=skynet --option output=skynet-data --option force fmri.xml
The above command says that we need to plan the fmri dax using pegasus on the grid site skynet. The output data needs to be transferred back to the skynet-data host. In addition we want the condor submit files to be generated in a directory structure whose base is dags
Here is the output of vds-plan running on site skynet.$vds-plan --pegasus --base dags --option pools=skynet --option output=skynet-data --option force fmri.xml 2006.11.29 20:42:50.707 PDT: [INFO] Parsing of the DAX 2006.11.29 20:42:51.217 PDT: [INFO] Parsing of the DAX (completed) 2006.11.29 20:42:51.238 PDT: [INFO] Parsing the site catalog 2006.11.29 20:42:51.316 PDT: [INFO] Parsing the site catalog (completed) 2006.11.29 20:42:51.687 PDT: [INFO] Querying Replica Catalog 2006.11.29 20:42:51.697 PDT: [INFO] Querying Replica Catalog (completed) 2006.11.29 20:42:51.697 PDT: [INFO] Doing site selection [..] 2006.11.29 20:42:51.7I have concretized your abstract workflow. The workflow has been entered into the workflow database with a state of "planned". The next step is to start or execute your workflow. The invocation required is vds-run /nfs/home/vdsuser-4/tutorial/dags/ivdgl1/fmri/run0001
$vds-run /nfs/home/vdsuser-4/tutorial/dags/ivdgl1/fmri/run0001 [..] # parsing properties in /nfs/home/vdsuser-4/.wfrc... # slurped /nfs/home/vdsuser-4/tutorial/dags/ivdgl1/fmri/run0001/braindump.txt I have started your workflow, committed it to DAGMan, and updated its state in the work database. A separate daemon was started to collect information about the progress of the workflow. The job state will soon be visible. Your workflow runs in base directory cd /nfs/home/vdsuser-4/tutorial/dags/ivdgl1/fmri/run0001The above command submits the workflow to Condor DAGMAN/CondorG. After submittting it starts a monitoring daemon tailstatd that parses the condor log files to update the status of the jobs and push it in a work database.
In this exercise we are going to list ways to track your workflow, and give some debugging hints when something goes wrong.
We will change into the directory, that was mentioned by the vds-run command.
$ cd /nfs/home/@user@/tutorial/dags/ivdgl1/fmri/run000X
In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just a very few number of files to track the progress of the workflow
At the first go you should be concerned with only one file
$more jobstate.log 1157589711 INTERNAL *** TAILSTATD_STARTED *** 1157589711 INTERNAL *** DAGMAN_STARTED *** convert_Node_convertX UN_READY - - - 1157589711 new_rc_tx_softmean_Node_softmean_0 UN_READY - - - 1157589711 new_rc_register_slicer_Node_slicerX UN_READY - - - 1157589711 new_rc_register_align_warp_Node_align_warp_collection_4 UN_READY - - - [..]
In the starting of the jobstate.log, when the workflow has just started running you will see a lot of entries with status UN_READY. That designates that DAGMan has just parsed in the .dag file and has not started working on any job as yet. Initially all the jobs in the workflow are listed as UN_READY
After sometime you will see entries in jobstate.log, that shows a job is being executed etc
1157589873 rc_tx_skynet_0 SUBMIT 21053.0 local - 1157589878 rc_tx_skynet_0 EXECUTE 21053.0 local - 1157589878 rc_tx_skynet_0 JOB_TERMINATED 21053.0 local - 1157589878 rc_tx_skynet_0 POST_SCRIPT_STARTED - local - 1157589883 rc_tx_skynet_0 POST_SCRIPT_TERMINATED 21053.0 local - 1157589883 rc_tx_skynet_0 POST_SCRIPT_SUCCESS - local -
The above shows the data transfer job being submitted and then executed
on the grid. In addition it lists that job is being run on the grid
site local (which is your submit machine). The various states of the job while it
goes through submission to execution to postprocessing are in UPPERCASE.
To see which job of yours is being executed, you can also use condor_q. By default condor_q list all the jobs on the submit host. However, we are just interested in our own respective jobs. So we will use some classad magic to narrow the results.
Run command condor_q -const '(Owner == "@user@")' with @user@ replaced by the username.
$ condor_q -const '(Owner == "vdsuser-4")' 252.0 vdsuser-4 11/29 17:04 0+00:11:21 R 0 9.8 condor_dagman -f - 260.0 vdsuser-4 11/29 17:15 0+00:00:00 I 0 9.8 kickstart -n convert
The above indicates that currently we have one job running. The dagman job is the graph manager and convert is the application code
Keep a lookout on the condor_q to track whether a workflow is running or not. If you do not see any of your job in the condor_q for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in CondorDAGMAN releasing the next job into the queue after a job has finished successfully.
If condor_q is empty, then either your workflow has - successfully completed - stopped midway due to non recoverable error
Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log
$ tail jobstate.log 1157590861 new_rc_register_convert_Node_convertZ POST_SCRIPT_TERMINATED 21098.0 local - 1157590861 new_rc_register_convert_Node_convertZ POST_SCRIPT_SUCCESS - local - 1157590861 INTERNAL *** DAGMAN_FINISHED *** 1157590866 INTERNAL *** TAILSTATD_FINISHED 0 ***
Looking at the last two lines we see that DAGMAN finshed, and tailstatd finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the grid successfully.
The 3 output images generated by the workflow are the .gif files that resides
in the directory /nfs/storage01/@user@/@user@-atlas-x.gif where @user@ is your user id. for e.g if user is vdsuser-4 then
the path will be /nfs/storage01/vdsuser-4/vdsuser-4-atlas-x.gif
To view the images, you can copy *.gif to your
skynet webspace, and view it in your web browser:
$ cp /nfs/storage01/vdsuser-4/*.gif ~/public_html $
Point your web browser to: http://skynet-login.isi.edu/~@user@/@user@-atlas-x.gif where @user@ is your userid
Let us again look at the jobstate.log. Again we need to look at the last few lines of jobstate.log
$ tail jobstate.log 1149912756 Frequency_ID000006 SUBMIT 277.0 isi_skynet - 1149912766 Frequency_ID000006 GLOBUS_SUBMIT 277.0 isi_skynet - 1149912766 Frequency_ID000006 GRID_SUBMIT 277.0 isi_skynet - 1149912861 Frequency_ID000006 EXECUTE 277.0 isi_skynet - 1149912901 Frequency_ID000006 JOB_TERMINATED 277.0 isi_skynet - 1149912901 Frequency_ID000006 POST_SCRIPT_STARTED - isi_skynet - 1149912906 Frequency_ID000006 POST_SCRIPT_TERMINATED 277.0 isi_skynet - 1149912906 Frequency_ID000006 POST_SCRIPT_FAILURE 1 isi_skynet - 1149912906 INTERNAL *** DAGMAN_FINISHED *** 1149912911 INTERNAL *** TAILSTATD_FINISHED 1 ***
Looking at the last two lines we see that DAGMAN finshed, and tailstatd finished unsuccessfully with a status 1.
We can easily determine which job failed. It is Frequency_ID000006 in
this case.
To determine the reason for failure we need to look at
it's kickstart output file which is $JOBNAME.out.NN or $JOBNAME.out depending on the version of PEGASUS.
The End