NEW FEATURES
- Support for Pegasus DAX 3.0
Pegasus now also can accept DAX’es in Pegasus 3.0 format
Some salient features of the new format are
- Users can specify locations of the files in the DAX
- Users can specify what executables to use in the DAX
- Users can specify sub dax in the DAX using the dax element. The dax jobs result in a separate subworkflow being launched with the appropriate pegasus-plan command as the prescript
- Users can specify condor DAG’s in the DAX using the dag element. The dag job is passed on the Condor DAGMAN as a SUBDAG for execution.
A sample 3.0 DAX can be found at http://pegasus.isi.edu/mapper/docs/schemas/dax-3.0/two_node_dax-3.0_v6.xml
In the next Pegasus release ( Pegasus 3.0 ) a JAVA DAX API will be made available. Certain more extensions will be added to the schema. For feature requests email pegasus@isi.edu
- Support for running workflows on EC2 using S3 for storage
Users while running on Amazon EC2 can use S3 for storage backend for the workflow execution. The details below assume that a user configures a condor pool on the nodes allocated from EC3
To enable Pegasus for S3 the following properties need to be set.
- pegasus.execute.*.filesystem.local = true
- pegasus.transfer.*.impl = S3
- pegasus.transfer.sls.*.impl = S3
- pegasus.dir.create.impl = S3
- pegasus.file.cleanup.impl = S3
- pegasus.gridstart = SeqExec
- pegasus.transfer.sls.s3.stage.sls.file = false
For data stagein and creating S3 buckets for workflows pegasus relies on the amazon provided s3cmd command line client.
Pegasus looks for a transformation with namespace amazon and logical name as s3cmd in the transformation catalog to figure out the location of the s3cmd client. for e.g in the File based Transformation Catalog the full name for transformation will be amazon::s3cmd
In order to enable stdtout and stderr streaming correctly from Condor on EC2 we recommend adding certain profiles in the site catalog for the cloud site. Here is a sample site catalog
/usr/local/pegasus/default /usr/local/globus/default /usr/local/globus/default/lib
<!-- the directory where a user wants to run the jobs on the
nodes retrived from ec2 –> /mnt
<profile namespace="pegasus" key="style">condor</profile> <!-- to be set to ensure condor streams stdout and stderr back
to submit host –> YES true true ON_EXIT
<profile namespace="condor" key="universe">vanilla</profile> <profile namespace="condor" key="requirements">(Arch==Arch)&&(Disk!=0)&&(Memory!=0)&&(OpSys==OpSys)&&(FileSystemDomain!="")</profile> <profile namespace="condor" key="rank">SlotID</profile> <lrc url="rls://example.com"/> <gridftp url="s3://" storage="" major="2" minor="4" patch="3"/> <jobmanager universe="vanilla" url="example.com/jobmanager-pbs" major="2" minor="4" patch="3"/> <jobmanager universe="transfer" url="example.com/jobmanager-fork" major="2" minor="4" patch="3"/> <!-- create a new bucket for each wf <workdirectory >/</workdirectory> --> <!-- use an existing bucket -->
existing-bucket
Relevant JIRA links
- pegasus-analyzer
There is a new tool called pegasus-analyzer. It helps the users to analyze the workflows after the workflow has finished executing.
It is not meant to be run while the workflow is still running. To track the status of a running workflow for now, the users are recommended to use pegasus-status.
pegasus-analyzer looks at the workflow submit directory and parses the condor dagman logs and the job.out files to print a summary of the workflow execution.
The tool prints out the following summary of the workflow
Total jobs jobs succeeded jobs failed jobs unsubmitted
For all the failed jobs the tool prints out the contents of job.out and job.err file.
The user can use the –quiet option to display only the paths to the .out and .err files. This is useful when the job output is particularly big or when kickstart is used to launch the jobs.
For pegasus 3.0 the tool will be updated to parse kickstart output files and provide a concise view rather than displaying the whole output
- Support for Condor Glite
Pegasus now supports a new style named glite for generating the submit files. This allows pegasus to create submit files for a glite environment where a glite blahp talks to the scheduler instead of GRAM. At a minimum the following profiles need to be associated with the job.
pegasus profile style – value set to glite condor profile grid_resource – value set to the remote scheduler to which glite blahp talks to .
This style should only be used when the condor on the submit host can directly talk to scheduler running on the cluster. In Pegasus site catalog there should be a separate compute site that has this style associated with it. This style should not be specified for the local site.
As part of applying the style to the job, this style adds the following classads expressions to the job description
+remote_queue – value picked up from globus profile queue +remote_cerequirements – See below
The remote CE requirements are constructed from the following profiles associated with the job. The profiles for a job are derived from various sources
- user properties
- transformation catalog
- site catalog
- DAX
Note it is upto the user to specify these or a subset of them.
The following globus profiles if associated with the job are picked up
- hostcount -> PROCS
- count -> NODES
- maxwalltime-> WALLTIME
The following condor profiles if associated with the job are picked up
priority -> PRIORITY
All the env profiles are translated to MYENV
For e.g. the expression in the submit file may look as
+remote_cerequirements = "PROCS==18 && NODES==1 && PRIORITY==10 && WALLTIME==3600
&& PASSENV==1 && JOBNAME==\"TEST JOB\" && MYENV ==\"FOO=BAR,HOME=/home/user\""
All the jobs that have this style applied dont have a remote directory specified in the submit directory. They rely on kickstart to change to the working directory when the job is launched on the remote node.
- Generating a site catalog for OSG using OSGMM The pegasus-get-sites tool has been modified to query the OSGMM ( OSG Match Maker) to generate a site catalog for a VO
It builds upon the earlier Engage implementation. It has now been generalized and renamed to OSGMM
To pegasus-get-sites the source option now needs to be OSGMM instead of Engage
Some of the changes are
The condor collector host can be specified at command line or in properties by specifying the property pegasus.catalog.site.osgmm.collector.host . It defaults to ligo-osgmm.renci.org
If a user is part of the Engage VO they should set pegasus.catalog.site.osgmm.collector.host=engage-central.renci.org
The default VO used is LIGO. Can be overriden by specifying the –vo option to pegasus-get-sites , or specifying the property pegasus.catalog.site.osgmm.vo
By default the implementation always returns validated sites. To retrieve all sites for a VO set pegasus.catalog.site.osgmm.retrieve.validated.sites to false.
In case of multiple gatekeepers are associated with the same osg site, multiple site catalog entries are created in the site catalog. A suffix is added to the extra sites (__index , where index starts from 1)
Sample Usage
pegasus-get-sites --source OSGMM --sc osg-sites.xml --vo LIGO --grid OSG
Tracked in JIRA at PM-67 #188
Currently, there is no way to filter sites according to the grid ( OSG|OSG-ITB ) in OSGMM
The site catalog generated has storage directories that have a VO component in them.
- Generating a site catalog for OSG using MYOSG pegasus-get-sites has now been modified to generate a site catalog by querying MyOSG
To use MYOSG as the backend the source option needs to be set to MYOSG
Sample usage
pegasus-get-sites –source MYOSG –sc myosg-sites-new.xml -vvvvv –vo ligo –grid osg
This was tracked in JIRA PM-61 #182
Pegasus Team recommends using OSGMM for generating a site catalog.
- Separation of Symlink and Stagein Transfer Jobs
The following transfer refiners
- Default
- Bundle
- Cluster now support the separation of the symlink jobs from the stage in jobs. While using these refiners, the files that need to be symlinked against existing files on a compute site will have a separate symlink job. The files that need to be actually copied to a remote site, will appear in the stage_in_ jobs.
This distinction, allows for the users to stage in data using third party transfers that run on the submit host, and at the same time be able to symlink against existing datasets.
The symlink jobs run on the remote compute sites. Earlier this was not possible, and hence for a user to use symlinking they had to turn off third party transfers. This resulted in an increased load on the head node as the stage in jobs executed there.
By default, Pegasus will use the transfer executable shipped with the worker package to do the symbolic linking .
If the user wants to change the executable to use , they can set the following property
pegasus.transfer.symlink.impl
The above also allows us to use separate executables for staging in data and for symbolic linking. For e.g. we can use GUC to stage in data by setting
pegasus.transfer.stagein.impl GUC
To control the symlinking granularity in the Bundle and Cluster refiners the following Pegasus profile keys can be associated
bundle.symlink cluster.symlink
The feature implementation was tracked in JIRA at PM-54 #175
- Bypassing First Level Staging of Files for worker node execution
Pegasus now has capability to bypass first level staging if the input files in the replica catalog have a pool attribute matching the site at which a job is being run. This applies in case of worker node execution.
The cache file generated in the submit directory is the transient replica catalog. It also now has locations of where the inpute files are staged on the remote sites. Earlier it was only the files that were generated by the workflow.
Tracked in JIRA here
- Resolving SRM URL’s for file URL’s on a filesystem
There is now support to resolve the SRM urls in the replica catalog to the file url on a site. The user needs to specify the URL prefix and the mount point of the filesystem.
This can be done by specifying the properties
pegasus.transfer.srm.[sitename].service.url
pegasus.transfer.srm.[sitename].service.mountpoint
Pegasus will then map SRM URL’s associate with site to a paht on the filesytem by replacing the service url component with the mount point.
For example if user has this specified
pegasus.transfer.srm.ligo-cit.service.url srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop
pegasus.transfer.srm.ligo-cit.service.mountpoint /mnt/hadoop/
then url srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop/ligo/frames/S5/test.gwf will resolve to
/mnt/hadoop/ligo/frames/S5/test.gwf
- New Transfer implementation Symlink
Pegasus has now support for a perl executable called symlink shipped with the Pegasus worker package, that can be used to create multiple symlinks against input datasets in a single invocation
The Transfer implementation that uses the transfer executable also has the same functionality. However, the transfer executable complains if it cannot find the Globus client libraries.
In order to use this executable for the symlink jobs, users need to set the following property
pegasus.transfer.symlink.impl Symlink
Later on ( pegasus 3.0 release onwards ) this will be made the default executable to be used for symlinking jobs.
- Passing options forward to pegasus-run in pegasus-plan
Users can now pass forward option to pegasus-run invocation that is used to submit the workflows in case of successful mapping.
There is a –forward option[=value] to pegasus-plan . This option allows a user to forward options to pegasus-run. For e.g. nogrid option can be passed to pegasus-run as follows pegasus-plan –forward nogrid
The option can be repeated multiple times to forward multiple options to pegasus-run. The longopt version should always be specified for pegasus-run.
- Passing extra arguments SLS transfer implementations
Users now can specify pegasus.transfer.sls.arguments to pass extra options at runtime to the SLS Implementations used by Pegasus. The following SLS transfer implementations accept the above property.
S3 Transfer
- Passing non standard java options to dax jobs in DAX 3.0
The non standard jvm options (-X[option]) can now be specified for the sub workflows in the arguments section for the dax jobs.
For example for the DAX jobs , user can set the java max heap size to 1024m by specifying -X1024m in the arguments for the DAX job
- Location of Condor Logs directory on the submit host
By default, pegasus designates the condor logs to be created in the /tmp directory. This is done to ensure that the logs are created in a local directory even though the submit directory maybe on NFS.
In the submit directory the symbolic link to the appropriate log file in the /tmp exists. However, since /tmp is automatically purged in most cases, users may want to preserve their condor logs in a directory on the local filesystem other than /tmp
The new property
pegasus.dir.submit.logs
allows a user to designate the logs directory on the submit host for condor logs.
- Removing profile keys as part of overriding profiles
There is now a notion of empty profile key valus in Pegasus. The default action on empty key value is to remove the key. Currently the following namespaces follow this convention
- Condor
- Globus
- Pegasus
This allows a user to unset values as part of overriding profiles. Normally a user can only update a profile value i.e they can update the value of a key, but the key remains associated with the job This allows the user to remove the key from the profile namespace.
For e.g.
A user may have a profile X set in the site catalog.
Now for a particular job a user does not want that profile key to be used. He can now specify the same profile X with empty value in the transformation catalog for that job. This results in the profile key X being removed from the job.
- Constructing Paths to Condor DAGMan for recursive/hierarichal workflows
The entry for condor::dagman is no longer required for site local in the transformation catalog. Instead pegasus constructs path from the following environment variables. CONDOR_HOME, CONDOR_LOCATION
The priority order is as follows
- CONDOR_HOME defined in the environment
- CONDOR_LOCATION defined in the environment
- entry for condor::dagman for site local
This is useful when running workflows that refer to sub workflows as in the new DAX 3.0 format.
This was tracked in JIRA PM-50 #171
- Constructing path to kickstart
By default the path to kickstart is determined on the basis of the environment variable PEGASUS_HOME associated with a site entry in the site catalog.
However, in some cases a user might want to use their own modified version of kickstart.
In order to enable that
The path to kickstart will be constructed according to the following rule
- pegasus profile gridstart.path specified in the site catalog for the site in question.
- If 1 is not specified, then a path is constructed on the basis of the environment variable PEGASUS_HOME for the site in the site catalog.
The above was tracked in JIRA PM-60 #181
- Bulk Lookups to Replica Catalog using rc-client
rc-client now can do bulk lookups similar to how it does bulk inserts and deletes
Details at PM-75 #196
- Additions to show-job workflow visualization script
show-job now has a –title option to list add a user provided title for the generated gantt chart.
show-job can also visualize workflow of workflows
- Absolute paths for certain properties in the properties file
The properties file that is written out now in the submit directory has absolute paths specified for the following property values.
pegasus.catalog.replica.file pegasus.catalog.transformation.file pegasus.catalog.site.file
This is even though user may have specified relative paths in properties file.
- The default horizontal clustering factor
Updated the default clustering factor as collapse with value = 1, instead of earlier value of 3
This ensures, that users can cluster only jobs of certain types, and let others remain unclustered. Another way was to specify the collapse factor as 1 explicitly for jobs that users dont want clustering for.
BUGS FIXED
- Handling of standard universe in condor style In Condor style , standard universe if specified for a job is ONLY associated for compute jobs. This ensures that pegasus auxillary jobs never execute in standard universe.
- Bug Fix for replica selection bug 43 Checked in the fix for JIRA bug 43 PM-43 #164
The ReplicaLocation class now has a clone method that does a shallow clone This clone method is called in the selectReplica methods in the replica selectors.
- rc-client did not implement pegasus.catalog.replica.lrc.ignore property This is now fixed. This bug was tracked in JIRA PM-42 #163
- DAX’es created while partitioning a workflow During the partitioning of the workflows , the DAX for a partition was created incorrectly as the register flags were not correctly parsed by the VDL DAX parser.
This was tracked in JIRA PM-48 #169
- Handling of initialdir and remote_initialdir keys Changed the internal handling for the initialdir and remote_initialdir keys. The initialdir key is now only associated for standard universe jobs. For glidein and condor style we now associate remote_initialdir unless it is a standard universe job.
This was tracked in JIRA PM-58 #179
- Querying RLI for non existent LFN using rc-client rc-client had inconsistent behavior when querying RLI for a LFN that does not exist in the RLS. This affected the rc-client lookup command option.
Details at PM-74 #195
- Running clustered jobs on the cloud in directory other than /tmp There was a bug whereby the clustered jobs executing in worker node execution mode did not honor the wntmp environment variable specified in the Site Catalog for the site.
The bug fix was tracked through JIRA at PM-83 #204
- Bug Fix for worker package deployment The regex employed to determine the pegasus version from a URL to a worker package was insufficient. It only took care of x86 builds.
For e.g. it could not parse the following urlhttp://pegasus.isi.edu/mapper/download/nightly/pegasus-worker-2.4.0cvs-ia64_rhas_3.tar.gz STATIC_BINARY INTEL64::LINUX NULL
This is now fixed. Related to JIRA PM-33
- Destination URL construction for worker package staging Earlier the worker package input files always had third party URL’s, even if the worker package deployment job executed on the remote site ( in push / pull mode ).
Now, the third party URL’s are only constructed if the worker package deployment job is actually run in third party mode. In push-pull mode, the destination URL’s are file URLs
Tracked in JIRA at PM-89 #207
Documentation
- User Guides The Running on different Grids Guide now has information on how to run workflows using glite.
- Pegasus Replica Selection
The guides are checked in $PEGASUS_HOME/doc/guides
They can be found online at http://pegasus.isi.edu/mapper/doc.php
- Property Document was updated with the new properties introduced.