Pegasus 2.4.0 Released

===============================
Release Notes for PEGASUS 2.4.0
===============================

NEW FEATURES
--------------

1) Support for Pegasus DAX 3.0

Pegasus now also can accept DAX'es in Pegasus 3.0 format

Some salient features of the new format are
- Users can specify locations of the files in the DAX
- Users can specify what executables to use in the DAX
- Users can specify sub dax in the DAX using the dax element. The
dax jobs result in a separate subworkflow being launched with the
appropriate pegasus-plan command as the prescript
- Users can specify condor DAG's in the DAX using the dag
element. The dag job is passed on the Condor DAGMAN as a SUBDAG
for execution.

A sample 3.0 DAX can be found at
http://pegasus.isi.edu/mapper/docs/schemas/dax-3.0/two_node_dax-3.0_v6.xml

In the next Pegasus release ( Pegasus 3.0 ) a JAVA DAX API will be
made available. Certain more extensions will be added to the
schema. For feature requests email pegasus@isi.edu

2) Support for running workflows on EC2 using S3 for storage

Users while running on Amazon EC2 can use S3 for storage backend
for the workflow execution. The details below assume that a user
configures a condor pool on the nodes allocated from EC3

To enable Pegasus for S3 the following properties need to be set.

- pegasus.execute.*.filesystem.local = true
- pegasus.transfer.*.impl = S3
- pegasus.transfer.sls.*.impl = S3
- pegasus.dir.create.impl = S3
- pegasus.file.cleanup.impl = S3
- pegasus.gridstart = SeqExec
- pegasus.transfer.sls.s3.stage.sls.file = false

For data stagein and creating S3 buckets for workflows pegasus
relies on the amazon provided s3cmd command line client.

Pegasus looks for a transformation with namespace amazon and
logical name as s3cmd in the transformation catalog to figure out
the location of the s3cmd client. for e.g in the File based
Transformation Catalog the full name for transformation will be
amazon::s3cmd

In order to enable stdtout and stderr streaming correctly from
Condor on EC2 we recommend adding certain profiles in the site
catalog for the cloud site.
Here is a sample site catalog

<site handle="ec2" sysinfo="INTEL32::LINUX">
<profile namespace="env" key="PEGASUS_HOME">/usr/local/pegasus/default</profile>
<profile namespace="env" key="GLOBUS_LOCATION">/usr/local/globus/default</profile>
<profile namespace="env" key="LD_LIBRARY_PATH">/usr/local/globus/default/lib</profile>

<profile namespace="env" key="wntmp">/mnt</profile>

<profile namespace="pegasus" key="style">condor</profile>

<profile namespace="condor" key="should_transfer_files">YES</profile>
<profile namespace="condor" key="transfer_output">true</profile>
<profile namespace="condor" key="transfer_error">true</profile>
<profile namespace="condor" key="WhenToTransferOutput">ON_EXIT</profile>

<profile namespace="condor" key="universe">vanilla</profile>

<profile namespace="condor" key="requirements">(Arch==Arch)&amp;&amp;(Disk!=0)&amp;&amp;(Memory!=0)&amp;&amp;(OpSys==OpSys)&amp;&amp;(FileSystemDomain!="")</profile>
<profile namespace="condor" key="rank">SlotID</profile>

<workdirectory>existing-bucket</workdirectory>
</site>

Relevant JIRA links
http://jira.pegasus.isi.edu/browse/PM-68
http://jira.pegasus.isi.edu/browse/PM-20
http://jira.pegasus.isi.edu/browse/PM-85

3) pegasus-analyzer

There is a new tool called pegasus-analyzer. It helps the users to
analyze the workflows after the workflow has finished executing.

It is not meant to be run while the workflow is still running. To
track the status of a running workflow for now, the users are
recommended to use pegasus-status.

pegasus-analyzer looks at the workflow submit directory and parses
the condor dagman logs and the job.out files to print a summary of
the workflow execution.

The tool prints out the following summary of the workflow

Total jobs
jobs succeeded
jobs failed
jobs unsubmitted

For all the failed jobs the tool prints out the contents of job.out
and job.err file.

The user can use the --quiet option to display only the paths to
the .out and .err files. This is useful when the job output is
particularly big or when kickstart is used to launch the jobs.

For pegasus 3.0 the tool will be updated to parse kickstart output
files and provide a concise view rather than displaying the whole
output

4) Support for Condor Glite

Pegasus now supports a new style named glite for generating the submit
files. This allows pegasus to create submit files for a glite
environment where a glite blahp talks to the scheduler instead of
GRAM. At a minimum the following profiles need to be associated with
the job.

pegasus profile style - value set to glite
condor profile grid_resource - value set to the remote scheduler to
which glite blahp talks to .

This style should only be used when the condor on the submit host
can directly talk to scheduler running on the cluster. In Pegasus
site catalog there should be a separate compute site that has
this style associated with it. This style should not be specified
for the local site.

As part of applying the style to the job, this style adds the
following classads expressions to the job description

+remote_queue - value picked up from globus profile queue
+remote_cerequirements - See below

The remote CE requirements are constructed from the following
profiles associated with the job. The profiles for a job are
derived from various sources

- user properties
- transformation catalog
- site catalog
- DAX

Note it is upto the user to specify these or a subset of them.

The following globus profiles if associated with the job are picked up

hostcount -> PROCS
count -> NODES
maxwalltime-> WALLTIME

The following condor profiles if associated with the job are picked up

priority -> PRIORITY

All the env profiles are translated to MYENV

For e.g. the expression in the submit file may look as

+remote_cerequirements = "PROCS==18 && NODES==1 && PRIORITY==10 && WALLTIME==3600
&& PASSENV==1 && JOBNAME==\"TEST JOB\" && MYENV ==\"FOO=BAR,HOME=/home/user\""

All the jobs that have this style applied dont have a remote
directory specified in the submit directory. They rely on
kickstart to change to the working directory when the job is
launched on the remote node.

5) Generating a site catalog for OSG using OSGMM
The pegasus-get-sites tool has been modified to query the OSGMM (
OSG Match Maker) to generate a site catalog for a VO

It builds upon the earlier Engage implementation. It has now been
generalized and renamed to OSGMM

To pegasus-get-sites the source option now needs to be OSGMM
instead of Engage

Some of the changes are

The condor collector host can be specified at command line or in
properties by specifying the property pegasus.catalog.site.osgmm.collector.host .
It defaults to ligo-osgmm.renci.org

If a user is part of the Engage VO they should set
pegasus.catalog.site.osgmm.collector.host=engage-central.renci.org

The default VO used is LIGO. Can be overriden by specifying the
--vo option to pegasus-get-sites , or specifying the property
pegasus.catalog.site.osgmm.vo

By default the implementation always returns validated sites.
To retrieve all sites for a VO set
pegasus.catalog.site.osgmm.retrieve.validated.sites to false.

In case of multiple gatekeepers are associated with the same osg
site, multiple site catalog entries are created in the site
catalog. A suffix is added to the extra sites (__index , where
index starts from 1)

Sample Usage
pegasus-get-sites --source OSGMM --sc osg-sites.xml --vo LIGO --grid OSG

Tracked in JIRA athttp://pegasus.isi.edu/jira/browse/PM-67

Currently, there is no way to filter sites according to the grid ( OSG|OSG-ITB ) in OSGMM

The site catalog generated has storage directories that have a VO component in them.

6) Generating a site catalog for OSG using MYOSG
pegasus-get-sites has now been modified to generate a site catalog by querying MyOSG

To use MYOSG as the backend the source option needs to be set to MYOSG

Sample usage

pegasus-get-sites --source MYOSG --sc myosg-sites-new.xml -vvvvv --vo ligo --grid osg

This was tracked in JIRA
http://pegasus.isi.edu/jira/browse/PM-61

Pegasus Team recommends using OSGMM for generating a site catalog.

7) Separation of Symlink and Stagein Transfer Jobs

The following transfer refiners
- Default
- Bundle
- Cluster
now support the separation of the symlink jobs from the stage in
jobs. While using these refiners, the files that need to be
symlinked against existing files on a compute site will have a
separate symlink job. The files that need to be actually copied to
a remote site, will appear in the stage_in_ jobs.

This distinction, allows for the users to stage in data using third
party transfers that run on the submit host, and at the same time
be able to symlink against existing datasets.

The symlink jobs run on the remote compute sites. Earlier this was
not possible, and hence for a user to use symlinking they had to
turn off third party transfers. This resulted in an increased load
on the head node as the stage in jobs executed there.

By default, Pegasus will use the transfer executable shipped with
the worker package to do the symbolic linking .

If the user wants to change the executable to use , they can set the following property

pegasus.transfer.symlink.impl

The above also allows us to use separate executables for staging in
data and for symbolic linking.
For e.g. we can use GUC to stage in data by setting

pegasus.transfer.stagein.impl GUC

To control the symlinking granularity in the Bundle and Cluster
refiners the following Pegasus profile keys can be associated

bundle.symlink
cluster.symlink

The feature implementation was tracked in JIRA at
http://pegasus.isi.edu/jira/browse/PM-54

8) Bypassing First Level Staging of Files for worker node execution

Pegasus now has capability to bypass first level staging if the
input files in the replica catalog have a pool attribute matching
the site at which a job is being run. This applies in case of
worker node execution.

The cache file generated in the submit directory is the transient
replica catalog. It also now has locations of where the inpute
files are staged on the remote sites. Earlier it was only the files
that were generated by the workflow.

Tracked in JIRA here
http://pegasus.isi.edu/jira/browse/PM-20
http://pegasus.isi.edu/jira/browse/PM-62

9) Resolving SRM URL's for file URL's on a filesystem

There is now support to resolve the SRM urls in the replica catalog to
the file url on a site. The user needs to specify the URL prefix
and the mount point of the filesystem.

This can be done by specifying the properties

pegasus.transfer.srm.[sitename].service.url
pegasus.transfer.srm.[sitename].service.mountpoint

Pegasus will then map SRM URL's associate with site to a paht on
the filesytem by replacing the service url component with the mount
point.

For example if user has this specified

pegasus.transfer.srm.ligo-cit.service.url srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop
pegasus.transfer.srm.ligo-cit.service.mountpoint /mnt/hadoop/

then url
srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop/ligo/frames/S5/test.gwf
will resolve to

/mnt/hadoop/ligo/frames/S5/test.gwf

10) New Transfer implementation Symlink

Pegasus has now support for a perl executable called symlink
shipped with the Pegasus worker package, that can be used to create
multiple symlinks against input datasets in a single invocation

The Transfer implementation that uses the transfer executable also
has the same functionality.
However, the transfer executable complains if it cannot find the
Globus client libraries.

In order to use this executable for the symlink jobs, users need to
set the following property

pegasus.transfer.symlink.impl Symlink

Later on ( pegasus 3.0 release onwards ) this will be made the
default executable to be used for symlinking jobs.

11) Passing options forward to pegasus-run in pegasus-plan

Users can now pass forward option to pegasus-run invocation that is
used to submit the workflows in case of successful mapping.

There is a --forward option[=value] to pegasus-plan . This option
allows a user to forward options to pegasus-run.
For e.g. nogrid option can be passed to pegasus-run as follows
pegasus-plan --forward nogrid

The option can be repeated multiple times to forward multiple
options to pegasus-run. The longopt version should always be
specified for pegasus-run.

12) Passing extra arguments SLS transfer implementations

Users now can specify pegasus.transfer.sls.arguments to pass extra
options at runtime to the SLS Implementations used by Pegasus.
The following SLS transfer implementations accept the above property.

S3
Transfer

13) Passing non standard java options to dax jobs in DAX 3.0

The non standard jvm options (-X[option]) can now be specified for
the sub workflows in the arguments section for the dax jobs.

For example for the DAX jobs , user can set the java max heap size
to 1024m by specifying -X1024m in the arguments for the DAX job

14) Location of Condor Logs directory on the submit host

By default, pegasus designates the condor logs to be created in the
/tmp directory. This is done to ensure that the logs are created in
a local directory even though the submit directory maybe on NFS.

In the submit directory the symbolic link to the appropriate log
file in the /tmp exists. However, since /tmp is automatically
purged in most cases, users may want to preserve their condor logs
in a directory on the local filesystem other than /tmp

The new property

pegasus.dir.submit.logs

allows a user to designate the logs directory on the submit host
for condor logs.

15) Removing profile keys as part of overriding profiles

There is now a notion of empty profile key valus in Pegasus. The
default action on empty key value is to remove the key. Currently
the following namespaces follow this convention

- Condor
- Globus
- Pegasus

This allows a user to unset values as part of overriding
profiles. Normally a user can only update a profile value i.e they
can update the value of a key, but the key remains associated with
the job This allows the user to remove the key from the profile
namespace.

For e.g.

A user may have a profile X set in the site catalog.

Now for a particular job a user does not want that profile key to
be used. He can now specify the same profile X with empty value in
the transformation catalog for that job. This results in the
profile key X being removed from the job.

16) Constructing Paths to Condor DAGMan for recursive/hierarichal
workflows

The entry for condor::dagman is no longer required for site local
in the transformation catalog.
Instead pegasus constructs path from the following environment
variables. CONDOR_HOME, CONDOR_LOCATION

The priority order is as follows

1) CONDOR_HOME defined in the environment
2) CONDOR_LOCATION defined in the environment
3) entry for condor::dagman for site local

This is useful when running workflows that refer to sub workflows
as in the new DAX 3.0 format.

This was tracked in JIRA
http://pegasus.isi.edu/jira/browse/PM-50

17) Constructing path to kickstart

By default the path to kickstart is determined on the basis of the
environment variable PEGASUS_HOME associated with a site entry in
the site catalog.

However, in some cases a user might want to use their own modified
version of kickstart.

In order to enable that

The path to kickstart will be constructed according to the following rule
1) pegasus profile gridstart.path specified in the site catalog for
the site in question.
2) If 1 is not specified, then a path is constructed on the basis
of the environment variable PEGASUS_HOME for the site in the site
catalog.

The above was tracked in JIRA
http://pegasus.isi.edu/jira/browse/PM-60

18) Bulk Lookups to Replica Catalog using rc-client

rc-client now can do bulk lookups similar to how it does bulk
inserts and deletes

Details at
http://jira.pegasus.isi.edu/browse/PM-75

19) Additions to show-job workflow visualization script

show-job now has a --title option to list add a user provided title for the generated gantt chart.

show-job can also visualize workflow of workflows

20) Absolute paths for certain properties in the properties file

The properties file that is written out now in the submit directory
has absolute paths specified for the following property values.

pegasus.catalog.replica.file
pegasus.catalog.transformation.file
pegasus.catalog.site.file

This is even though user may have specified relative paths in properties file.

21) The default horizontal clustering factor

Updated the default clustering factor as collapse with value = 1, instead of earlier value of 3

This ensures, that users can cluster only jobs of certain types,
and let others remain unclustered. Another way was to specify the
collapse factor as 1 explicitly for jobs that users dont want
clustering for.

BUGS FIXED
----------
1) Handling of standard universe in condor style
In Condor style , standard universe if specified for a job is ONLY
associated for compute jobs. This ensures that pegasus auxillary
jobs never execute in standard universe.

2) Bug Fix for replica selection bug 43
Checked in the fix for JIRA bug 43 http://pegasus.isi.edu/jira/browse/PM-43

The ReplicaLocation class now has a clone method that does a
shallow clone
This clone method is called in the selectReplica methods in the replica selectors.

3) rc-client did not implement pegasus.catalog.replica.lrc.ignore property
This is now fixed.
This bug was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-42

4) DAX'es created while partitioning a workflow
During the partitioning of the workflows , the DAX for a partition
was created incorrectly as the register flags were not correctly
parsed by the VDL DAX parser.

This was tracked in JIRA
http://pegasus.isi.edu/jira/browse/PM-48

5) Handling of initialdir and remote_initialdir keys
Changed the internal handling for the initialdir and
remote_initialdir keys. The initialdir key is now only associated
for standard universe jobs. For glidein and condor style we now
associate remote_initialdir unless it is a standard universe job.

This was tracked in JIRA
http://pegasus.isi.edu/jira/browse/PM-58

6) Querying RLI for non existent LFN using rc-client
rc-client had inconsistent behavior when querying RLI for a LFN
that does not exist in the RLS.
This affected the rc-client lookup command option.

Details at
http://jira.pegasus.isi.edu/browse/PM-74

7) Running clustered jobs on the cloud in directory other than /tmp
There was a bug whereby the clustered jobs executing in worker node
execution mode did not honor the wntmp environment variable
specified in the Site Catalog for the site.

The bug fix was tracked through JIRA
http://jira.isi.edu/browse/PM-83

8) Bug Fix for worker package deployment
The regex employed to determine the pegasus version from a URL to a
worker package was insufficient. It only took care of x86 builds.

For e.g. it could not parse the following
urlhttp://pegasus.isi.edu/mapper/download/nightly/pegasus-worker-2.4.0cvs-ia64_rhas_3.tar.gz
STATIC_BINARY INTEL64::LINUX NULL

This is now fixed.
Related to JIRA PM-33

9) Destination URL construction for worker package staging
Earlier the worker package input files always had third party
URL's, even if the worker package deployment job executed on the
remote site ( in push / pull mode ).

Now, the third party URL's are only constructed if the worker
package deployment job is actually run in third party mode.
In push-pull mode, the destination URL's are file URLs

Tracked in JIRA athttp://jira.isi.edu/browse/PM-89

Documentation
--------------

1) User Guides
The Running on different Grids Guide now has information on how to
run workflows using glite.
- Pegasus Replica Selection

The guides are checked in $PEGASUS_HOME/doc/guides

They can be found online at
http://pegasus.isi.edu/mapper/doc.php

2) Property Document was updated with the new properties introduced.