=============================== Release Notes for PEGASUS 2.4.0 =============================== NEW FEATURES -------------- 1) Support for Pegasus DAX 3.0 Pegasus now also can accept DAX'es in Pegasus 3.0 format Some salient features of the new format are - Users can specify locations of the files in the DAX - Users can specify what executables to use in the DAX - Users can specify sub dax in the DAX using the dax element. The dax jobs result in a separate subworkflow being launched with the appropriate pegasus-plan command as the prescript - Users can specify condor DAG's in the DAX using the dag element. The dag job is passed on the Condor DAGMAN as a SUBDAG for execution. A sample 3.0 DAX can be found at http://pegasus.isi.edu/mapper/docs/schemas/dax-3.0/two_node_dax-3.0_v6.xml In the next Pegasus release ( Pegasus 3.0 ) a JAVA DAX API will be made available. Certain more extensions will be added to the schema. For feature requests email firstname.lastname@example.org 2) Support for running workflows on EC2 using S3 for storage Users while running on Amazon EC2 can use S3 for storage backend for the workflow execution. The details below assume that a user configures a condor pool on the nodes allocated from EC3 To enable Pegasus for S3 the following properties need to be set. - pegasus.execute.*.filesystem.local = true - pegasus.transfer.*.impl = S3 - pegasus.transfer.sls.*.impl = S3 - pegasus.dir.create.impl = S3 - pegasus.file.cleanup.impl = S3 - pegasus.gridstart = SeqExec - pegasus.transfer.sls.s3.stage.sls.file = false For data stagein and creating S3 buckets for workflows pegasus relies on the amazon provided s3cmd command line client. Pegasus looks for a transformation with namespace amazon and logical name as s3cmd in the transformation catalog to figure out the location of the s3cmd client. for e.g in the File based Transformation Catalog the full name for transformation will be amazon::s3cmd In order to enable stdtout and stderr streaming correctly from Condor on EC2 we recommend adding certain profiles in the site catalog for the cloud site. Here is a sample site catalog <site handle="ec2" sysinfo="INTEL32::LINUX"> <profile namespace="env" key="PEGASUS_HOME">/usr/local/pegasus/default</profile> <profile namespace="env" key="GLOBUS_LOCATION">/usr/local/globus/default</profile> <profile namespace="env" key="LD_LIBRARY_PATH">/usr/local/globus/default/lib</profile> <!-- the directory where a user wants to run the jobs on the nodes retrived from ec2 --> <profile namespace="env" key="wntmp">/mnt</profile> <profile namespace="pegasus" key="style">condor</profile> <!-- to be set to ensure condor streams stdout and stderr back to submit host --> <profile namespace="condor" key="should_transfer_files">YES</profile> <profile namespace="condor" key="transfer_output">true</profile> <profile namespace="condor" key="transfer_error">true</profile> <profile namespace="condor" key="WhenToTransferOutput">ON_EXIT</profile> <profile namespace="condor" key="universe">vanilla</profile> <profile namespace="condor" key="requirements">(Arch==Arch)&&(Disk!=0)&&(Memory!=0)&&(OpSys==OpSys)&&(FileSystemDomain!="")</profile> <profile namespace="condor" key="rank">SlotID</profile> <lrc url="rls://example.com"/> <gridftp url="s3://" storage="" major="2" minor="4" patch="3"/> <jobmanager universe="vanilla" url="example.com/jobmanager-pbs" major="2" minor="4" patch="3"/> <jobmanager universe="transfer" url="example.com/jobmanager-fork" major="2" minor="4" patch="3"/> <!-- create a new bucket for each wf <workdirectory >/</workdirectory> --> <!-- use an existing bucket --> <workdirectory>existing-bucket</workdirectory> </site> Relevant JIRA links http://jira.pegasus.isi.edu/browse/PM-68 http://jira.pegasus.isi.edu/browse/PM-20 http://jira.pegasus.isi.edu/browse/PM-85 3) pegasus-analyzer There is a new tool called pegasus-analyzer. It helps the users to analyze the workflows after the workflow has finished executing. It is not meant to be run while the workflow is still running. To track the status of a running workflow for now, the users are recommended to use pegasus-status. pegasus-analyzer looks at the workflow submit directory and parses the condor dagman logs and the job.out files to print a summary of the workflow execution. The tool prints out the following summary of the workflow Total jobs jobs succeeded jobs failed jobs unsubmitted For all the failed jobs the tool prints out the contents of job.out and job.err file. The user can use the --quiet option to display only the paths to the .out and .err files. This is useful when the job output is particularly big or when kickstart is used to launch the jobs. For pegasus 3.0 the tool will be updated to parse kickstart output files and provide a concise view rather than displaying the whole output 4) Support for Condor Glite Pegasus now supports a new style named glite for generating the submit files. This allows pegasus to create submit files for a glite environment where a glite blahp talks to the scheduler instead of GRAM. At a minimum the following profiles need to be associated with the job. pegasus profile style - value set to glite condor profile grid_resource - value set to the remote scheduler to which glite blahp talks to . This style should only be used when the condor on the submit host can directly talk to scheduler running on the cluster. In Pegasus site catalog there should be a separate compute site that has this style associated with it. This style should not be specified for the local site. As part of applying the style to the job, this style adds the following classads expressions to the job description +remote_queue - value picked up from globus profile queue +remote_cerequirements - See below The remote CE requirements are constructed from the following profiles associated with the job. The profiles for a job are derived from various sources - user properties - transformation catalog - site catalog - DAX Note it is upto the user to specify these or a subset of them. The following globus profiles if associated with the job are picked up hostcount -> PROCS count -> NODES maxwalltime-> WALLTIME The following condor profiles if associated with the job are picked up priority -> PRIORITY All the env profiles are translated to MYENV For e.g. the expression in the submit file may look as +remote_cerequirements = "PROCS==18 && NODES==1 && PRIORITY==10 && WALLTIME==3600 && PASSENV==1 && JOBNAME==\"TEST JOB\" && MYENV ==\"FOO=BAR,HOME=/home/user\"" All the jobs that have this style applied dont have a remote directory specified in the submit directory. They rely on kickstart to change to the working directory when the job is launched on the remote node. 5) Generating a site catalog for OSG using OSGMM The pegasus-get-sites tool has been modified to query the OSGMM ( OSG Match Maker) to generate a site catalog for a VO It builds upon the earlier Engage implementation. It has now been generalized and renamed to OSGMM To pegasus-get-sites the source option now needs to be OSGMM instead of Engage Some of the changes are The condor collector host can be specified at command line or in properties by specifying the property pegasus.catalog.site.osgmm.collector.host . It defaults to ligo-osgmm.renci.org If a user is part of the Engage VO they should set pegasus.catalog.site.osgmm.collector.host=engage-central.renci.org The default VO used is LIGO. Can be overriden by specifying the --vo option to pegasus-get-sites , or specifying the property pegasus.catalog.site.osgmm.vo By default the implementation always returns validated sites. To retrieve all sites for a VO set pegasus.catalog.site.osgmm.retrieve.validated.sites to false. In case of multiple gatekeepers are associated with the same osg site, multiple site catalog entries are created in the site catalog. A suffix is added to the extra sites (__index , where index starts from 1) Sample Usage pegasus-get-sites --source OSGMM --sc osg-sites.xml --vo LIGO --grid OSG Tracked in JIRA athttp://pegasus.isi.edu/jira/browse/PM-67 Currently, there is no way to filter sites according to the grid ( OSG|OSG-ITB ) in OSGMM The site catalog generated has storage directories that have a VO component in them. 6) Generating a site catalog for OSG using MYOSG pegasus-get-sites has now been modified to generate a site catalog by querying MyOSG To use MYOSG as the backend the source option needs to be set to MYOSG Sample usage pegasus-get-sites --source MYOSG --sc myosg-sites-new.xml -vvvvv --vo ligo --grid osg This was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-61 Pegasus Team recommends using OSGMM for generating a site catalog. 7) Separation of Symlink and Stagein Transfer Jobs The following transfer refiners - Default - Bundle - Cluster now support the separation of the symlink jobs from the stage in jobs. While using these refiners, the files that need to be symlinked against existing files on a compute site will have a separate symlink job. The files that need to be actually copied to a remote site, will appear in the stage_in_ jobs. This distinction, allows for the users to stage in data using third party transfers that run on the submit host, and at the same time be able to symlink against existing datasets. The symlink jobs run on the remote compute sites. Earlier this was not possible, and hence for a user to use symlinking they had to turn off third party transfers. This resulted in an increased load on the head node as the stage in jobs executed there. By default, Pegasus will use the transfer executable shipped with the worker package to do the symbolic linking . If the user wants to change the executable to use , they can set the following property pegasus.transfer.symlink.impl The above also allows us to use separate executables for staging in data and for symbolic linking. For e.g. we can use GUC to stage in data by setting pegasus.transfer.stagein.impl GUC To control the symlinking granularity in the Bundle and Cluster refiners the following Pegasus profile keys can be associated bundle.symlink cluster.symlink The feature implementation was tracked in JIRA at http://pegasus.isi.edu/jira/browse/PM-54 8) Bypassing First Level Staging of Files for worker node execution Pegasus now has capability to bypass first level staging if the input files in the replica catalog have a pool attribute matching the site at which a job is being run. This applies in case of worker node execution. The cache file generated in the submit directory is the transient replica catalog. It also now has locations of where the inpute files are staged on the remote sites. Earlier it was only the files that were generated by the workflow. Tracked in JIRA here http://pegasus.isi.edu/jira/browse/PM-20 http://pegasus.isi.edu/jira/browse/PM-62 9) Resolving SRM URL's for file URL's on a filesystem There is now support to resolve the SRM urls in the replica catalog to the file url on a site. The user needs to specify the URL prefix and the mount point of the filesystem. This can be done by specifying the properties pegasus.transfer.srm.[sitename].service.url pegasus.transfer.srm.[sitename].service.mountpoint Pegasus will then map SRM URL's associate with site to a paht on the filesytem by replacing the service url component with the mount point. For example if user has this specified pegasus.transfer.srm.ligo-cit.service.url srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop pegasus.transfer.srm.ligo-cit.service.mountpoint /mnt/hadoop/ then url srm://osg-se.ligo.caltech.edu:10443/srm/v2/server?SFN=/mnt/hadoop/ligo/frames/S5/test.gwf will resolve to /mnt/hadoop/ligo/frames/S5/test.gwf 10) New Transfer implementation Symlink Pegasus has now support for a perl executable called symlink shipped with the Pegasus worker package, that can be used to create multiple symlinks against input datasets in a single invocation The Transfer implementation that uses the transfer executable also has the same functionality. However, the transfer executable complains if it cannot find the Globus client libraries. In order to use this executable for the symlink jobs, users need to set the following property pegasus.transfer.symlink.impl Symlink Later on ( pegasus 3.0 release onwards ) this will be made the default executable to be used for symlinking jobs. 11) Passing options forward to pegasus-run in pegasus-plan Users can now pass forward option to pegasus-run invocation that is used to submit the workflows in case of successful mapping. There is a --forward option[=value] to pegasus-plan . This option allows a user to forward options to pegasus-run. For e.g. nogrid option can be passed to pegasus-run as follows pegasus-plan --forward nogrid The option can be repeated multiple times to forward multiple options to pegasus-run. The longopt version should always be specified for pegasus-run. 12) Passing extra arguments SLS transfer implementations Users now can specify pegasus.transfer.sls.arguments to pass extra options at runtime to the SLS Implementations used by Pegasus. The following SLS transfer implementations accept the above property. S3 Transfer 13) Passing non standard java options to dax jobs in DAX 3.0 The non standard jvm options (-X[option]) can now be specified for the sub workflows in the arguments section for the dax jobs. For example for the DAX jobs , user can set the java max heap size to 1024m by specifying -X1024m in the arguments for the DAX job 14) Location of Condor Logs directory on the submit host By default, pegasus designates the condor logs to be created in the /tmp directory. This is done to ensure that the logs are created in a local directory even though the submit directory maybe on NFS. In the submit directory the symbolic link to the appropriate log file in the /tmp exists. However, since /tmp is automatically purged in most cases, users may want to preserve their condor logs in a directory on the local filesystem other than /tmp The new property pegasus.dir.submit.logs allows a user to designate the logs directory on the submit host for condor logs. 15) Removing profile keys as part of overriding profiles There is now a notion of empty profile key valus in Pegasus. The default action on empty key value is to remove the key. Currently the following namespaces follow this convention - Condor - Globus - Pegasus This allows a user to unset values as part of overriding profiles. Normally a user can only update a profile value i.e they can update the value of a key, but the key remains associated with the job This allows the user to remove the key from the profile namespace. For e.g. A user may have a profile X set in the site catalog. Now for a particular job a user does not want that profile key to be used. He can now specify the same profile X with empty value in the transformation catalog for that job. This results in the profile key X being removed from the job. 16) Constructing Paths to Condor DAGMan for recursive/hierarichal workflows The entry for condor::dagman is no longer required for site local in the transformation catalog. Instead pegasus constructs path from the following environment variables. CONDOR_HOME, CONDOR_LOCATION The priority order is as follows 1) CONDOR_HOME defined in the environment 2) CONDOR_LOCATION defined in the environment 3) entry for condor::dagman for site local This is useful when running workflows that refer to sub workflows as in the new DAX 3.0 format. This was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-50 17) Constructing path to kickstart By default the path to kickstart is determined on the basis of the environment variable PEGASUS_HOME associated with a site entry in the site catalog. However, in some cases a user might want to use their own modified version of kickstart. In order to enable that The path to kickstart will be constructed according to the following rule 1) pegasus profile gridstart.path specified in the site catalog for the site in question. 2) If 1 is not specified, then a path is constructed on the basis of the environment variable PEGASUS_HOME for the site in the site catalog. The above was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-60 18) Bulk Lookups to Replica Catalog using rc-client rc-client now can do bulk lookups similar to how it does bulk inserts and deletes Details at http://jira.pegasus.isi.edu/browse/PM-75 19) Additions to show-job workflow visualization script show-job now has a --title option to list add a user provided title for the generated gantt chart. show-job can also visualize workflow of workflows 20) Absolute paths for certain properties in the properties file The properties file that is written out now in the submit directory has absolute paths specified for the following property values. pegasus.catalog.replica.file pegasus.catalog.transformation.file pegasus.catalog.site.file This is even though user may have specified relative paths in properties file. 21) The default horizontal clustering factor Updated the default clustering factor as collapse with value = 1, instead of earlier value of 3 This ensures, that users can cluster only jobs of certain types, and let others remain unclustered. Another way was to specify the collapse factor as 1 explicitly for jobs that users dont want clustering for. BUGS FIXED ---------- 1) Handling of standard universe in condor style In Condor style , standard universe if specified for a job is ONLY associated for compute jobs. This ensures that pegasus auxillary jobs never execute in standard universe. 2) Bug Fix for replica selection bug 43 Checked in the fix for JIRA bug 43 http://pegasus.isi.edu/jira/browse/PM-43 The ReplicaLocation class now has a clone method that does a shallow clone This clone method is called in the selectReplica methods in the replica selectors. 3) rc-client did not implement pegasus.catalog.replica.lrc.ignore property This is now fixed. This bug was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-42 4) DAX'es created while partitioning a workflow During the partitioning of the workflows , the DAX for a partition was created incorrectly as the register flags were not correctly parsed by the VDL DAX parser. This was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-48 5) Handling of initialdir and remote_initialdir keys Changed the internal handling for the initialdir and remote_initialdir keys. The initialdir key is now only associated for standard universe jobs. For glidein and condor style we now associate remote_initialdir unless it is a standard universe job. This was tracked in JIRA http://pegasus.isi.edu/jira/browse/PM-58 6) Querying RLI for non existent LFN using rc-client rc-client had inconsistent behavior when querying RLI for a LFN that does not exist in the RLS. This affected the rc-client lookup command option. Details at http://jira.pegasus.isi.edu/browse/PM-74 7) Running clustered jobs on the cloud in directory other than /tmp There was a bug whereby the clustered jobs executing in worker node execution mode did not honor the wntmp environment variable specified in the Site Catalog for the site. The bug fix was tracked through JIRA http://jira.isi.edu/browse/PM-83 8) Bug Fix for worker package deployment The regex employed to determine the pegasus version from a URL to a worker package was insufficient. It only took care of x86 builds. For e.g. it could not parse the following urlhttp://pegasus.isi.edu/mapper/download/nightly/pegasus-worker-2.4.0cvs-ia64_rhas_3.tar.gz STATIC_BINARY INTEL64::LINUX NULL This is now fixed. Related to JIRA PM-33 9) Destination URL construction for worker package staging Earlier the worker package input files always had third party URL's, even if the worker package deployment job executed on the remote site ( in push / pull mode ). Now, the third party URL's are only constructed if the worker package deployment job is actually run in third party mode. In push-pull mode, the destination URL's are file URLs Tracked in JIRA athttp://jira.isi.edu/browse/PM-89 Documentation -------------- 1) User Guides The Running on different Grids Guide now has information on how to run workflows using glite. - Pegasus Replica Selection The guides are checked in $PEGASUS_HOME/doc/guides They can be found online at http://pegasus.isi.edu/mapper/doc.php 2) Property Document was updated with the new properties introduced.