- substantial performance improvements for the planner for large workflows
- leaf cleanup jobs in the workflow
- new default transfer refiner
- abitlity to automatically add data flow dependencies
- new mode for runtime clustering
- pegasus-transfer is now multithreaded
- updates to replica catalog backends
- Improved Planner Performance
This release has major performance improvements to the planner that should help in planning larger DAX’es than earlier. Additionally, the planner can now optionally log JAVA HEAP memory usage on the INFO log at the end of the planning process, if the property pegasus.log.memory.usage is set to true.
- Leaf Cleanup Jobs
Pegasus now has a new cleanup option called Leaf that adds a leaf cleanup jobs symmetric to the create dir jobs. The leaf cleanup jobs remove the directory from the staging site that the create dir jobs create at the end of the workflow. The leaf cleanup is turned on by passing –cleanup Leaf to pegasus-plan.
Care should be taken while enabling this option for hierarchal workflows. Leaf cleanup jobs will create problems, if there are data dependencies between sub workflows in a hierarchal workflow. In that case, the cleanup option needs to be explicitly set to None for the pegasus-plan invocations for the dax jobs in the hierachal DAX. - New Default Transfer Refiner
This release has a new default transfer refiner called BalancedCluster that does round robin distribution at the file level instead of the job level, while creating clustered stagein and stageout jobs. This refiner by default adds two stagein and two stageout jobs per level of the workflow.
- Planner can automatically infer and data flow dependencies in the DAG
The planner can now automatically add dependencies on the basis of data dependencies implied by input and output files for jobs. For example if Job A creates an output file X and job B consumes it, then the planner should automatically add a dependency between A -> B if it does not exist already.
This feature is turned on by default and can be turned off by setting the property pegasus.parser.dax.data.dependencies to false. More details at https://jira.isi.edu/browse/PM-746
- Update to Replica Catalog Backends
The replica catalog backends ( File, Regex and JDBCRC) have been updated to consider lfn, pfn mapping but with different pool/handle as different entries.
For the JDBCRC the database schema has been updated. To migrate your existing JDBCRC backend, users are recommended to use the alter-my-rc.py script located into ‘share/pegasus/sql’ to migrate the database.Note that you will need to edit the script to update the database name, host, user, and password. Details at https://jira.isi.edu/browse/PM-732 - Improved Credential Handing for data transfers
In case of data transfer jobs, it is now possible to associate different credentials for a single file transfer ( one for the source server and the other for the destination server) . For example, when leveraging GridFTP transfers between two sides that accept different grid credentials such as XSEDE Stampede site and NCSA Bluewaters. In that case, Pegasus picks up the associated credentials from the site catalog entries for the source and the destination sites associated with the transfer.Also starting 4.4, the credentials should be associated as Pegasus profiles with the site entries in the site catalog, if you want them transferred with the job to the remote site.Details about credential handling in Pegasus can be found hereAssociated JIRA item for the improvementThe credential handling support in pegasus-transfer, pegasus-createdir and pegasus-cleanup were also updated
- New mode for runtime clustering
This release has a new mode added for runtime clustering.
Mode 1: The module groups tasks into clustered job such that no clustered job runs longer than the maxruntime input parameter to the module.
Mode 2(New): New mode now allows users to group tasks into a fixed number of clustered jobs. The module distributes tasks evenly (based on job runtime) across jobs, such that each clustered job takes approximately the same time. This mode is helpful when users are aware of the number of resources available to them at the time of execution. - pegasus-transfer is now threaded
pegasus-transfer is now multithreaded. Pegasus exposes two knobs to control the number of threads pegasus-transfer can use depending on whether you want to control standard transfer jobs, or you want to control transfers that happen as a part of a PegasusLite job . For the former, see the pegasus.transfer.threads property, and for the latter the pegasus.transfer.lite.threads property. For 4.4.0 pegasus.transfer.threads defaults to 2 and pegasus.transfer.lite.threads defaults to 1.
- pegasus-analyzer recurses into subworkflows
pegasus-analyzer has a –recurse option that sets it to automatically recurse into failed sub workflows. By default, if a workflow has a sub workflow in it, and that sub workflow fails , pegasus-analyzer reports that the sub workflow node failed, and lists a command invocation that the user must execute to determine what jobs in the sub workflow failed. If this option is set, then the analyzer automatically issues the command invocation and in addition displays the failed jobs in the sub workflow.Details at https://jira.isi.edu/browse/PM-730
-
Support for Fixed Output MapperUsing this output mapper, users can specify an externally accessible URL in the properties file, pointing to a directory where the output files needs to be transferred to. To use this mapper, set the following propertiespegasus.dir.storage.mapper Fixedpegasus.dir.storage.mapper.fixed.url <url to the storage directory e.g. gsiftp://outputs.isi.edu/shared/outputs>
- Extra ways for user application to flag errors
CondorG does not propogate exitcodes correctly from GRAM. As a result, a job in a Pegasus workflow that is not launched via pegasus-kickstart maynot have the right exitcode propogated from user application -> GRAM -> CondorG -> Workflow. For example, in Pegasus MPI jobs are never launched using pegasus-kickstart. Usually ways of handling this error is to have a wrapper script that detects failure and then having the postscript fail on the basis of the message logged.Starting 4.4.0, Pegasus provides a mechanism of logging something on stdout /stderr that can be used to designate failures. This obviates the need for users to have a wrapper script. Users can associate two pegasus profiles with the jobs
- exitcode.failuremsg -The message string that pegasus-exitcode searches for in the stdout and stderr of the job to flag failures.
- exitcode.successmsg – The message string that pegasus-exitcode searches for in the stdout and stderr of the job to determine whether a job logged it’s success message or not. Note this value is used to check for whether a job failed or not i.e if this profile is specified, and pegasus-exitcode DOES NOT find the string in the job stdout or stderr, the job is flagged as failed. The complete rules for determining failure are described in the man page for pegasus-exitcode.
More details at http://jira.isi.edu/browse/PM-737
- Updated examples for Glite submission directly to local PBS
The 4.4.0 release has improvements for the submission of workflows directly to local PBS using the Condor Glite interfaces. The documentation on how to use this through Pegasus is documented at
http://pegasus.isi.edu/wms/docs/4.4.0/execution_environments.php#glite
It is important to note that to use this, you need to use the pbs_local_attributes.sh file shipped with Pegasus in the share/pegasus/htcondor/glite directory and put in the glite bin directory of your condor installation.Additionally, there is a new example in the examples directory that illustrates how to execute an MPI job using this submission mechanism through Pegasus. - Finer grained specification of linux versions for worker package staging
Planner now has added logic for users to specify finer grained linux versions to stage the worker package for .
Users can now specify in the site catalog the osrelease and osversion attributes e.g.<site handle=”exec-site” arch=”x86_64″ os=”LINUX” osrelease=”deb” osversion=”7″>If a supported release version combination is not specified, then planner throws a warning and defaults to the default combination for the OS.More details at https://jira.isi.edu/browse/PM-732
- pegasus-kickstart can now copy all of applications stdio if -b all is passed
Added an option to capture all stdio. This is a feature that HUBzero requested. Kickstart will now copy
all stdout and stderr of the job to the invocation record if the user specifies ‘-B all’. - Tutorial includes pegasus-dashboard
The tutorial comes configured with pegasus-dashboard.
- Improved formatting of extra long values for pegasus-statistics
More details at https://jira.isi.edu/browse/PM-744
- Changed timeout parameters for pegasus-gridftp
Increased the timeout parameter for GridFTPClient to 60 seconds. The globus jar defaults to 30 seconds. The timeout was increased to ensure that transfers don’t fail against heavliy loaded GridFTP servers.
- ewew
Bugs Fixed
- IRODS support in pegasus-transfer , pegasus-createdir was broken
irods mkdir command got the wrong path when invoked by pegasus-transfer. this is now fixed
-
Data reuse algorithm does not cascade the deletion upwardsIn certain cases, the cascading of deletion in data reuse did not happen completely. This is now fixed. More details at https://jira.isi.edu/browse/PM-742
-
Improved argument management for PMCThis was done to address the case where a task has quoted arguments with spaces.
- Clusters of size 1 should be allowed when using PMC
For label based clustering with PMC single node clusters are allowed. This is important as in some cases, PMC jobs might have been set to work with the relevant globus profiles.
- nonascii characters in application stdout broke parsing in monitord
The URL quoting logic was updated to encode unicode strings as UTF-8 before the string was passed to the quote fuction. More details at
-
Removing a workflow using pegasus-remove does not update the stampede database
If you remove a running workflow, using pegasus-remove, the stampede database is not updated to reflect that the workflow failed. Changes were made to pegasus-dagman to ensure that pegasus-monitord gets 100 seconds to complete the population before sending a kill signal.