## Pegasus 4.4.x Series ### Pegasus 4.4.2 **Release Date:** March 16, 2015 We are happy to annouce the release of Pegasus 4.4.2. Pegasus 4.4.2 is a minor release, which contains minor enhancements and fixes bugs to Pegasus 4.4.1 release. #### Enhancements 1) Support for recursive clustering Pegasus now supports recursive clustering, where users can employ multiple clustering techniques on the same graph. For example a user can do label based clustering on the graph, and then do a level based clustering. More details at PM-817 [\#935](https://github.com/pegasus-isi/pegasus/issues/935) 2) Planner reports file breakdowns in the metrics sent to the metrics server. The planner now send file breakdowns ( number of input, intermediate and output files ) as part of the metrics message to the metrics server. This is also reported in the metrics file left in the submit directory. 3) pegasus-transfer does not hide scp errors. 4) more helpul message is thrown if user does not set pegasus.catalog.site.file property #### Bugs Fixed 1) work dir in job instance table was populated incorrectly The work directory in the job instance table of the monitoring database was populated by the submit directory instead of the directory in wihch the job was executed. This resulted in pegasus-analyzer displaying the submit directory for the failed job instead of the directory in which the job actually ran on the remote node. More details at PM-817 [\#935](https://github.com/pegasus-isi/pegasus/issues/935) 2) pegasus-status showed master dag job as failure also When a job in a workflow fails, pegasus-status also includes the corresonding dag job as failed. This leads to it reporting 1 more than the actual number of user compute jobs that failed. this is now fixed. More details at PM-811 [\#929](https://github.com/pegasus-isi/pegasus/issues/929) 3) local-scratch directory not picked up for PegasusLite jobs Users can specify a local-scratch directory in the site catalog for a site, to designate the local directory on the worker node where a PegasusLite job should be run. However, this was not picked up by the planner and set for the jobs. This is now fixed. This only works when user is executing workflows in nonsharedfs mode. 4) pegasus dashboard tables were not updated in real time. Fixed off by 1 error in the flush command, where we decide whether we want to flush the event to the database or batch them up. This one off error affected the pegasus dashboard as the workflow start and end events were not updated to the database when they happened by pegasus-monitord. this is now fixed. 5) Input files in the DAX where transfer flag is set to false, should not be considered for cleanup, as they are never staged to scratch directory on the staging site by the stage in jobs 6) pegasus.gridstart.arguments was not set for all clustered jobs. User provided extra arguments for kickstart invocation were not passed to all the constitutent jobs making up a job cluster, in case of job clustering. This is now fixed. More details at PM-823 [\#941](https://github.com/pegasus-isi/pegasus/issues/941) 7) MPI_ERR_TRUNCATE: message truncated in PMC This error was encountered in certain conditions and was a result of mismatched tag/source between an MPI_Probe/MPI_Iprobe and MPI_Recv. This is now fixed More details at PM-848 [\#966](https://github.com/pegasus-isi/pegasus/issues/966) 8) Setting pegasus.catalog.site XML4 raised an error Introduced backward compatibility for this. More details at PM-815 [\#933](https://github.com/pegasus-isi/pegasus/issues/933) 9) pegasus-plan --help resulted in metrics to be sent. This is now fixed More details at PM-816 [\#934](https://github.com/pegasus-isi/pegasus/issues/934) ### Pegasus 4.4.1 **Release Date:** December 19, 2014 We are happy to annouce the release of Pegasus 4.4.1. Pegasus 4.4.1 is a minor release, which contains minor enhancements and fixes bugs to Pegasus 4.4.0 release. #### Enhancements 1) Leaf cleanup job failures don't trigger workflow failures 2) Finer grained capturing of GridFTP errors Moved to only ignore common failures of GridFTP removals, instead of ignoring all errors 3) pegasus-transfer threading enhancements Allow two retrie with threading before falling back on single-threaded transfers. This prevents pegasus-transfer from overwhelming remote file servers when failures happen. 4) Support for MPI Jobs when submitting using Glite to PBS For user specified MPI jobs in the DAX, the only way to ensure that the MPI job launches in the right directory through GLITE and blahp is to have a wrapper around the user mpi job and refer to that in the transformation catalog. The wrapper should cd in to the directory set by Pegasus in the job's environment. The following environment variable is set _PEGASUS_SCRATCH_DIR 5) Updated quoting support for glite jobs Quoting in the blahp layer in Condor for glite jobs is broken. There were fixes made to the planner and pbs_loca_submit_attributes.sh files such that env. var values can contain spaces or double quotes. The fix relies on users to put the pbs_local_submit_attributes.sh from the pegasus distribution to the condor glite bin directory. More details at PM-802 [\#920](https://github.com/pegasus-isi/pegasus/issues/920) 6) pegasus-s3 now has support for copying objects larger than 5GB 7) pegasus-tc-converter code was cleaned up . support for database backed TC was dropped. 8) The planner now complaisn for deep LFN's when using condor file transfers 9) The planner stack trace is enabled for errors with a single -v ( i.e INFO messagae level or higher) More details at PM-800 [\#918](https://github.com/pegasus-isi/pegasus/issues/918) #### Bugs Fixed 1) Change in how monitord parses job output and error files Earlier pegasus-monitord had a race condition, at it tried to parse the .out and .err file when a JOB_FAILURE or JOB_SUCCESS happened, instead of doing it at POST_SCRIPT_SUCCESS or POST_SCRIPT_FAILURE message, if a postscript was associated . This resulted in it detecting empty kickstart output files, as postscript might have moved it before monitord opened a file handle to it. The fix for this , changed the monitord logic to parse files on JOB_FAILURE or JOB_SUCCESS only if postscript is not associated with the job More details at PM-793 [\#911](https://github.com/pegasus-isi/pegasus/issues/911) 2) pegasus-monitord did not handle aborted jobs well For aborted jobs that failed with signal, monitord did not parse the job status . Because of that no corresponding JOB_FAILURE was recorded, and hence the exitcode for the inv.end event is not recorded. PM-805 [\#923](https://github.com/pegasus-isi/pegasus/issues/923) 3) A set of portability fixes from the Debian packaging were incorporated into pegasus builds. 4) Clusters of size 1 should be allowed when using PMC An earlier fix for 4.4.0 allowed single jobs to be clustered using PMC. However, this resulted in regular MPI jobs that should not be clustered, to be clustered also using PMC. The logic was updated to only wrap a single job with PMC if label based clustering is turned on and the job is associated with a label. More details at PM-745 [\#863](https://github.com/pegasus-isi/pegasus/issues/863) 5) Round robin site selector did not do correct distribution The selector was not distributing the jobs round robin at each level as it was suppposed to. More details at PM-775 [\#893](https://github.com/pegasus-isi/pegasus/issues/893) 6) Based on user configuration, the leaf cleanup jobs tried to delete the submit directory for the workflow A user can configure a workflow such that the workflow submit directory and the workflow scratch directory are the same on local site. This can result in stuck workflows if the leaf cleanup jobs are enabled. The planner now throws an error during planning if it detects the directories are the same More details at PM-773 [\#891](https://github.com/pegasus-isi/pegasus/issues/891) 7) pegasus-cleanup needs to add wildcards to s3:// URLs when --recursive is used More details at PM-790 [\#908](https://github.com/pegasus-isi/pegasus/issues/908) 8) leaf cleanup jobs delete directory that a workflow corresponding to dax job may require For hierarchical workflows, there maybe a case where the jobs that make up the workflow referred to by the subdax job may run in a child directory of the scratch directory in whcih jobs of top level worklfow are running. With leaf cleanup enabled, the parent scratch directory maybe cleaned before the subdax job has been completed. Fix for this involved, putting in explicit dependencies between the leaf cleanup job and the subdax jobs. More details at PM-795 [\#913](https://github.com/pegasus-isi/pegasus/issues/913) 9) pegasus-analyzer did not show planner prescript log for failed subdax jobs For prescript failures for sub dax jobs ( i.e the failure of planning operation on the sub workflow ), pegasus-analyzer never showed the content of the log. It only pointed to the location of the log in the submit directory. This is now fixed. PM-808 [\#926](https://github.com/pegasus-isi/pegasus/issues/926) 10) pegasus-analyzer shows job stderr for failed pegasus-lite jobs When a Pegasus Lite job fails, pegasus-analyzer showed stderr from both the Kickstart record and the job stderr. This was pretty confusing as stderr for those jobs are used to log all kinds of PegasusLite stuff, and has usually nothing to do with the failure. To make these jobs easier to debug for our users, we added logic to only show the Kickstart stderr in these cases. More details at PM-798 [\#916](https://github.com/pegasus-isi/pegasus/issues/916) 11) Planner did not validate pegasus.data.configuration value. AS a result, because of a typo in the properties file planner failed with NPE. More details at PM-799 [\#917](https://github.com/pegasus-isi/pegasus/issues/917) 12) pegasus-statistics output padding Value padding is done only for text output files so they are human readable. However, due to a bug the value padding computation were being done for CSV file as well at one point in code. This caused an exception when output filetype for job statistics was csv ### Pegasus 4.4.0 **Release Date:** July 9, 2014 We are happy to announce the release of Pegasus 4.4.0 Pegasus 4.4.0 is a major release of Pegasus which contains all the enhancements and bugfixes in 4.3.2 New features and Improvements in 4.4.0 include - substantial performance improvements for the planner for large workflows - leaf cleanup jobs in the workflow - new default transfer refiner - abitlity to automatically add data flow dependencies - new mode for runtime clustering - pegasus-transfer is now multithreaded - updates to replica catalog backends #### New Features 1) Improved Planner Performance This release has major performance improvements to the planner that should help in planning larger DAX'es than earlier. Additionally, the planner can now optionally log JAVA HEAP memory usage on the INFO log at the end of the planning process, if the property pegasus.log.memory.usage is set to true. 2) Leaf Cleanup Jobs Pegasus now has a new cleanup option called Leaf that adds a leaf cleanup jobs symmetric to the create dir jobs. The leaf cleanup jobs remove the directory from the staging site that the create dir jobs create at the end of the workflow. The leaf cleanup is turned on by passing --cleanup Leaf to pegasus-plan. Care should be taken while enabling this option for hierarchal workflows. Leaf cleanup jobs will create problems, if there are data dependencies between sub workflows in a hierarchal workflow. In that case, the cleanup option needs to be explicitly set to None for the pegasus-plan invocations for the dax jobs in the hierachal DAX. 3) New Default Transfer Refiner This release has a new default transfer refiner called BalancedCluster that does round robin distribution at the file level instead of the job level, while creating clustered stagein and stageout jobs. This refiner by default adds two stagein and two stageout jobs per level of the workflow. 4) Planner can automatically infer and data flow dependencies in the DAG The planner can now automatically add dependencies on the basis of data dependencies implied by input and output files for jobs. For example if Job A creates an output file X and job B consumes it, then the planner should automatically add a dependency between A -> B if it does not exist already. This feature is turned on by default and can be turned off by setting the property pegasus.parser.dax.data.dependencies to false. More details at PM-746 [\#864](https://github.com/pegasus-isi/pegasus/issues/864) 5) Update to Replica Catalog Backends The replica catalog backends ( File, Regex and JDBCRC) have been updated to consider lfn, pfn mapping but with different pool/handle as different entries. For the JDBCRC the database schema has been updated. To migrate your existing JDBCRC backend, users are recommended to use the alter-my-rc.py script located into 'share/pegasus/sql' to migrate the database. Note that you will need to edit the script to update the database name, host, user, and password. Details at PM-732 [\#850](https://github.com/pegasus-isi/pegasus/issues/850) 6) Improved Credential Handing for data transfers In case of data transfer jobs, it is now possible to associate different credentials for a single file transfer ( one for the source server and the other for the destination server) . For example, when leveraging GridFTP transfers between two sides that accept different grid credentials such as XSEDE Stampede site and NCSA Bluewaters. In that case, Pegasus picks up the associated credentials from the site catalog entries for the source and the destination sites associated with the transfer. Also starting 4.4, the credentials should be associated as Pegasus profiles with the site entries in the site catalog, if you want them transferred with the job to the remote site. Details about credential handling in Pegasus can be found here https://pegasus.isi.edu/wms/docs/4.4.0cvs/reference.php#cred_staging Associated JIRA item for the improvement PM-731 [\#849](https://github.com/pegasus-isi/pegasus/issues/849) The credential handling support in pegasus-transfer, pegasus-createdir and pegasus-cleanup were also updated 7) New mode for runtime clustering This release has a new mode added for runtime clustering. Mode 1: The module groups tasks into clustered job such that no clustered job runs longer than the maxruntime input parameter to the module. Mode 2(New): New mode now allows users to group tasks into a fixed number of clustered jobs. The module distributes tasks evenly (based on job runtime) across jobs, such that each clustered job takes approximately the same time. This mode is helpful when users are aware of the number of resources available to them at the time of execution. 8) pegasus-transfer is now threaded pegasus-transfer is now multithreaded. Pegasus exposes two knobs to control the number of threads pegasus-transfer can use depending on whether you want to control standard transfer jobs, or you want to control transfers that happen as a part of a PegasusLite job . For the former, see the pegasus.transfer.threads property, and for the latter the pegasus.transfer.lite.threads property. For 4.4.0 pegasus.transfer.threads defaults to 2 and pegasus.transfer.lite.threads defaults to 1. 9) pegasus-analyzer recurses into subworkflows pegasus-analyzer has a --recurse option that sets it to automatically recurse into failed sub workflows. By default, if a workflow has a sub workflow in it, and that sub workflow fails , pegasus-analyzer reports that the sub workflow node failed, and lists a command invocation that the user must execute to determine what jobs in the sub workflow failed. If this option is set, then the analyzer automatically issues the command invocation and in addition displays the failed jobs in the sub workflow. Details at PM-730 [\#848](https://github.com/pegasus-isi/pegasus/issues/848) 10) Support for Fixed Output Mapper Using this output mapper, users can specify an externally accessible URL in the properties file, pointing to a directory where the output files needs to be transferred to. To use this mapper, set the following properties pegasus.dir.storage.mapper Fixed pegasus.dir.storage.mapper.fixed.url 11) Extra ways for user application to flag errors CondorG does not propogate exitcodes correctly from GRAM. As a result, a job in a Pegasus workflow that is not launched via pegasus-kickstart maynot have the right exitcode propogated from user application -> GRAM -> CondorG -> Workflow. For example, in Pegasus MPI jobs are never launched using pegasus-kickstart. Usually ways of handling this error is to have a wrapper script that detects failure and then having the postscript fail on the basis of the message logged. Starting 4.4.0, Pegasus provides a mechanism of logging something on stdout /stderr that can be used to designate failures. This obviates the need for users to have a wrapper script. Users can associate two pegasus profiles with the jobs exitcode.failuremsg -The message string that pegasus-exitcode searches for in the stdout and stderr of the job to flag failures. exitcode.successmsg - The message string that pegasus-exitcode searches for in the stdout and stderr of the job to determine whether a job logged it's success message or not. Note this value is used to check for whether a job failed or not i.e if this profile is specified, and pegasus-exitcode DOES NOT find the string in the job stdout or stderr, the job is flagged as failed. The complete rules for determining failure are described in the man page for pegasus-exitcode. More details at PM-737 [\#855](https://github.com/pegasus-isi/pegasus/issues/855) 12) Updated examples for Glite submission directly to local PBS The 4.4.0 release has improvements for the submission of workflows directly to local PBS using the Condor Glite interfaces. The documentation on how to use this through Pegasus is documented at http://pegasus.isi.edu/wms/docs/4.4.0/execution_environments.php#glite It is important to note that to use this, you need to use the pbs_local_attributes.sh file shipped with Pegasus in the share/pegasus/htcondor/glite directory and put in the glite bin directory of your condor installation. Additionally, there is a new example in the examples directory that illustrates how to execute an MPI job using this submission mechanism through Pegasus. 13) Finer grained specification of linux versions for worker package staging Planner now has added logic for users to specify finer grained linux versions to stage the worker package for . Users can now specify in the site catalog the osrelease and osversion attributes e.g. If a supported release version combination is not specified, then planner throws a warning and defaults to the default combination for the OS. More details at PM-732 [\#850](https://github.com/pegasus-isi/pegasus/issues/850) 14) pegasus-kickstart can now copy all of applications stdio if -b all is passed Added an option to capture all stdio. This is a feature that HUBzero requested. Kickstart will now copy all stdout and stderr of the job to the invocation record if the user specifies '-B all'. 15) Tutorial includes pegasus-dashboard The tutorial comes configured with pegasus-dashboard. 16) Improved formatting of extra long values for pegasus-statistics More details at PM-744 [\#862](https://github.com/pegasus-isi/pegasus/issues/862) 17) Changed timeout parameters for pegasus-gridftp Increased the timeout parameter for GridFTPClient to 60 seconds. The globus jar defaults to 30 seconds. The timeout was increased to ensure that transfers don't fail against heavliy loaded GridFTP servers. #### Bugs Fixed 1) IRODS support in pegasus-transfer , pegasus-createdir was broken irods mkdir command got the wrong path when invoked by pegasus-transfer. this is now fixed 2) Data reuse algorithm does not cascade the deletion upwards In certain cases, the cascading of deletion in data reuse did not happen completely. This is now fixed. More details at PM-742 [\#860](https://github.com/pegasus-isi/pegasus/issues/860) 3) Improved argument management for PMC This was done to address the case where a task has quoted arguments with spaces. 4) Clusters of size 1 should be allowed when using PMC For label based clustering with PMC single node clusters are allowed. This is important as in some cases, PMC jobs might have been set to work with the relevant globus profiles. PM-745 [\#863](https://github.com/pegasus-isi/pegasus/issues/863) 5) nonascii characters in application stdout broke parsing in monitord The URL quoting logic was updated to encode unicode strings as UTF-8 before the string was passed to the quote fuction. More details at PM-757 [\#875](https://github.com/pegasus-isi/pegasus/issues/875) 6) Removing a workflow using pegasus-remove does not update the stampede database If you remove a running workflow, using pegasus-remove, the stampede database is not updated to reflect that the workflow failed. Changes were made to pegasus-dagman to ensure that pegasus-monitord gets 100 seconds to complete the population before sending a kill signal. 7) Translation of values from days to years/days was borken in pegasus-statistics This is now fixed.