- ensemble manager for managing collections of workflows
- support for job checkpoint files
- support for Google Cloud Storage
- improvements to pegasus-dashboard
- data management improvements
- new tools pegasus-db-admin, pegasus-submitdir , pegasus-halt and pegasus-graphviz
- Ensemble manager for managing collections of workflows
The ensemble manager is a service that manages collections of workflows called ensembles. The ensemble manager is useful when you have a set of workflows you need to run over a long period of time. It can throttle the number of concurrent planning and running workflows, and plan and run workflows in priority order. A typical use-case is a user with 100 workflows to run, who needs no more than one to be planned at a time, and needs no more than two to be running concurrently.The ensemble manager also allows workflows to be submitted and monitored programmatically through its RESTful interface.Details about ensemble manager can be found at https://pegasus.isi.edu/wms/docs/4.5.0/service.php
- Support for Google Cloud Storage
Pegasus now supports running of workflows in the Google cloud. When running workflows in Google cloud, users can specify Google storage to act as the staging site. More details on how to configure Pegasus to use google storage can be found at pegasus.isi.edu/wms/docs/4.5.0/cloud.php#google_cloud. All the pegasus auxillary clients ( pegasus-transfer, pegasus-create-dir and pegasus-cleanup) were updated to handle google storage URL’s ( starting with gs://). The tools call out to google command line tool called gsutils.
- Support for job checkpoint files
Pegasus now supports checkpoint files created by jobs. This allows users to run long running jobs ( where the runtime of a job exceeds the maxwalltime supported on a compute site) to completion, provided the jobs generate a checkpoint file periodically. To use this, checkpoint files with link as checkpoint need to be specified for the jobs in the DAX . Additionally, the jobs need to specify the pegasus profile checkpoint.time that indicates the number of minutes after which pegasus-kickstart sends a TERM signal to the job, signalling it to start the generation of the checkpoint file .Details on this can be found in the userguide https://pegasus.isi.edu/wms/docs/4.5.0/transfer.php#staging_job_checkpoi…
- Pegasus Dashboard Improvements
Pegasus dashboard can now be deployed in multiuser mode. It is now started by the pegasus-service command. Instructions for starting the pegasus service can be found at https://pegasus.isi.edu/wms/docs/4.5.0/service.php#idp2043968
Changed the look and feel of the dashboard. Users can now track all job instances ( retries ) of a job through the dashboard. Earlier it was only the latest job retry.
There is a new tab called failing jobs on the workflows page. The tab lists jobs that have failed at least once and are currently being retried.
The submit host is displayed on the workflow’s main page.
The job details page now shows information about the Host where the job ran, and all the states that the job has gone through.
The dashboard also has a file browser which allows users to view files in the worklfow submit directory directly from the dashboard.
- Data configuration is now supported per site
Starting with the 4.5.0 release, users now can associate a pegasus profile key data.configuration per site in the site catalog to specify the data configuration mode (sharedfs, nonsharedfs or condorio) to use for jobs executed on that site. Earlier this was a global configuration, that applied to the whole workflow and had to be specified in the properties file.More details at
https://jira.isi.edu/browse/PM-810 - Support for sqlite JDBCRC
Users can now specify a sqlite backend for their JDBCRC replica catalog. To create the database for the sqlite based replica catalog, use the command pegasus-db-admin
pegasus-db-admin create jdbc:sqlite:/shared/jdbcrc.db
To setup Pegasus to use sqlite JDBCRC set the following propertiespegasus.catalog.replica JDBCRCpegasus.catalog.replica.db.driver sqlitepegasus.catalog.replica.db.url jdbc:sqlite:/shared/jdbcrc.dbUsers can use the tool pegasus-rc-client to insert, query and delete entires from the catalog
- New database management tool called pegasus-db-admin
Depending on configuration, Pegasus can refer to three different types of databases during the various stages of workflow planning and execution.
master – Usually a sqlite database located at $HOME/.pegasus/workflow.db. This is always populated by pegasus-monitord and is used by pegasus-dashboard to track users top level workflows.
workflow – Usually a sqlite database created by pegasus-monitord in the workflow submit directory. This contains detailed information about the workflow execution.
jdbcrc – if a user has configured a JDBCRC replica catalog.
The tool is automatically invoked by the planner to check for comaptibility and updates the master database if required. The jdbcrc is checked if a user has it configured at planning time or when using the pegasus-rc-client command line tool.
This tool should be used by users, when setting up new database catalogs, or to check for compatibility. For more details refer to the migration guide at https://pegasus.isi.edu/wms/docs/4.5.0cvs/useful_tips.php#migrating_from…
- pegasus-kickstart allows for system calls interposition
pegasus-kickstart has new options -z and -Z that get enabled for linux platforms. When enabled, pegasus-kickstart captures information about the files opened and I/O for user applications and includes it in the proc section of it’s output. This -z flag causes kickstart to use ptrace() to intercept system calls and report a list of files accessed and I/O performed. The -Z flag causes kickstart to use LD_PRELOAD to intercept library calls and report a list of files accessed and I/O performed.
- pegasus-kickstart now captures condor job id and LRMS job ids
pegasus-kickstart now captures both the condor job id and the local LRMS ( the system through which the job is executed) in the invocation record for the job.
- pegasus-transfer has support for SSHFTP
pegasus-transfer now has support for GridFTP over SSH . More details at
https://pegasus.isi.edu/wms/docs/4.5.0/transfer.php#idp17066608
- pegasus-s3 has support for bulk deletes
pegasus-s3 now supports batched deletion of keys from a S3 bucket. This improves the performance for deleting keys from a large bucket.
- DAGMan metrics reporting enabled
Pegasus workflows now have DAGMan metric reporting capability turned on. Details on Pegasus usage tracking policy can be found here
As part of this effort the planner now invokes condor_submit_dag at planning time to generate the DAGMan submit file, that is then modified to enable metrics reporting.More details at https://jira.isi.edu/browse/PM-797
- Planner reports file distribution counts in metrics report
The planner now reports file distribution counts ( number of input, intermediate and output files) in it’s metrics report .
- Notion of scope for data reuse
Users can now enable partial data reuse, where only output files of certain jobs are checked for existence in the replica catalog, to trigger data reuse. Three scopes are supported
full – full data reuse as is implemented in 4.4
none – no data reuse i.e same as –force option to the planner
partial – in this case, only certain jobs ( those that have pegasus profile key enable_for_data_reuse set to true )are checked for presence of output files in the replica catalog - New tool called pegasus-submitdir
There is a new tool called pegasus-submitdir that allows users to archive, extract , move and delete a workflow submit directory. The tool ensures that master database ( usually in $HOME/.pegasus/workflow.db) is updated accordingly.
- New tool called pegasus-halt
There is a new tool called pegasus-halt , that allows users to gracefully halt running workflows. The tool places DAGMan .halt files (http://research.cs.wisc.edu/htcondor/manual/v8.2/2_10DAGMan_Applications…) for all dags in a workflow.More details at https://jira.isi.edu/browse/PM-702
- New tool called pegasus-graphviz
Pegasus now has a tool called pegasus-graphviz that allows you to visualize the DAX and DAG files. It creates a dot file as output .
- New canonical executable pegasus-mpi-keg
New executable called pegasus-mpi-keg that can be compiled from source. Useful for creating synthetic workflows containing MPI jobs. It is similar to pegasus-keg and accepts the same command line arguments. The only difference is that it is MPI code.
- Change in default values
By default, pegasus-transfer now launches maximum of 8 threads to manage the transfers of multiple files.The default job retries for a job in case of failure is now 1 instead of 3.The time for removing the job after has entered the HELD state has been reduced from 1 hour to 30 minutes now.
- Support for DAGMan ABORT-DAG-ON feature
Pegasus now supports a dagman profile key named ABORT-DAG-ON , that can be associated with a job. This job can then cause the whole workflow to be aborted if it fails or exits with a specific value.
More details at https://jira.isi.edu/browse/PM-819 - Deprecated pool attribute in replica catalog
Users now can associate a site attribute in their file based replica catalogs to indicate the site where a file resides. The old attribute pool has been deprecated.
More details at https://jira.isi.edu/browse/PM-813 - Support for pegasus profile glite.arguments
Users can now specify a pegasus profile key glite.arguments that gets added to corresponding PBS qsub file that is generated by the Glite layer in HTCondor. For e.g you can set the value to “-N testjob -l walltime=01:23:45 -l nodes=2” . This will get translated to the following in the PBS file#PBS -N testjob -l walltime=01:23:45 -l nodes=2Thes values specified for this profile, override any other conflicting directives that are created on the basis of the globus profiles associated with the jobs.More details at https://jira.isi.edu/browse/PM-880
- Reorganized documentation
The userguide has been reorganized to make it easier for users to identify the right chapter they want to navigate to. The configuration documentation has been streamlined and put into a single chapter, rather than having a separate chapter for profiles and properties.
- Support for hints namespace
Users can now specify the following hints profile keys to control the behavior of the planner
execution.site – the execution site where a job should execute
pfn – the path to the remote executable picked upgrid.jobtype – the job type to be used while selecting the gridgateway / jobmanager for the jobMore details at https://jira.isi.edu/browse/PM-828 - Added support for HubZero Distribute job wrapper
Added support for HubZero specific job launcher Distribute, that submits jobs to a remote PBS cluster. The compute jobs are setup by Pegasus to run in local universe, and are wrapped with Distribute job wrapper, that takes care of the submission and monitoring of the job. More details at https://jira.isi.edu/browse/PM-796
- New classad populated for dagman jobs
Pegasus now popualtes a +pegasus_execution_sites classad in the dagman submit file. The value is the list of execution sites for which the workflow was planned for.
More details at https://jira.isi.edu/browse/PM-846
- Python DAX API now bins the file by link type when rendering the workflow
Python DAX API now groups the jobs by their link type before rendering them to XML. This improves the readability of the generated DAX.
More details at https://jira.isi.edu/browse/PM-874
- Better demarcation of various stages in PegasusLite logs
The jobs .err file in PegasusLite modes captures the logs from the PegasusLite wrapper that launches users jobs on remote nodes. This log is now clearly demarcated to identify the various stages of a job execution by PegasusLite.
-
Dropped support for Globus RLS replica catalog backends
- pegasus-plots is deprecated and will be removed in 4.6
The jobs .err file in PegasusLite modes captures the logs from the PegasusLite wrapper that launches users jobs on remote nodes. This log is now clearly demarcated to identify the various stages of a job execution by PegasusLite.
- Fixed kickstart handling of environment variables with quotes
If an environment variable has quotes, then invalid XML output was produced by pegasus-kickstart. This is now fixed. More details at
- Leaking file descriptors for two stage transfers
pegasus-transfer opens a temp file for each two stage transfer it has to execute. It was not closing them explicitly.
- Disabling of chmod jobs triggered an exceptionDisabling the chmod jobs results in creation of noop jobs instead of the chmod jobs. However, that resulted in planner exceptions when adding create dir and leaf cleanup nodes. This is now fixed.More details at https://jira.isi.edu/browse/PM-845
- Incorrect binning of file transfers amongst transfer jobsBy default, pair only considered the destination URL of a transfer pair to determine whether the associated transfer job has to run locally on the submit host or on the remote staging site. However, this logic broke when user had input files catalogued in the replica catalog with file urls for files on the submit site and remote execution sites. The logic has now been updated to take into account source URL’s also.More details at https://jira.isi.edu/browse/PM-829
- pegasus auxillary jobs are never lauched with pegasus-kickstart invoke capability
For compute jobs with long command line arguments , the planner triggers the pegasus invoke capability in addition to the -w option. However, this cannot be applied to pegasus auxillary jobs as that interferes with the credential handling.
More details at https://jira.isi.edu/browse/PM-851
- Everything in the remote job directory gets staged in condorio mode, if a job has no output files
If a job has no output files asscociated with it in the DAX, then in condorio data configuration mode the planner added an empty value for classad key transfer_output_files in the job submit file. This results in Condor staging back all the inputs ( all the contents in remote jobs directory) back to the submit host. This is now fixed as the planner now adds a special key +TransferOutput=”” , that prevents Condor from staging everything back.
More details at https://jira.isi.edu/browse/PM-820
- Setting multiple strings for exitcode.successmsg and exitcode.failuremsg
Users can now specify multiple pegasus profiles with the key exitcode.successmsg or exticode.failuremsg. Each value gets translated to a corresponding -s or -f argument to pegasus-exitcode invocation for the job.
More details at https://jira.isi.edu/browse/PM-826
- pegasus-monitord failed when submission of job fails
The events SUBMIT_FAILED, GRID_SUBMIT_FAILED, GLOBUS_SUBMIT_FAILED were not handled correctly by pegasus-monitord. As a result, subsequent event insertions for the job resulted in integrity errors. This is now fixed.
More details at https://jira.isi.edu/browse/PM-877