Pegasus 4.1.0 Released

This a major release of Pegasus that has support for PMC (pegasus-mpi-cluster ) that can be used to run the tasks in a clustered job in parallel on remote machines using MPI. As part of this release, the support for submitting workflows using CondorC has been updated. The Pegasus Tutorial has also been updated and is available to run on

– Amazon EC2

– Futuregrid

– Local machine using Virtual Box

NEW FEATURES

—————————–

pegasus-mpi-cluster
Pegasus has support for a new clustering executable called pegasus-mpi-cluster (PMC) that allows users to run tasks in a clustered job in parallel using MPI on the remote node. The input format for PMC is a DAG based format similar to Condor DAGMan’s. PMC follows the dependencies specified in the DAG to release the jobs in the right order and executes parallel jobs via the workers when possible. The input file for PMC is automatically generated by the Pegasus Planner when generating the executable workflow.

In order to use PMC set in your properties pegasus.clusterer.job.aggregator mpiexec

Also, you may need to put an entry in your transformation catalog for pegasus::mpiexec to point to the location of the PMC executable on the remote side.

More details can be found in the man page for pegasus-mpi-cluster and in the clustering chapter

https://pegasus.isi.edu/wms/docs/4.1/reference.php#job_clustering

There is a XSEDE example in the examples directory that shows how to use PMC on XSEDE
Use of new client pegasus-gridftp in pegasus-create-dir and pegasus-cleanup

Starting with release 4.1, the pegasus create dir and cleanup clients use a java based client called pegasus-gridftp to create directories and remove files from against a gridftp server. Pegasus by default now adds a dagman category named cleanup for all cleanup jobs in the workflow. The maxjobs for this category is set to 4 by default.

This can be overriden by specifying the property

dagman.cleanup.maxjobs
Support for CondorC

The support for CondorC in Pegasus has been updated. Users can associate a pegasus profile named style with value condorc with a site in the site catalog to indicate that submission to the site has to be achieved using CondorC.

The site catalog entry should mention the grid gateways to indicate the remote schedd to which the jobs need to be submitted, and the condor collector for the condorc site. It is optional to specify the condor collector. If not specified, Pegasus will use the  contact mentioned in the grid gateway.

Example snippet with relevant entries below

   <site handle=”isi-condorc” arch=”x86″ os=”LINUX”>

<grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”compute” total-nodes=”50″/>

<grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”auxillary” total-nodes=”50″/>

<head-fs>

<scratch>

<shared>

<file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/>

<internal-mount-point mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/>

</shared>

</scratch>

<storage>

<shared>

<file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/storage/”/>

<internal-mount-point mount-point=”/nfs/ccg3/scratch/testing/bamboo/storage”/>

</shared>

</storage>

</head-fs>

<replica-catalog type=”LRC” url=”rlsn://dummyValue.url.edu” />

   <!– specify which condor collector to use –>

<profile namespace=”condor” key=”condor_collector”>ccg-testing1.isi.edu</profile>

<!– submission to this site is using condorc –>

<profile namespace=”pegasus” key=”style”>condorc</profile>

<profile namespace=”condor” key=”should_transfer_files”>Yes</profile>

<profile namespace=”condor” key=”when_to_transfer_output”>ON_EXIT</profile>

<profile namespace=”env” key=”PEGASUS_HOME” >/usr</profile>

<profile namespace=”condor” key=”universe”>vanilla</profile>

</site>
Updated the Pegasus Tutorial

The Pegasus Tutorial has now been updated and is available to run on

– Amazon EC2

– Futuregrid

– Local machine using Virtual Box

https://pegasus.isi.edu/wms/docs/4.1/tutorial_vm.php
Changed the default transfer refiner for Pegasus

The default transfer refiner in Pegasus now clusters both stagein and stageout jobs per level of the workflow. The previous version used to cluster stagein jobs per workflow and the stageout jobs per level of the workflow.

More details can be found at

https://pegasus.isi.edu/wms/docs/4.1/reference.php#id645300
pegasus-statistics has a new -f option

The -f option can be used to specify the output format for pegasus-statistics. Valid supported formats are txt and csv
Updated condor periodic_release and periodic_remove expressions

Earlier, Pegasus used to set default periodic_release and periodic_remove expressions as follows

periodic_release = (NumSystemHolds <= 3)

periodic_remove = (NumSystemHolds > 3)

This had the effect of removing the jobs as soon as they went to held state.

Starting 4.1 the expressions have been updated to

periodic_release = False

periodic_remove = (JobStatus == 5) && ((CurrentTime – EnteredCurrentStatus) > 14400)

With this, the job remains in held state for 4 hours before being removed. The idea is that it is a long enough time for users to debug held jobs.

If users wish to use the previous expressions, they can do it by specifying the condor profile keys periodic_release and periodic_remove.
Property to turn off registration jobs

Pegasus now exposes a boolean property pegasus.register that can be used to turn off the registration of jobs.
More descriptive errors if incomplete site catalog specified

Earlier, incomplete site catalog causes NPE’s when running pegasus-plan. This has been replaced by more descriptive errors that will give user enough information to figure out the missing entries in the site catalog.

More details at

https://jira.isi.edu/browse/PM-590
Change in DAX schema

The dax schema version is now 3.4. The schema now allows for specifying filesizes as a size attribute in the uses element that lists the input and output files for a job.

The DAX Generator API’s have been updated accordingly.

This is useful for users extending the Pegasus Code for their specific research use cases.
Prototype support for Shiwa bundles

pegasus-plan has a new option –shiwa-bundle that allows users to pass a pegasus SHIWA bundle for execution. A Pegasus shiwa bundle, is a bundle that has been generated using the Pegasus Plugin for the Shiwa Desktop.
Improved performance for the expunge operation in against mysql database

When monitord is run in a replay mode, the database is first expunged of all the information related to that workflow. In case, of mysql backend where the same database maybe used to track multiple hierarchal workflows, the expunge operation has to be careful to delete only the relevant entries for the various tables.

In earlier versions, this expunge operation was implemented at OR level in SQLAlchemy that led to lots of select and delete statements to be executed ( one per entry ). This blew up the memory footprint for monitord and prevented the workflow population in case of large databases. For 4.1, we changed the schema to add cascaded delete clauses, and set the passive delete option to true in SQLAlchemy.

More details

https://jira.isi.edu/browse/PM-646
Runtime Clustering picks up pegasus profile key named runtime

Starting 4.1, the runtime clustering in Pegasus picks up pegasus profile key runtime instead of job.runtime . job.runtime is deprecated and a message is logged if a user has that specified.

The planner picks up job.runtime only if runtime is not specified for a job.

BUGS FIXED

—————-

pegasus-lite-local.sh made assumptions on PATH

pegasus-lite-local wrapper that is invoked if a pegasus lite jobs runs in local universe made assumption on PATH variable to determine the pegasus tools.

This is now fixed. More details at

https://jira.isi.edu/browse/PM-636
Overwriting of entries with file based replica catalog

pegasus-rc-client lfn pfn pool=”local” # Inserts new entry in RC file

pegasus-rc-client lfn pfn pool=”usc” # Overwrites pool=”local” to pool=”usc”

The uniqueness constraint in the File RC has been updated to consider the site attribute also.

More details at

https://jira.isi.edu/browse/PM-634
pegasus-statistics failed on workflows with large number of sub workflows

pegasus-statistics failed if a workflow had more 1000 sub workflows. This was due to a SQL Alchemy issue

More details at

https://jira.isi.edu/browse/PM-616
Properties propogation for sub workflows

There was a bug with properties propogation for hierarchal workflows when using PegasusLite for some sub workflows and sharedfs for others

This is partially fixed.

https://jira.isi.edu/browse/PM-624