This a major release of Pegasus that has support for PMC (pegasus-mpi-cluster ) that can be used to run the tasks in a clustered job in parallel on remote machines using MPI. As part of this release, the support for submitting workflows using CondorC has been updated. The Pegasus Tutorial has also been updated and is available to run on
– Amazon EC2
– Futuregrid
– Local machine using Virtual Box
NEW FEATURES
—————————–
- pegasus-mpi-cluster
Pegasus has support for a new clustering executable called pegasus-mpi-cluster (PMC) that allows users to run tasks in a clustered job in parallel using MPI on the remote node. The input format for PMC is a DAG based format similar to Condor DAGMan’s. PMC follows the dependencies specified in the DAG to release the jobs in the right order and executes parallel jobs via the workers when possible. The input file for PMC is automatically generated by the Pegasus Planner when generating the executable workflow.In order to use PMC set in your properties pegasus.clusterer.job.aggregator mpiexecAlso, you may need to put an entry in your transformation catalog for pegasus::mpiexec to point to the location of the PMC executable on the remote side.More details can be found in the man page for pegasus-mpi-cluster and in the clustering chapterThere is a XSEDE example in the examples directory that shows how to use PMC on XSEDE
-
Use of new client pegasus-gridftp in pegasus-create-dir and pegasus-cleanupStarting with release 4.1, the pegasus create dir and cleanup clients use a java based client called pegasus-gridftp to create directories and remove files from against a gridftp server. Pegasus by default now adds a dagman category named cleanup for all cleanup jobs in the workflow. The maxjobs for this category is set to 4 by default.This can be overriden by specifying the propertydagman.cleanup.maxjobs
-
Support for CondorCThe support for CondorC in Pegasus has been updated. Users can associate a pegasus profile named style with value condorc with a site in the site catalog to indicate that submission to the site has to be achieved using CondorC.The site catalog entry should mention the grid gateways to indicate the remote schedd to which the jobs need to be submitted, and the condor collector for the condorc site. It is optional to specify the condor collector. If not specified, Pegasus will use the contact mentioned in the grid gateway.Example snippet with relevant entries below<site handle=”isi-condorc” arch=”x86″ os=”LINUX”><grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”compute” total-nodes=”50″/><grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”auxillary” total-nodes=”50″/><head-fs><scratch><shared><file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/><internal-mount-point mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/></shared></scratch><storage><shared><file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/storage/”/><internal-mount-point mount-point=”/nfs/ccg3/scratch/testing/bamboo/storage”/></shared></storage></head-fs><replica-catalog type=”LRC” url=”rlsn://dummyValue.url.edu” /><!– specify which condor collector to use –><profile namespace=”condor” key=”condor_collector”>ccg-testing1.isi.edu</profile>
<!– submission to this site is using condorc –>
<profile namespace=”pegasus” key=”style”>condorc</profile><profile namespace=”condor” key=”should_transfer_files”>Yes</profile><profile namespace=”condor” key=”when_to_transfer_output”>ON_EXIT</profile><profile namespace=”env” key=”PEGASUS_HOME” >/usr</profile><profile namespace=”condor” key=”universe”>vanilla</profile></site> -
Updated the Pegasus TutorialThe Pegasus Tutorial has now been updated and is available to run on– Amazon EC2– Futuregrid– Local machine using Virtual Box
-
Changed the default transfer refiner for PegasusThe default transfer refiner in Pegasus now clusters both stagein and stageout jobs per level of the workflow. The previous version used to cluster stagein jobs per workflow and the stageout jobs per level of the workflow.More details can be found at
-
pegasus-statistics has a new -f optionThe -f option can be used to specify the output format for pegasus-statistics. Valid supported formats are txt and csv
-
Updated condor periodic_release and periodic_remove expressionsEarlier, Pegasus used to set default periodic_release and periodic_remove expressions as followsperiodic_release = (NumSystemHolds <= 3)periodic_remove = (NumSystemHolds > 3)This had the effect of removing the jobs as soon as they went to held state.Starting 4.1 the expressions have been updated toperiodic_release = Falseperiodic_remove = (JobStatus == 5) && ((CurrentTime – EnteredCurrentStatus) > 14400)With this, the job remains in held state for 4 hours before being removed. The idea is that it is a long enough time for users to debug held jobs.If users wish to use the previous expressions, they can do it by specifying the condor profile keys periodic_release and periodic_remove.
-
Property to turn off registration jobsPegasus now exposes a boolean property pegasus.register that can be used to turn off the registration of jobs.
-
More descriptive errors if incomplete site catalog specifiedEarlier, incomplete site catalog causes NPE’s when running pegasus-plan. This has been replaced by more descriptive errors that will give user enough information to figure out the missing entries in the site catalog.More details at
-
Change in DAX schemaThe dax schema version is now 3.4. The schema now allows for specifying filesizes as a size attribute in the uses element that lists the input and output files for a job.The DAX Generator API’s have been updated accordingly.This is useful for users extending the Pegasus Code for their specific research use cases.
-
Prototype support for Shiwa bundlespegasus-plan has a new option –shiwa-bundle that allows users to pass a pegasus SHIWA bundle for execution. A Pegasus shiwa bundle, is a bundle that has been generated using the Pegasus Plugin for the Shiwa Desktop.
-
Improved performance for the expunge operation in against mysql databaseWhen monitord is run in a replay mode, the database is first expunged of all the information related to that workflow. In case, of mysql backend where the same database maybe used to track multiple hierarchal workflows, the expunge operation has to be careful to delete only the relevant entries for the various tables.In earlier versions, this expunge operation was implemented at OR level in SQLAlchemy that led to lots of select and delete statements to be executed ( one per entry ). This blew up the memory footprint for monitord and prevented the workflow population in case of large databases. For 4.1, we changed the schema to add cascaded delete clauses, and set the passive delete option to true in SQLAlchemy.More details
-
Runtime Clustering picks up pegasus profile key named runtimeStarting 4.1, the runtime clustering in Pegasus picks up pegasus profile key runtime instead of job.runtime . job.runtime is deprecated and a message is logged if a user has that specified.
The planner picks up job.runtime only if runtime is not specified for a job.
BUGS FIXED
—————-
-
pegasus-lite-local.sh made assumptions on PATHpegasus-lite-local wrapper that is invoked if a pegasus lite jobs runs in local universe made assumption on PATH variable to determine the pegasus tools.This is now fixed. More details at
-
Overwriting of entries with file based replica catalogpegasus-rc-client lfn pfn pool=”local” # Inserts new entry in RC filepegasus-rc-client lfn pfn pool=”usc” # Overwrites pool=”local” to pool=”usc”The uniqueness constraint in the File RC has been updated to consider the site attribute also.More details at
-
pegasus-statistics failed on workflows with large number of sub workflowspegasus-statistics failed if a workflow had more 1000 sub workflows. This was due to a SQL Alchemy issueMore details at
-
Properties propogation for sub workflowsThere was a bug with properties propogation for hierarchal workflows when using PegasusLite for some sub workflows and sharedfs for othersThis is partially fixed.