Pegasus 4.1.0 Released

with No Comments

This a major release of Pegasus that has support for PMC (pegasus-mpi-cluster ) that can be used to run the tasks in a clustered job in parallel on remote machines using MPI. As part of this release, the support for submitting workflows using CondorC has been updated. The Pegasus Tutorial has also been updated and is available to run on

   – Amazon EC2
   – Futuregrid
   – Local machine using Virtual Box
NEW FEATURES
—————————–
  1. pegasus-mpi-cluster
    Pegasus has support for a new clustering executable called pegasus-mpi-cluster (PMC) that allows users to run tasks in a clustered job in parallel using MPI on the remote node. The input format for PMC is a DAG based format similar to Condor DAGMan’s. PMC follows the dependencies specified in the DAG to release the jobs in the right order and executes parallel jobs via the workers when  possible. The input file for PMC is automatically generated by the Pegasus Planner when generating the executable workflow.
    In order to use PMC set in your properties   pegasus.clusterer.job.aggregator  mpiexec
    Also, you may need to put an entry in your transformation catalog  for pegasus::mpiexec to point to the location of the PMC executable  on the remote side.
     More details can be found in the man page for pegasus-mpi-cluster  and in the clustering chapter
     There is a XSEDE example in the examples directory that shows how   to use PMC on XSEDE
  2. Use of new client pegasus-gridftp in pegasus-create-dir and pegasus-cleanup
    Starting with release 4.1, the pegasus create dir and cleanup clients use a java based client called pegasus-gridftp to create directories and remove files from against a gridftp server. Pegasus by default now adds a dagman category named cleanup for  all cleanup jobs in the workflow. The maxjobs for this category is set to 4 by default.
    This can be overriden by specifying the property
        dagman.cleanup.maxjobs
  3. Support for CondorC
    The support for CondorC in Pegasus has been updated. Users can associate a pegasus profile named style with value condorc with a  site in the site catalog to indicate that submission to the site has to be achieved using CondorC.
    The site catalog entry should mention the grid gateways to indicate the remote schedd to which the jobs need to be submitted, and the condor collector for the condorc site. It is optional to specify the condor collector. If not specified, Pegasus will use the  contact mentioned in the grid gateway.
       Example snippet with relevant entries  below
         <site handle=”isi-condorc” arch=”x86″ os=”LINUX”>
             <grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”compute” total-nodes=”50″/>
             <grid type=”condor” contact=”ccg-testing1.isi.edu” scheduler=”Condor” jobtype=”auxillary” total-nodes=”50″/>
    <head-fs>
                <scratch>
                    <shared>
                        <file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/>
                        <internal-mount-point mount-point=”/nfs/ccg3/scratch/bamboo/scratch/”/>
                    </shared>
                </scratch>
                <storage>
                    <shared>
                        <file-server protocol=”file” url=”file://” mount-point=”/nfs/ccg3/scratch/bamboo/storage/”/>
                        <internal-mount-point mount-point=”/nfs/ccg3/scratch/testing/bamboo/storage”/>
                    </shared>
                </storage>
            </head-fs>
            <replica-catalog type=”LRC” url=”rlsn://dummyValue.url.edu” />
             <!– specify which condor collector to use –>
            <profile namespace=”condor” key=”condor_collector”>ccg-testing1.isi.edu</profile>

    <!– submission to this site is using condorc –>

            <profile namespace=”pegasus” key=”style”>condorc</profile>
            <profile namespace=”condor” key=”should_transfer_files”>Yes</profile>
            <profile namespace=”condor” key=”when_to_transfer_output”>ON_EXIT</profile>
            <profile namespace=”env” key=”PEGASUS_HOME” >/usr</profile>
            <profile namespace=”condor” key=”universe”>vanilla</profile>
        </site>
  4. Updated the Pegasus Tutorial
       The Pegasus Tutorial has now been updated and is available to run on
       – Amazon EC2
       – Futuregrid
       – Local machine using Virtual Box
  5. Changed the default transfer refiner for Pegasus
    The default transfer refiner in Pegasus now clusters both stagein   and stageout jobs per level of the workflow. The previous version used to cluster stagein jobs per workflow and the stageout jobs per level of the workflow.
    More details can be found at
  6.  pegasus-statistics has a new -f option
     The -f option can be used to specify the output format for pegasus-statistics. Valid supported formats are txt and csv
  7. Updated condor periodic_release and periodic_remove expressions
    Earlier, Pegasus used to set default periodic_release and periodic_remove expressions as follows
    periodic_release = (NumSystemHolds <= 3)
    periodic_remove = (NumSystemHolds > 3)
    This had the effect of removing the jobs as soon as they went to  held state.
    Starting 4.1 the expressions have been updated to
    periodic_release = False
    periodic_remove = (JobStatus == 5) && ((CurrentTime – EnteredCurrentStatus) > 14400)
    With this, the job remains in held state for 4 hours before being removed. The idea is that it is a long enough time for users to debug held jobs.
    If users wish to use the previous expressions, they can do it by specifying the condor profile keys periodic_release and  periodic_remove.
  8. Property to turn off registration jobs
    Pegasus now exposes a boolean property pegasus.register that can be used to turn off the registration of jobs.
  9. More descriptive errors if incomplete site catalog specified
    Earlier, incomplete site catalog causes NPE’s when running pegasus-plan. This has been replaced by more descriptive errors that will give user enough information to figure out the missing entries in the site catalog.
    More details at
  10. Change in DAX schema
    The dax schema version is now 3.4. The schema now allows for  specifying filesizes as a size attribute in the uses element that lists the input and output files for a job.
    The DAX Generator API’s have been updated accordingly.
     This is useful for users extending the Pegasus Code for their  specific research use cases.
  11. Prototype support for Shiwa bundles
    pegasus-plan has a new option –shiwa-bundle that allows users to pass a pegasus SHIWA bundle for execution.  A Pegasus shiwa bundle, is a bundle that has been generated using the Pegasus Plugin for the Shiwa Desktop.
  12. Improved performance for the expunge operation in against mysql database
    When monitord is run in a replay mode, the database is first expunged of all the information related to that workflow. In case, of mysql backend where the same database maybe used to track multiple hierarchal workflows, the expunge operation has to be careful to delete only the relevant entries for the various tables.
    In earlier versions, this expunge operation was implemented at OR level in SQLAlchemy that led to lots of select and delete statements to be executed ( one per entry ). This blew up the memory footprint for monitord and prevented the workflow population in case of large databases. For 4.1, we changed the schema to add cascaded delete clauses, and set the passive delete option to true in SQLAlchemy.
    More details
  13. Runtime Clustering picks up pegasus profile key named runtime
    Starting 4.1, the runtime clustering  in Pegasus picks up  pegasus profile key runtime instead of job.runtime . job.runtime is deprecated and a message is logged if a user has that specified.

    The planner picks up job.runtime only if runtime is not specified for a job.

BUGS FIXED
—————-
  1. pegasus-lite-local.sh made assumptions on PATH
    pegasus-lite-local wrapper that is invoked if a pegasus lite jobs runs in local universe made assumption on PATH variable to determine the pegasus tools.
    This is now fixed. More details at
  2. Overwriting of entries with file based replica catalog
    pegasus-rc-client lfn pfn pool=”local” # Inserts new entry in RC file
    pegasus-rc-client lfn pfn pool=”usc” # Overwrites pool=”local” to pool=”usc”
    The uniqueness constraint in the File RC has been updated to consider the site attribute also.
    More details at
  3. pegasus-statistics failed on workflows with large number of sub workflows
    pegasus-statistics failed if a workflow had more 1000 sub workflows. This was due to a SQL Alchemy issue
    More details at
  4. Properties propogation for sub workflows
    There was a bug with properties propogation for hierarchal workflows when using PegasusLite for some sub workflows and  sharedfs for others
    This is partially fixed.