2.9. Information Catalogs

The workflow description (DAX) that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process.

Figure 2.14. Information Catalogs used by Pegasus

Information Catalogs used by Pegasus

2.9.1. The Site Catalog

The site catalog describes the sites where the workflow jobs are to be executed. In this tutorial we assume that you have a Personal Condor pool running on localhost. If you are using one of the tutorial VMs this has already been setup for you. The site catalog for the tutorial examples is in sites.xml:

$ more sites.xml
...
   <!-- The local site contains information about the submit host -->
    <!-- The arch and os keywords are used to match binaries in the transformation catalog -->
    <site handle="local" arch="x86_64" os="LINUX">

        <!-- These are the paths on the submit host were Pegasus stores data -->
        <!-- Scratch is where temporary files go -->
        <directory type="shared-scratch" path="/home/tutorial/scratch">
            <file-server operation="all" url="file:///home/tutorial/scratch"/>
        </directory>

        <!-- Storage is where pegasus stores output files -->
        <directory type="local-storage" path="/home/tutorial/output">
            <file-server operation="all" url="file:///home/tutorial/output"/>
        </directory>
    </site>

...
      

Note

By default (unless specified in properties), Pegasus picks ups the site catalog from a XML file named sites.xml in the current working directory from where pegasus-plan is invoked.

There are two sites defined in the site catalog: "local" and "condorpool". The "local" site is used by Pegasus to learn about the submit host where the workflow management system runs. The "condorpool" site is the Condor pool configured on your submit machine. In the case of the tutorial VM, the local site and the condorpool site refer to the same machine, but they are logically separate as far as Pegasus is concerned.

  1. The local site is configured with a "storage" file system that is mounted on the submit host (indicated by the file:// URL). This file system is where the output data from the workflow will be stored. When the workflow is planned we will tell Pegasus that the output site is "local".

  2. The condorpool site is also configured with a "scratch" file system. This file system is where the working directory will be created. When we plan the workflow we will tell Pegasus that the execution site is "condorpool".

Pegasus supports many different file transfer protocols. In this case the Pegasus configuration is set up so that input and output files are transferred to/from the condorpool site by Condor. This is done by setting pegasus.data.configuration = condorio in the properties file. In a normal Condor pool, this will cause job input/output files to be transferred from/to the submit host to/from the worker node. In the case of the tutorial VM, this configuration is just a fancy way to copy files from the workflow scratch directory to the job scratch directory.

Finally, the condorpool site is configured with two profiles that tell Pegasus that it is a plain Condor pool. Pegasus supports many ways of submitting tasks to a remote cluster. In this configuration it will submit vanilla Condor jobs.

2.9.1.1. HPC Clusters

Typically the sites in the site catalog describe remote clusters, such as PBS clusters or Condor pools.

Usually, a typical deployment of an HPC cluster is illustrated below. The site catalog, captures for each cluster (site)

  • directories that can be used for executing jobs

  • whether a shared file system is available

  • file servers to use for staging input data and staging out output data

  • headnode of the cluster to which jobs can be submitted.

Figure 2.15. Sample HPC Cluster Setup

Sample HPC Cluster Setup

Below is a sample site catalog entry for HPC cluster at SDSC that is part of XSEDE

<site  handle="sdsc-gordon" arch="x86_64" os="LINUX">
        <grid  type="gt5" contact="gordon-ln4.sdsc.xsede.org:2119/jobmanager-fork" scheduler="Fork" jobtype="auxillary"/>
        <grid  type="gt5" contact="gordon-ln4.sdsc.xsede.org:2119/jobmanager-pbs" scheduler="unknown" jobtype="compute"/>

        <!-- the base directory where workflow jobs will execute for the site -->
        <directory type="shared-scratch" path="/oasis/scratch/ux454281/temp_project">
            <file-server operation="all" url="gsiftp://oasis-dm.sdsc.xsede.org:2811/oasis/scratch/ux454281/temp_project"/>
        </directory>

        <profile namespace="globus" key="project">TG-STA110014S</profile>
        <profile namespace="env" key="PEGASUS_HOME">/home/ux454281/software/pegasus/pegasus-4.5.0</profile>
    </site>

2.9.2. The Transformation Catalog

The transformation catalog describes all of the executables (called "transformations") used by the workflow. This description includes the site(s) where they are located, the architecture and operating system they are compiled for, and any other information required to properly transfer them to the execution site and run them.

For this tutorial, the transformation catalog is in the file tc.txt:

$ more tc.txt
tr wc {
    site condorpool {
        pfn "/usr/bin/wc"
        arch "x86_64"
        os "linux"
        type "INSTALLED"
    }
}
...

Note

By default (unless specified in properties), Pegasus picks up the transformation catalog from a text file named tc.txt in the current working directory from where pegasus-plan is invoked.

The tc.txt file contains information about two transformations: wc, and split. These two transformations are referenced in the split DAX. The transformation catalog indicates that both transformations are installed on the condorpool site, and are compiled for x86_64 Linux.

2.9.3. The Replica Catalog

Note: Replica Catalog configuration is not required for the tutorial setup. It is only required if you want to refer to input files on external servers.

The example that you ran, was configured with the inputs already present on the submit host (where Pegasus is installed) in a directory. If you have inputs at external servers, then you can specify the URLs to the input files in the Replica Catalog. This catalog tells Pegasus where to find each of the input files for the workflow.

All files in a Pegasus workflow are referred to in the DAX using their Logical File Name (LFN). These LFNs are mapped to Physical File Names (PFNs) when Pegasus plans the workflow. This level of indirection enables Pegasus to map abstract DAXes to different execution sites and plan out the required file transfers automatically.

The Replica Catalog for the diamond workflow is in the rc.txt file:

$ more rc.txt
# This is the replica catalog. It lists information about each of the
# input files used by the workflow. You can use this to specify locations to input files present on external servers.

# The format is:
# LFN     PFN    pool="SITE"
#
# For example:
#data.txt  file:///tmp/data.txt         site="local"
#data.txt  http://example.org/data.txt  site="example"
pegasus.html file:///home/tutorial/split/input/pegasus.html   site="local"

Note

By default (unless specified in properties), Pegasus picks ups the transformation catalog from a text file named tc.txt in the current working directory from where pegasus-plan is invoked. In our tutorial, input files are on the submit host and we used the --input dir option to pegasus-plan to specify where they are located.

This replica catalog contains only one entry for the split workflow’s only input file. This entry has an LFN of "pegasus.html" with a PFN of "file:///home/tutorial/split/input/pegasus.html" and the file is stored on the local site, which implies that it will need to be transferred to the condorpool site when the workflow runs.