Note: There is a newer version of Pegasus available. Please see the main documentation page.

Chapter 4. Creating Workflows

4.1. Abstract Workflows (DAX)

The DAX is a description of an abstract workflow in XML format that is used as the primary input into Pegasus. The DAX schema is described in dax-3.2.xsd The documentation of the schema and its elements can be found in dax-3.2.html.

A DAX can be created by all users with the DAX generating API in Java, Perl, or Python format

Note

We highly recommend using the DAX API.

Advanced users who can read XML schema definitions can generate a DAX directly from a script

The sample workflow below incorporates some of the elementary graph structures used in all abstract workflows.

  • fan-out, scatter, and diverge all describe the fact that multiple siblings are dependent on fewer parents.

    The example shows how the Job 2 and 3 nodes depend on Job 1 node.

  • fan-in, gather, join, and converge describe how multiple siblings are merged into fewer dependent child nodes.

    The example shows how the Job 4 node depends on both Job 2 and Job 3 nodes.

  • serial execution implies that nodes are dependent on one another, like pearls on a string.

  • parallel execution implies that nodes can be executed in parallel

Figure 4.1. Sample Workflow

Sample Workflow


The example diamond workflow consists of four nodes representing jobs, and are linked by six files.

  • Required input files must be registered with the Replica catalog in order for Pegasus to find it and integrate it into the workflow.

  • Leaf files are a product or output of a workflow. Output files can be collected at a location.

  • The remaining files all have lines leading to them and originating from them. These files are products of some job steps (lines leading to them), and consumed by other job steps (lines leading out of them). Often, these files represent intermediary results that can be cleaned.

There are two main ways of generating DAX's

  1. Using a DAX generating API in Java, Perl or Python.

    Note: We recommend this option.

  2. Generating XML directly from your script.

    Note: This option should only be considered by advanced users who can also read XML schema definitions.

One example for a DAX representing the example workflow can look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated: 2010-11-22T22:55:08Z -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.2.xsd"
      version="3.2" name="diamond" index="0" count="1">
  <!-- part 2: definition of all jobs (at least one) -->
  <job namespace="diamond" name="preprocess" version="2.0" id="ID000001">
    <argument>-a preprocess -T60 -i <file name="f.a" /> -o <file name="f.b1" /> <file name="f.b2" /></argument>
    <uses name="f.b2" link="output" register="false" transfer="false" />
    <uses name="f.b1" link="output" register="false" transfer="false" />
    <uses name="f.a" link="input" />
  </job>
  <job namespace="diamond" name="findrange" version="2.0" id="ID000002">
    <argument>-a findrange -T60 -i <file name="f.b1" /> -o <file name="f.c1" /></argument>
    <uses name="f.b1" link="input" register="false" transfer="false" />
    <uses name="f.c1" link="output" register="false" transfer="false" />
  </job>
  <job namespace="diamond" name="findrange" version="2.0" id="ID000003">
    <argument>-a findrange -T60 -i <file name="f.b2" /> -o <file name="f.c2" /></argument>
    <uses name="f.c2" link="output" register="false" transfer="false" />
    <uses name="f.b2" link="input" register="false" transfer="false" />
  </job>
  <job namespace="diamond" name="analyze" version="2.0" id="ID000004">
    <argument>-a analyze -T60 -i <file name="f.c1" /> <file name="f.c2" /> -o <file name="f.d" /></argument>
    <uses name="f.c2" link="input" register="false" transfer="false" />
    <uses name="f.d" link="output" register="false" transfer="true" />
    <uses name="f.c1" link="input" register="false" transfer="false" />
  </job>
  <!-- part 3: list of control-flow dependencies -->
  <child ref="ID000002">
    <parent ref="ID000001" />
  </child>
  <child ref="ID000003">
    <parent ref="ID000001" />
  </child>
  <child ref="ID000004">
    <parent ref="ID000002" />
    <parent ref="ID000003" />
  </child>
</adag>

The example workflow representation in form of a DAX requires external catalogs, such as transformation catalog (TC) to resolve the logical job names (such as diamond::preprocess:2.0), and a replica catalog (RC) to resolve the input file f.a. The above workflow defines the four jobs just like the example picture, and the files that flow between the jobs. The intermediary files are neither registered nor staged out, and can be considered transient. Only the final result file f.d is staged out.

4.2. Data Discovery (Replica Catalog)

The Replica Catalog keeps mappings of logical file ids/names (LFN's) to physical file ids/names (PFN's). A single LFN can map to several PFN's. A PFN consists of a URL with protocol, host and port information and a path to a file. Along with the PFN one can also store additional key/value attributes to be associated with a PFN.

Pegasus supports 3 different implemenations of the Replica Catalog.

  1. File(Default)

  2. Database via JDBC

  3. Replica Location Service

    • RLS

    • LRC

  4. MRC

4.2.1. File

In this mode, Pegasus queries a file based replica catalog. The file format is a simple multicolumn format. It is neither transactionally safe, nor advised to use for production purposes in any way. Multiple concurrent instances will conflict with each other. The site attribute should be specified whenever possible. The attribute key for the site attribute is "pool".

LFN PFN
LFN PFN a=b [..]
LFN PFN a="b" [..]
"LFN w/LWS" "PFN w/LWS" [..]
      

The LFN may or may not be quoted. If it contains linear whitespace, quotes, backslash or an equal sign, it must be quoted and escaped. The same conditions apply for the PFN. The attribute key-value pairs are separated by an equality sign without any whitespaces. The value may be quoted. The LFN sentiments about quoting apply.

The file mode is the Default mode. In order to use the File mode you have to set the following properties

  1. pegasus.catalog.replica=File

  2. pegasus.catalog.replica.file=<path to the replica catalog file>

4.2.2. JDBCRC

In this mode, Pegasus queries a SQL based replica catalog that is accessed via JDBC. The sql schema&rsquor;s for this catalog can be found at $PEGASUS_HOME/sql directory. You will have to install the schema into either PostgreSQL or MySQL by running the appropriate commands to load the two scheams create-XX-init.sql and create-XX-rc.sql where XX is either my (for MySQL) or pg (for PostgreSQL)

To use JDBCRC, the user additionally needs to set the following properties

  1. pegasus.catalog.replica.db.url=<jdbc url to the databse>

  2. pegasus.catalog.replica.db.user=<database user>

  3. pegasus.catalog.replica.db.password=<database password>

4.2.3. Replica Location Service

Replica Location Service (RLS) is a distributed replica catalog, that ships with Globus. There is an index service called Replica Location Index (RLI) to which 1 or more Local Replica Catalog (LRC) report. Each LRC can contain all or a subset of mappings.

Details about RLS can be found at http://www.globus.org/toolkit/data/rls/

4.2.3.1. RLS

In this mode, Pegasus queries the central RLI to discover in which LRC&rsquor;s the mappings for a LFN reside. It then queries the individual LRC&rsquor;s for the PFN&rsquor;s. To use this mode the following properties need to be set:

  1. pegasus.catalog.replica=RLS

  2. pegasus.catalog.replica.url=<url to the globus LRC>

4.2.3.2. LRC

This mode is availabe If the user does not want to query the RLI (Replica Location Index), but instead wishes to directly query a single Local Replica Catalog. To use the LRC mode the follow properties need to be set

  1. pegasus.catalog.replica=LRC

  2. pegasus.catalog.replica.url=<url to the globus LRC>

Details about Globus Replica Catalog and LRC can be found at http://www.globus.org/toolkit/data/rls/

4.2.4. MRC

In this mode, Pegasus queries multiple replica catalogs to discover the file locations on the grid.

To use it set

  1. pegasus.catalog.replica=MRC

Each associated replica catalog can be configured via properties as follows.

The user associates a variable name referred to as [value] for each of the catalogs, where [value] is any legal identifier (concretely [A-Za-z][_A-Za-z0-9]*) For each associated replica catalogs the user specifies the following properties

  • pegasus.catalog.replica.mrc.[value] - specifies the type of replica catalog.

  • pegasus.catalog.replica.mrc.[value].key - specifies a property name key for a particular catalog

For example, to query two lrcs at the same time specify the following:

  • pegasus.catalog.replica.mrc.lrc1=LRC

  • pegasus.catalog.replica.mrc.lrc1.url=<url to the 1st globus LRC>

  • pegasus.catalog.replica.mrc.lrc2=LRC

  • pegasus.catalog.replica.mrc.lrc2.url=<url to the 2nd globus LRC>

In the above example,lrc1 and lrc2 are any valid identifier names and url is the property key that needed to be specified.

4.2.4.1. Replica Catalog Client pegasus-rc-client

The client used to interact with the Replica Catalogs is pegasus-rc-client. The implementation that the client talks to is configured using Pegasus properties.

Lets assume we create a file f.a in your home directory as shown below.

$ date > $HOME/f.a 

We now need to register this file in the File replica catalog located in $HOME/rc using the pegasus-rc-client. Replace the gsiftp://url with the appropriate parameters for your grid site.

$ rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc insert \
 f.a gsiftp://somehost:port/path/to/file/f.a pool=local

You may first want to verify that the file registeration is in the replica catalog. Since we are using a File catalog we can look at the file $HOME/rc to view entries.

$ cat $HOME/rc
    
# file-based replica catalog: 2010-11-10T17:52:53.405-07:00
f.a gsiftp://somehost:port/path/to/file/f.a pool=local

The above line shows that entry for file f.a was made correctly.

You can also use the pegasus-rc-client to look for entries.

$ pegasus-rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc lookup LFN f.a

f.a gsiftp://somehost:port/path/to/file/f.a pool=local

4.3. Resource Discovery (Site Catalog)

The Site Catalog describes the compute resources (which are often clusters) that we intend to run the workflow upon. A site is a homogeneous part of a cluster that has at least a single GRAM gatekeeper with a jobmanager-fork andjobmanager-<scheduler> interface and at least one gridftp server along with a sh$ cat $HOME ared file system. The GRAM gatekeeper can be either WS GRAM or Pre-WS GRAM. A site can also be a condor pool or glidein pool with a shared file system.

Pegasus currently supports two implementation of the Site Catalog:

  1. XML3(Default)

  2. XML(Deprecated)

  3. File(Deprecated)

4.3.1. XML3

This is the default format for Pegasus 3.0. This format allows defining filesystem of shared as well as local type on the head node of the remote cluster as well as on the backend nodes

Figure 4.2. Schema Image of the Site Catalog XML 3

Schema Image of the Site Catalog XML 3

Below is an example of the XML3 site catalog

<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog 
http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0">
  <site  handle="isi" arch="x86" os="LINUX" osrelease="" osversion="" glibc="">
      <grid  type="gt2" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="auxillary"/>
      <grid  type="gt2" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="compute"/>
          <head-fs>
               <scratch>
                  <shared>
                     <file-server protocol="gsiftp" url="gsiftp://skynet-data.isi.edu"
                                  mount-point="/nfs/scratch01" />
                     <internal-mount-point mount-point="/nfs/scratch01"/>
                  </shared>
               </scratch>
               <storage>
                  <shared>
                     <file-server protocol="gsiftp" url="gsiftp://skynet-data.isi.edu" 
                                  mount-point="/exports/storage01"/>
                     <internal-mount-point mount-point="/exports/storage01"/>
                  </shared>
               </storage>
          </head-fs>
      <replica-catalog  type="LRC" url="rlsn://smarty.isi.edu"/>
      <profile namespace="env" key="PEGASUS_HOME" >/nfs/vdt/pegasus</profile>
      <profile namespace="env" key="GLOBUS_LOCATION" >/vdt/globus</profile>
  </site>
</sitecatalog>

Described below are some of the entries in the site catalog.

  1. site - A site identifier.

  2. replica-catalog - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC. This is optional

  3. File Systems - Info about filesystems mounted on the remote clusters head node or worker nodes. It has several configurations

    • head-fs/scratch - This describe the scratch file systems (temporary for execution) available on the head node

    • head-fs/storage - This describes the storage file systems (long term) available on the head node

    • worker-fs/scratch - This describe the scratch file systems (temporary for execution) available on the worker node

    • worker-fs/storage - This describes the storage file systems (long term) available on the worker node

    Each scratch and storage entry can contain two sub entries,

    • SHARED for shared file systems like NFS, LUSTRE etc.

    • LOCAL for local file systems (local to the node/machine)

    Each of the filesystems are defined by used a file-server element. Protocol defines the protocol uses to access the files, URL defines the url prefix to obtain the files from and mount-point is the mount point exposed by the file server.

    Along with this an internal-mount-point needs to defined to access the files directly from the machine without any file servers.

  4. arch,os,osrelease,osversion, glibc - The arch/os/osrelease/osversion/glibc of the site. OSRELEASE, OSVERSION and GLIBC are optional

    ARCH can have one of the following values X86, X86_64, SPARCV7, SPARCV9, AIX, PPC.

    OS can have one of the following values LINUX,SUNOS,MACOSX. The default value for sysinfo if none specified is X86::LINUX

  5. Profiles - One or many profiles can be attached to a pool.

    One example is the environments to be set on a remote pool.

To use this site catalog the follow properties need to be set:

  1. pegasus.catalog.site=XML3

  2. pegasus.catalog.site.file=<path to the site catalog file>

4.3.2. XML

Warning

This format is now deprecated in favor of the XML3 format. If you are still using the XML or File format you should convert it to XML3 formation using the pegasus-sc-converter client

$ cat $HOME/sites.xml

<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
  xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
  http://pegasus.isi.edu/schema/sc-2.0.xsd"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0">
  <site handle="local" gridlaunch="/nfs/vdt/pegasus/bin/kickstart"
   sysinfo="INTEL32::LINUX">
    <profile namespace="env" key="PEGASUS_HOME" >/nfs/vdt/pegasus</profile>
    <profile namespace="env" key="GLOBUS_LOCATION" >/vdt/globus</profile>
    <profile namespace="env" key="LD_LIBRARY_PATH" >/vdt/globus/lib</profile>
    <profile namespace="env" key="JAVA_HOME" >/vdt/java</profile>
    <lrc url="rlsn://localhost" />
    <gridftp  url="gsiftp://localhost" storage="/$HOME/storage" major="4" minor="0"
     patch="5">
    </gridftp>
    <jobmanager universe="transfer" url="localhost/jobmanager-fork" major="4" minor="0"
     patch="5" />
    <jobmanager universe="vanilla" url="localhost/jobmanager-fork" major="4" minor="0"
     patch="5" />
    <workdirectory >$HOME/workdir</workdirectory>
  </site>
  <site handle="clus1" gridlaunch="/opt/nfs/vdt/pegasus/bin/kickstart"
   sysinfo="INTEL32::LINUX">
    <profile namespace="env" key="PEGASUS_HOME" >/opt/nfs/vdt/pegasus</profile>
    <profile namespace="env" key="GLOBUS_LOCATION" >/opt/vdt/globus</profile>
    <profile namespace="env" key="LD_LIBRARY_PATH" >/opt/vdt/globus/lib</profile>
    <lrc url="rlsn://clus1.com" />
    <gridftp  url="gsiftp://clus1.com" storage="/jobmanager-fork" major="4" minor="0"
     patch="3">
    </gridftp>
    <jobmanager universe="transfer" url="clus1.com/jobmanager-fork" major="4" minor="0"
     patch="3" />
    <jobmanager universe="vanilla" url="clus1.com/jobmanager-pbs" major="4" minor="0"
     patch="3" />
    <workdirectory >$HOME/workdir-clus1</workdirectory>
  </site>
</sitecatalog>
  1. site - A site identifier.

  2. lrc - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC

  3. workdirectory - A remote working directory (Should be on a shared file system)

  4. gridftp - A URL prefix for a remote storage location. and a path to the storage location

  5. jobmanager - Url to the jobmanager entrypoints for the remote grid. Different universes are supported which map to different batch jobmanagers.

    "vanilla" for compute jobs and "transfer" for transfer jobs are mandatory. Generally a transfer universe should map to the fork jobmanager.

  6. gridlaunch - Path to the remote kickstart tool (provenance tracking)

  7. sysinfo - The arch/os/osversion/glibc of the site. The format is ARCH::OS:OSVER:GLIBC where OSVERSION and GLIBC are optional.

    ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for sysinfo if none specified is INTEL32::LINUX

  8. Profiles - One or many profiles can be attached to a pool.

    Profiles such as the environments to be set on a remote pool.

To use this format you need to set the following properties

  1. pegasus.catalog.site=XML

  2. pegasus.catalog.site.file=<path to the site catalog file>

4.3.3. Text

Warning

This format is now deprecated in favor of the XML3 format. If you are still using the File format you should convert it to XML3 format using the client pegasus-sc-converter

The format for the File is as follows

site site_id {
  #required. Can be a dummy value if using Simple File RC
  lrc "rls://someurl"

  #required on a shared file system
  workdir "path/to/a/tmp/shared/file/sytem/"

  #required one or more entries
  gridftp "gsiftp://hostname/mountpoint&rdquor; "GLOBUS VERSION"

  #required one or more entries
  universe transfer "hostname/jobmanager-<scheduler>" "GLOBUS VERSION"

  #reqired one or more entries
  universe vanilla "hostname/jobmanager-<scheduler>" "GLOBUS VERSION"

  #optional
  sysinfo  "ARCH::OS:OSVER:GLIBC"

  #optional
  gridlaunch "/path/to/gridlaunch/executable"

  #optional zero or more entries
  profile namespace "key" "value"
} 

The gridlaunch and profile entries are optional. All the rest are required for each pool. Also the transfer and vanilla universe are mandatory. You can add multiple transfer and vanilla universe if you have more then one head node on the cluster. The entries in the Site Catalog have the following meaning:

  1. site - A site identifier.

  2. lrc - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC

  3. workdir - A remote working directory (Should be on a shared file system)

  4. gridftp gridftp - A URL prefix for a remote storage location.

  5. universe - Different universes are supported which map to different batch jobmanagers.

    "vanilla" for compute jobs and "transfer" for transfer jobs are mandatory. Generally a transfer universe should map to the fork jobmanager.

  6. gridlaunch - Path to the remote kickstart tool (provenance tracking)

  7. sysinfo - The arch/os/osversion/glibc of the site. The format is ARCH::OS:OSVER:GLIBC where OSVERSION and GLIBC are optiona.

    ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for sysinfo if none specified is INTEL32::LINUX

  8. Profiles - One or many profiles can be attached to a pool.

    Profiles such as the environments to be set on a remote pool.

To use this format you need to set the following properties:

  1. pegasus.catalog.site=Text

  2. pegasus.catalog.site.file=<path to the site catalog file>

4.3.4. Site Catalog Client pegasus-sc-client

The pegasus-sc-client can be used to generate a site catalog for Open Science Grid (OSG) by querying their Monitoring Interface likes VORS or OSGMM. See pegasus-sc-client --help for more details

4.3.5. Site Catalog Converter pegasus-sc-converter

Pegasus 3.0 by default now parses Site Catalog format conforming to the SC schema 3.0 ( XML3 ) available here and is explained in detail in the Catalog Properties section of Running Workflows.

Pegasus 3.0 comes with a pegasus-sc-converter that will convert users old site catalog ( XML ) to the XML3 format. Sample usage is given below.

$ pegasus-sc-converter -i sample.sites.xml -I XML -o sample.sites.xml3 -O XML3

2010.11.22 12:55:14.169 PST:   Written out the converted file to sample.sites.xml3 

To use the converted site catalog, in the properties do the following:

  1. unset pegasus.catalog.site or set pegasus.catalog.site to XML3

  2. point pegasus.catalog.site.file to the converted site catalog

4.4. Executable Discovery (Transformation Catalog)

The Transformation Catalog maps logical transformations to physical executables on the system. It also provides additional information about the transformation as to what system they are compiled for, what profiles or environment variables need to be set when the transformation is invoked etc.

Pegasus currently supports two implementations of the Transformation Catalog

  1. Text: A multiline text based Transformation Catalog (DEFAULT)

  2. File: A simple multi column text based Transformation Catalog

  3. Database: A database backend (MySQL or PostgreSQL) via JDB

In this guide we will look at the format of the Multiline Text based TC.

4.4.1. MultiLine Text based TC (Text)

The multile line text based TC is the new default TC in Pegasus. This format allows you to define the transformations

The file is read and cached in memory. Any modifications, as adding or deleting, causes an update of the memory and hence to the file underneath. All queries are done against the memory representation. The file sample.tc.text in the etc directory contains an example

tr example::keg:1.0 { 

#specify profiles that apply for all the sites for the transformation 
#in each site entry the profile can be overriden 

  profile env "APP_HOME" "/tmp/myscratch"
  profile env "JAVA_HOME" "/opt/java/1.6"

  site isi {
    profile env "HELLo" "WORLD"
    profile condor "FOO" "bar"
    profile env "JAVA_HOME" "/bin/java.1.6"
    pfn "/path/to/keg"
    arch "x86"
    os "linux"
    osrelease "fc"
    osversion "4"
    type "INSTALLED"
  }

  site wind {
    profile env "CPATH" "/usr/cpath"
    profile condor "universe" "condor"
    pfn "file:///path/to/keg"
    arch "x86"
    os "linux"
    osrelease "fc"
    osversion "4"
    type "STAGEABLE"
  }
}

The entries in this catalog have the following meaning

  1. tr tr - A transformation identifier. (Normally a Namespace::Name:Version.. The Namespace and Version are optional.)

  2. pfn - URL or file path for the location of the executable. The pfn is a file path if the transformation is of type INSTALLED and generally a url (file:/// or http:// or gridftp://) if of type STAGEABLE

  3. site - The site identifier for the site where the transformation is available

  4. type - The type of transformation. Whether it is Iinstalled ("INSTALLED") on the remote site or is availabe to stage ("STAGEABLE").

  5. arch, os, osrelease, osversion - The arch/os/osrelease/osversion of the transformation. osrelease and osversion are optional.

    ARCH can have one of the following values x86, x86_64, sparcv7, sparcv9, ppc, aix. The default value for arch is x86

    OS can have one of the following values linux,sunos,macosx. The default value for OS if none specified is linux

  6. Profiles - One or many profiles can be attached to a transformation for all sites or to a transformation on a particular site.

To use this format of the Transformation Catalog you need to set the following properties

  1. pegasus.catalog.transformation=Text

  2. pegasus.catalog.transformation.file=<path to the transformation catalog file>

4.4.2. Singleline Text based TC (File)

Warning

This format is now deprecated in favor of the multiline TC. If you are still using the single line TC you should convert it to multiline using the tc-converter client.

The format of the this TC is as follows.

#site  logicaltr   physicaltr   type  system  profiles(NS::KEY="VALUE")

site1 sys::date:1.0 /usr/bin/date  INSTALLED INTEL32::LINUX:FC4.2:3.6 ENV::PATH="/usr/bin";PEGASUS_HOME="/usr/local/pegasus"

The system and profile entries are optional and will use default values if not specified. The entries in the file format have the following meaning:

  1. site - A site identifier.

  2. logicaltr - The logical transformation name. The format is NAMESPACE::NAME:VERSION where NAMESPACE and NAME are optional.

  3. physicaltr - The physical transformation path or URL.

    If the transformation type is INSTALLED then it needs to be an absolute path to the executable. If the type is STAGEABLE then the path needs to be a HTTP, FTP or gsiftp URL

  4. type - The type of transformation. Can have on of two values

    • INSTALLED: This means that the transformation is installed on the remote site

    • STAGEABLE: This means that the transformation is available as a static binary and can be staged to a remote site.

  5. system - The system for which the transformation is compiled.

    The formation of the sytem is ARCH::OS:OSVERSION:GLIBC where the GLIBC and OS VERSION are optional. ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for system if none specified is INTEL32::LINUX

  6. Profiles - The profiles associated with the transformation. For indepth information about profiles and their priorities read the Profile Guide.

    The format for profiles is NS::KEY="VALUE" where NS is the namespace of the profile e.g. Pegasus,condor,DAGMan,env,globus. The key and value can be any strings. Remember to quote the value with double quotes. If you need to specify several profiles you can do it in several ways

    • NS1::KEY1="VALUE1",KEY2="VALUE2";NS2::KEY3="VALUE3",KEY4="VALUE4"

      This is the most optimized form. Multiple key values for the same namespace are separated by a comma "," and different namespaces are separated by a semicolon ";"

    • NS1::KEY1="VALUE1";NS1::KEY2="VALUE2";NS2::KEY3="VALUE3";NS2::KEY4="VALUE4"

      You can also just repeat the triple of NS::KEY="VALUE" separated by semicolons for a simple format;

To use this format of the Transformation Catalog you need to set the following properties

  1. pegasus.catalog.transformation=File

  2. pegasus.catalog.transformation.file=<path to the transformation catalog file>

4.4.3. Database TC (Database)

The database TC alows you to use a relational database. To use the database TC you need to have installed a MySQL or PostgreSQL server. The schema for the database is available in $PEGASUS_HOME/sql directory. You will have to install the schema into either PostgreSQL or MySQL by running the appropriate commands to load the two scheams create-XX-init.sql and create-XX-tc.sql where XX is either my (for MySQL) or pg (for PostgreSQL)

To use the Database TC you need to set the following properties

  1. pegasus.catalog.transformation.db.driver=MySQL | Postgres

  2. pegasus.catalog.transformation.db.url=<jdbc url to the databse>

  3. pegasus.catalog.transformation.db.user=<database user>

  4. pegasus.catalog.transformation.db.password=<database password>

4.4.4. TC Client pegasus-tc-client

We need to map our declared transformations (preprocess, findranage, and analyze) from the example DAX above to a simple "mock application" name "keg" ("canonical example for the grid") which reads input files designated by arguments, writes them back onto output files, and produces on STDOUT a summary of where and when it was run. Keg ships with Pegasus in the bin directory. Run keg on the command line to see how it works.

$ keg -o /dev/fd/1

Timestamp Today: 20040624T054607-05:00 (1088073967.418;0.022)
Applicationname: keg @ 10.10.0.11 (VPN)
Current Workdir: /home/unique-name
Systemenvironm.: i686-Linux 2.4.18-3
Processor Info.: 1 x Pentium III (Coppermine) @ 797.425
Output Filename: /dev/fd/1

Now we need to map all 3 transformations onto the "keg" executable. We place these mappings in our File transformation catalog for site clus1.

Note

In earlier version of Pegasus users had to define entries for Pegasus executables such as transfer, replica client, dirmanager, etc on each site as well as site "local". This is no longer required. Pegasus versions 2.0 and later automatically pick up the paths for these binaries from the environment profile PEGASUS_HOME set in the site catalog for each site.

A single entry needs to be on one line. The above example is just formatted for convenience.

Alternatively you can also use the pegasus-tc-client to add entries to any implementation of the transformation catalog. The following example shows the addiition the last entry in the File based transformation catalog.

$ pegasus-tc-client -Dpegasus.catalog.transformation=Text \
-Dpegasus.catalog.transformation.file=$HOME/tc -a -r clus1 -l black::analyze:1.0 \
-p gsiftp://clus1.com/opt/nfs/vdt/pegasus/bin/keg  -t STAGEABLE -s INTEL32::LINUX \
-e ENV::KEY3="VALUE3"

2007.07.11 16:12:03.712 PDT: [INFO] Added tc entry sucessfully

To verify if the entry was correctly added to the transformation catalog you can use the pegasus-tc-client to query.

$ pegasus-tc-client -Dpegasus.catalog.transformation=File \
-Dpegasus.catalog.transformation.file=$HOME/tc -q -P -l black::analyze:1.0

#RESID     LTX          PFN                  TYPE              SYSINFO

clus1    black::analyze:1.0    gsiftp://clus1.com/opt/nfs/vdt/pegasus/bin/keg
                STAGEABLE    INTEL32::LINUX

4.4.5. TC Converter Client pegasus-tc-converter

Pegasus 3.0 by default now parses a file based multiline textual format of a Transformation Catalog. The new Text format is explained in detail in the chapter on Catalogs.

Pegasus 3.0 comes with a pegasus-tc-converter that will convert users old transformation catalog ( File ) to the Text format. Sample usage is given below.

$ pegasus-tc-converter -i sample.tc.data -I File -o sample.tc.text -O Text

2010.11.22 12:53:16.661 PST:   Successfully converted Transformation Catalog from File to Text 
2010.11.22 12:53:16.666 PST:   The output transfomation catalog is in file  /lfs1/software/install/pegasus/pegasus-3.0.0cvs/etc/sample.tc.text 

To use the converted transformation catalog, in the properties do the following:

  1. unset pegasus.catalog.transformation or set pegasus.catalog.transformation to Text

  2. point pegasus.catalog.transformation.file to the converted transformation catalog