The DAX is a description of an abstract workflow in XML format that is used as the primary input into Pegasus. The DAX schema is described in dax-3.2.xsd The documentation of the schema and its elements can be found in dax-3.2.html.
A DAX can be created by all users with the DAX generating API in Java, Perl, or Python format
Note
We highly recommend using the DAX API.Advanced users who can read XML schema definitions can generate a DAX directly from a script
The sample workflow below incorporates some of the elementary graph structures used in all abstract workflows.
-
fan-out, scatter, and diverge all describe the fact that multiple siblings are dependent on fewer parents.
The example shows how the Job 2 and 3 nodes depend on Job 1 node.
-
fan-in, gather, join, and converge describe how multiple siblings are merged into fewer dependent child nodes.
The example shows how the Job 4 node depends on both Job 2 and Job 3 nodes.
serial execution implies that nodes are dependent on one another, like pearls on a string.
parallel execution implies that nodes can be executed in parallel
The example diamond workflow consists of four nodes representing jobs, and are linked by six files.
Required input files must be registered with the Replica catalog in order for Pegasus to find it and integrate it into the workflow.
Leaf files are a product or output of a workflow. Output files can be collected at a location.
The remaining files all have lines leading to them and originating from them. These files are products of some job steps (lines leading to them), and consumed by other job steps (lines leading out of them). Often, these files represent intermediary results that can be cleaned.
There are two main ways of generating DAX's
One example for a DAX representing the example workflow can look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<!-- generated: 2010-11-22T22:55:08Z -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.2.xsd"
version="3.2" name="diamond" index="0" count="1">
<!-- part 2: definition of all jobs (at least one) -->
<job namespace="diamond" name="preprocess" version="2.0" id="ID000001">
<argument>-a preprocess -T60 -i <file name="f.a" /> -o <file name="f.b1" /> <file name="f.b2" /></argument>
<uses name="f.b2" link="output" register="false" transfer="false" />
<uses name="f.b1" link="output" register="false" transfer="false" />
<uses name="f.a" link="input" />
</job>
<job namespace="diamond" name="findrange" version="2.0" id="ID000002">
<argument>-a findrange -T60 -i <file name="f.b1" /> -o <file name="f.c1" /></argument>
<uses name="f.b1" link="input" register="false" transfer="false" />
<uses name="f.c1" link="output" register="false" transfer="false" />
</job>
<job namespace="diamond" name="findrange" version="2.0" id="ID000003">
<argument>-a findrange -T60 -i <file name="f.b2" /> -o <file name="f.c2" /></argument>
<uses name="f.c2" link="output" register="false" transfer="false" />
<uses name="f.b2" link="input" register="false" transfer="false" />
</job>
<job namespace="diamond" name="analyze" version="2.0" id="ID000004">
<argument>-a analyze -T60 -i <file name="f.c1" /> <file name="f.c2" /> -o <file name="f.d" /></argument>
<uses name="f.c2" link="input" register="false" transfer="false" />
<uses name="f.d" link="output" register="false" transfer="true" />
<uses name="f.c1" link="input" register="false" transfer="false" />
</job>
<!-- part 3: list of control-flow dependencies -->
<child ref="ID000002">
<parent ref="ID000001" />
</child>
<child ref="ID000003">
<parent ref="ID000001" />
</child>
<child ref="ID000004">
<parent ref="ID000002" />
<parent ref="ID000003" />
</child>
</adag>
The example workflow representation in form of a DAX requires
external catalogs, such as transformation catalog (TC) to resolve the
logical job names (such as diamond::preprocess:2.0), and a replica catalog
(RC) to resolve the input file f.a. The above
workflow defines the four jobs just like the example picture, and the
files that flow between the jobs. The intermediary files are neither
registered nor staged out, and can be considered transient. Only the final
result file f.d is staged out.
The Replica Catalog keeps mappings of logical file ids/names (LFN's) to physical file ids/names (PFN's). A single LFN can map to several PFN's. A PFN consists of a URL with protocol, host and port information and a path to a file. Along with the PFN one can also store additional key/value attributes to be associated with a PFN.
Pegasus supports 3 different implemenations of the Replica Catalog.
File(Default)
Database via JDBC
-
Replica Location Service
RLS
LRC
MRC
In this mode, Pegasus queries a file based replica catalog. The file format is a simple multicolumn format. It is neither transactionally safe, nor advised to use for production purposes in any way. Multiple concurrent instances will conflict with each other. The site attribute should be specified whenever possible. The attribute key for the site attribute is "pool".
LFN PFN
LFN PFN a=b [..]
LFN PFN a="b" [..]
"LFN w/LWS" "PFN w/LWS" [..]
The LFN may or may not be quoted. If it contains linear whitespace, quotes, backslash or an equal sign, it must be quoted and escaped. The same conditions apply for the PFN. The attribute key-value pairs are separated by an equality sign without any whitespaces. The value may be quoted. The LFN sentiments about quoting apply.
The file mode is the Default mode. In order to use the File mode you have to set the following properties
pegasus.catalog.replica=File
pegasus.catalog.replica.file=
<path to the replica catalog file>
In this mode, Pegasus queries a SQL based replica catalog that is accessed via JDBC. The sql schema’s for this catalog can be found at $PEGASUS_HOME/sql directory. You will have to install the schema into either PostgreSQL or MySQL by running the appropriate commands to load the two scheams create-XX-init.sql and create-XX-rc.sql where XX is either my (for MySQL) or pg (for PostgreSQL)
To use JDBCRC, the user additionally needs to set the following properties
pegasus.catalog.replica.db.url=
<jdbc url to the databse>pegasus.catalog.replica.db.user=
<database user>pegasus.catalog.replica.db.password=
<database password>
Replica Location Service (RLS) is a distributed replica catalog, that ships with Globus. There is an index service called Replica Location Index (RLI) to which 1 or more Local Replica Catalog (LRC) report. Each LRC can contain all or a subset of mappings.
Details about RLS can be found at http://www.globus.org/toolkit/data/rls/
In this mode, Pegasus queries the central RLI to discover in which LRC’s the mappings for a LFN reside. It then queries the individual LRC’s for the PFN’s. To use this mode the following properties need to be set:
pegasus.catalog.replica=RLS
pegasus.catalog.replica.url=
<url to the globus LRC>
This mode is availabe If the user does not want to query the RLI (Replica Location Index), but instead wishes to directly query a single Local Replica Catalog. To use the LRC mode the follow properties need to be set
pegasus.catalog.replica=
LRCpegasus.catalog.replica.url=
<url to the globus LRC>
Details about Globus Replica Catalog and LRC can be found at http://www.globus.org/toolkit/data/rls/
In this mode, Pegasus queries multiple replica catalogs to discover the file locations on the grid.
To use it set
pegasus.catalog.replica=
MRC
Each associated replica catalog can be configured via properties as follows.
The user associates a variable name referred to as [value] for each of the catalogs, where [value] is any legal identifier (concretely [A-Za-z][_A-Za-z0-9]*) For each associated replica catalogs the user specifies the following properties
pegasus.catalog.replica.mrc.[value] - specifies the type of replica catalog.
pegasus.catalog.replica.mrc.[value].key - specifies a property name key for a particular catalog
For example, to query two lrcs at the same time specify the following:
pegasus.catalog.replica.mrc.lrc1=LRC
pegasus.catalog.replica.mrc.lrc1.url=
<url to the 1st globus LRC>pegasus.catalog.replica.mrc.lrc2=LRC
pegasus.catalog.replica.mrc.lrc2.url=<url to the 2nd globus LRC>
In the above example,lrc1 and lrc2 are any valid identifier names and url is the property key that needed to be specified.
The client used to interact with the Replica Catalogs is pegasus-rc-client. The implementation that the client talks to is configured using Pegasus properties.
Lets assume we create a file f.a in your home directory as shown below.
$ date > $HOME/f.a
We now need to register this file in the File replica catalog located in $HOME/rc using the pegasus-rc-client. Replace the gsiftp://url with the appropriate parameters for your grid site.
$ rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc insert \
f.a gsiftp://somehost:port/path/to/file/f.a pool=local
You may first want to verify that the file registeration is in the replica catalog. Since we are using a File catalog we can look at the file $HOME/rc to view entries.
$ cat $HOME/rc
# file-based replica catalog: 2010-11-10T17:52:53.405-07:00
f.a gsiftp://somehost:port/path/to/file/f.a pool=local
The above line shows that entry for file f.a was made correctly.
You can also use the pegasus-rc-client to look for entries.
$ pegasus-rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc lookup LFN f.a
f.a gsiftp://somehost:port/path/to/file/f.a pool=local
The Site Catalog describes the compute resources (which are often clusters) that we intend to run the workflow upon. A site is a homogeneous part of a cluster that has at least a single GRAM gatekeeper with a jobmanager-fork andjobmanager-<scheduler> interface and at least one gridftp server along with a sh$ cat $HOME ared file system. The GRAM gatekeeper can be either WS GRAM or Pre-WS GRAM. A site can also be a condor pool or glidein pool with a shared file system.
Pegasus currently supports two implementation of the Site Catalog:
XML3(Default)
XML(Deprecated)
File(Deprecated)
This is the default format for Pegasus 3.0. This format allows defining filesystem of shared as well as local type on the head node of the remote cluster as well as on the backend nodes
Below is an example of the XML3 site catalog
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0">
<site handle="isi" arch="x86" os="LINUX" osrelease="" osversion="" glibc="">
<grid type="gt2" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="auxillary"/>
<grid type="gt2" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="compute"/>
<head-fs>
<scratch>
<shared>
<file-server protocol="gsiftp" url="gsiftp://skynet-data.isi.edu"
mount-point="/nfs/scratch01" />
<internal-mount-point mount-point="/nfs/scratch01"/>
</shared>
</scratch>
<storage>
<shared>
<file-server protocol="gsiftp" url="gsiftp://skynet-data.isi.edu"
mount-point="/exports/storage01"/>
<internal-mount-point mount-point="/exports/storage01"/>
</shared>
</storage>
</head-fs>
<replica-catalog type="LRC" url="rlsn://smarty.isi.edu"/>
<profile namespace="env" key="PEGASUS_HOME" >/nfs/vdt/pegasus</profile>
<profile namespace="env" key="GLOBUS_LOCATION" >/vdt/globus</profile>
</site>
</sitecatalog>
Described below are some of the entries in the site catalog.
site - A site identifier.
replica-catalog - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC. This is optional
-
File Systems - Info about filesystems mounted on the remote clusters head node or worker nodes. It has several configurations
head-fs/scratch - This describe the scratch file systems (temporary for execution) available on the head node
head-fs/storage - This describes the storage file systems (long term) available on the head node
worker-fs/scratch - This describe the scratch file systems (temporary for execution) available on the worker node
worker-fs/storage - This describes the storage file systems (long term) available on the worker node
Each scratch and storage entry can contain two sub entries,
SHARED for shared file systems like NFS, LUSTRE etc.
LOCAL for local file systems (local to the node/machine)
Each of the filesystems are defined by used a file-server element. Protocol defines the protocol uses to access the files, URL defines the url prefix to obtain the files from and mount-point is the mount point exposed by the file server.
Along with this an internal-mount-point needs to defined to access the files directly from the machine without any file servers.
-
arch,os,osrelease,osversion, glibc - The arch/os/osrelease/osversion/glibc of the site. OSRELEASE, OSVERSION and GLIBC are optional
ARCH can have one of the following values X86, X86_64, SPARCV7, SPARCV9, AIX, PPC.
OS can have one of the following values LINUX,SUNOS,MACOSX. The default value for sysinfo if none specified is X86::LINUX
-
Profiles - One or many profiles can be attached to a pool.
One example is the environments to be set on a remote pool.
To use this site catalog the follow properties need to be set:
pegasus.catalog.site=XML3
pegasus.catalog.site.file=
<path to the site catalog file>
Warning
This format is now deprecated in favor of the XML3 format. If you are still using the XML or File format you should convert it to XML3 formation using the pegasus-sc-converter client
$ cat $HOME/sites.xml
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
http://pegasus.isi.edu/schema/sc-2.0.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0">
<site handle="local" gridlaunch="/nfs/vdt/pegasus/bin/kickstart"
sysinfo="INTEL32::LINUX">
<profile namespace="env" key="PEGASUS_HOME" >/nfs/vdt/pegasus</profile>
<profile namespace="env" key="GLOBUS_LOCATION" >/vdt/globus</profile>
<profile namespace="env" key="LD_LIBRARY_PATH" >/vdt/globus/lib</profile>
<profile namespace="env" key="JAVA_HOME" >/vdt/java</profile>
<lrc url="rlsn://localhost" />
<gridftp url="gsiftp://localhost" storage="/$HOME/storage" major="4" minor="0"
patch="5">
</gridftp>
<jobmanager universe="transfer" url="localhost/jobmanager-fork" major="4" minor="0"
patch="5" />
<jobmanager universe="vanilla" url="localhost/jobmanager-fork" major="4" minor="0"
patch="5" />
<workdirectory >$HOME/workdir</workdirectory>
</site>
<site handle="clus1" gridlaunch="/opt/nfs/vdt/pegasus/bin/kickstart"
sysinfo="INTEL32::LINUX">
<profile namespace="env" key="PEGASUS_HOME" >/opt/nfs/vdt/pegasus</profile>
<profile namespace="env" key="GLOBUS_LOCATION" >/opt/vdt/globus</profile>
<profile namespace="env" key="LD_LIBRARY_PATH" >/opt/vdt/globus/lib</profile>
<lrc url="rlsn://clus1.com" />
<gridftp url="gsiftp://clus1.com" storage="/jobmanager-fork" major="4" minor="0"
patch="3">
</gridftp>
<jobmanager universe="transfer" url="clus1.com/jobmanager-fork" major="4" minor="0"
patch="3" />
<jobmanager universe="vanilla" url="clus1.com/jobmanager-pbs" major="4" minor="0"
patch="3" />
<workdirectory >$HOME/workdir-clus1</workdirectory>
</site>
</sitecatalog>
site - A site identifier.
lrc - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC
workdirectory - A remote working directory (Should be on a shared file system)
gridftp - A URL prefix for a remote storage location. and a path to the storage location
-
jobmanager - Url to the jobmanager entrypoints for the remote grid. Different universes are supported which map to different batch jobmanagers.
"vanilla" for compute jobs and "transfer" for transfer jobs are mandatory. Generally a transfer universe should map to the fork jobmanager.
gridlaunch - Path to the remote kickstart tool (provenance tracking)
-
sysinfo - The arch/os/osversion/glibc of the site. The format is ARCH::OS:OSVER:GLIBC where OSVERSION and GLIBC are optional.
ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for sysinfo if none specified is INTEL32::LINUX
-
Profiles - One or many profiles can be attached to a pool.
Profiles such as the environments to be set on a remote pool.
To use this format you need to set the following properties
pegasus.catalog.site=XML
pegasus.catalog.site.file=
<path to the site catalog file>
Warning
This format is now deprecated in favor of the XML3 format. If you are still using the File format you should convert it to XML3 format using the client pegasus-sc-converter
The format for the File is as follows
site site_id {
#required. Can be a dummy value if using Simple File RC
lrc "rls://someurl"
#required on a shared file system
workdir "path/to/a/tmp/shared/file/sytem/"
#required one or more entries
gridftp "gsiftp://hostname/mountpoint” "GLOBUS VERSION"
#required one or more entries
universe transfer "hostname/jobmanager-<scheduler>" "GLOBUS VERSION"
#reqired one or more entries
universe vanilla "hostname/jobmanager-<scheduler>" "GLOBUS VERSION"
#optional
sysinfo "ARCH::OS:OSVER:GLIBC"
#optional
gridlaunch "/path/to/gridlaunch/executable"
#optional zero or more entries
profile namespace "key" "value"
}
The gridlaunch and profile entries are optional. All the rest are required for each pool. Also the transfer and vanilla universe are mandatory. You can add multiple transfer and vanilla universe if you have more then one head node on the cluster. The entries in the Site Catalog have the following meaning:
site - A site identifier.
lrc - URL for a local replica catalog (LRC) to register your files in. Only used for RLS implementation of the RC
workdir - A remote working directory (Should be on a shared file system)
gridftp gridftp - A URL prefix for a remote storage location.
-
universe - Different universes are supported which map to different batch jobmanagers.
"vanilla" for compute jobs and "transfer" for transfer jobs are mandatory. Generally a transfer universe should map to the fork jobmanager.
gridlaunch - Path to the remote kickstart tool (provenance tracking)
-
sysinfo - The arch/os/osversion/glibc of the site. The format is ARCH::OS:OSVER:GLIBC where OSVERSION and GLIBC are optiona.
ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for sysinfo if none specified is INTEL32::LINUX
-
Profiles - One or many profiles can be attached to a pool.
Profiles such as the environments to be set on a remote pool.
To use this format you need to set the following properties:
pegasus.catalog.site=Text
pegasus.catalog.site.file=
<path to the site catalog file>
The pegasus-sc-client can be used to generate a site catalog for Open Science Grid (OSG) by querying their Monitoring Interface likes VORS or OSGMM. See pegasus-sc-client --help for more details
Pegasus 3.0 by default now parses Site Catalog format conforming to the SC schema 3.0 ( XML3 ) available here and is explained in detail in the Catalog Properties section of Running Workflows.
Pegasus 3.0 comes with a pegasus-sc-converter that will convert users old site catalog ( XML ) to the XML3 format. Sample usage is given below.
$ pegasus-sc-converter -i sample.sites.xml -I XML -o sample.sites.xml3 -O XML3
2010.11.22 12:55:14.169 PST: Written out the converted file to sample.sites.xml3
To use the converted site catalog, in the properties do the following:
unset pegasus.catalog.site or set pegasus.catalog.site to XML3
point pegasus.catalog.site.file to the converted site catalog
The Transformation Catalog maps logical transformations to physical executables on the system. It also provides additional information about the transformation as to what system they are compiled for, what profiles or environment variables need to be set when the transformation is invoked etc.
Pegasus currently supports two implementations of the Transformation Catalog
Text: A multiline text based Transformation Catalog (DEFAULT)
File: A simple multi column text based Transformation Catalog
Database: A database backend (MySQL or PostgreSQL) via JDB
In this guide we will look at the format of the Multiline Text based TC.
The multile line text based TC is the new default TC in Pegasus. This format allows you to define the transformations
The file is read and cached in memory. Any modifications, as adding or deleting, causes an update of the memory and hence to the file underneath. All queries are done against the memory representation. The file sample.tc.text in the etc directory contains an example
tr example::keg:1.0 {
#specify profiles that apply for all the sites for the transformation
#in each site entry the profile can be overriden
profile env "APP_HOME" "/tmp/myscratch"
profile env "JAVA_HOME" "/opt/java/1.6"
site isi {
profile env "HELLo" "WORLD"
profile condor "FOO" "bar"
profile env "JAVA_HOME" "/bin/java.1.6"
pfn "/path/to/keg"
arch "x86"
os "linux"
osrelease "fc"
osversion "4"
type "INSTALLED"
}
site wind {
profile env "CPATH" "/usr/cpath"
profile condor "universe" "condor"
pfn "file:///path/to/keg"
arch "x86"
os "linux"
osrelease "fc"
osversion "4"
type "STAGEABLE"
}
}
The entries in this catalog have the following meaning
tr tr - A transformation identifier. (Normally a Namespace::Name:Version.. The Namespace and Version are optional.)
pfn - URL or file path for the location of the executable. The pfn is a file path if the transformation is of type INSTALLED and generally a url (file:/// or http:// or gridftp://) if of type STAGEABLE
site - The site identifier for the site where the transformation is available
type - The type of transformation. Whether it is Iinstalled ("INSTALLED") on the remote site or is availabe to stage ("STAGEABLE").
-
arch, os, osrelease, osversion - The arch/os/osrelease/osversion of the transformation. osrelease and osversion are optional.
ARCH can have one of the following values x86, x86_64, sparcv7, sparcv9, ppc, aix. The default value for arch is x86
OS can have one of the following values linux,sunos,macosx. The default value for OS if none specified is linux
Profiles - One or many profiles can be attached to a transformation for all sites or to a transformation on a particular site.
To use this format of the Transformation Catalog you need to set the following properties
pegasus.catalog.transformation=Text
pegasus.catalog.transformation.file=
<path to the transformation catalog file>
Warning
This format is now deprecated in favor of the multiline TC. If you are still using the single line TC you should convert it to multiline using the tc-converter client.
The format of the this TC is as follows.
#site logicaltr physicaltr type system profiles(NS::KEY="VALUE") site1 sys::date:1.0 /usr/bin/date INSTALLED INTEL32::LINUX:FC4.2:3.6 ENV::PATH="/usr/bin";PEGASUS_HOME="/usr/local/pegasus"
The system and profile entries are optional and will use default values if not specified. The entries in the file format have the following meaning:
site - A site identifier.
logicaltr - The logical transformation name. The format is NAMESPACE::NAME:VERSION where NAMESPACE and NAME are optional.
-
physicaltr - The physical transformation path or URL.
If the transformation type is INSTALLED then it needs to be an absolute path to the executable. If the type is STAGEABLE then the path needs to be a HTTP, FTP or gsiftp URL
-
type - The type of transformation. Can have on of two values
INSTALLED: This means that the transformation is installed on the remote site
STAGEABLE: This means that the transformation is available as a static binary and can be staged to a remote site.
-
system - The system for which the transformation is compiled.
The formation of the sytem is ARCH::OS:OSVERSION:GLIBC where the GLIBC and OS VERSION are optional. ARCH can have one of the following values INTEL32, INTEL64, SPARCV7, SPARCV9, AIX, AMD64. OS can have one of the following values LINUX,SUNOS. The default value for system if none specified is INTEL32::LINUX
-
Profiles - The profiles associated with the transformation. For indepth information about profiles and their priorities read the Profile Guide.
The format for profiles is NS::KEY="VALUE" where NS is the namespace of the profile e.g. Pegasus,condor,DAGMan,env,globus. The key and value can be any strings. Remember to quote the value with double quotes. If you need to specify several profiles you can do it in several ways
-
NS1::KEY1="VALUE1",KEY2="VALUE2";NS2::KEY3="VALUE3",KEY4="VALUE4"
This is the most optimized form. Multiple key values for the same namespace are separated by a comma "," and different namespaces are separated by a semicolon ";"
-
NS1::KEY1="VALUE1";NS1::KEY2="VALUE2";NS2::KEY3="VALUE3";NS2::KEY4="VALUE4"
You can also just repeat the triple of NS::KEY="VALUE" separated by semicolons for a simple format;
-
To use this format of the Transformation Catalog you need to set the following properties
pegasus.catalog.transformation=File
pegasus.catalog.transformation.file=
<path to the transformation catalog file>
The database TC alows you to use a relational database. To use the database TC you need to have installed a MySQL or PostgreSQL server. The schema for the database is available in $PEGASUS_HOME/sql directory. You will have to install the schema into either PostgreSQL or MySQL by running the appropriate commands to load the two scheams create-XX-init.sql and create-XX-tc.sql where XX is either my (for MySQL) or pg (for PostgreSQL)
To use the Database TC you need to set the following properties
pegasus.catalog.transformation.db.driver=MySQL | Postgres
pegasus.catalog.transformation.db.url=
<jdbc url to the databse>pegasus.catalog.transformation.db.user=
<database user>pegasus.catalog.transformation.db.password=
<database password>
We need to map our declared transformations (preprocess, findranage, and analyze) from the example DAX above to a simple "mock application" name "keg" ("canonical example for the grid") which reads input files designated by arguments, writes them back onto output files, and produces on STDOUT a summary of where and when it was run. Keg ships with Pegasus in the bin directory. Run keg on the command line to see how it works.
$ keg -o /dev/fd/1
Timestamp Today: 20040624T054607-05:00 (1088073967.418;0.022)
Applicationname: keg @ 10.10.0.11 (VPN)
Current Workdir: /home/unique-name
Systemenvironm.: i686-Linux 2.4.18-3
Processor Info.: 1 x Pentium III (Coppermine) @ 797.425
Output Filename: /dev/fd/1
Now we need to map all 3 transformations onto the "keg" executable. We place these mappings in our File transformation catalog for site clus1.
Note
In earlier version of Pegasus users had to define entries for Pegasus executables such as transfer, replica client, dirmanager, etc on each site as well as site "local". This is no longer required. Pegasus versions 2.0 and later automatically pick up the paths for these binaries from the environment profile PEGASUS_HOME set in the site catalog for each site.
A single entry needs to be on one line. The above example is just formatted for convenience.
Alternatively you can also use the pegasus-tc-client to add entries to any implementation of the transformation catalog. The following example shows the addiition the last entry in the File based transformation catalog.
$ pegasus-tc-client -Dpegasus.catalog.transformation=Text \
-Dpegasus.catalog.transformation.file=$HOME/tc -a -r clus1 -l black::analyze:1.0 \
-p gsiftp://clus1.com/opt/nfs/vdt/pegasus/bin/keg -t STAGEABLE -s INTEL32::LINUX \
-e ENV::KEY3="VALUE3"
2007.07.11 16:12:03.712 PDT: [INFO] Added tc entry sucessfully
To verify if the entry was correctly added to the transformation catalog you can use the pegasus-tc-client to query.
$ pegasus-tc-client -Dpegasus.catalog.transformation=File \
-Dpegasus.catalog.transformation.file=$HOME/tc -q -P -l black::analyze:1.0
#RESID LTX PFN TYPE SYSINFO
clus1 black::analyze:1.0 gsiftp://clus1.com/opt/nfs/vdt/pegasus/bin/keg
STAGEABLE INTEL32::LINUX
Pegasus 3.0 by default now parses a file based multiline textual format of a Transformation Catalog. The new Text format is explained in detail in the chapter on Catalogs.
Pegasus 3.0 comes with a pegasus-tc-converter that will convert users old transformation catalog ( File ) to the Text format. Sample usage is given below.
$ pegasus-tc-converter -i sample.tc.data -I File -o sample.tc.text -O Text
2010.11.22 12:53:16.661 PST: Successfully converted Transformation Catalog from File to Text
2010.11.22 12:53:16.666 PST: The output transfomation catalog is in file /lfs1/software/install/pegasus/pegasus-3.0.0cvs/etc/sample.tc.text
To use the converted transformation catalog, in the properties do the following:
unset pegasus.catalog.transformation or set pegasus.catalog.transformation to Text
point pegasus.catalog.transformation.file to the converted transformation catalog




