4.2. Data Discovery (Replica Catalog)

The Replica Catalog keeps mappings of logical file ids/names (LFN's) to physical file ids/names (PFN's). A single LFN can map to several PFN's. A PFN consists of a URL with protocol, host and port information and a path to a file. Along with the PFN one can also store additional key/value attributes to be associated with a PFN.

Pegasus supports the following implementations of the Replica Catalog.

  1. File(Default)

  2. Regex

  3. Directory

  4. Database via JDBC

  5. MRC

4.2.1. File

In this mode, Pegasus queries a file based replica catalog. The file format is a simple multicolumn format. It is neither transactionally safe, nor advised to use for production purposes in any way. Multiple concurrent instances will conflict with each other. The site attribute should be specified whenever possible. The attribute key for the site attribute is "site".

LFN PFN
LFN PFN a=b [..]
LFN PFN a="b" [..]
"LFN w/LWS" "PFN w/LWS" [..]
      

The LFN may or may not be quoted. If it contains linear whitespace, quotes, backslash or an equal sign, it must be quoted and escaped. The same conditions apply for the PFN. The attribute key-value pairs are separated by an equality sign without any whitespaces. The value may be quoted. The LFN sentiments about quoting apply.

The file mode is the Default mode. In order to use the File mode you have to set the following properties

  1. pegasus.catalog.replica=File

  2. pegasus.catalog.replica.file=<path to the replica catalog file>

4.2.2. Regex

In this mode, Pegasus queries a file based replica catalog. The file format is a simple multicolumn format. It is neither transactionally safe purposes in any way. Multiple concurrent instances will conflict with each other. The site attribute should be specified whenever possible. The attribute key for the site attribute is "site".

In addition users can specifiy regular expression based LFN's. A regular expression based entry should be qualified with an attribute named 'regex'. The attribute regex when set to true identifies the catalog entry as a regular expression based entry. Regular expressions should follow Java regular expression syntax.

For example, consider a replica catalog as shown below.

Entry 1 refers to an entry which does not use a regular expressions. This entry would only match a file named 'f.a', and nothing else.

Entry 2 referes to an entry which uses a regular expression. In this entry f.a referes to files having name as f<any-character>a i.e. faa, f.a, f0a, etc.

#1
f.a file:///Volumes/data/input/f.a site="local"
#2
f.a file:///Volumes/data/input/f.a site="local" regex="true"

Regular expression based entries also support substitutions. For example, consider the regular expression based entry shown below.

Entry 3 will match files with name alpha.csv, alpha.txt, alpha.xml. In addition, values matched in the expression can be used to generate a PFN.

For the entry below if the file being looked up is alpha.csv, the PFN for the file would be generated as file:///Volumes/data/input/csv/alpha.csv. Similary if the file being lookedup was alpha.csv, the PFN for the file would be generated as file:///Volumes/data/input/xml/alpha.xml i.e. The section [0], [1] will be replaced. Section [0] refers to the entire string i.e. alpha.csv. Section [1] refers to a partial match in the input i.e. csv, or txt, or xml. Users can utilize as many sections as they wish.

#3
alpha\.(csv|txt|xml) file:///Volumes/data/input/[1]/[0] site="local" regex="true"

In case of a LFN name matching multiple entries in the file, the implementation picks up the first matching regex as it appears in the file. If you want to specify a default location for all LFN's that don't match any regex expression, you can have this entry as the last entry in your file.

#4 all unmatched LFN's reside in the same input directory.

.*     file:///Volumes/data/input/[0] site="local" regex="true"

4.2.3. Directory

In this mode, Pegasus does a directory listing on an input directory to create the LFN to PFN mappings. The directory listing is performed recursively, resulting in deep LFN mappings. For example, if an input directory $input is specified with the following structure

$input
$input/f.1
$input/f.2
$input/D1
$input/D1/f.3

Pegasus will create the mappings the following LFN PFN mappings internally

f.1 file://$input/f.1  site="local"
f.2 file://$input/f.2  site="local"
D1/f.3 file://$input/D1/f.3 site="local"

Users can optionally specify additional properties to configure the behavior of this implementation.

  1. pegasus.catalog.replica.directory to specify the path to the directory where the files exist.

  2. pegasus.catalog.replica.directory.site to specify a site attribute other than local to associate with the mappings.

  3. pegasus.catalog.replica.directory.flat.lfn to specify whether you want deep LFN's to be constructed or not. If not specified, value defaults to false i.e. deep lfn's are constructed for the mappings.

  4. pegasus.catalog.replica.directory.url.prefix to associate a URL prefix for the PFN's constructed. If not specified, the URL defaults to file://

Tip

pegasus-plan has --input-dir option that can be used to specify an input directory on the command line. This allows you to specify a separate replica catalog to catalog the locations of output files.

4.2.4. JDBCRC

In this mode, Pegasus queries a SQL based replica catalog that is accessed via JDBC. To create the schema for JDBCRC use the pegasus-db-admin command line tool.

Note

A site attribute was added to the SQL schema as a unique key for 4.4. To update an existing database schema, use pegasus-db-admin tool.

Figure 4.2. Schema Image of the JDBCRC.

Schema Image of the JDBCRC.

To use JDBCRC, the user additionally needs to set the following properties

  1. pegasus.catalog.replica JDBCRC

  2. pegasus.catalog.replica.db.driver mysql | postgres |sqlite

  3. pegasus.catalog.replica.db.url=<jdbc url to the database> e.g jdbc:mysql://database-host.isi.edu/database-name | jdbc:sqlite:/shared/jdbcrc.db

  4. pegasus.catalog.replica.db.user=<database user>

  5. pegasus.catalog.replica.db.password=<database password>

Users can use the command line client pegasus-rc-client to interface to query, insert and remove entries from the JDBCRC backend. Starting 4.5 release, there is also support for sqlite databases. Specify the jdbc url to refer to a sqlite database.

4.2.5. MRC

In this mode, Pegasus queries multiple replica catalogs to discover the file locations on the grid.

To use it set

  1. pegasus.catalog.replica=MRC

Each associated replica catalog can be configured via properties as follows.

The user associates a variable name referred to as [value] for each of the catalogs, where [value] is any legal identifier (concretely [A-Za-z][_A-Za-z0-9]*) For each associated replica catalogs the user specifies the following properties

  • pegasus.catalog.replica.mrc.[value] - specifies the type of replica catalog.

  • pegasus.catalog.replica.mrc.[value].key - specifies a property name key for a particular catalog

For example, to query a File catalog and JDBCRC at the same time specify the following:

  • pegasus.catalog.replica=MRC

  • pegasus.catalog.replica.mrc.jdbcrc=JDBCRC

  • pegasus.catalog.replica.mrc.jdbcrc.url=<jdbc url >

  • pegasus.catalog.replica.mrc.file1=File

  • pegasus.catalog.replica.mrc.file1.url=<path to file based replica catalog>

In the above example, jdbcrc and file1 are any valid identifier names and url is the property key that needed to be specified.

Another example is to use MRC with multiple input directories. Sample properties for that configuration are listed below

  • pegasus.catalog.replica=MRC

  • pegasus.catalog.replica.mrc.directory1=Directory

  • pegasus.catalog.replica.mrc.directory1.directory=/path/to/dir1

  • pegasus.catalog.replica.mrc.directory1.directory.site=obelix

  • pegasus.catalog.replica.mrc.directory2=Directory

  • pegasus.catalog.replica.mrc.directory2.directory=/path/to/dir2

  • pegasus.catalog.replica.mrc.directory2.directory.site=corbusier

4.2.5.1. Replica Catalog Client pegasus-rc-client

The client used to interact with the Replica Catalogs is pegasus-rc-client. The implementation that the client talks to is configured using Pegasus properties.

Lets assume we create a file f.a in your home directory as shown below.

$ date > $HOME/f.a 

We now need to register this file in the File replica catalog located in $HOME/rc using the pegasus-rc-client. Replace the gsiftp://url with the appropriate parameters for your grid site.

$ pegasus-rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc insert \
 f.a gsiftp://somehost:port/path/to/file/f.a site=local

You may first want to verify that the file registeration is in the replica catalog. Since we are using a File catalog we can look at the file $HOME/rc to view entries.

$ cat $HOME/rc
    
# file-based replica catalog: 2010-11-10T17:52:53.405-07:00
f.a gsiftp://somehost:port/path/to/file/f.a site=local

The above line shows that entry for file f.a was made correctly.

You can also use the pegasus-rc-client to look for entries.

$ pegasus-rc-client -Dpegasus.catalog.replica=File -Dpegasus.catalog.replica.file=$HOME/rc lookup LFN f.a

f.a gsiftp://somehost:port/path/to/file/f.a site=local