10.3. Supported Transfer Protocols

Pegasus refers to a python script called pegasus-transfer as the executable in the transfer jobs to transfer the data. pegasus-transfer looks at source and destination url and figures out automatically which underlying client to use. pegasus-transfer is distributed with the PEGASUS and can be found at $PEGASUS_HOME/bin/pegasus-transfer.

Currently, pegasus-transfer interfaces with the following transfer clients

Table 10.4. Transfer Clients interfaced to by pegasus-transfer

Transfer Client Used For
gfal-copy staging file to and from GridFTP servers
globus-url-copy staging files to and from GridFTP servers, only if gfal is not detected in the path.
gfal-copy staging files to and from SRM or XRootD servers
wget staging files from a HTTP server
cp copying files from a POSIX filesystem
ln symlinking against input files
pegasus-s3 staging files to and from S3 buckets in Amazon Web Services
gsutil staging files to and from Google Storage buckets
scp staging files using scp
gsiscp staging files using gsiscp and X509
iget staging files to and from iRODS servers
htar to retrieve input files from HPSS tape storage.

For remote sites, Pegasus constructs the default path to pegasus-transfer on the basis of PEGASUS_HOME env profile specified in the site catalog. To specify a different path to the pegasus-transfer client , users can add an entry into the transformation catalog with fully qualified logical name as pegasus::pegasus-transfer

10.3.1. Amazon S3 (s3://)

Pegasus can be configured to use Amazon S3 as a staging site. In this mode, Pegasus transfers workflow inputs from the input site to S3. When a job runs, the inputs for that job are fetched from S3 to the worker node, the job is executed, then the output files are transferred from the worker node back to S3. When the jobs are complete, Pegasus transfers the output data from S3 to the output site.

In order to use S3, it is necessary to create a config file for the S3 transfer client, pegasus-s3. See the man page for details on how to create the config file. You also need to specify S3 as a staging site.

Next, you need to modify your site catalog to tell the location of your s3cfg file. See the section on credential staging.

The following site catalog shows how to specify the location of the s3cfg file on the local site and how to specify an Amazon S3 staging site:

<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog
             http://pegasus.isi.edu/schema/sc-3.0.xsd" version="3.0">
    <site handle="local" arch="x86_64" os="LINUX">
        <head-fs>
            <scratch>
                <shared>
                    <file-server protocol="file" url="file://" mount-point="/tmp/wf/work"/>
                    <internal-mount-point mount-point="/tmp/wf/work"/>
                </shared>
            </scratch>
            <storage>
                <shared>
                    <file-server protocol="file" url="file://" mount-point="/tmp/wf/storage"/>
                    <internal-mount-point mount-point="/tmp/wf/storage"/>
                </shared>
            </storage>
        </head-fs>
        <profile namespace="env" key="S3CFG">/home/username/.s3cfg</profile>
    </site>
    <site handle="s3" arch="x86_64" os="LINUX">
        <head-fs>
            <scratch>
                <shared>
                    <!-- wf-scratch is the name of the S3 bucket that will be used -->
                    <file-server protocol="s3" url="s3://user@amazon" mount-point="/wf-scratch"/>
                    <internal-mount-point mount-point="/wf-scratch"/>
                </shared>
            </scratch>
        </head-fs>
    </site>
    <site handle="condorpool" arch="x86_64" os="LINUX">
        <head-fs>
            <scratch/>
            <storage/>
        </head-fs>
        <profile namespace="pegasus" key="style">condor</profile>
        <profile namespace="condor" key="universe">vanilla</profile>
        <profile namespace="condor" key="requirements">(Target.Arch == "X86_64")</profile>
    </site>
</sitecatalog>

10.3.2. Docker (docker://)

10.3.3. File / Symlink (file:// , symlink://)

10.3.4. GridFTP (gsiftp://)

10.3.4.1. Preference of GFAL over GUC

JGlobus is no longer actively supported and is not in compliance with RFC 2818 . As a result cleanup jobs using pegasus-gridftp client would fail against the servers supporting the strict mode. We have removed the pegasus-gridftp client and now use gfal clients as globus-url-copy does not support removes. If gfal is not available, globus-url-copy is used for cleanup by writing out zero bytes files instead of removing them.

If you want to force globus-url-copy to be preferred over GFAL, set the PEGASUS_FORCE_GUC=1 environment variable in the site catalog for the sites you want the preference to be enforced. Please note that we expect globus-url-copy support to be completely removed in future releases of Pegasus due to the end of life of Globus Toolkit (see announcement).

10.3.5. GridFTP over SSH (sshftp://)

Instead of using X.509 based security, newer version of Globus GridFTP can be configured to set up transfers over SSH. See the Globus Documentation for details on installing and setting up this feature.

Pegasus requires the ability to specify which SSH key to be used at runtime, and thus a small modification is necessary to the default Globus configuration. On the hosts where Pegasus initiates transfers (which depends on the data configuration of the workflow), please replace gridftp-ssh, usually located under /usr/share/globus/gridftp-ssh, with:

#!/bin/bash

url_string=$1
remote_host=$2
port=$3
user=$4

port_str=""
if  [ "X" = "X$port" ]; then
    port_str=""
else
    port_str=" -p $port "
fi

if  [ "X" != "X$user" ]; then
    remote_host="$user@$remote_host"
fi

remote_default1=.globus/sshftp
remote_default2=/etc/grid-security/sshftp
remote_fail="echo -e 500 Server is not configured for SSHFTP connections.\\\r\\\n"
remote_program=$GLOBUS_REMOTE_SSHFTP
if  [ "X" = "X$remote_program" ]; then
    remote_program="(( test -f $remote_default1 && $remote_default1 ) || ( test -f $remote_default2 && $remote_default2 ) || $remote_fail )"
fi

if [ "X" != "X$GLOBUS_SSHFTP_PRINT_ON_CONNECT" ]; then
    echo "Connecting to $1 ..." >/dev/tty
fi

# for pegasus-transfer
extra_opts=" -o StrictHostKeyChecking=no"
if [ "x$SSH_PRIVATE_KEY" != "x" ]; then
    extra_opts="$extra_opts -i $SSH_PRIVATE_KEY"
fi

exec /usr/bin/ssh $extra_opts $port_str $remote_host $remote_program

Once configured, you should be able to use URLs such as sshftp://username@host/foo/bar.txt in your workflows.

10.3.6. Google Storage (gs://)

10.3.7. HTTP (http:// , https://)

10.3.8. HPSS (hpss://)

We support retrieval of input files from a tar file in HPSS storage using the htar command. The naming convention to describe the tar file and the file to retrieve fro the tar file is as follows

hpss:///some-name.tar/path/in-tar-to/file.txt

For example: for e.g hpss:///test.tar/set1/f.a

For efficient retrieval pegasus-transfer bin's all the hpss transfers in the .in file

  • fiirst by the tar file and then

  • the destination directory.

Binning by destination directory is done to support deep LFN's. Also thing to note is that htar command returns success even if a file does not exist in the archive. pegasus-transfer tries to make sure after the transfer that the destination file exists and is readable.

HPSS requires a token to generated for retrieval. Information on how to specify the token location can be found here.

10.3.9. iRODS (irods://)

iRODS can be used as a input data location, a storage site for intermediate data during workflow execution, or a location for final output data. Pegasus uses a URL notation to identify iRODS files. Example:

irods://some-host.org/path/to/file.txt

The path to the file is relative to the internal iRODS location. In the example above, the path used to refer to the file in iRODS is path/to/file.txt (no leading /).

See the section on credential staging for information on how to set up an irodsEnv file to be used by Pegasus.

10.3.10. SCP (scp://)

10.3.11. OSG Stash / stashcp (stash://)

Open Science Grid provides a data service called Stash, and the command line tool stashcp for interacting with the Stash data. An example on how to set up the site catalog and URLs can be found in the OSG User Support Pegasus tutorial

10.3.12. Globus Online (go://)

Globus Online is a transfer service with features such as policy based connection management and automatic failure detection and recovery. Pegasus has limited the support for Globus Online transfers.

If you want to use Globus Online in your workflow, all data has to be accessible via a Globus Online endpoint. You can not mix Globus Online endpoints with other protocols. For most users, this means they will have to create an endpoint for their submit host and probably modify both the replica catalog and DAX generator so that all URLs in the workflow are for Globus Online endpoints.

There are two levels of credentials required. One is for the workflow to use the Globus Online API, which is handled by OAuth tokens, provided by Globus Auth service. The second level is for the endpoints, which the user will have to manage via the Globus Online web interface. The required steps are:

  1. Using pegasus-globus-online-init, provide authorization to Pegasus and retrieve your transfer access tokens. By default Pegasus acquires temporary tokens that expire within a few days. Using --permanent option you can request refreshable tokens that last indefinetely (or until access is revoked).

  2. In the Globus Online web interface, under Endpoints, find the endpoints you need for the workflow, and activate them. Note that you should activate them for the whole duration of the workflow or you will have to regularly log in and re-activate the endpoints during workflow execution.

URLs for Globus Online endpoint data follows the following scheme: go://[endpoint]/[path]. For example, for a user with the Globus Online private endpoint bob#researchdata and a file /home/bsmith/experiment/1.dat, the URL would be: go://bob#researchdata/home/bsmith/experiment/1.dat