Pegasus 4.7.4 User Guide

1. Introduction
1.1. Overview and Features
1.2. Workflow Gallery
1.3. About this Document
1.4. Document Formats (Web, PDF)
2. Tutorial
2.1. Introduction
2.2. Getting Started
2.3. What are Scientific Workflows
2.4. Submitting an Example Workflow
2.5. Workflow Dashboard for Monitoring and Debugging
2.6. Command line tools for Monitoring and Debugging
2.7. Recovery from Failures
2.8. Generating the Workflow
2.9. Information Catalogs
2.10. Configuring Pegasus
2.11. Conclusion
3. Installation
3.1. Prerequisites
3.2. Optional Software
3.3. Environment
3.4. RHEL / CentOS / Scientific Linux
3.5. Ubuntu
3.6. Debian
3.7. Mac OS X
3.8. Pegasus from Tarballs
4. Creating Workflows
4.1. Abstract Workflows (DAX)
4.2. Data Discovery (Replica Catalog)
4.3. Resource Discovery (Site Catalog)
4.4. Executable Discovery (Transformation Catalog)
4.5. Variable Expansion
5. Running Workflows
5.1. Executable Workflows (DAG)
5.2. Mapping Refinement Steps
5.3. Data Staging Configuration
5.4. PegasusLite
5.5. Pegasus-Plan
5.6. Basic Properties
6. Monitoring, Debugging and Statistics
6.1. Workflow Status
6.2. Plotting and Statistics
6.3. Dashboard
6.4. Notifications
6.5. Monitoring Database
6.6. Stampede Workflow Events
7. Execution Environments
7.1. Localhost
7.2. Condor Pool
7.3. Cloud (Amazon EC2/S3, Google Cloud, ...)
7.4. Remote Cluster using Globus GRAM
7.5. Remote Cluster using CREAMCE
7.6. Local PBS Cluster Using Glite
7.7. SDSC Comet with Bosco glideins
7.8. Remote PBS Cluster using Bosco and SSH
7.9. Campus Cluster
7.10. XSEDE
7.11. Open Science Grid Using glideinWMS
8. Example Workflows
8.1. Grid Examples
8.2. Condor Examples
8.3. Local Shell Examples
8.4. Notifications Example
8.5. Workflow of Workflows
9. Data Management
9.1. Replica Selection
9.2. Data Transfers
9.3. Credentials Management
9.4. Staging Mappers
9.5. Output Mappers
9.6. Data Cleanup
9.7. Metadata
10. Optimizing Workflows for Efficiency and Scalability
10.1. Optimizing Short Jobs / Scheduling Delays
10.2. Job Clustering
10.3. How to Scale Large Workflows
10.4. Hierarchical Workflows
10.5. Optimizing Data Transfers
10.6. Job Throttling
11. Pegasus Service
11.1. Service Administration
11.2. Dashboard
11.3. Running Pegasus Service under Apache HTTPD
11.4. Ensemble Manager
12. Configuration
12.1. Differences between Profiles and Properties
12.2. Profiles
12.3. Properties
13. Submit Directory Details
13.1. Layout
13.2. Condor DAGMan File
13.3. Kickstart XML Record
13.4. Jobstate.Log File
13.5. Braindump File
13.6. Pegasus static.bp File
14. API Reference
14.1. DAX XML Schema
14.2. DAX Generator API
14.3. DAX Generator without a Pegasus DAX API
14.4. Monitoring
15. Command Line Tools
pegasus-analyzer — debugs a workflow.
pegasus-cluster — run a list of applications
pegasus-configure-glite — install Pegasus-specific glite configuration
pegasus-config — Can be used to find installed Pegasus tools and libraries.
pegasus-dagman — Wrapper around *condor_dagman*. Not to be run by user.
pegasus-dax-validator — determines if a given DAX file is valid.
pegasus-db-admin — Manage Pegasus databases.
pegasus-em — Submit and monitor ensembles of workflows
pegasus-exitcode — Used post-job to check the stdout/stderr for errors
pegasus-globus-online — Interfaces with Globus Online for managed transfers.
pegasus-graphviz — Convert a DAX or DAG into a graphviz dot file
pegasus-gridftp — Perform file and directory operations on remote GridFTP servers
pegasus-halt — stops a workflow gracefully, current jobs will finish
pegasus-init — create a new workflow configuration
pegasus-invoke — invokes a command from a file
pegasus-keg — kanonical executable for grids
pegasus-kickstart — remote job wrapper
pegasus-metadata — Query metadata collected for Pegasus workflows
pegasus-monitord — tracks a workflow progress, mining information
pegasus-mpi-cluster — Enables running DAGs (Directed Acyclic Graphs) on clusters using MPI.
pegasus-mpi-keg — MPI version of KEG
pegasus-plan — runs Pegasus to generate the executable workflow
pegasus-plots — A tool to generate graphs and charts to visualize workflow run.
pegasus-rc-client — shell client for replica implementations
pegasus-remove — removes a workflow that has been planned and submitted using pegasus-plan and pegasus-run
pegasus-run — executes a workflow that has been planned using *pegasus-plan*.
pegasus-s3 — Upload, download, delete objects in Amazon S3
pegasus-sc-converter — A client to convert site catalog from one format to another format.
pegasus-service — Runs the Pegasus Service server
pegasus-statistics — A tool to generate statistics about the workflow run.
pegasus-status — Pegasus workflow- and run-time status
pegasus-submit-dag — Wrapper around *condor_submit_dag*. Not to be run by user.
pegasus-submitdir — Manage a workflow submit directory.
pegasus-tc-client — A full featured generic client to handle adds, deletes and queries to the Transformation Catalog (TC).
pegasus-tc-converter — A client to convert transformation catalog from one format to another format.
pegasus-transfer — Handles data transfers for Pegasus workflows.
pegasus-version — print or match the version of the toolkit.
16. Useful Tips
16.1. Migrating From Pegasus 4.5.X to Pegasus current version
16.2. Migrating From Pegasus <4.5 to Pegasus 4.5.X
16.3. Migrating From Pegasus 3.1 to Pegasus 4.X
16.4. Migrating From Pegasus 2.X to Pegasus 3.X
16.5. Best Practices For Developing Portable Code
16.6. Slot Partitioning and CPU Affinity in Condor
17. Funding, citing, and anonymous usage statistics
17.1. Citing Pegasus in Academic Works
17.2. Usage Statistics Collection
18. Glossary
A. Tutorial VM
A.1. Introduction
A.2. VirtualBox
A.3. Amazon EC2