Pegasus 4.8.0 User Guide

1. Introduction
1.1. Overview and Features
1.2. Workflow Gallery
1.3. About this Document
1.4. Document Formats (Web, PDF)
2. Tutorial
2.1. Introduction
2.2. Getting Started
2.3. What are Scientific Workflows
2.4. Submitting an Example Workflow
2.5. Workflow Dashboard for Monitoring and Debugging
2.6. Command line tools for Monitoring and Debugging
2.7. Recovery from Failures
2.8. Generating the Workflow
2.9. Information Catalogs
2.10. Configuring Pegasus
2.11. Conclusion
3. Installation
3.1. Prerequisites
3.2. Optional Software
3.3. Environment
3.4. RHEL / CentOS / Scientific Linux
3.5. Ubuntu
3.6. Debian
3.7. Mac OS X
3.8. Pegasus from Tarballs
4. Creating Workflows
4.1. Abstract Workflows (DAX)
4.2. Data Discovery (Replica Catalog)
4.3. Resource Discovery (Site Catalog)
4.4. Executable Discovery (Transformation Catalog)
4.5. Variable Expansion
5. Running Workflows
5.1. Executable Workflows (DAG)
5.2. Mapping Refinement Steps
5.3. Data Staging Configuration
5.4. PegasusLite
5.5. Pegasus-Plan
5.6. Basic Properties
6. Monitoring, Debugging and Statistics
6.1. Workflow Status
6.2. Plotting and Statistics
6.3. Dashboard
6.4. Notifications
6.5. Monitoring Database
6.6. Stampede Workflow Events
7. Execution Environments
7.1. Localhost
7.2. Condor Pool
7.3. Cloud (Amazon EC2/S3, Google Cloud, ...)
7.4. Remote Cluster using PyGlidein
7.5. Remote Cluster using Globus GRAM
7.6. Remote Cluster using CREAMCE
7.7. Local PBS Cluster Using Glite
7.8. SDSC Comet with BOSCO glideins
7.9. Remote PBS Cluster using BOSCO and SSH
7.10. Campus Cluster
7.11. XSEDE
7.12. Open Science Grid Using glideinWMS
8. Containers
8.1. Overview
8.2. Configuring Workflows To Use Containers
8.3. Container Execution Model
8.4. Staging of Application Containers
8.5. Container Example - Montage Workflow
9. Example Workflows
9.1. Grid Examples
9.2. Condor Examples
9.3. Container Examples
9.4. Local Shell Examples
9.5. Notifications Example
9.6. Workflow of Workflows
10. Data Management
10.1. Replica Selection
10.2. Data Transfers
10.3. Credentials Management
10.4. Staging Mappers
10.5. Output Mappers
10.6. Data Cleanup
10.7. Metadata
11. Optimizing Workflows for Efficiency and Scalability
11.1. Optimizing Short Jobs / Scheduling Delays
11.2. Job Clustering
11.3. How to Scale Large Workflows
11.4. Hierarchical Workflows
11.5. Optimizing Data Transfers
11.6. Job Throttling
12. Pegasus Service
12.1. Service Administration
12.2. Dashboard
12.3. Running Pegasus Service under Apache HTTPD
12.4. Ensemble Manager
13. Configuration
13.1. Differences between Profiles and Properties
13.2. Profiles
13.3. Properties
14. Submit Directory Details
14.1. Layout
14.2. Condor DAGMan File
14.3. Kickstart XML Record
14.4. Jobstate.Log File
14.5. Braindump File
14.6. Pegasus static.bp File
15. Jupyter Notebooks
15.1. Introduction
15.2. Requirements
15.3. The Pegasus DAX and Jupyter Python APIs
15.4. JupyterHub
15.5. API Reference
15.6. Tutorial Example Notebook
16. API Reference
16.1. DAX XML Schema
16.2. DAX Generator API
16.3. DAX Generator without a Pegasus DAX API
16.4. Monitoring
17. Command Line Tools
pegasus-analyzer — debugs a workflow.
pegasus-cluster — run a list of applications
pegasus-configure-glite — install Pegasus-specific glite configuration
pegasus-config — Can be used to find installed Pegasus tools and libraries.
pegasus-dagman — Wrapper around *condor_dagman*. Not to be run by user.
pegasus-dax-validator — determines if a given DAX file is valid.
pegasus-db-admin — Manage Pegasus databases.
pegasus-em — Submit and monitor ensembles of workflows
pegasus-exitcode — Used post-job to check the stdout/stderr for errors
pegasus-globus-online — Interfaces with Globus Online for managed transfers.
pegasus-graphviz — Convert a DAX or DAG into a graphviz dot file
pegasus-gridftp — Perform file and directory operations on remote GridFTP servers
pegasus-halt — stops a workflow gracefully, current jobs will finish
pegasus-init — create a new workflow configuration
pegasus-integrity — Generates and verifies data integrity with checksums
pegasus-invoke — invokes a command from a file
pegasus-keg — kanonical executable for grids
pegasus-kickstart — remote job wrapper
pegasus-metadata — Query metadata collected for Pegasus workflows
pegasus-monitord — tracks a workflow progress, mining information
pegasus-mpi-cluster — Enables running DAGs (Directed Acyclic Graphs) on clusters using MPI.
pegasus-mpi-keg — MPI version of KEG
pegasus-plan — runs Pegasus to generate the executable workflow
pegasus-plots — A tool to generate graphs and charts to visualize workflow run.
pegasus-rc-client — shell client for replica implementations
pegasus-remove — removes a workflow that has been planned and submitted using pegasus-plan and pegasus-run
pegasus-run — executes a workflow that has been planned using *pegasus-plan*.
pegasus-s3 — Upload, download, delete objects in Amazon S3
pegasus-sc-converter — A client to convert site catalog from one format to another format.
pegasus-service — Runs the Pegasus Service server
pegasus-statistics — A tool to generate statistics about the workflow run.
pegasus-status — Pegasus workflow- and run-time status
pegasus-submit-dag — Wrapper around *condor_submit_dag*. Not to be run by user.
pegasus-submitdir — Manage a workflow submit directory.
pegasus-tc-client — A full featured generic client to handle adds, deletes and queries to the Transformation Catalog (TC).
pegasus-tc-converter — A client to convert transformation catalog from one format to another format.
pegasus-transfer — Handles data transfers for Pegasus workflows.
pegasus-version — print or match the version of the toolkit.
18. Useful Tips
18.1. Migrating From Pegasus 4.5.X to Pegasus current version
18.2. Migrating From Pegasus <4.5 to Pegasus 4.5.X
18.3. Migrating From Pegasus 3.1 to Pegasus 4.X
18.4. Migrating From Pegasus 2.X to Pegasus 3.X
18.5. Best Practices For Developing Portable Code
18.6. Slot Partitioning and CPU Affinity in Condor
19. Funding, citing, and anonymous usage statistics
19.1. Citing Pegasus in Academic Works
19.2. Usage Statistics Collection
20. Glossary
Glossary
A. Tutorial VM
A.1. Introduction
A.2. VirtualBox
A.3. Amazon EC2