14. Glossary

Abstract Workflow (formerly known as DAX, prior to 5.0 version)

The workflow input in YAML format given to Pegasus in which transformations and files are represented as logical names. It is an execution-independent specification of computations. The abstract workflow can optionally include catalog descriptions that tell Pegasus the location of the files (Replica Catalog), executables (Transformation Catalog) and the sites (Site Catalog).

Clustering

The process of grouping short running jobs together into a larger job. This is done to minimize the scheduling overhead for the jobs. The scheduling overhead is only incurred for the clustered job. For example if scheduling overhead is x seconds and 10 jobs are clustered into a larger job, the scheduling overhead for 10 jobs will be x instead of 10x.

Compute Site

The logical handle of the computational resource in the Site Catalog on which the executable workflow is executed by Pegasus.

Concrete Workflow or Executable Workflow

The output workflow generated by Pegasus in which files are represented by physical filenames, transformations are represented by paths to executables, and sites or hosts have been selected for running each task. By default, Pegasus generates an executable workflow represented as a HTCondor DAGMan workflow.

Deferred Planning

Planning mode to set up Pegasus. In this mode, instead of mapping the job at submit time, the decision of mapping a job to a site is deferred until a later point, i.e. when the job is about to be run. Deferred planning only applies to pegasusWorkflow jobs in the hierarchical workflows, where these jobs refer to another abstract workflow to be executed.

Directed Acyclic Graph (DAG)

A graph in which all the arcs (connections) are unidirectional,linear, and without loops (cycles).

Full Ahead Planning

Planning mode to set up Pegasus. In this mode, all the jobs are mapped before submitting the workflow for execution to the underlying execution resources.

Globus

The Globus Alliance is a community of organizations and individuals developing fundamental technologies behind the “Grid,” which lets people share computing power, databases, instruments, and other on-line tools securely across corporate, institutional, and geographic boundaries without sacrificing local autonomy.

See Globus Toolkit

Globus Toolkit

Globus Toolkit is an open source software toolkit used for building Grid systems and applications.

GRAM

A Globus service that enables users to locate, submit, monitor and cancel remote jobs on Grid-based compute resources. It provides a single protocol for communicating with different batch/cluster job schedulers.

Grid

A collection of many compute resources , each under different administrative domains connected via a network (usually the Internet).

GridFTP

A high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. It is based upon the Internet FTP protocol, and uses basic Grid security on both control (command) and data channels.

Hierarchical Workflow

A abstract workflow where some jobs in the workflow, instead of referring to compute jobs, point to another workflow that needs to be executed. Hierarchical workflows are one way for users to compose very large workflows that end up containing hundreds of thousands of tasks across all the sub workflows.

HTCondor DAGMan

The workflow execution engine used by Pegasus to manage the execution of the executable workflow.

HTCondor-G

A task broker that manages jobs to run at various distributed sites, using Globus GRAM to launch jobs on the remote sites. More information can be found at HTCondor Website

Input Site

The logical handle to the storage resource described in the Site Catalog, where input data required by a workflow resides.

Invocation Record

As far as possible, jobs in Pegasus are launched using Kickstart that captures runtime provenance about the Tasks in a job (such as exitcode, duration, hostname and directory where the job ran etc) in a YAML formatted record that is called the Invocation Record. In case of clustered jobs, there will be multiple Invocation Records associated with the job in the Stampede Database.

See Task.

Job

A node in your workflow is referred to as a Job in the workflow. For recording purposes in the Stampede monitoring database there is differentiation between jobs in the input Abstract Workflows and the jobs in the Executable Workflows (HTCondor DAGs). Pegasus takes in an Abstract Workflow which is composed of Tasks. Pegasus plans it into a HTCondor DAG / Executable workflow that consists of Jobs. In case of Clustering, multiple tasks in the Abstract Workflow can be captured into a single job in the Executable workflow.

Kickstart

A lightweight C executable that Pegasus uses to launch user executables to gather metrics about the execution of each job.

Logical File Name (LFN)

The unique logical identifier for a data file or an executable. Each LFN is associated with a set of PFN’s that are the physical instantiations of the file.

See Physical File Name

Metadata

Any attributes of a dataset that are explicitly represented in the workflow system. These may include provenance information (e.g., which component was used to generate the dataset), execution information (e.g., time of creation of the dataset), and properties of the dataset (e.g., density of a node type).

Open Science Grid (OSG)

The Open Science Grid consists of computing and storage elements at over 100 individual sites spanning the United States. Researchers can submit batch jobs from their home institution - or OSG-provided submit points - in order to access their local resources and expand elastically out to the OSG, and leverage the distributed nature of our consortium. More information can be found at OSG Website

Output Replica Catalog

A catalog where the register jobs in the executable workflow record the locations of the generated outputs that are staged to the output site. By default, this is a sqlite database in the submit directory of the workflow.

Output Site

The logical handle to the data staging storage resource described in the Site Catalog, that identifies where the final outputs of the workflow are to be placed.

Pegasus

Pegasus is a workflow system, that takes in an abstract workflow and generates an executable workflow that can be executed on a set of distributed execution resources. It automatically locates the necessary input data and computational resources necessary for workflow execution. Pegasus allows workflow-based applications to execute in a number of different environments including desktops, campus clusters, computational grids, and clouds.

Physical File Name (PFN)

The physical filename (URL) of the LFN, that points to an actual file on a particular resource. The physical filename is usually associated with a “site” attribute in Pegasus catalogs, to tell Pegasus what site a file pointed to by a PFN resides on.

Replica Catalog

A catalog that maps LFNs on to PFNs. Pegasus uses this catalog to discover locations of datasets referred to in the abstract workflow.

Site

A set of compute resources under a single administrative domain.

Site Catalog

A catalog indexed by logical site identifiers that maintains information about the various computational sites.

Staging Site

The logical handle to the data staging storage resource described in the Site Catalog, which is used by Pegasus to stage input data required for jobs in the workflow, and store the intermediate datasets generated by the jobs in the workflow.

Stampede Database

The database where all the runtime provenance about the execution of the workflows is recorded. Pegasus Dashboard also pulls information from this database. By default, this is a sqlite database in the submit directory of the workflow.

Sub Workflow

The workflow referred to by a pegasusWorkflow job in a hierarchical workflow.

See Hierarchical Workflow.

Submit Directory

The directory where Pegasus writes out the executable workflow. Usually, these are all the files required by HTCondor DAGMan to execute the executable workflow.

Task

The monitoring layer in Pegasus differentiates between jobs in the input Abstract Workflows and the jobs in the executable workflows (HTCondor DAGs). Pegasus takes in an Abstract Workflow which is composed of Tasks. Pegasus plans it into a HTCondor DAG / Executable workflow that consists of Jobs. In case of Clustering, multiple tasks in the Abstract Workflow can be captured into a single job in the Executable workflow.

Transformation

Any executable or code that is run as a task in the workflow.

Transformation Catalog

A catalog that maps transformation names onto the physical pathnames of the transformation at a given compute site.

XSEDE

The Extreme Science and Engineering Discovery Environment (XSEDE) is a collection of supercomputing clusters and academic clouds largely available in the United States for use by researchers in various fields. More information can be found at XSEDE Website