Schemas
Workflow YAML Schema
The workflow (erstwhile called DAX) format is described by the YAML
schema instance document
wf-5.0.yml
.
Workflow YAML Schema In Detail
The abstract workflow file format has four major sections, with the second section divided into more sub-sections. The abstract workflow format works on the abstract or logical level, letting you focus on the shape of the workflows, what to do and what to work upon.
Workflow level Metadata
Metadata that is associated with the whole workflow. These are defined in the Metadata section.
Workflow-level Notifications
Very simple workflow-level notifications. These are defined in the Notification section.
Catalogs
The first section deals with included catalogs. While we do recommend to use external replica- and transformation catalogs, it is possible to include some replicas and transformations into the abstract workflow file itself. Any workflow-included entry takes precedence over regular replica catalog (RC) and transformation catalog (TC) entries.
The first section (and any of its sub-sections) is completely optional.
The first sub-section deals with included replica descriptions.
The second sub-section deals with included transformation descriptions.
The third sub-section declares multi-item executables.
Job List
The jobs section defines the job- or task descriptions. For each task to conduct, a three-part logical name declares the task and aides identifying it in the transformation catalog or one of the executable section above. During planning, the logical name is translated into the physical executable location on the chosen target site. By declaring jobs abstractly, physical layout consideration of the target sites do not matter. The job’s id uniquley identifies the job within this workflow.
The arguments declare what command-line arguments to pass to the job. If you are passing filenames, you should refer to the logical filename using the file element in the argument list.
Important for properly planning the task is the list of files consumed by the task, its input files, and the files produced by the task, its output files. Each file is described with a uses element inside the task.
Elements exist to link a logical file to any of the stdio file descriptors. The profile element is Pegasus’s way to abstract site-specific data.
Jobs are nodes in the workflow graph. Other nodes include unplanned workflows (DAX), which are planned and then run when the node runs, and planned workflows (DAG), which are simply executed.
Control-flow Dependencies
The third section lists the dependencies between the tasks. The relationships are defined as child parent relationships, and thus impacts the order in which tasks are run. No cyclic dependencies are permitted.
Dependencies are directed edges in the workflow graph.
YAML Intro
The Workflow API generated abstract workflows always list the following optional attributes under the x-pegasus extensions.
x-pegasus:
apiLang: python
createdBy: vahi
createdOn: 07-24-20T10:08:48Z
pegasus: "5.0"
name: diamond
The following attributes are supported for the root element adag.
attribute |
optional? |
type |
meaning |
---|---|---|---|
pegasus |
required |
VersionPattern |
Version number of YAML instance document. Must be 5.0. |
name |
required |
string |
name of this abstract workflow |
apiLang |
optional |
string |
the language of the workflow api used to generate this |
createdBy |
optional |
string |
the user who created this |
createdOn |
optional |
string |
the timestamp when it was created |
The version attribute is restricted to the regular expression
\d+(\.\d+(\.\d+)?)?
.This expression represents the VersionPattern
type that is used in other places, too. It is a more restrictive
expression than before, but allows us to compute comparable version
number using the following formula:
version1: a.b.c |
version2: d.e.f |
n = a * 1,000,000 + b * 1,000 + c |
m = d * 1,000,000 + e * 1,000 + f |
version1 > version2 if n > m |
Workflow-level Metadata
Metadata associated with the whole workflow.
metadata:
creator: vahi
The workflow level metadata maybe used to control the Pegasus Mapper behaviour at planning time or maybe propogated to external services while querying for job characteristics.
Workflow-level Notifications
Notifications that are generated when workflow level events happened.
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
The above snippet will append the current time to a log file in the current directory. This is with regards to the pegasus-monitord instance acting on the notification.
The Catalogs Section
The initial section features three sub-sections:
a catalog of files used,
a catalog of transformations used, and
compound transformation declarations.
The Replica Catalog Section
The file section acts as in in-file replica catalog (RC). Any files declared in this section take precedence over files in external replica catalogs during planning.
replicaCatalog:
replicas:
- lfn: input.txt
pfns:
- {site: local, pfn: 'http://example.com/pegasus/input/input.txt'}
checksum: {sha256: 66a42b4be204c824a7533d2c677ff7cc5c44526300ecd6b450602e06128063f9}
The first replicas entry above is an example of a data file with single replica. The lfn attribute signifies logical file name. Each logical filename may have additional information associated with it, enumerated by profile elements. Each entry in the replicas section may have 0 or more metadata associated with it.
Each entry in the replicas section can provide 0 or more pfn locations, taking precedence over the replica catalog. Multiple locations constitute replicas of the same file, and are assumed to be usable interchangably. The pfn attribute is mandatory, and typically would use a file schema URL. The site attribute is optional, and defaults to value local if missing. A pfns entri may have profile children-elements, which refer to attributes of the physical file. The file-level profiles refer to attributes of the logical file.
The Transformation Catalog Section
The executable section acts as an in-file transformation catalog (TC). Any transformations declared in this section take precedence over the external transformation catalog during planning.
transformationCatalog
transformations:
- name: keg
sites:
- {name: condorpool, pfn: /usr/bin/pegasus-keg, type: installed}
profiles:
env: {APP_HOME: /tmp/myscratch, JAVA_HOME: /opt/java/1.6}
Logical filenames pertaining to a single executables in the transformation catalog use the transformations element. Any transformation entry features the optional namespace attribute, a mandatory name attribute, and an optional version attribute. The version attribute defaults to “1.0” when absent. An executable typically needs additional attributes to describe it properly, like the architecture, OS release and other flags typically seen with transformations. They are described in the sites entries under each transformations entry.
attribute |
optional? |
type |
meaning |
---|---|---|---|
name |
required |
string |
logical transformation name |
namespace |
optional |
string |
namespace of logical transformation, default to null value. |
version |
optional |
VersionPattern |
version of logical transformation, defaults to “1.0”. |
sites |
required |
yaml array |
details about where the transformation resides on 1 or more sites. |
profiles |
optional |
yaml array |
details about the profiles associated with the transformation. |
hooks |
optional |
yaml array |
details about the shell hooks to be invoked. |
requires |
optional |
yaml array |
details about dependent transformations required. |
attribute |
optional? |
type |
meaning |
---|---|---|---|
name |
required |
string |
the site on which the transformation resides. |
type |
required |
string |
whether the executable is installed or stageable. |
pfn |
required |
string |
the pfn of where it is. |
arch |
optional |
Architecture |
restricted set of tokens such as x86, x86_64 etc. |
os.type |
optional |
OSType |
restricted set of tokens, such as linux, macosx etc. |
os.release |
optional |
string |
the os release such deb, rhel etc. |
os.version |
optional |
VersionPattern |
os version. |
bypass |
optional |
boolean |
boolean attribute indicate whether to bypass staging. |
profiles |
optional |
yaml |
details about the profiles associated with the sites entry |
metadata |
optional |
yaml |
details about the profiles associated with the sites entry |
container |
optional |
string |
the name of the container in which the job should run. |
Similar to the replica catalog, each sites entry may have 0 or more profile elements abstracting away site-specific details, zero or more metadata elements, and a required pfn entry. If there are no sites entry, the transformation must still be searched for in the external transformation catalog.
Each transformations entry may also feature hooks entry. These enable notifications at the appropriate point when every job that uses this executable reaches the point of notification. Please refer to the notification section for details and caveats.
The last example above comes from the black diamond example workflow, and presents the kind and extend of attributes you are most likely to see and use in your own workflows.
The Compound Transformation Section
The compound transformation section declares a transformation that comprises multiple plain transformation. You can think of a compound transformation like a script interpreter and the script itself. In order to properly run the application, you must start both, the script interpreter and the script passed to it. The compound transformation helps Pegasus to properly deal with this case, especially when it needs to stage executables.
transformationCatalog
transformations:
- name: mDiffFit
namespace: montage
version: 2.0
requires: [mDiff, mFitplane]
A transformations entry declares a set of purely logical entities, executables and config (data) files, that are all required together for the same job. Being purely logical entities, the lookup happens only when the transformation element is referenced (or instantiated) by a job element later on.
The namespace and version attributes of the transformation element are optional, and provide the defaults for the inner uses elements. They are also essential for matching the transformation with a job.
The transformations entry can have a requires element indicating 0 or more required executable. Each entry in requires list a string of the format Namespace::Name:Version (Namespace:: and :Version may be omitted). The name is a mandatory attribute.
Graph Nodes
The nodes in the abstract workflow comprise regular job nodes, already instantiated sub-workflows as dag nodes, and still to be instantiated abstract workflow nodes. Each of the graph nodes can has a mandatory id attribute. The id attribute is currently a restriction of type NodeIdentifierPattern type, which restricts the id to letters, digits, hyphen and underscore.
The level attribute is deprecated, as the planner will trust its own re-computation more than user input. Please do not use nor produce any level attribute.
The node-label attribute is optional. It applies to the use-case when every transformation has the same name, but its arguments determine what it really does. In the presence of a node-label value, a workflow grapher could use the label value to show graph nodes to the user. It may also come in handy while debugging.
Any job-like graph node has the following set of children elements, as defined in the AbstractJobType declaration in the schema definition:
0 or 1 argument element to declare the command-line of the job’s invocation.
0 or more profile array to abstract away site-specific or job-specific details.
0 or 1 stdin element to link a logical file the the job’s standard input.
0 or 1 stdout element to link a logical file to the job’s standard output.
0 or 1 stderr element to link a logical file to the job’s standard error.
0 or more uses array to declare consumed data files and produced data files.
0 or more hooks array to solicit notifications whence a job reaches a certain state in its life-cycle.
Job Nodes
A job element has a number of attributes. In addition to the id and node-label described in (Graph Nodes)above, the optional namespace, mandatory name and optional version identify the transformation, and provide the look-up handle: first in the DAX’s transformation elements, then in the executable elements, and finally in an external transformation catalog.
<!-- part 2: definition of all jobs (at least one) -->
<job id="ID000001" namespace="example" name="mDiffFit" version="1.0"
node-label="preprocess" >
<argument>-a top -T 6 -i <file name="f.a"/> -o <file name="f.b1"/></argument>
<!-- profiles are optional -->
<profile namespace="execution" key="site">isi_viz</profile>
<profile namespace="condor" key="getenv">true</profile>
<uses name="f.a" link="input" transfer="true" register="true">
<metadata key="size">1024</metadata>
</uses>
<uses name="f.b" link="output" register="false" transfer="true" type="data" />
<!-- 'WHEN' enumeration: never, start, on_error, on_success, at_end, all -->
<!-- PEGASUS_* env-vars: event, status, submit dir, wf/job id, stdout, stderr -->
<invoke when="start">/path/to arg arg</invoke>
<invoke when="on_success"><![CDATA[/path/to arg arg]]></invoke>
<invoke when="at_end"><![CDATA[/path/to arg arg]]></invoke>
</job>
jobs:
- type: job
name: preprocess
id: ID0000001
arguments: [-a, preprocess, -T, "3", -i, f.a, -o, f.b1, f.b2]
uses:
- lfn: f.a
metadata:
creator: vahi
type: input
- lfn: f.b1
type: output
stageOut: true
registerReplica: true
- lfn: f.b2
type: output
stageOut: true
registerReplica: true
profiles:
env: {APP_HOME: /tmp/myscratch, JAVA_HOME: /opt/java/1.6}
metadata:
time: "60"
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
The type attribute indicates whether it is a
job: compute job in the worflow
pegasusWorklfow: an abstract workflow embedded as a node in the workflow.
condorWorkflow: a condor dag workflow embedded as a node in the workflow.
The argument array element contains the complete command-line that is needed to invoke the executable. The only variable components are logical filenames, as included file elements.
The profiles array lets you encapsulate profiles for various namespaces.
The stdin, stdout and stderr element permits you to connect a stdio file descriptor to a logical filename. Note that you will still have to declare these files in the uses section below.
The uses element enumerates all the files that the task consumes or produces. While it is not necessary nor required to have all files appear on the command-line, it is imperative that you declare even hidden files that your task requires in this section, so that the proper ancilliary staging- and clean-up tasks can be generated during planning.
The hooks array may be specified multiple times, as needed. The currently supported shell hooks have a mandatory _on attribute with the following value set:
keyword |
job life-cycle state |
meaning |
---|---|---|
never |
never |
(default). Never notify of anything. This is useful to temporarily disable an existing notifications. |
start |
submit |
create a notification when the job is submitted. |
on_error |
end |
after a job finishes with failure (exitcode != 0). |
on_success |
end |
after a job finishes with success (exitcode == 0). |
at_end |
end |
after a job finishes, regardless of exitcode. |
all |
always |
like start and at_end combined. |
Warning
In clustered jobs, a notification can only be sent at the start or end of the clustered job, not for each member.
Each shell hook is a simple local invocation of an executable or script with the specified arguments. The executable inside the invoke body will see the following environment variables:
variable |
job life-cycle state |
meaning |
---|---|---|
PEGASUS_EVENT |
always |
The value of the |
PEGASUS_STATUS |
end |
The exit status of the graph node. Only available for end notifications. |
PEGASUS_SUBMIT_DIR |
always |
In which directory to find the job (or workflow). |
PEGASUS_JOBID |
always |
The job (or workflow) identifier. This is potentially more than merely the value of the id attribute. |
PEGASUS_STDOUT |
always |
The filename where stdout goes. Empty and possibly non-existent at submit time (though we still have the filename). The kickstart record for job nodes. |
PEGASUS_STDERR |
always |
The filename where stderr goes. Empty and possibly non-existent at submit time (though we still have the filename). |
Condor Workflows Nodes
A workflow that has already been concretized, either by an earlier run of Pegasus, or otherwise constructed for DAGMan execution, can be included into the current workflow using the dag element.
jobs:
- type: condorWorkflow
file: black.dag
id: ID0000001
arguments: []
uses:
- {lfn: black.dag, type: input}
profiles:
dagman: {MAXJOBS: '10', dir: /dag-dir/test}
The id and node-label attributes were described
previously. The file attribute refers to a file
from the File Catalog that provides the actual DAGMan DAG as data
content. The condorWorkflow job features optional profile elements. These
would most likely pertain to the dagman
and env
profile
namespaces. It should be possible to have the optional notify element
in the same manner as for jobs.
A graph node that is a dag instead of a job would just use a different submit file generator to create a DAGMan invocation. There can be an argument element to modify the command-line passed to DAGMan.
Pegasus Workflow Nodes
A still to be planned workflow incurs an invocation of the Pegasus planner as part of the workflow. This still abstract sub-workflow uses the dax element.
jobs:
- type: pegasusWorkflow
file: blackdiamond.yml
id: ID0000001
arguments: [--input-dir, input, --output-sites, local, -vvv, --force]
uses:
- {lfn: blackdiamond.yml, type: input}
- {lfn: f.d, type: output, stageOut: true, registerReplica: true}
profiles:
dagman: {MAXJOBS: '10'}
In addition to the id and node-label attributes, See Graph
Nodes. The name attribute refers to a file from
the File Catalog that provides the to be planned DAX as external file
data content. The dax element features optional profile elements.
These would most likely pertain to the pegasus
, dagman
and
env
profile namespaces. It may be possible to have the optional
notify element in the same manner as for jobs.
A graph node that is a job of type pegasusWorkflow instead of a job would just use yet another submit file and pre-script generator to create a DAGMan invocation. The argument string pertains to the command line of the to-be-generated pegasus-plan invocation.
The Dependency Section
This section describes the dependencies between the jobs.
jobDependencies:
- id: ID0000001
children:
- ID0000002
- ID0000003
Under jobDependencies you can list an array of job id elements. For each id you can specify a children sub-array that lists the ids of the dependent job ids
Abstract Workflow YAML Schema Example
The following code example shows the yaml instance document representing the diamond workflow.
x-pegasus:
apiLang: python
createdBy: ryantanaka
createdOn: 07-24-20T10:08:48Z
pegasus: "5.0"
name: diamond
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
jobs:
- type: job
name: preprocess
id: ID0000001
arguments: [-a, preprocess, -T, "3", -i, f.a, -o, f.b1, f.b2]
uses:
- lfn: f.a
metadata:
creator: ryan
type: input
- lfn: f.b1
type: output
stageOut: true
registerReplica: true
- lfn: f.b2
type: output
stageOut: true
registerReplica: true
metadata:
time: "60"
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
- type: job
name: findrange
id: ID0000002
arguments: [-a, findrange, -T, "3", -i, f.b1, -o, f.c1]
uses:
- lfn: f.b1
type: input
- lfn: f.c1
type: output
stageOut: true
registerReplica: true
metadata:
time: "60"
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
- type: job
name: findrange
id: ID0000003
arguments: [-a, findrange, -T, "3", -i, f.b2, -o, f.c2]
uses:
- lfn: f.c2
type: output
stageOut: true
registerReplica: true
- lfn: f.b2
type: input
metadata:
time: "60"
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
- type: job
name: analyze
id: ID0000004
arguments: [-a, analyze, -T, "3", -i, f.c1, f.c2, -o, f.d]
uses:
- lfn: f.d
metadata:
final_output: "true"
type: output
stageOut: true
registerReplica: true
- lfn: f.c2
type: input
- lfn: f.c1
type: input
metadata:
time: "60"
hooks:
shell:
- _on: start
cmd: /pegasus/libexec/notification/email -t notify@example.com
- _on: end
cmd: /pegasus/libexec/notification/email -t notify@example.com
jobDependencies:
- id: ID0000001
children:
- ID0000002
- ID0000003
- id: ID0000002
children:
- ID0000004
- id: ID0000003
children:
- ID0000004
The above workflow defines the black diamond from the abstract workflow section of the Introduction chapter. It will require minimal configuration, because the catalog sections include all necessary declarations.
The file element defines the location of the required input file in terms of the local machine. Please note that
The uses element declares the required input file “f.a” in terms of the local machine. Please note that if you plan the workflow for a remote site, the has to be some way for the file to be staged from the local site to the remote site. While Pegasus will augment the workflow with such ancillary jobs, the site catalog as well as local and remote site have to be set up properlyl. For a locally run workflow you don’t need to do anything.
The jobs array define the workflow’s logical constituents, the way to invoke the
keg
command, where to put filenames on the commandline, and what files are consumed or produced. In addition to the direction of files, further attributes determine whether to register the file with a replica catalog and whether to transfer it to the output site in case of a product. We are only interested in the final data product “f.d” in this workflow, and not any intermediary files. Typically, you would also want to register the data products in the replica catalog, especially in larger scenarios.The jobDependencies elements define the control flow between the jobs.