.. _schemas:
=======
Schemas
=======
DAX XML Schema
==============
The DAX format is described by the XML schema instance document
:download:`dax-3.6.xsd <../../schemas/dax-3.6/dax-3.6.xsd>`. A local copy of the
schema definition is provided in the “etc” directory. The documentation
of the XML schema and its elements can be found in
:download:`dax-3.6.html <../../schemas/dax-3.6/dax-3.6.html>` as well as locally in
``doc/schemas/dax-3.6/dax-3.6.html`` in your Pegasus distribution.
DAX XML Schema In Detail
------------------------
The DAX file format has four major sections, with the second section
divided into more sub-sections. The DAX format works on the abstract or
logical level, letting you focus on the shape of the workflows, what to
do and what to work upon.
1. Workflow level Metadata
Metadata that is associated with the whole workflow. These are
defined in the Metadata section.
2. Workflow-level Notifications
Very simple workflow-level notifications. These are defined in the
`Notification <#notifications>`__ section.
3. Catalogs
The first section deals with included catalogs. While we do recommend
to use external replica- and transformation catalogs, it is possible
to include some replicas and transformations into the DAX file
itself. Any DAX-included entry takes precedence over regular replica
catalog (RC) and transformation catalog (TC) entries.
The first section (and any of its sub-sections) is completely
optional.
1. The first sub-section deals with included replica descriptions.
2. The second sub-section deals with included transformation
descriptions.
3. The third sub-section declares multi-item executables.
4. Job List
The jobs section defines the job- or task descriptions. For each task
to conduct, a three-part logical name declares the task and aides
identifying it in the transformation catalog or one of the
*executable* section above. During planning, the logical name is
translated into the physical executable location on the chosen target
site. By declaring jobs abstractly, physical layout consideration of
the target sites do not matter. The job's *id* uniquley identifies
the job within this workflow.
The arguments declare what command-line arguments to pass to the job.
If you are passing filenames, you should refer to the logical
filename using the *file* element in the argument list.
Important for properly planning the task is the list of files
consumed by the task, its input files, and the files produced by the
task, its output files. Each file is described with a *uses* element
inside the task.
Elements exist to link a logical file to any of the stdio file
descriptors. The *profile* element is Pegasus's way to abstract
site-specific data.
Jobs are nodes in the workflow graph. Other nodes include unplanned
workflows (DAX), which are planned and then run when the node runs,
and planned workflows (DAG), which are simply executed.
5. Control-flow Dependencies
The third section lists the dependencies between the tasks. The
relationships are defined as child parent relationships, and thus
impacts the order in which tasks are run. No cyclic dependencies are
permitted.
Dependencies are directed edges in the workflow graph.
XML Intro
~~~~~~~~~
If you have seen the DAX schema before, not a lot of new items in the
root element. *However*, we did retire the (old) attributes ending in
*Count*.
::
The following attributes are supported for the root element *adag*.
.. table:: Root element attributes
========== ========= ================== ======================================================
attribute optional? type meaning
========== ========= ================== ======================================================
version required *VersionPattern* Version number of DAX instance document. Must be 3.6.
name required string name of this DAX (or set of DAXes).
count optional positiveInteger size of list of DAXes with this *name*. Defaults to 1.
index optional nonNegativeInteger current index of DAX with same *name*. Defaults to 0.
fileCount removed nonNegativeInteger Old 2.1 attribute, removed, do not use.
jobCount removed positiveInteger Old 2.1 attribute, removed, do not use.
childCount removed nonNegativeInteger Old 2.1 attribute, removed, do not use.
========== ========= ================== ======================================================
The *version* attribute is restricted to the regular expression
``\d+(\.\d+(\.\d+)?)?``.This expression represents the *VersionPattern*
type that is used in other places, too. It is a more restrictive
expression than before, but allows us to compute comparable version
number using the following formula:
=================================== ===================================
version1: a.b.c version2: d.e.f
n = a \* 1,000,000 + b \* 1,000 + c m = d \* 1,000,000 + e \* 1,000 + f
version1 > version2 if n > m
=================================== ===================================
Workflow-level Metadata
~~~~~~~~~~~~~~~~~~~~~~~
Metadata associated with the whole workflow.
::
diamond
Karan Vahi
The workflow level metadata maybe used to control the Pegasus Mapper
behaviour at planning time or maybe propogated to external services
while querying for job characteristics.
Workflow-level Notifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Notifications that are generated when workflow level events happened.
::
/bin/date -Ins >> my.log
The above snippet will append the current time to a log file in the
current directory. This is with regards to the pegasus-monitord instance
acting on the `notification <#notifications>`__.
The Catalogs Section
~~~~~~~~~~~~~~~~~~~~
The initial section features three sub-sections:
1. a catalog of files used,
2. a catalog of transformations used, and
3. compound transformation declarations.
.. _dax-replica-catalog:
The Replica Catalog Section
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The file section acts as in in-file replica catalog (RC). Any files
declared in this section take precedence over files in external replica
catalogs during planning.
::
/* integer to be defined */
/* 32 char hex string */
/* ISO-8601 timestamp */
/* ISO-8601 *or* 20100417134523:int */
ocean
voeckler
The first *file* entry above is an example of a data file with two
replicas. The *file* element requires a logical file *name*. Each
logical filename may have additional information associated with it,
enumerated by *profile* elements. Each file entry may have 0 or more
*metadata* associated with it. Each piece of metadata has a *key* string
and *type* attribute describing the element's value.
**Warning**
The *metadata* element is not support as of this writing! Details may
change in the future.
The *file* element can provide 0 or more *pfn* locations, taking
precedence over the replica catalog. A *file* element that does not name
any *pfn* children-elements will still require look-ups in external
replica catalogs. Each *pfn* element names a concrete location of a
file. Multiple locations constitute replicas of the same file, and are
assumed to be usable interchangably. The *url* attribute is mandatory,
and typically would use a file schema URL. The *site* attribute is
optional, and defaults to value *local* if missing. A *pfn* element may
have *profile* children-elements, which refer to attributes of the
physical file. The file-level profiles refer to attributes of the
logical file.
.. note::
The ``stat`` profile namespace is ony an example, and details about
stat are not yet implemented. The proper namespaces ``pegasus``,
``condor``, ``dagman``, ``env``, ``hints``, ``globus`` and
``selector`` enjoy full support.
The second *file* entry above shows a usage example from the
black-diamond example workflow that you are more likely to encouter or
write.
The presence of an in-file replica catalog lets you declare a couple of
interesting advanced features. The DAG and DAX file declarations are
just files for all practical purposes. For deferred planning, the
location of the site catalog (SC) can be captured in a file, too, that
is passed to the job dealing with the deferred planning as logical
filename.
::
.. _dax-transformation-catalog:
The Transformation Catalog Section
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The executable section acts as an in-file transformation catalog (TC).
Any transformations declared in this section take precedence over the
external transformation catalog during planning.
::
5000
AB454DSSDA4646DS
2010-11-22T10:05:55.470606000-0800
/* see above */
ocean
0a9c38b919c7809cb645fc09011588a6
/path/to/my_send_email some args
Logical filenames pertaining to a single executables in the
transformation catalog use the *executable* element. Any *executable*
element features the optional *namespace* attribute, a mandatory *name*
attribute, and an optional *version* attribute. The *version* attribute
defaults to "1.0" when absent. An executable typically needs additional
attributes to describe it properly, like the architecture, OS release
and other flags typically seen with transformations, or found in the
transformation catalog.
.. table:: executable element attributes
========= ========= ============== =============================================================
attribute optional? type meaning
========= ========= ============== =============================================================
name required string logical transformation name
namespace optional string namespace of logical transformation, default to *null* value.
version optional VersionPattern version of logical transformation, defaults to "1.0".
installed optional boolean whether to stage the file (false), or not (true, default).
arch optional Architecture restricted set of tokens, see schema definition file.
os optional OSType restricted set of tokens, see schema definition file.
osversion optional VersionPattern kernel version as beginning of \`uname -r`.
glibc optional VersionPattern version of libc.
========= ========= ============== =============================================================
The rationale for giving these flags in the *executable* element header
is that PFNs are just identical replicas or instances of a given LFN. If
you need a different 32/64 bit-ed-ness or OS release, the underlying PFN
would be different, and thus the LFN for it should be different, too.
.. note::
We are still discussing some details and implications of this
decision.
The initial examples come with the same caveats as for the included
replica catalog.
**Warning**
The *metadata* element is not support as of this writing! Details may
change in the future.
Similar to the replica catalog, each *executable* element may have 0 or
more *profile* elements abstracting away site-specific details, zero or
more *metadata* elements, and zero or more *pfn* elements. If there are
no *pfn* elements, the transformation must still be searched for in the
external transformation catalog. As before, the *pfn* element may have
*profile* children-elements, referring to attributes of the physical
filename itself.
Each *executable* element may also feature *invoke* elements. These
enable notifications at the appropriate point when every job that uses
this executable reaches the point of notification. Please refer to the
`notification section <#notifications>`__ for details and caveats.
The last example above comes from the black diamond example workflow,
and presents the kind and extend of attributes you are most likely to
see and use in your own workflows.
The Compound Transformation Section
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The compound transformation section declares a transformation that
comprises multiple plain transformation. You can think of a compound
transformation like a script interpreter and the script itself. In order
to properly run the application, you must start both, the script
interpreter and the script passed to it. The compound transformation
helps Pegasus to properly deal with this case, especially when it needs
to stage executables.
::
A *transformation* element declares a set of purely logical entities,
executables and config (data) files, that are all required together for
the same job. Being purely logical entities, the lookup happens only
when the transformation element is referenced (or instantiated) by a job
element later on.
The *namespace* and *version* attributes of the transformation element
are optional, and provide the defaults for the inner uses elements. They
are also essential for matching the transformation with a job.
The *transformation* is made up of 1 or more *uses* element. Each *uses*
has a boolean attribute *executable*, ``true`` by default, or ``false``
to indicate a data file. The *name* is a mandatory attribute, refering
to an LFN declared previously in the File Catalog (*executable* is
``false``), Executable Catalog (*executable* is ``true``), or to be
looked up as necessary at instantiation time. The lookup catalog is
determined by the *executable* attribute.
After *uses* elements, any number of *invoke* elements may occur to add
a `notification <#notifications>`__ each whenever this transformation is
instantiated.
The *namespace* and *version* attributes' default values inside *uses*
elements are inherited from the *transformation* attributes of the same
name. There is no such inheritance for *uses* elements with *executable*
attribute of ``false``.
.. _api-graph-nodes:
Graph Nodes
~~~~~~~~~~~
The nodes in the DAX comprise regular job nodes, already instantiated
sub-workflows as dag nodes, and still to be instantiated dax nodes. Each
of the graph nodes can has a mandatory *id* attribute. The *id*
attribute is currently a restriction of type *NodeIdentifierPattern*
type, which is a restriction of the ``xs:NMTOKEN`` type to letters,
digits, hyphen and underscore.
The *level* attribute is deprecated, as the planner will trust its own
re-computation more than user input. Please do not use nor produce any
*level* attribute.
The *node-label* attribute is optional. It applies to the use-case when
every transformation has the same name, but its arguments determine what
it really does. In the presence of a *node-label* value, a workflow
grapher could use the label value to show graph nodes to the user. It
may also come in handy while debugging.
Any job-like graph node has the following set of children elements, as
defined in the *AbstractJobType* declaration in the schema definition:
- 0 or 1 *argument* element to declare the command-line of the job's
invocation.
- 0 or more *profile* elements to abstract away site-specific or
job-specific details.
- 0 or 1 *stdin* element to link a logical file the the job's standard
input.
- 0 or 1 *stdout* element to link a logical file to the job's standard
output.
- 0 or 1 *stderr* element to link a logical file to the job's standard
error.
- 0 or more *uses* elements to declare consumed data files and produced
data files.
- 0 or more *invoke* elements to solicit
`notifications <#notifications>`__ whence a job reaches a certain
state in its life-cycle.
.. _api-job-nodes:
Job Nodes
^^^^^^^^^
A job element has a number of attributes. In addition to the *id* and
*node-label* described in (Graph Nodes)above, the optional *namespace*,
mandatory *name* and optional *version* identify the transformation, and
provide the look-up handle: first in the DAX's *transformation*
elements, then in the *executable* elements, and finally in an external
transformation catalog.
::
-a top -T 6 -i -o
isi_viz
true
1024
/path/to arg arg
The *argument* element contains the complete command-line that is needed
to invoke the executable. The only variable components are logical
filenames, as included *file* elements.
The *profile* argument lets you encapsulate site-specific knowledge .
The *stdin*, *stdout* and *stderr* element permits you to connect a
stdio file descriptor to a logical filename. Note that you will still
have to declare these files in the *uses* section below.
The *uses* element enumerates all the files that the task consumes or
produces. While it is not necessary nor required to have all files
appear on the command-line, it is imperative that you declare even
hidden files that your task requires in this section, so that the proper
ancilliary staging- and clean-up tasks can be generated during planning.
The *invoke* element may be specified multiple times, as needed. It has
a mandatory when attribute with the following value set:
.. table:: invoke element attributes
========== ==================== =====================================================================================================
keyword job life-cycle state meaning
========== ==================== =====================================================================================================
never never (default). Never notify of anything. This is useful to temporarily disable an existing notifications.
start submit create a notification when the job is submitted.
on_error end after a job finishes with failure (exitcode != 0).
on_success end after a job finishes with success (exitcode == 0).
at_end end after a job finishes, regardless of exitcode.
all always like start and at_end combined.
========== ==================== =====================================================================================================
..
**Warning**
In clustered jobs, a notification can only be sent at the start or
end of the clustered job, not for each member.
Each *invoke* is a simple local invocation of an executable or script
with the specified arguments. The executable inside the invoke body will
see the following environment variables:
.. table:: invoke/executable environment variables
================== ==================== =========================================================================================================================================================
variable job life-cycle state meaning
================== ==================== =========================================================================================================================================================
PEGASUS_EVENT always The value of the ``when`` attribute
PEGASUS_STATUS end The exit status of the graph node. Only available for end notifications.
PEGASUS_SUBMIT_DIR always In which directory to find the job (or workflow).
PEGASUS_JOBID always The job (or workflow) identifier. This is potentially more than merely the value of the *id* attribute.
PEGASUS_STDOUT always The filename where *stdout* goes. Empty and possibly non-existent at submit time (though we still have the filename). The kickstart record for job nodes.
PEGASUS_STDERR always The filename where *stderr* goes. Empty and possibly non-existent at submit time (though we still have the filename).
================== ==================== =========================================================================================================================================================
Generators should use CDATA encapsulated values to the invoke element to
minimize interference. Unfortunately, CDATA cannot be nested, so if the
user invocation contains a CDATA section, we suggest that they use
careful XML-entity escaped strings. The `notifications
section <#notifications>`__ describes these in further detail.
DAG Nodes
^^^^^^^^^
A workflow that has already been concretized, either by an earlier run
of Pegasus, or otherwise constructed for DAGMan execution, can be
included into the current workflow using the *dag* element.
::
/dag-dir/test
The *id* and *node-label* attributes were described
`previously <#api-graph-nodes>`__. The *name* attribute refers to a file
from the File Catalog that provides the actual DAGMan DAG as data
content. The *dag* element features optional *profile* elements. These
would most likely pertain to the ``dagman`` and ``env`` profile
namespaces. It should be possible to have the optional *notify* element
in the same manner as for jobs.
A graph node that is a dag instead of a job would just use a different
submit file generator to create a DAGMan invocation. There can be an
*argument* element to modify the command-line passed to DAGMan.
DAX Nodes
^^^^^^^^^
A still to be planned workflow incurs an invocation of the Pegasus
planner as part of the workflow. This still abstract sub-workflow uses
the *dax* element.
::
bar
-Xmx1024 -Xms512 -Dpegasus.dir.storage=storagedir -Dpegasus.dir.exec=execdir -o local --dir ./datafind -vvvvv --force -s dax_site
In addition to the *id* and *node-label* attributes, See `Graph
Nodes <#api-graph-nodes>`__. The *name* attribute refers to a file from
the File Catalog that provides the to be planned DAX as external file
data content. The *dax* element features optional *profile* elements.
These would most likely pertain to the ``pegasus``, ``dagman`` and
``env`` profile namespaces. It may be possible to have the optional
*notify* element in the same manner as for jobs.
A graph node that is a *dax* instead of a job would just use yet another
submit file and pre-script generator to create a DAGMan invocation. The
*argument* string pertains to the command line of the to-be-generated
DAGMan invocation.
Inner ADAG Nodes
^^^^^^^^^^^^^^^^
While completeness would argue to have a recursive nesting of *adag*
elements, such recursive nestings are currently not supported, not even
in the schema. If you need to nest workflows, please use the *dax* or
*dag* element to achieve the same goal.
The Dependency Section
~~~~~~~~~~~~~~~~~~~~~~
This section describes the dependencies between the jobs.
::
Each *child* element contains one or more *parent* element. Either
element refers to a *job*, *dag* or *dax* element id attribute using the
*ref* attribute. In this version, we relaxed the ``xs:IDREF`` constraint
in favor of a restriction on the ``xs:NMTOKEN`` type to permit a larger
set of identifiers.
The *parent* element has an optional *edge-label* attribute.
**Warning**
The *edge-label* attribute is currently unused.
Its goal is to annotate edges when drawing workflow graphs.
Closing
~~~~~~~
As any XML element, the root element needs to be closed.
::
DAX XML Schema Example
----------------------
The following code example shows the XML instance document representing
the diamond workflow.
::
/bin/mailx -s 'diamond failed' use@some.domain
2
3
2
3
2
3
-a preprocess -T60 -i -o
-a findrange -T60 -i -o
-a findrange -T60 -i -o
-a analyze -T60 -i -o
The above workflow defines the black diamond from the abstract workflow
section of the `Introduction <#about>`__ chapter. It will require
minimal configuration, because the catalog sections include all
necessary declarations.
The file element defines the location of the required input file in
terms of the local machine. Please note that
- The **file** element declares the required input file "f.a" in terms
of the local machine. Please note that if you plan the workflow for a
remote site, the has to be some way for the file to be staged from
the local site to the remote site. While Pegasus will augment the
workflow with such ancillary jobs, the site catalog as well as local
and remote site have to be set up properlyl. For a locally run
workflow you don't need to do anything.
- The **executable** elements declare the same executable keg that is
to be run for each the logical transformation in terms of the remote
site *futuregrid*. To declare it for a local site, you would have to
adjust the *site* attribute's value to ``local``. This section also
shows that the same executable may come in different guises as
transformation.
- The **job** elements define the workflow's logical constituents, the
way to invoke the ``keg`` command, where to put filenames on the
commandline, and what files are consumed or produced. In addition to
the direction of files, further attributes determine whether to
register the file with a replica catalog and whether to transfer it
to the output site in case of a product. We are only interested in
the final data product "f.d" in this workflow, and not any
intermediary files. Typically, you would also want to register the
data products in the replica catalog, especially in larger scenarios.
- The **child** elements define the control flow between the jobs.