.. _schemas: ======= Schemas ======= DAX XML Schema ============== The DAX format is described by the XML schema instance document :download:`dax-3.6.xsd <../../schemas/dax-3.6/dax-3.6.xsd>`. A local copy of the schema definition is provided in the “etc” directory. The documentation of the XML schema and its elements can be found in :download:`dax-3.6.html <../../schemas/dax-3.6/dax-3.6.html>` as well as locally in ``doc/schemas/dax-3.6/dax-3.6.html`` in your Pegasus distribution. DAX XML Schema In Detail ------------------------ The DAX file format has four major sections, with the second section divided into more sub-sections. The DAX format works on the abstract or logical level, letting you focus on the shape of the workflows, what to do and what to work upon. 1. Workflow level Metadata Metadata that is associated with the whole workflow. These are defined in the Metadata section. 2. Workflow-level Notifications Very simple workflow-level notifications. These are defined in the `Notification <#notifications>`__ section. 3. Catalogs The first section deals with included catalogs. While we do recommend to use external replica- and transformation catalogs, it is possible to include some replicas and transformations into the DAX file itself. Any DAX-included entry takes precedence over regular replica catalog (RC) and transformation catalog (TC) entries. The first section (and any of its sub-sections) is completely optional. 1. The first sub-section deals with included replica descriptions. 2. The second sub-section deals with included transformation descriptions. 3. The third sub-section declares multi-item executables. 4. Job List The jobs section defines the job- or task descriptions. For each task to conduct, a three-part logical name declares the task and aides identifying it in the transformation catalog or one of the *executable* section above. During planning, the logical name is translated into the physical executable location on the chosen target site. By declaring jobs abstractly, physical layout consideration of the target sites do not matter. The job's *id* uniquley identifies the job within this workflow. The arguments declare what command-line arguments to pass to the job. If you are passing filenames, you should refer to the logical filename using the *file* element in the argument list. Important for properly planning the task is the list of files consumed by the task, its input files, and the files produced by the task, its output files. Each file is described with a *uses* element inside the task. Elements exist to link a logical file to any of the stdio file descriptors. The *profile* element is Pegasus's way to abstract site-specific data. Jobs are nodes in the workflow graph. Other nodes include unplanned workflows (DAX), which are planned and then run when the node runs, and planned workflows (DAG), which are simply executed. 5. Control-flow Dependencies The third section lists the dependencies between the tasks. The relationships are defined as child parent relationships, and thus impacts the order in which tasks are run. No cyclic dependencies are permitted. Dependencies are directed edges in the workflow graph. XML Intro ~~~~~~~~~ If you have seen the DAX schema before, not a lot of new items in the root element. *However*, we did retire the (old) attributes ending in *Count*. :: The following attributes are supported for the root element *adag*. .. table:: Root element attributes ========== ========= ================== ====================================================== attribute optional? type meaning ========== ========= ================== ====================================================== version required *VersionPattern* Version number of DAX instance document. Must be 3.6. name required string name of this DAX (or set of DAXes). count optional positiveInteger size of list of DAXes with this *name*. Defaults to 1. index optional nonNegativeInteger current index of DAX with same *name*. Defaults to 0. fileCount removed nonNegativeInteger Old 2.1 attribute, removed, do not use. jobCount removed positiveInteger Old 2.1 attribute, removed, do not use. childCount removed nonNegativeInteger Old 2.1 attribute, removed, do not use. ========== ========= ================== ====================================================== The *version* attribute is restricted to the regular expression ``\d+(\.\d+(\.\d+)?)?``.This expression represents the *VersionPattern* type that is used in other places, too. It is a more restrictive expression than before, but allows us to compute comparable version number using the following formula: =================================== =================================== version1: a.b.c version2: d.e.f n = a \* 1,000,000 + b \* 1,000 + c m = d \* 1,000,000 + e \* 1,000 + f version1 > version2 if n > m =================================== =================================== Workflow-level Metadata ~~~~~~~~~~~~~~~~~~~~~~~ Metadata associated with the whole workflow. :: diamond Karan Vahi The workflow level metadata maybe used to control the Pegasus Mapper behaviour at planning time or maybe propogated to external services while querying for job characteristics. Workflow-level Notifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Notifications that are generated when workflow level events happened. :: /bin/date -Ins >> my.log The above snippet will append the current time to a log file in the current directory. This is with regards to the pegasus-monitord instance acting on the `notification <#notifications>`__. The Catalogs Section ~~~~~~~~~~~~~~~~~~~~ The initial section features three sub-sections: 1. a catalog of files used, 2. a catalog of transformations used, and 3. compound transformation declarations. .. _dax-replica-catalog: The Replica Catalog Section ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The file section acts as in in-file replica catalog (RC). Any files declared in this section take precedence over files in external replica catalogs during planning. :: /* integer to be defined */ /* 32 char hex string */ /* ISO-8601 timestamp */ /* ISO-8601 *or* 20100417134523:int */ ocean voeckler The first *file* entry above is an example of a data file with two replicas. The *file* element requires a logical file *name*. Each logical filename may have additional information associated with it, enumerated by *profile* elements. Each file entry may have 0 or more *metadata* associated with it. Each piece of metadata has a *key* string and *type* attribute describing the element's value. **Warning** The *metadata* element is not support as of this writing! Details may change in the future. The *file* element can provide 0 or more *pfn* locations, taking precedence over the replica catalog. A *file* element that does not name any *pfn* children-elements will still require look-ups in external replica catalogs. Each *pfn* element names a concrete location of a file. Multiple locations constitute replicas of the same file, and are assumed to be usable interchangably. The *url* attribute is mandatory, and typically would use a file schema URL. The *site* attribute is optional, and defaults to value *local* if missing. A *pfn* element may have *profile* children-elements, which refer to attributes of the physical file. The file-level profiles refer to attributes of the logical file. .. note:: The ``stat`` profile namespace is ony an example, and details about stat are not yet implemented. The proper namespaces ``pegasus``, ``condor``, ``dagman``, ``env``, ``hints``, ``globus`` and ``selector`` enjoy full support. The second *file* entry above shows a usage example from the black-diamond example workflow that you are more likely to encouter or write. The presence of an in-file replica catalog lets you declare a couple of interesting advanced features. The DAG and DAX file declarations are just files for all practical purposes. For deferred planning, the location of the site catalog (SC) can be captured in a file, too, that is passed to the job dealing with the deferred planning as logical filename. :: .. _dax-transformation-catalog: The Transformation Catalog Section ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The executable section acts as an in-file transformation catalog (TC). Any transformations declared in this section take precedence over the external transformation catalog during planning. :: 5000 AB454DSSDA4646DS 2010-11-22T10:05:55.470606000-0800 /* see above */ ocean 0a9c38b919c7809cb645fc09011588a6 /path/to/my_send_email some args Logical filenames pertaining to a single executables in the transformation catalog use the *executable* element. Any *executable* element features the optional *namespace* attribute, a mandatory *name* attribute, and an optional *version* attribute. The *version* attribute defaults to "1.0" when absent. An executable typically needs additional attributes to describe it properly, like the architecture, OS release and other flags typically seen with transformations, or found in the transformation catalog. .. table:: executable element attributes ========= ========= ============== ============================================================= attribute optional? type meaning ========= ========= ============== ============================================================= name required string logical transformation name namespace optional string namespace of logical transformation, default to *null* value. version optional VersionPattern version of logical transformation, defaults to "1.0". installed optional boolean whether to stage the file (false), or not (true, default). arch optional Architecture restricted set of tokens, see schema definition file. os optional OSType restricted set of tokens, see schema definition file. osversion optional VersionPattern kernel version as beginning of \`uname -r`. glibc optional VersionPattern version of libc. ========= ========= ============== ============================================================= The rationale for giving these flags in the *executable* element header is that PFNs are just identical replicas or instances of a given LFN. If you need a different 32/64 bit-ed-ness or OS release, the underlying PFN would be different, and thus the LFN for it should be different, too. .. note:: We are still discussing some details and implications of this decision. The initial examples come with the same caveats as for the included replica catalog. **Warning** The *metadata* element is not support as of this writing! Details may change in the future. Similar to the replica catalog, each *executable* element may have 0 or more *profile* elements abstracting away site-specific details, zero or more *metadata* elements, and zero or more *pfn* elements. If there are no *pfn* elements, the transformation must still be searched for in the external transformation catalog. As before, the *pfn* element may have *profile* children-elements, referring to attributes of the physical filename itself. Each *executable* element may also feature *invoke* elements. These enable notifications at the appropriate point when every job that uses this executable reaches the point of notification. Please refer to the `notification section <#notifications>`__ for details and caveats. The last example above comes from the black diamond example workflow, and presents the kind and extend of attributes you are most likely to see and use in your own workflows. The Compound Transformation Section ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The compound transformation section declares a transformation that comprises multiple plain transformation. You can think of a compound transformation like a script interpreter and the script itself. In order to properly run the application, you must start both, the script interpreter and the script passed to it. The compound transformation helps Pegasus to properly deal with this case, especially when it needs to stage executables. :: A *transformation* element declares a set of purely logical entities, executables and config (data) files, that are all required together for the same job. Being purely logical entities, the lookup happens only when the transformation element is referenced (or instantiated) by a job element later on. The *namespace* and *version* attributes of the transformation element are optional, and provide the defaults for the inner uses elements. They are also essential for matching the transformation with a job. The *transformation* is made up of 1 or more *uses* element. Each *uses* has a boolean attribute *executable*, ``true`` by default, or ``false`` to indicate a data file. The *name* is a mandatory attribute, refering to an LFN declared previously in the File Catalog (*executable* is ``false``), Executable Catalog (*executable* is ``true``), or to be looked up as necessary at instantiation time. The lookup catalog is determined by the *executable* attribute. After *uses* elements, any number of *invoke* elements may occur to add a `notification <#notifications>`__ each whenever this transformation is instantiated. The *namespace* and *version* attributes' default values inside *uses* elements are inherited from the *transformation* attributes of the same name. There is no such inheritance for *uses* elements with *executable* attribute of ``false``. .. _api-graph-nodes: Graph Nodes ~~~~~~~~~~~ The nodes in the DAX comprise regular job nodes, already instantiated sub-workflows as dag nodes, and still to be instantiated dax nodes. Each of the graph nodes can has a mandatory *id* attribute. The *id* attribute is currently a restriction of type *NodeIdentifierPattern* type, which is a restriction of the ``xs:NMTOKEN`` type to letters, digits, hyphen and underscore. The *level* attribute is deprecated, as the planner will trust its own re-computation more than user input. Please do not use nor produce any *level* attribute. The *node-label* attribute is optional. It applies to the use-case when every transformation has the same name, but its arguments determine what it really does. In the presence of a *node-label* value, a workflow grapher could use the label value to show graph nodes to the user. It may also come in handy while debugging. Any job-like graph node has the following set of children elements, as defined in the *AbstractJobType* declaration in the schema definition: - 0 or 1 *argument* element to declare the command-line of the job's invocation. - 0 or more *profile* elements to abstract away site-specific or job-specific details. - 0 or 1 *stdin* element to link a logical file the the job's standard input. - 0 or 1 *stdout* element to link a logical file to the job's standard output. - 0 or 1 *stderr* element to link a logical file to the job's standard error. - 0 or more *uses* elements to declare consumed data files and produced data files. - 0 or more *invoke* elements to solicit `notifications <#notifications>`__ whence a job reaches a certain state in its life-cycle. .. _api-job-nodes: Job Nodes ^^^^^^^^^ A job element has a number of attributes. In addition to the *id* and *node-label* described in (Graph Nodes)above, the optional *namespace*, mandatory *name* and optional *version* identify the transformation, and provide the look-up handle: first in the DAX's *transformation* elements, then in the *executable* elements, and finally in an external transformation catalog. :: -a top -T 6 -i -o isi_viz true 1024 /path/to arg arg The *argument* element contains the complete command-line that is needed to invoke the executable. The only variable components are logical filenames, as included *file* elements. The *profile* argument lets you encapsulate site-specific knowledge . The *stdin*, *stdout* and *stderr* element permits you to connect a stdio file descriptor to a logical filename. Note that you will still have to declare these files in the *uses* section below. The *uses* element enumerates all the files that the task consumes or produces. While it is not necessary nor required to have all files appear on the command-line, it is imperative that you declare even hidden files that your task requires in this section, so that the proper ancilliary staging- and clean-up tasks can be generated during planning. The *invoke* element may be specified multiple times, as needed. It has a mandatory when attribute with the following value set: .. table:: invoke element attributes ========== ==================== ===================================================================================================== keyword job life-cycle state meaning ========== ==================== ===================================================================================================== never never (default). Never notify of anything. This is useful to temporarily disable an existing notifications. start submit create a notification when the job is submitted. on_error end after a job finishes with failure (exitcode != 0). on_success end after a job finishes with success (exitcode == 0). at_end end after a job finishes, regardless of exitcode. all always like start and at_end combined. ========== ==================== ===================================================================================================== .. **Warning** In clustered jobs, a notification can only be sent at the start or end of the clustered job, not for each member. Each *invoke* is a simple local invocation of an executable or script with the specified arguments. The executable inside the invoke body will see the following environment variables: .. table:: invoke/executable environment variables ================== ==================== ========================================================================================================================================================= variable job life-cycle state meaning ================== ==================== ========================================================================================================================================================= PEGASUS_EVENT always The value of the ``when`` attribute PEGASUS_STATUS end The exit status of the graph node. Only available for end notifications. PEGASUS_SUBMIT_DIR always In which directory to find the job (or workflow). PEGASUS_JOBID always The job (or workflow) identifier. This is potentially more than merely the value of the *id* attribute. PEGASUS_STDOUT always The filename where *stdout* goes. Empty and possibly non-existent at submit time (though we still have the filename). The kickstart record for job nodes. PEGASUS_STDERR always The filename where *stderr* goes. Empty and possibly non-existent at submit time (though we still have the filename). ================== ==================== ========================================================================================================================================================= Generators should use CDATA encapsulated values to the invoke element to minimize interference. Unfortunately, CDATA cannot be nested, so if the user invocation contains a CDATA section, we suggest that they use careful XML-entity escaped strings. The `notifications section <#notifications>`__ describes these in further detail. DAG Nodes ^^^^^^^^^ A workflow that has already been concretized, either by an earlier run of Pegasus, or otherwise constructed for DAGMan execution, can be included into the current workflow using the *dag* element. :: /dag-dir/test The *id* and *node-label* attributes were described `previously <#api-graph-nodes>`__. The *name* attribute refers to a file from the File Catalog that provides the actual DAGMan DAG as data content. The *dag* element features optional *profile* elements. These would most likely pertain to the ``dagman`` and ``env`` profile namespaces. It should be possible to have the optional *notify* element in the same manner as for jobs. A graph node that is a dag instead of a job would just use a different submit file generator to create a DAGMan invocation. There can be an *argument* element to modify the command-line passed to DAGMan. DAX Nodes ^^^^^^^^^ A still to be planned workflow incurs an invocation of the Pegasus planner as part of the workflow. This still abstract sub-workflow uses the *dax* element. :: bar -Xmx1024 -Xms512 -Dpegasus.dir.storage=storagedir -Dpegasus.dir.exec=execdir -o local --dir ./datafind -vvvvv --force -s dax_site In addition to the *id* and *node-label* attributes, See `Graph Nodes <#api-graph-nodes>`__. The *name* attribute refers to a file from the File Catalog that provides the to be planned DAX as external file data content. The *dax* element features optional *profile* elements. These would most likely pertain to the ``pegasus``, ``dagman`` and ``env`` profile namespaces. It may be possible to have the optional *notify* element in the same manner as for jobs. A graph node that is a *dax* instead of a job would just use yet another submit file and pre-script generator to create a DAGMan invocation. The *argument* string pertains to the command line of the to-be-generated DAGMan invocation. Inner ADAG Nodes ^^^^^^^^^^^^^^^^ While completeness would argue to have a recursive nesting of *adag* elements, such recursive nestings are currently not supported, not even in the schema. If you need to nest workflows, please use the *dax* or *dag* element to achieve the same goal. The Dependency Section ~~~~~~~~~~~~~~~~~~~~~~ This section describes the dependencies between the jobs. :: Each *child* element contains one or more *parent* element. Either element refers to a *job*, *dag* or *dax* element id attribute using the *ref* attribute. In this version, we relaxed the ``xs:IDREF`` constraint in favor of a restriction on the ``xs:NMTOKEN`` type to permit a larger set of identifiers. The *parent* element has an optional *edge-label* attribute. **Warning** The *edge-label* attribute is currently unused. Its goal is to annotate edges when drawing workflow graphs. Closing ~~~~~~~ As any XML element, the root element needs to be closed. :: DAX XML Schema Example ---------------------- The following code example shows the XML instance document representing the diamond workflow. :: /bin/mailx -s 'diamond failed' use@some.domain 2 3 2 3 2 3 -a preprocess -T60 -i -o -a findrange -T60 -i -o -a findrange -T60 -i -o -a analyze -T60 -i -o The above workflow defines the black diamond from the abstract workflow section of the `Introduction <#about>`__ chapter. It will require minimal configuration, because the catalog sections include all necessary declarations. The file element defines the location of the required input file in terms of the local machine. Please note that - The **file** element declares the required input file "f.a" in terms of the local machine. Please note that if you plan the workflow for a remote site, the has to be some way for the file to be staged from the local site to the remote site. While Pegasus will augment the workflow with such ancillary jobs, the site catalog as well as local and remote site have to be set up properlyl. For a locally run workflow you don't need to do anything. - The **executable** elements declare the same executable keg that is to be run for each the logical transformation in terms of the remote site *futuregrid*. To declare it for a local site, you would have to adjust the *site* attribute's value to ``local``. This section also shows that the same executable may come in different guises as transformation. - The **job** elements define the workflow's logical constituents, the way to invoke the ``keg`` command, where to put filenames on the commandline, and what files are consumed or produced. In addition to the direction of files, further attributes determine whether to register the file with a replica catalog and whether to transfer it to the output site in case of a product. We are only interested in the final data product "f.d" in this workflow, and not any intermediary files. Typically, you would also want to register the data products in the replica catalog, especially in larger scenarios. - The **child** elements define the control flow between the jobs.