10.7. Metadata

Pegasus allows users to associate metadata at

  • Workflow Level in the DAX

  • Task level in the DAX and the Transformation Catalog

  • File level in the DAX and Replica Catalog

Metadata is specified as a key value tuple, where both key and values are of type String.

All the metadata ( user specified and auto-generated) gets populated into the workflow database ( usually in the workflow submit directory) by pegasus-monitord. The metadata in this database can be be queried for using the pegasus-metadata command line tool, or is also shown in the Pegasus Dashboard.

10.7.1. Metadata in the DAX

In the DAX, metadata can be associated with the workflow, tasks, files and executable. For details on how to associate metadata in the DAX using the DAX API refer to the DAX API chapter. Below is an example DAX that illustrates metadata associations at workflow, task and file level.

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

<!-- Section 1: Metadata attributes for the workflow (can be empty)  -->

   <metadata key="name">diamond</metadata>
   <metadata key="createdBy">Karan Vahi</metadata>

<!-- Section 2: Invokes - Adds notifications for a workflow (can be empty) -->

   <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>

<!-- Section 3: Files - Acts as a Replica Catalog (can be empty) -->

   <file name="f.a">
      <metadata key="size">1024</metadata>
      <pfn url="file:///Volumes/Work/lfs1/work/pegasus-features/PM-902/f.a" site="local"/>
   </file>

<!-- Section 4: Executables - Acts as a Transformaton Catalog (can be empty) -->

   <executable namespace="pegasus" name="preprocess" version="4.0" installed="true" arch="x86" os="linux">
      <metadata key="size">2048</metadata>
      <pfn url="file:///usr/bin/keg" site="TestCluster"/>
   </executable>
   <executable namespace="pegasus" name="findrange" version="4.0" installed="true" arch="x86" os="linux">
      <pfn url="file:///usr/bin/keg" site="TestCluster"/>
   </executable>
   <executable namespace="pegasus" name="analyze" version="4.0" installed="true" arch="x86" os="linux">
      <pfn url="file:///usr/bin/keg" site="TestCluster"/>
   </executable>

<!-- Section 5: Transformations - Aggregates executables and Files (can be empty) -->


<!-- Section 6: Job's, DAX's or Dag's - Defines a JOB or DAX or DAG (Atleast 1 required) -->

   <job id="j1" namespace="pegasus" name="preprocess" version="4.0">
      <metadata key="time">60</metadata>
      <argument>-a preprocess -T 60 -i  <file name="f.a"/> -o  <file name="f.b1"/>   <file name="f.b2"/></argument>
      <uses name="f.a" link="input">
         <metadata key="size">1024</metadata>
      </uses>
      <uses name="f.b1" link="output" transfer="true" register="true"/>
      <uses name="f.b2" link="output" transfer="true" register="true"/>
      <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
      <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   </job>
   <job id="j2" namespace="pegasus" name="findrange" version="4.0">
      <metadata key="time">60</metadata>
      <argument>-a findrange -T 60 -i  <file name="f.b1"/> -o  <file name="f.c1"/></argument>
      <uses name="f.b1" link="input"/>
      <uses name="f.c1" link="output" transfer="true" register="true"/>
      <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
      <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   </job>
   <job id="j3" namespace="pegasus" name="findrange" version="4.0">
      <metadata key="time">60</metadata>
      <argument>-a findrange -T 60 -i  <file name="f.b2"/> -o  <file name="f.c2"/></argument>
      <uses name="f.b2" link="input"/>
      <uses name="f.c2" link="output" transfer="true" register="true"/>
      <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
      <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   </job>
   <job id="j4" namespace="pegasus" name="analyze" version="4.0">
      <metadata key="time">60</metadata>
      <argument>-a analyze -T 60 -i  <file name="f.c1"/>   <file name="f.c2"/> -o  <file name="f.d"/></argument>
      <uses name="f.c1" link="input"/>
      <uses name="f.c2" link="input"/>
      <uses name="f.d" link="output" transfer="true" register="true"/>
      <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
      <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   </job>

<!-- Section 7: Dependencies - Parent Child relationships (can be empty) -->

   <child ref="j2">
      <parent ref="j1"/>
   </child>
   <child ref="j3">
      <parent ref="j1"/>
   </child>
   <child ref="j4">
      <parent ref="j2"/>
      <parent ref="j3"/>
   </child>
</adag>

10.7.2. Workflow Level Metadata

Workflow level metadata can be associated only in the DAX under the root element adag. Below is a snippet that illustrates this

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

<!-- Section 1: Metadata attributes for the workflow (can be empty)  -->

   <metadata key="name">diamond</metadata>
   <metadata key="createdBy">Karan Vahi</metadata>

...

</adag>

10.7.3. Task Level Metadata

Metadata for the tasks is picked up from

  • metadata associated with the job element in the DAX

  • metadata associated with the corresponding transformation. The transformation for a task is picked up from either a matching executable entry in the DAX ( if exists ) or the Transformation Catalog.

Below is a snippet that illustrates metadata for a task specified in the job element in the DAX

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

...
    <job id="j2" namespace="pegasus" name="findrange" version="4.0">
      <metadata key="time">60</metadata>
      <argument>-a findrange -T 60 -i  <file name="f.b1"/> -o  <file name="f.c1"/></argument>
      <uses name="f.b1" link="input"/>
      <uses name="f.c1" link="output" transfer="true" register="true"/>
      <invoke when="start">/pegasus/libexec/notification/email -t notify@example.com</invoke>
      <invoke when="at_end">/pegasus/libexec/notification/email -t notify@example.com</invoke>
   </job>

...

</adag>

Below is a snippet that illustrates metadata for a task specified in the executable element in the DAX

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

...
    <!-- Section 4: Executables - Acts as a Transformaton Catalog (can be empty) -->

   <executable namespace="pegasus" name="findrange" version="4.0" installed="true" arch="x86" os="linux">
      <metadata key="size">2048</metadata>
      <pfn url="file:///usr/bin/keg" site="TestCluster"/>
   </executable>

...

</adag>

Metadata can be associated with the transformation in the transformation catalog. The metadata specified in the transformation catalog gets automatically associated with the task level metadata for the corresponding task ( that uses that executable). This resolution is similar to how profiles associated in the Transformation Catalog get associated with the tasks. Below is an example Transformation Catalog that illustrates metadata associated with the executable.

tr pegasus::findrange:4.0 { 
    site TestCluster { 
        pfn "/usr/bin/pegasus-keg" 
        arch "x86_64" 
        os "linux" 
        type "INSTALLED" 
        profile pegasus "clusters.size" "20" 
        metadata "key" "value" 
        metadata "appmodel" "myxform.aspen" 
        metadata "version" "3.0" 
    } 
}

10.7.4. File Level Metadata

Metadata for the files is picked up from

  • metadata associated with the file element in the DAX. File elements are optionally used to record the locations of input files for the workflow in the DAX.

  • metadata associated with the files in the uses section of the job element in the DAX

  • metadata associated with the file in the Replica Catalog.

Below is a snippet that illustrates metadata for a file specified in the file element in the DAX

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

...
    <!-- Section 3: Files - Acts as a Replica Catalog (can be empty) -->

   <file name="f.a">
      <metadata key="size">1024</metadata>
      <pfn url="file:///Volumes/Work/lfs1/work/pegasus-features/PM-902/f.a" site="local"/>
   </file>


...

</adag>

Below is a snippet that illustrates metadata for a file in the uses section of the job element

<?xml version="1.0" encoding="UTF-8"?>
<!-- generated on: 2016-01-21T10:36:39-08:00 -->
<!-- generated by: vahi [ ?? ] -->
<adag xmlns="http://pegasus.isi.edu/schema/DAX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/DAX http://pegasus.isi.edu/schema/dax-3.6.xsd" version="3.6" name="diamond" index="0" count="1">

...
    <job id="j1" namespace="pegasus" name="preprocess" version="4.0">
      <argument>-a preprocess -T 60 -i  <file name="f.a"/> -o  <file name="f.b1"/>   <file name="f.b2"/></argument>
      <uses name="f.a" link="input">
         <metadata key="size">1024</metadata>
         <metadata key="source">DAX</metadata>
      </uses>
      <uses name="f.b1" link="output" transfer="true" register="true"/>
      <uses name="f.b2" link="output" transfer="true" register="true"/>
   </job>

...

</adag>

Below is a snippet that illustrates metadata for an input file in the Replica Catalog entry for the file

# File Based Replica Catalog
f.a file://$PWD/production_200.conf site="local" source="replica_catalog"

10.7.5. Automatically Generated Metadata attributes

Pegasus captures certain metadata attributes as output files are generated and associates them at the file level in the database. Currently, the following attributes for the output files are automatically captured from the kickstart record and stored in the workflow database.

  • pfn - the physical file location

  • ctime - creation time

  • size - size of the file in bytes

  • user - the linux user as who the process ran that generated the output file.

Note

The automatic collection of the metadata attributes for output files is only triggered if the output file is marked to be registered in the replica catalog, and --output-site option to pegasus-plan is specified.

10.7.6. Tracing Metadata for an output file

The command line client pegasus-metadata allows a user to trace all the metadata associated with the file. The client will display metadata for the output file, the task that generated the file, the workflow which contains the task, and the root workflow which contains the task. Below is a sample illustration of it.

$ pegasus-metadata file --file-name f.d --trace /path/to/submit-dir

Workflow 493dda63-c6d0-4e62-bc36-26e5629449ad
    createdby : Test user
    name      : diamond

Task ID0000004
    size           : 2048
    time           : 60
    transformation : analyze

File f.d
    ctime        : 2016-01-20T19:02:14-08:00
    final_output : true
    size         : 582
    user         : bamboo