2.8. Generating the Workflow

The example that you ran earlier already had the workflow description (split.dax) generated. Pegasus reads workflow descriptions from DAX files. The term "DAX" is short for "Directed Acyclic Graph in XML". DAX is an XML file format that has syntax for expressing jobs, arguments, files, and dependencies. We now will be creating the split workflow that we just ran using the Pegasus provided DAX API:

Figure 2.13. Split Workflow

Split Workflow

In this diagram, the ovals represent computational jobs, the dog-eared squares are files, and the arrows are dependencies.

In order to create a DAX it is necessary to write code for a DAX generator. Pegasus comes with Perl, Java, and Python libraries for writing DAX generators. In this tutorial we will show how to use the Python library.

The DAX generator for the split workflow is in the file daxgen.py. Look at the file by typing:

$ more daxgen.py


We will be using the more command to inspect several files in this tutorial. more is a pager application, meaning that it splits text files into pages and displays the pages one at a time. You can view the next page of a file by pressing the spacebar. Type 'h' to get help on using more. When you are done, you can type 'q' to close the file.

The code has 3 main sections:

  1. A new ADAG object is created. This is the main object to which jobs and dependencies are added.

    # Create a abstract dag
    dax = ADAG("split")
  2. Jobs and files are added. The 5 jobs in the diagram above are added and 9 files are referenced. Arguments are defined using strings and File objects. The input and output files are defined for each job. This is an important step, as it allows Pegasus to track the files, and stage the data if necessary. Workflow outputs are tagged with "transfer=true".

    # the split job that splits the webpage into smaller chunks
    webpage = File("pegasus.html")
    split = Job("split")
    split.uses(webpage, link=Link.INPUT)
  3. Dependencies are added. These are shown as arrows in the diagram above. They define the parent/child relationships between the jobs. When the workflow is executing, the order in which the jobs will be run is determined by the dependencies between them.

    # Add control-flow dependencies
    dax.depends(wc, split)

Generate a DAX file named split.dax by typing:

$ ./generate_dax.sh split.dax
Generated dax split.dax

The split.dax file should contain an XML representation of the split workflow. You can inspect it by typing:

$ more split.dax