The example that you ran earlier already had the workflow description (split.dax) generated. Pegasus reads workflow descriptions from DAX files. The term "DAX" is short for "Directed Acyclic Graph in XML". DAX is an XML file format that has syntax for expressing jobs, arguments, files, and dependencies. We now will be creating the split workflow that we just ran using the Pegasus provided DAX API:
In this diagram, the ovals represent computational jobs, the dog-eared squares are files, and the arrows are dependencies.
In order to create a DAX it is necessary to write code for a DAX generator. Pegasus comes with Perl, Java, and Python libraries for writing DAX generators. In this tutorial we will show how to use the Python library.
The DAX generator for the split workflow is in the file
daxgen.py. Look at the file by typing:
$ more daxgen.py ...
We will be using the
more command to inspect
several files in this tutorial.
more is a pager
application, meaning that it splits text files into pages and displays
the pages one at a time. You can view the next page of a file by
pressing the spacebar. Type 'h' to get help on using
more. When you are done, you can type 'q' to close
The code has 3 main sections:
A new ADAG object is created. This is the main object to which jobs and dependencies are added.
# Create a abstract dag dax = ADAG("split") ...
Jobs and files are added. The 5 jobs in the diagram above are added and 9 files are referenced. Arguments are defined using strings and File objects. The input and output files are defined for each job. This is an important step, as it allows Pegasus to track the files, and stage the data if necessary. Workflow outputs are tagged with "transfer=true".
# the split job that splits the webpage into smaller chunks webpage = File("pegasus.html") split = Job("split") split.addArguments("-l","100","-a","1",webpage,"part.") split.uses(webpage, link=Link.INPUT) dax.addJob(split) ...
Dependencies are added. These are shown as arrows in the diagram above. They define the parent/child relationships between the jobs. When the workflow is executing, the order in which the jobs will be run is determined by the dependencies between them.
# Add control-flow dependencies dax.depends(wc, split)
Generate a DAX file named
$ ./generate_dax.sh split.dax Generated dax split.dax
split.dax file should contain an XML
representation of the split workflow. You can inspect it by typing:
$ more split.dax ...