5. Creating Workflows¶
5.1. Abstract Workflows¶
The Abstract Workflow is a description of an user workflow, usually in YAML format (before 5.0 release, it was a XML based format called the DAX) that is used as the primary input into Pegasus. The workflow schema is described using JSON schemas in wf-5.0.yml . We recommend that users use the Workflow API to generate the abstract workflows. The documentation of the API’s can be found at Workflow API . The Workflow API is available for users to use in Python, Java and R format.
The sample workflow below incorporates some of the elementary graph structures used in all abstract workflows.
fan-out, scatter, and diverge all describe the fact that multiple siblings are dependent on fewer parents.
The example shows how the Job 2 and 3 nodes depend on Job 1 node.
fan-in, gather, join, and converge describe how multiple siblings are merged into fewer dependent child nodes.
The example shows how the Job 4 node depends on both Job 2 and Job 3 nodes.
serial execution implies that nodes are dependent on one another, like pearls on a string.
parallel execution implies that nodes can be executed in parallel
The example diamond workflow consists of four nodes representing jobs, and are linked by six files.
Required input files must be registered with the Replica catalog in order for Pegasus to find it and integrate it into the workflow.
Leaf files are a product or output of a workflow. Output files can be collected at a location.
The remaining files all have lines leading to them and originating from them. These files are products of some job steps (lines leading to them), and consumed by other job steps (lines leading out of them). Often, these files represent intermediary results that can be cleaned.
The example workflow representation in form of an abstract requires external catalogs, such as
replica catalog (RC) to resolve the input file
transformation catalog (TC) to resolve the logical job names (such as diamond::preprocess:2.0) and
site catalog (SC) to resolve on what compute resources will the jobs execute on.
The workflow below defines the four jobs just like the example picture,
and the files that flow between the jobs.The intermediary files are neither
registered nor staged out, and can be considered transient.
Only the final result file
f.d is staged out.
There are two main ways of generating the abstract workfow 1. Using a Workflow generating API in
Generating YAML directly from your script.
Note: This option should only be considered by advanced users who can also read YAML schema definitions. This process can be error prone considering YAML’s sensitivity towards indenting and whitespaces.
One example for the Abstract Workflow representing the example workflow can look like the following:
The Abstract Workflow description that you specify to Pegasus is portable, and usually does not contain any locations to physical input files, executables or cluster end points where jobs are executed. Pegasus uses three information catalogs during the planning process.
To discover locations of files referred to in the workflow. At a minimum, you need to specify locations of all the raw input files of the workflow. These are the files that are not generated by any job in the workflow. In the example Abstract Worfklow above, that would be file f.a.
You can use the Python Worklfow API to generate a replica catalog. By default, Pegasus will pick up a file named replicas.yml from the directory where the planner is invoked from.
You can find more details about Replica Catalog in the reference guide here.
To discover locations of executables that are invoked by the jobs in the workflow. The transformation catalog is used to map the logical job names to actual executables that can be invoked on the various sites where the jobs are launched. In the example Abstract Worfklow above, the transformation catalog will map the transformations preprocess, findrange, analyze to an executable available on a particular site.
You can use the Python Worklfow API to generate a replica catalog. By default, Pegasus will pick up a file named transformations.yml from the directory where the planner is invoked from.
The following illustrates how
Pegasus.api.transformation_catalog.TransformationCatalogcan be used to generate a new Transformation Catalog programmatically.
You can find more details about Transformation Catalog in the reference guide here.
To discover what directories and file servers to use for staging in data and placing outputs. Pegasus by default constructs two sites automatically for a a user
The local site is used by Pegasus to learn about the submit host where Pegasus is installed and executed from.
The condorpool site is the Condor pool configured on your submit machine.
You can use the Python Worklfow API to generate a site catalog. By default, Pegasus will pick up a file named sites.yml from the directory where the planner is invoked from. If you want to override the default sites created or use other sites representing HPC clusters or so forth, refer to the Site Catalog in the reference guide here.
5.3. Best Practices For Developing Portable Code¶
This section lists out issues for application developers to keep in mind while developing code that will be run by Pegasus in a distributed computing environment.
5.3.1. Applications cannot specify the directory in which they should be run¶
Application codes are either installed in some standard location at the compute sites or staged on demand. When they are invoked, they are not invoked from the directories where they are installed. Therefore, they should work when invoked from any directory.
5.3.2. No hard-coded paths¶
The applications should not hard-code directory paths as these hard coded paths may become unusable when the application runs on different sites. Rather, these paths should be passed via command line arguments to the job or picked up from environment variables to increase portability.
5.3.3. Propogating back the right exitcode¶
A job in the workflow is only released for execution if its parents have executed successfully. Hence, it is very important that the applications exit with the correct error code in case of success and failure. The application should exit with a status of 0 indicating a successful execution, or a non zero status indicating an error has occurred. Failure to do so will result in erroneous workflow execution where jobs might be released for execution even though their parents had exited with an error.
Successful execution of the application code can only be
determined by an exitcode of 0. The application code should not rely upon
something being written to
stdout to designate success. For example, if
the application writes to
SUCCESS, and exits with a non
zero status the job will still be marked as
In *nix, a quick way to see if a code is exiting with the correct code is to execute the code and then execute echo $?.
$ component-x input-file.lisp ... some output ... $ echo $? 0
If the code is not exiting correctly, it is necessary to wrap the code in a script that tests some final condition (such as the presence or format of a result file) and uses exit to return correctly.
5.3.5. Setting the Job Environment¶
Pegasus allows users to associate env profiles with the jobs, that allow them to specify the environment variables that need to be set when the job executes. Sometimes this maybe insufficient, as you may need to run a script at runtime on the compute node to determine the environment in which your job can execute in.
If your job runs with PegasusLite (i.e. your data configuration is either condorio or nonsharedfs), Pegasus allows you to specify an environment setup script file that is sourced in the PegasusLite wrapper before your job is invoked. This setup script can be used to set up the environment for your job. Details on how to configure this can be found in the PegasusLite chapter.