Scientific workflows allow users to easily express multi-step computational tasks, for example retrieve data from an instrument or a database, reformat the data, and run an analysis. A scientific workflow describes the dependencies between the tasks and in most cases the workflow is described as a directed acyclic graph (DAG), where the nodes are tasks and the edges denote the task dependencies. A defining property for a scientific workflow is that it manages data flow. The tasks in a scientific workflow can be everything from short serial tasks to very large parallel tasks (MPI for example) surrounded by a large number of small, serial tasks used for pre- and post-processing.
Workflows can vary from simple to complex. Below are some examples. In the figures below, the task are designated by circles/ellipses while the files created by the tasks are indicated by rectangles. Arrows indicate task dependencies.
It consists of a single task that runs the
command and generates a listing of the files in the `/` directory.
Pipeline of Tasks
The pipeline workflow consists of two tasks linked together in a pipeline. The first job runs the `curl` command to fetch the Pegasus home page and store it as an HTML file. The result is passed to the `wc` command, which counts the number of lines in the HTML file.
The split workflow downloads the Pegasus home page using the `curl` command, then uses the `split` command to divide it into 4 pieces. The result is passed to the `wc` command to count the number of lines in each piece.
The merge workflow runs the `ls` command on several */bin directories and passes the results to the `cat` command, which merges the files into a single listing. The merge workflow is an example of a parameter sweep over arguments.
The diamond workflow runs combines the split and merge workflow patterns to create a more complex workflow.
The above examples can be used as building blocks for much complex workflows. Some of these are showcased on the Pegasus Applications page.