10.6. Data Cleanup

When executing large workflows, users often may run out of diskspace on the remote clusters / staging site. Pegasus provides a couple of ways of enabling automated data cleanup on the staging site ( i.e the scratch space used by the workflows). This is achieved by adding data cleanup jobs to the executable workflow that the Pegasus Mapper generates. These cleanup jobs are responsible for removing files and directories during the workflow execution. To enable data cleanup you can pass the --cleanup option to pegasus-plan . The value passed decides the cleanup strategy implemented

  1. none disables cleanup altogether. The planner does not add any cleanup jobs in the executable workflow whatsoever.

  2. leaf the planner adds a leaf cleanup node per staging site that removes the directory created by the create dir job in the workflow

  3. inplace the mapper adds cleanup nodes per level of the workflow in addition to leaf cleanup nodes. The nodes remove files no longer required during execution. For example, an added cleanup node will remove input files for a particular compute job after the job has finished successfully. Starting 4.8.0 release, the number of cleanup nodes created by this algorithm on a particular level, is dictated by the number of nodes it encounters on a level of the workflow.

  4. constraint the mapper adds cleanup nodes to constraint the amount of storage space used by a workflow, in addition to leaf cleanup nodes. The nodes remove files no longer required during execution. The added cleanup node guarantees limits on disk usage. File sizes are read from the size flag in the DAX, or from a CSV file ( pegasus.file.cleanup.constraint.csv).

Note

For large workflows with lots of files, the inplace strategy may take a long time as the algorithm works at a per file level to figure out when it is safe to remove a file.

Behaviour of the cleanup strategies implemented in the Pegasus Mapper can be controlled by properties described here.

10.6.1. Data Cleanup in Hierarchal Workflows

By default, for hierarchal workflows the inplace cleanup is always turned off. This is because the cleanup algorithm ( InPlace ) does not work across the sub workflows. For example, if you have two DAX jobs in your top level workflow and the child DAX job refers to a file generated during the execution of the parent DAX job, the InPlace cleanup algorithm when applied to the parent dax job will result in the file being deleted, when the sub workflow corresponding to parent DAX job is executed. This would result in failure of sub workflow corresponding to the child DAX job, as the file deleted is required to present during it's execution.

In case there are no data dependencies across the dax jobs, then yes you can enable the InPlace algorithm for the sub dax’es . To do this you can set the property

  • pegasus.file.cleanup.scope deferred

This will result in cleanup option to be picked up from the arguments for the DAX job in the top level DAX .