2.7. Recovery from Failures

Executing workflows in a distributed environment can lead to failures. Often, they are a result of the underlying infrastructure being temporarily unavailable, or errors in workflow setup such as incorrect executables specified, or input files being unavailable.

In case of transient infrastructure failures such as a node being temporarily down in a cluster, Pegasus will automatically retry jobs in case of failure. After a set number of retries (usually once), a hard failure occurs, because of which workflow will eventually fail.

In most of the cases, these errors are correctable (either the resource comes back online or application errors are fixed). Once the errors are fixed, you may not want to start a new workflow but instead start from the point of failure. In order to do this, you can submit the rescue workflows automatically created in case of failures. A rescue workflow contains only a description of only the work that remains to be done.

2.7.1. Submitting Rescue Workflows

In this example, we will take our previously run workflow and introduce errors such that workflow we just executed fails at runtime.

First we will "hide" the input file to cause a failure by renaming it:

$ mv input/pegasus.html input/pegasus.html.bak
      

Now submit the workflow again:

$ ./plan_dax.sh split.dax
2015.10.22 20:20:08.299 PDT:
2015.10.22 20:20:08.307 PDT:   -----------------------------------------------------------------------
2015.10.22 20:20:08.312 PDT:   File for submitting this DAG to Condor           : split-0.dag.condor.sub
2015.10.22 20:20:08.323 PDT:   Log of DAGMan debugging messages                 : split-0.dag.dagman.out
2015.10.22 20:20:08.330 PDT:   Log of Condor library output                     : split-0.dag.lib.out
2015.10.22 20:20:08.339 PDT:   Log of Condor library error messages             : split-0.dag.lib.err
2015.10.22 20:20:08.346 PDT:   Log of the life of condor_dagman itself          : split-0.dag.dagman.log
2015.10.22 20:20:08.352 PDT:
2015.10.22 20:20:08.368 PDT:   -----------------------------------------------------------------------
2015.10.22 20:20:12.331 PDT:   Your database is compatible with Pegasus version: 4.5.3
2015.10.22 20:20:13.326 PDT:   Submitting to condor split-0.dag.condor.sub
2015.10.22 20:20:14.224 PDT:   Submitting job(s).
2015.10.22 20:20:14.254 PDT:   1 job(s) submitted to cluster 168.
2015.10.22 20:20:14.288 PDT:
2015.10.22 20:20:14.297 PDT:   Your workflow has been started and is running in the base directory:
2015.10.22 20:20:14.303 PDT:
2015.10.22 20:20:14.309 PDT:     /home/tutorial/split/submit/tutorial/pegasus/split/run0002
2015.10.22 20:20:14.315 PDT:
2015.10.22 20:20:14.321 PDT:   *** To monitor the workflow you can run ***
2015.10.22 20:20:14.326 PDT:
2015.10.22 20:20:14.332 PDT:     pegasus-status -l /home/tutorial/split/submit/tutorial/pegasus/split/run0002
2015.10.22 20:20:14.351 PDT:
2015.10.22 20:20:14.369 PDT:   *** To remove your workflow run ***
2015.10.22 20:20:14.376 PDT:
2015.10.22 20:20:14.388 PDT:     pegasus-remove /home/tutorial/split/submit/tutorial/pegasus/split/run0002
2015.10.22 20:20:14.397 PDT:
2015.10.22 20:20:16.146 PDT:   Time taken to execute is 10.292 seconds

We will now monitor the workflow using the pegasus-status command till it fails. We will add -w option to pegasus-status to watch automatically till the workflow finishes:

$ pegasus-status -w submit/tutorial/pegasus/split/run0002
(no matching jobs found in Condor Q)
UNREADY   READY     PRE  QUEUED    POST SUCCESS FAILURE %DONE
      8       0       0       0       0       2       1  18.2
Summary: 1 DAG total (Failure:1)

Now we can use the pegasus-analyzer command to determine what went wrong:

$ pegasus-analyzer submit/tutorial/pegasus/split/run0002

************************************Summary*************************************

 Submit Directory   : submit/tutorial/pegasus/split/run0002
 Total jobs         :     11 (100.00%)
 # jobs succeeded   :      2 (18.18%)
 # jobs failed      :      1 (9.09%)
 # jobs unsubmitted :      8 (72.73%)

******************************Failed jobs' details******************************

===========================stage_in_remote_local_0_0============================

 last state: POST_SCRIPT_FAILED
       site: local
submit file: stage_in_remote_local_0_0.sub
output file: stage_in_remote_local_0_0.out.001
 error file: stage_in_remote_local_0_0.err.001

-------------------------------Task #1 - Summary--------------------------------

site        : local
hostname    : unknown
executable  : /usr/local/bin/pegasus-transfer
arguments   :   --threads   2
exitcode    : 1
working dir : /home/tutorial/split/submit/tutorial/pegasus/split/run0002

------------------Task #1 - pegasus::transfer - None - stdout-------------------

2016-02-18 11:52:58,189    INFO:  Reading URL pairs from stdin
2016-02-18 11:52:58,189    INFO:  PATH=/usr/local/bin:/usr/bin:/bin
2016-02-18 11:52:58,189    INFO:  LD_LIBRARY_PATH=
2016-02-18 11:52:58,189    INFO:  1 transfers loaded
2016-02-18 11:52:58,189    INFO:  Sorting the tranfers based on transfer type and source/destination
2016-02-18 11:52:58,190    INFO:  --------------------------------------------------------------------------------
2016-02-18 11:52:58,190    INFO:  Starting transfers - attempt 1
2016-02-18 11:52:58,190    INFO:  Using 1 threads for this round of transfers
2016-02-18 11:53:00,205   ERROR:  Command exited with non-zero exit code (1): /bin/cp -f -R -L '/home/tutorial/split/input/pegasus.html' '/home/tutorial/split/scratch/tutorial/pegasus/split/run0002/pegasus.html'
2016-02-18 11:54:46,205    INFO:  --------------------------------------------------------------------------------
2016-02-18 11:54:46,205    INFO:  Starting transfers - attempt 2
2016-02-18 11:54:46,205    INFO:  Using 1 threads for this round of transfers
2016-02-18 11:54:48,220   ERROR:  Command exited with non-zero exit code (1): /bin/cp -f -R -L '/home/tutorial/split/input/pegasus.html' '/home/tutorial/split/scratch/tutorial/pegasus/split/run0002/pegasus.html'
2016-02-18 11:55:24,224    INFO:  --------------------------------------------------------------------------------
2016-02-18 11:55:24,224    INFO:  Starting transfers - attempt 3
2016-02-18 11:55:24,224    INFO:  Using 1 threads for this round of transfers
2016-02-18 11:55:26,240   ERROR:  Command exited with non-zero exit code (1): /bin/cp -f -R -L '/home/tutorial/split/input/pegasus.html' '/home/tutorial/split/scratch/tutorial/pegasus/split/run0002/pegasus.html'
2016-02-18 11:55:26,240    INFO:  --------------------------------------------------------------------------------
2016-02-18 11:55:26,240    INFO:  Stats: no local files in the transfer set
2016-02-18 11:55:26,240 CRITICAL:  Some transfers failed! See above, and possibly stderr.


-------------Task #1 - pegasus::transfer - None - Kickstart stderr--------------

cp: /home/tutorial/split/input/pegasus.html: No such file or directory
cp: /home/tutorial/split/input/pegasus.html: No such file or directory
cp: /home/tutorial/split/input/pegasus.html: No such file or directory

The above listing indicates that it could not transfer pegasus.html. Let's correct that error by restoring the pegasus.html file:

$ mv input/pegasus.html.bak input/pegasus.html
      

Now in order to start the workflow from where we left off, instead of executing pegasus-plan we will use the command pegasus-run on the directory from our previous failed workflow run:

$ pegasus-run submit/tutorial/pegasus/split/run0002/
Rescued /home/tutorial/split/submit/tutorial/pegasus/split/run0002/split-0.log as /home/tutorial/split/submit/tutorial/pegasus/split/run0002/split-0.log.000
Submitting to condor split-0.dag.condor.sub
Submitting job(s).
1 job(s) submitted to cluster 181.

Your workflow has been started and is running in the base directory:

  submit/tutorial/pegasus/split/run0002/

*** To monitor the workflow you can run ***

  pegasus-status -l submit/tutorial/pegasus/split/run0002/

*** To remove your workflow run ***

  pegasus-remove submit/tutorial/pegasus/split/run0002/

The workflow will now run to completion and succeed.

$ pegasus-status -l submit/tutorial/pegasus/split/run0002/
(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
    0     0     0     0     0    11     0 100.0 Success *split-0.dag
Summary: 1 DAG total (Success:1)