11.1. pegasus-analyzer

pegasus-analyzer - debugs a workflow.

pegasus-analyzer [--help|-h] [--quiet|-q] [--strict|-s]
                 [--summary|-S] [--monitord|-m|-t] [--verbose|-v]
                 [--output-dir|-o output_dir]
                 [--dag dag_filename] [--dir|-d|-i input_dir]
                 [--print|-p print_options] [--type workflow_type]
                 [--debug-job job][--debug-dir debug_dir]
                 [--local-executable local user executable]
                 [--conf|-c property_file] [--files]
                 [--top-dir dir_name] [--traverse-all|t] [--recurse|-r]
                 [--indent|-I indent_length]
                 [workflow_directory]

11.1.1. Description

pegasus-analyzer is a command-line utility for parsing the jobstate.log file and reporting successful and failed jobs. When executed without any options, it will query the SQLite or MySQL database and retrieve failed job information for the particular workflow. When invoked with the –files option, it will retrieve information from several log files, isolating jobs that did not complete successfully, and printing their stdout and stderr so that users can get detailed information about their workflow runs.

11.1.2. Options

-h; –help: Prints a usage summary with all the available command-line options.
-q; –quiet: Only print the the output and error filenames instead of their contents.
-s; –strict: Get jobs’ output and error filenames from the job’s submit file.
-S; –summary: Just print the summary about the jobs breakdown status and exit.
-m; -t; –monitord: Invoke pegasus-monitord before analyzing the jobstate.log file. Although pegasus-analyzer can be executed during the workflow execution as well as after the workflow has already completed execution, pegasus-monitord” is always invoked with the –replay option. Since multiple instances of pegasus-monitord” should not be executed simultaneously in the same workflow directory, the user should ensure that no other instances of pegasus-monitord are running. If the run_directory is writable, pegasus-analyzer will create a jobstate.log file there, rotating an older log, if it is found. If the run_directory is not writable (e.g. when the user debugging the workflow is not the same user that ran the workflow), pegasus-analyzer will exit and ask the user to provide the –output-dir option, in order to provide an alternative location for pegasus-monitord log files.
-v; –verbose: Sets the log level for pegasus-analyzer. If omitted, the default level will be set to WARNING. When this option is given, the log level is changed to INFO. If this option is repeated, the log level will be changed to DEBUG.
-o output_dir; –output-dir output_dir: This option provides an alternative location for all monitoring log files for a particular workflow. It is mainly used when an user does not have write privileges to a workflow directory and needs to generate the log files needed by pegasus-analyzer. If this option is used in conjunction with the –monitord option, it will invoke pegasus-monitord using output_dir to store all output files. Because workflows can have sub-workflows, pegasus-monitord will create its files prepending the workflow wf_uuid to each filename. This way, multiple workflow files can be stored in the same directory. pegasus-analyzer has built-in logic to find the specific jobstate.log file by looking at the workflow braindump.txt file first and figuring out the corresponding wf_uuid. If output_dir does not exist, it will be created.
–dag ‘dag_filename: In this option, dag_filename specifies the path to the DAG file to use. pegasus-analyzer will get the directory information from the dag_filename. This option overrides the –dir option below.
-d input_dir; -i input_dir; –dir input_dir: Makes pegasus-analyzer look for the jobstate.log file in the input_dir directory. If this option is omitted, pegasus-analyzer will look in the current directory.
-p print_options; –print print_options: Tells pegasus-analyzer what extra information it should print for failed jobs. print_options is a comma-delimited list of options, that include pre, invocation, and/or all, which activates all printing options. With the pre option, pegasus-analyzer will print the pre-script information for failed jobs. For the invocation option, pegasus-analyzer will print the invocation command, so users can manually run the failed job.
–debug-job job: When given this option, pegasus-analyzer turns on its debug_mode, when it can be used to debug a particular Pegasus Lite job. In this mode, pegasus-analyzer will create a shell script in the debug_dir (see below, for specifying it) and copy all necessary files to this local directory and then execute the job locally.
–debug-dir debug_dir: When in debug_mode, pegasus-analyzer will create a temporary debug directory. Users can give this option in order to specify a particular debug_dir directory to be used instead.
–local-executable local user executable: When in debug job mode for Pegasus Lite jobs, pegasus-analyzer creates a shell script to execute the Pegasus Lite job locally in a debug directory. The Pegasus Lite script refers to remote user executable path. This option can be used to pass the local path to the user executable on the submit host. If the path to the user executable in the Pegasus Lite job is same as the local installation.
–type workflow_type: In this options, users specify what workflow_type they want to debug. At this moment, the only workflow_type available is condor and it is the default value if this option is not specified.
-c property_file; –conf property_file: This option is used to specify an alternative property file, which may contain the path to the database to be used by pegasus-analyzer. If this option is not specified, the config file specified in the braindump.txt file will take precedence.
–files: This option allows users to run pegasus-analyzer using the files in the workflow directory instead of the database as the source of information. pegasus-analyzer will output the same information, this option only changes where the data comes from.
–top-dir dir_name: This option enables pegasus-analyzer to show information about sub-workflows when using the database mode. When debugging a top-level workflow with failures in sub-workflows, the analyzer will automatically print the command users should use to debug a failed sub-workflow. This allows the analyzer to find the database it needs to access.
-T ; –traverse-all: This option set pegasus-analyzer to go through all the descendant workflows of the workflow running in the submit directory passed, irrespective of the fact whether the workflow has succeeded or failed. This option is useful when running pegasus-analyzer on a running hierarchical workflow, to detect failures in sub-workflows that are currently running. This option is mutually exclusive to the –recurse option, that recurses through only failed sub workflow jobs.
-r; –recurse: This option sets pegasus-analyzer to automatically recurse into sub workflows in case of failure. By default, if a workflow has a sub workflow in it, and that sub workflow fails , pegasus-analyzer reports that the sub workflow node failed, and lists a command invocation that the user must execute to determine what jobs in the sub workflow failed. If this option is set, then the analyzer automatically issues the command invocation and in addition displays the failed jobs in the sub workflow. This option is mutually exclusive to the –traverse-all option, that traverses through all descendant workflows.
-I; –indent: This option sets indent length to use when walking displaying results from invoking the command on a hierarchical workflow using the -r|–recurse option. This option dictates the number of white spaces to use when indenting the output of pegasus-analyzer of a sub workflow.
-j; –json: This option returns the output from analyzer in a JSON serializable data structure (Python dict). Sample of this structure is shown below, where the keys are -

root_wf_uuid : uuid of the root workflow
submit_directory : submit directory of the root workflow
workflows: a dict containing Workflow objects
root: key used for root workflow
jobs: a dict containing Jobs objects
total: total number of jobs
success: number of jobs completed
failed: number of jobs failed
held: number of jobs held
unsubmitted: number of jobs unsubmitted
job_details: a dict containing details of all jobs
job_type: failed_jobs or unknown_jobs or failing_jobs or held_jobs
job: name of a specific job, contains JobInstance objects
tasks: a dict containing Task objects

{
  "root_wf_uuid": "f84f05fc-a8d0-42b5-bac5-52d6f41a77e3",
  "submit_directory": "/home/mzalam/processwf/process-workflow/submit/mzalam/pegasus/process/run0001",
  "workflows": {
    "root": {
      "wf_uuid": "f84f05fc-a8d0-42b5-bac5-52d6f41a77e3",
      "dag_file_name": "process-0.dag",
      "submit_hostname": "workflow.isi.edu",
      "submit_dir": "/process-workflow/submit/mzalam/pegasus/process/run0001",
      "user": "mzalam",
      "planner_version": "5.0.5",
      "wf_name": "process",
      "wf_status": "failure",
      "parent_wf_name": "-",
      "parent_wf_uuid": "-",
      "jobs": {
        "total": 5,
        "success": 1,
        "failed": 1,
        "held": 0,
        "unsubmitted": 3,
        "job_details": {
          "failed_jobs_details": {
            "ls_ID0000001": {
              "job_name": "ls_ID0000001",
              "state": "POST_SCRIPT_FAILURE",
              "site": "condorpool",
              "hostname": "workflow.isi.edu",
              "work_dir": "/wf/condor/local/execute/dir_148537",
              "submit_file": "/process_wf_failure/00/00/ls_ID0000001.sub",
              "stdout_file": "/process_wf_failure/00/00/ls_ID0000001.out",
              "stderr_file": "/process_wf_failure/00/00/ls_ID0000001.err",
              "executable": "/process-workflow/submit/mzalam/pegasus/process/run0001/00/00/ls_ID0000001.sh",
              "argv": "",
              "pre_executable": "",
              "pre_argv": null,
              "submit_dir": null,
              "subwf_dir": "-",
              "stdout_text": "-",
              "stderr_text": "/bin/ls: invalid option -- 'z'\nTry '/bin/ls --help' for more information.\n",
              "tasks": {
                "1": {
                  "task_submit_seq": 1,
                  "exitcode": 2,
                  "executable": "/usr/bin/ls",
                  "arguments": "-",
                  "transformation": "ls",
                  "abs_task_id": "ID0000001"
                }
              }
            }
          }
        }
      }
    }
  }
}

11.1.3. Environment Variables

pegasus-analyzer does not require that any environmental variables be set. It locates its required Python modules based on its own location, and therefore should not be moved outside of Pegasus’ bin directory.

11.1.4. Example

The simplest way to use pegasus-analyzer is to go to the run_directory and invoke the analyzer:

$ pegasus-analyzer .

which will cause pegasus-analyzer to print information about the workflow in the current directory.

pegasus-analyzer output contains a summary, followed by detailed information about each job that either failed, or is in an unknown state. Here is the summary section of the output:

**************************Summary***************************

 Total jobs         :     75 (100.00%)
 # jobs succeeded   :     41 (54.67%)
 # jobs failed      :      0 (0.00%)
 # jobs held        :      1 (1.33%)
 # jobs unsubmitted :     33 (44.00%)
 # jobs unknown     :      1 (1.33%)

jobs_succeeded are jobs that have completed successfully. jobs_failed are jobs that have finished, but that did not complete successfully. jobs_unsubmitted are jobs that are listed in the dag_file, but no information about them was found in the jobstate.log file. jobs_held are jobs that were in HTCondor HELD state on the last retry of the job. With default, pegasus added periodic_remove expression with the jobs, a held job can eventually fail. In that case, held job appears as a failed job also. Finally, jobs_unknown are jobs that have started, but have not reached completion.

After the summary section, pegasus-analyzer will display information about each job in the job_failed and job_unknown categories.

*******************************Held jobs' details*******************************

====================================sleep_j2====================================

        submit file            : sleep_j2.sub
        last_job_instance_id   : 7
        reason                 :  Error from slot1@corbusier.isi.edu:
                                  STARTER at 128.9.64.188 failed to
                                  send file(s) to
                                  <128.9.64.188:62639>: error reading from
                                  /opt/condor/8.4.8/local.corbusier/execute/dir_76205/f.out:
                                  (errno 2) No such file or directory;
                                 SHADOW failed to receive file(s) from <128.9.64.188:62653>

In the above example, the sleep_j2 job was held, and the analyzer displays the reason why it was held, as determined from the dagman.out file for the workflow. The last_job_instance_id is the database id for the job in the job instance table of the monitoring database.

******************Failed jobs' details**********************

=======================findrange_j3=========================

  last state: POST_SCRIPT_FAILURE
        site: local
 submit file: /home/user/diamond-submit/findrange_j3.sub
 output file: /home/user/diamond-submit/findrange_j3.out.000
  error file: /home/user/diamond-submit/findrange_j3.err.000

--------------------Task #1 - Summary-----------------------

 site        : local
 hostname    : server-machine.domain.com
 executable  : (null)
 arguments   : -a findrange -T 60 -i f.b2 -o f.c2
 error       : 2
 working dir :

In the example above, the findrange_j3 job has failed, and the analyzer displays information about the job, showing that the job finished with a POST_SCRIPT_FAILURE, and lists the submit, output and error files for this job. Whenever pegasus-analyzer detects that the output file contains a kickstart record, it will display the breakdown containing each task in the job (in this case we only have one task). Because pegasus-analyzer was not invoked with the –quiet flag, it will also display the contents of the output and error files (or the stdout and stderr sections of the kickstart record), which in this case are both empty.

In the case of SUBDAG and subdax jobs, pegasus-analyzer will indicate it, and show the command needed for the user to debug that sub-workflow. For example:

=================subdax_black_ID000009=====================

  last state: JOB_FAILURE
        site: local
 submit file: /home/user/run1/subdax_black_ID000009.sub
 output file: /home/user/run1/subdax_black_ID000009.out
  error file: /home/user/run1/subdax_black_ID000009.err
  This job contains sub workflows!
  Please run the command below for more information:
  pegasus-analyzer -d /home/user/run1/blackdiamond_ID000009.000

-----------------subdax_black_ID000009.out-----------------

Executing condor dagman ...

-----------------subdax_black_ID000009.err-----------------

tells the user the subdax_black_ID000009 sub-workflow failed, and that it can be debugged by using the indicated pegasus-analyzer command.

11.1.5. See Also

pegasus-status(1), pegasus-monitord(1), pegasus-statistics(1).