11.6. Job Throttling

Issue: For large workflows you may want to control the number of jobs released by DAGMan in local condor queue, or number of remote jobs submitted.

Solution: HTCondor DAGMan has knobs that can be tuned at a per workflow level to control it's behavior. These knobs control how it interacts with the local HTCondor Schedd to which it submits jobs that are ready to run in a particular DAG. These knobs are exposed as DAGMan profiles (maxidle, maxjobs, maxpre and maxpost) that you can set in your properties files.

Table 11.4. Useful dagman Commands that can be specified in the properties file.

Property Key Description

Property Key: dagman.maxpre
Profile  Key: 
MAXPRE
Scope       :
 Properties
Since       : 2.0
Type        : String

sets the maximum number of PRE scripts within the DAG that may be running at one time

Property Key: dagman.maxpost
Profile  Key: 
MAXPOST
Scope       :
 Properties
Since       : 2.0
Type        : String

sets the maximum number of POST scripts within the DAG that may be running at one time

Property Key: dagman.maxjobs
Profile  Key: 
MAXJOBS
Scope       :
 Properties
Since       : 2.0
Type        : String

sets the maximum number of jobs within the DAG that will be submitted to Condor at one time.

Property Key: dagman.maxidle
Profile  Key: 
MAXIDLE
Scope       :
 Properties
Since       : 2.0
Type        : String

Sets the maximum number of idle jobs allowed before HTCondor DAGMan stops submitting more jobs. Once idle jobs start to run, HTCondor DAGMan will resume submitting jobs. If the option is omitted, the number of idle jobs is unlimited.

Property Key: dagman.[CATEGORY-NAME].maxjobs
Profile  Key: 
[CATEGORY-NAME].MAXJOBS
Scope       :
 Properties
Since       : 2.0
Type        : String

is the value of maxjobs for a particular category. Users can associate different categories to the jobs at a per job basis. However, the value of a dagman knob for a category can only be specified at a per workflow basis in the properties.

Property Key: dagman.post.scope
Profile  Key: 
POST.SCOPE
Scope       :
 Properties
Since       : 2.0
Type        : String

scope for the postscripts.
  1. If set to all , means each job in the workflow will have a postscript associated with it.

  2. If set to none , means no job has postscript associated with it. None mode should be used if you are running vanilla / standard/ local universe jobs, as in those cases Condor traps the remote exitcode correctly. None scope is not recommended for grid universe jobs.

  3. If set to essential, means only essential jobs have post scripts associated with them. At present the only non essential job is the replica registration job.


Within a single workflow, you can also control the number of jobs submitted per type ( or category ) of jobs. To associate categories, you needs to associate dagman profile key named category with the jobs and specify the property dagman.[CATEGORY-NAME].* in the properties file. More information about HTCondor DAGMan categories can be found in the HTCondor Documentation.

By default, pegasus associates default category names to following types of auxillary jobs

Table 11.5. Default Category names associated by Pegasus

DAGMan Category Name Auxillary Job applied to. Default Value Assigned in generated DAG file

stage-in 

data stage-in jobs 10

stage-out

data stage-out jobs 10

stage-inter

inter site data transfer jobs -

cleanup

data cleanup jobs 4

registration 

registration jobs 1 (for file based RC)

Below is a sample properties file that illustrates how categories can be specified in the properties file

# pegasus properties file snippet illustrating 
# how to specify dagman categories for different types of jobs

dagman.stage-in.maxjobs 4
dagman.stage-out.maxjobs 1
dagman.cleanup.maxjobs 2

HTCondor also exposes useful configuration parameters that can be specified in it's configuration file (condor_config_val -conf will list the condor configuration files), to control job submission across workflows. Some of the useful parameters that you may want to tune are

Table 11.6. Useful HTCondor Job Throttling Configuration Parameters

HTCondor Configuration Parameter Description

Parameter Name: START_LOCAL_UNIVERSE
Sample Value  : 
TotalLocalJobsRunning < 20

Most of the pegauss added auxillary jobs ( createdir, cleanup, registration and data cleanup ) run in the local universe on the submit host. If you have a lot of workflows running, HTCondor may try to start too many local universe jobs, that may bring down your submit host. This global parameter is used to configure condor to not launch too many local universe jobs.

Parameter Name: GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE
Sample Value  : 
Integer

For grid jobs of type gt2, limits the number of globus-job-manager processes that the condor_gridmanager lets run at a time on the remote head node. Allowing too many globus-job-managers to run causes severe load on the head note, possibly making it non-functional. Usually the default value in htcondor ( as of version 8.3.5) is 10.

This parameter is useful when you are doing remote job submissions using HTCondor-G.

Parameter Name: GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE
Sample Value  : 
 Integer

An integer value that limits the number of jobs that a condor_gridmanager daemon will submit to a resource. A comma-separated list of pairs that follows this integer limit will specify limits for specific remote resources. Each pair is a host name and the job limit for that host. Consider the example
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 
                         200, foo.edu, 50, bar.com, 100.

In this example, all resources have a job limit of 200, except foo.edu, which has a limit of 50, and bar.com, which has a limit of 100. Limits specific to grid types can be set by appending the name of the grid type to the configuration variable name, as the example GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_CREAM = 300 In this example, the job limit for all CREAM resources is 300. Defaults to 1000 ( as of version 8.3.5).

This parameter is useful when you are doing remote job submissions using HTCondor-G.


11.6.1. Job Throttling Across Workflows

Issue: DAGMan throttling knobs are per workflow, and don't work across workflows. Is there any way to control different types of jobs run at a time across workflows?

Solution: While not possible in all cases, it is possible to throttle different types of jobs across workflows if you configure the jobs to run in vanilla universe by leverage HTCondor concurrency limits. Most of the Pegasus generated jobs ( data transfer jobs and auxillary jobs such as create dir, cleanup and registration) execute in local universe where concurrency limits don't work. To use this you need to do the following

  1. Get the local universe jobs to run locally in vanilla universe. You can do this by associating condor profiles universe and requirements in the site catalog for local site or individually in the transformation catalog for each pegasus executable. Here is an example local site catalog entry.

     <site handle="local" arch="x86_64" os="LINUX">
          <directory type="shared-scratch" path="/shared-scratch/local">
             <file-server operation="all" url="file:///shared-scratch/local"/>
          </directory>
          <directory type="local-storage" path="/storage/local">
             <file-server operation="all" url="file:///storage/local"/>
          </directory>
    
          <!-- keys to make jobs scheduled to local site run on local site in vanilla universe -->
          <profile namespace="condor" key="universe">vanilla</profile>
          <profile namespace="condor" key="requirements">(Machine=="submit.example.com")</profile>
       </site>
    

    Replace the Machine value in requirements with the hostname of your submit host.

  2. Copy condor_config.pegasus file from share/pegasus/htcondor directory to your condor config.d directory.

Starting Pegasus 4.5.1 release, the following values for concurrency limits can be associated with different types of jobs Pegasus creates. To enable the generation of concurrency limits with the jobs set the following property in your properties file.

pegasus.condor.concurrency.limits   true

Table 11.7. Pegasus Job Types To Condor Concurrency Limits

Pegasus Job Type HTCondor Concurrency Limit Compatible with distributed condor_config.pegasus

Data Stagein Job

pegasus_transfer.stagein

Data Stageout Job

pegasus_transfer.stageout

Inter Site Data Transfer Job

pegasus_transfer.inter

Worker Pacakge Staging Job

pegasus_transfer.worker

Create Directory Job

pegasus_auxillary.createdir

Data Cleanup Job

pegasus_auxillary.cleanup

Replica Registration Job

pegasus_auxillary.registration

Set XBit Job

pegasus_auxillary.chmod

User Compute Job

pegasus_compute

Note

It is not recommended to set limit for compute jobs unless you know what you are doing.