7.2. Condor Pool

A HTCondor pool is a set of machines that use HTCondor for resource management. A HTCondor pool can be a cluster of dedicated machines or a set of distributively owned machines. Pegasus can generate concrete workflows that can be executed on a HTCondor pool.

Figure 7.1. The distributed resources appear to be part of a HTCondor pool.

The distributed resources appear to be part of a HTCondor pool.

The workflow is submitted using DAGMan from one of the job submission machines in the HTCondor pool. It is the responsibility of the Central Manager of the pool to match the task in the workflow submitted by DAGMan to the execution machines in the pool. This matching process can be guided by including HTCondor specific attributes in the submit files of the tasks. If the user wants to execute the workflow on the execution machines (worker nodes) in a HTCondor pool, there should be a resource defined in the site catalog which represents these execution machines. The universe attribute of the resource should be vanilla. There can be multiple resources associated with a single HTCondor pool, where each resource identifies a subset of machine (worker nodes) in the pool.

When running on a HTCondor pool, the user has to decide how Pegasus should transfer data. Please see the Data Staging Configuration for the options. The easiest is to use condorio as that mode does not require any extra setup - HTCondor will do the transfers using the existing HTCondor daemons. For an example of this mode see the example workflow in share/pegasus/examples/condor-blackdiamond-condorio/ . In HTCondorio mode, the site catalog for the execution site is very simple as storage is provided by HTCondor:

<?xml version="1.0" encoding="UTF-8"?>
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-4.0.xsd"
             version="4.0">

    <site  handle="local" arch="x86_64" os="LINUX">
        <directory type="shared-scratch" path="/tmp/wf/work">
            <file-server operation="all" url="file:///tmp/wf/work"/>
        </directory>
        <directory type="local-storage" path="/tmp/wf/storage">
            <file-server operation="all" url="file:///tmp/wf/storage"/>
        </directory>
    </site>

    <site  handle="condorpool" arch="x86_64" os="LINUX">
        <profile namespace="pegasus" key="style" >condor</profile>
        <profile namespace="condor" key="universe" >vanilla</profile>
    </site>

</sitecatalog>

There is a set of HTCondor profiles which are used commonly when running Pegasus workflows. You may have to set some or all of these depending on the setup of the HTCondor pool:

  <!-- Change the style to HTCondor for jobs to be executed in the HTCondor Pool.
       By default, Pegasus creates jobs suitable for grid execution. -->
  <profile namespace="pegasus" key="style">condor</profile>

  <!-- Change the universe to vanilla to make the jobs go to remote compute
       nodes. The default is local which will only run jobs on the submit host -->
  <profile namespace="condor" key="universe" >vanilla</profhile>

  <!-- The requirements expression allows you to limit where your jobs go -->
  <profile namespace="condor" key="requirements">(Target.FileSystemDomain != &quot;yggdrasil.isi.edu&quot;)</profile>

  <!-- The following two profiles forces HTCondor to always transfer files. This
       has to be used if the pool does not have a shared filesystem -->
  <profile namespace="condor" key="should_transfer_files">True</profile>
  <profile namespace="condor" key="when_to_transfer_output">ON_EXIT</profile>

7.2.1. Glideins

In this section we describe how machines from different administrative domains and supercomputing centers can be dynamically added to a HTCondor pool for certain timeframe. These machines join the HTCondor pool temporarily and can be used to execute jobs in a non preemptive manner. This functionality is achieved using a HTCondor feature called glideins (see http://cs.wisc.edu/condor/glidein) . The startd daemon is the HTCondor daemon which provides the compute slots and runs the jobs. In the glidein case, the submit machine is usually a static machine and the glideins are told configued to report to that submit machine. The glideins can be submitted to any type of resource: a GRAM enabled cluster, a campus cluster, a cloud environment such as Amazon AWS, or even another HTCondor cluster.

Tip

As glideins are usually coming from different compute resource, and/or the glideins are running in an administrative domain different from the submit node, there is usually no shared filesystem available. Thus the most common data staging modes are condorio and nonsharedfs .

There are many useful tools which submits and manages glideins for you:

  • GlideinWMS is a tool and host environment used mostly on the Open Science Grid.

  • CorralWMS is a personal frontend for GlideinWMS. CorralWMS was developed by the Pegasus team and works very well for high throughput workflows.

  • condor_glidein is a simple glidein tool for Globus GRAM clusters. condor_glidein is shipped with HTCondor.

  • Glideins can also be created by hand or scripts. This is a useful solution for example for cluster which have no external job submit mechanisms or do not allow outside networking.

7.2.2. CondorC

Using HTCondorC users can submit workflows to remote HTCondor pools. HTCondorC is a HTCondor specific solution for remote submission that does not involve the setting up a GRAM on the headnode. To enable HTCondorC submission to a site, user needs to associate pegasus profile key named style with value as HTCondorc. In case, the remote HTCondor pool does not have a shared filesytem between the nodes making up the pool, users should use pegasus in the HTCondorio data configuration. In this mode, all the data is staged to the remote node in the HTCondor pool using HTCondor File transfers and is executed using PegasusLite.

A sample site catalog for submission to a HTCondorC enabled site is listed below

<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-4.0.xsd"
             version="4.0">
      
    <site  handle="local" arch="x86_64" os="LINUX">
        <directory type="shared-scratch" path="/tmp/wf/work">
            <file-server operation="all" url="file:///tmp/wf/work"/>
        </directory>
        <directory type="local-storage" path="/tmp/wf/storage">
            <file-server operation="all" url="file:///tmp/wf/storage"/>
        </directory>
    </site>

    <site  handle="condorcpool" arch="x86_86" os="LINUX">
         <!-- the grid gateway entries are used to designate
              the remote schedd for the HTCondorC pool -->
         <grid type="condor" contact="ccg-condorctest.isi.edu" scheduler="Condor" jobtype="compute" />
         <grid type="condor" contact="ccg-condorctest.isi.edu" scheduler="Condor" jobtype="auxillary" />
        
        <!-- enable submission using HTCondorc -->
        <profile namespace="pegasus" key="style">condorc</profile>

        <!-- specify which HTCondor collector to use. 
             If not specified defaults to remote schedd specified in grid gateway -->
        <profile namespace="condor" key="condor_collector">condorc-collector.isi.edu</profile>
        
        <profile namespace="condor" key="should_transfer_files">Yes</profile>
        <profile namespace="condor" key="when_to_transfer_output">ON_EXIT</profile>
        <profile namespace="env" key="PEGASUS_HOME" >/usr</profile>
        <profile namespace="condor" key="universe">vanilla</profile>

    </site>

</sitecatalog>

To enable PegasusLite in HTCondorIO mode, users should set the following in their properties

# pegasus properties
pegasus.data.configuration    condorio