7.5. Remote Cluster using PyGlidein

Glideins (HTCondor pilot jobs) provide an efficient solution for high-throughput workflows. The glideins are submitted to the remote cluster scheduler, and once started up, makes it appear like your HTCondor pool extends into the remote cluster. HTCondor can then schedule the jobs to the remote compute node in the same way it would schedule jobs to local compute nodes.

Some infrastructures, such as Open Science Grid, provide infrastructure level glidein solutions, such as GlideinWMS. Another solution is BOSCO. For some more custom setups, pyglidein from the IceCube project provides a nice framework. The architecture consists on a server on the submit host, which job it is to determining the demand. On the remote resource, the client can be invoked for example via cron, and submits directly to HTCondor, SLURM and PBS schedulers. This makes pyglidein very flexible and works well for example if the resource requires two-factor authentication.

Figure 7.4. pyglidein overview

pyglidein overview


To get started with pyglidein, check out a copy of the Git repository on both your submit host as well as the cluster you want to glidein to. Starting with the submit host, first make sure you have HTCondor configured for PASSWORD authentication. Make a copy of the HTCondor pool password file. You will need it later in the configuration, and it is a binary file, so make sure you cp instead of a copy-and-paste of the file contents.

Follow the installation instructions provided in the PyGlidein repo. Note that you can use virtualenv if you do not want to do a system-wide install:

$ module load python2   (might not be needed on your system) 
$ virtualenv pyglidein
New python executable in /home/user/pyglidein/bin/python
Installing setuptools, pip, wheel...done.
$ . pyglidein/bin/activate
$ pip install pyglidein
...
    

Then, to get the server started:

pyglidein_server --port 22001  
    

By default, the pyglidein server will use all jobs in the system to determine if glideins are needed. If you want user jobs to explicitly let us know they want glideins, you can pass a constraint for the server to use. For example, jobs could have the +WantPSCBridges = True attribute, and then we could start the server with:

pyglidein_server --port 22001 --constraint "'WantPSCBridges == True'"  
    

One the server is running, you can check status by pointing a web browser to it.

The client (running on the cluster you want glideins on), requires a few configuration files and a glidein.tar.gz file containing the HTCondor binaries, our pool password file, and a modified job wrapper script. This glidein.tar.gz file can be created using the provided create_glidein_tarball.py script, but an easier way is using the already prepared tarball from and injecting your pool password file. For example:

$ wget https://download.pegasus.isi.edu/pyglidein/glidein.tar.gz
$ mkdir glidein
$ cd glidein
$ tar xzf ../glidein.tar.gz
$ cp /some/path/to/poolpasswd passwdfile
$ tar czf ../glidein.tar.gz .
$ cd ..
$ rm -rf glidein
    

You can serve this file over for example http, but as it now contains your pool password, we recommend you copy the glidein.tar.gz to the remote cluster via scp.

Create a configuration file for your glidein. Here is an example for PSC Bridges (other config file examples available under configs/ in the PyGlidein GitHub repo):

[Mode]
debug = True

[Glidein]
address = http://workflow.isi.edu:22001/jsonrpc
site = PSC-Bridges
tarball = /home/rynge/pyglidein-config/glidein.tar.gz

[Cluster]
user = rynge
os = RHEL7
scheduler = slurm
max_idle_jobs = 1
limit_per_submit = 2
walltime_hrs = 48
partitions = RM

[RM]
gpu_only = False
whole_node = True
whole_node_memory = 120000
whole_node_cpus = 28
whole_node_disk = 8000000
whole_node_gpus = 0
partition = RM
group_jobs = False
submit_command = sbatch
running_cmd = squeue -u $USER -t RUNNING -h -p RM | wc -l
idle_cmd = squeue -u $USER -t PENDING -h -p RM | wc -l

[SubmitFile]
filename = submit.slurm
local_dir = $LOCAL
sbatch = #SBATCH
custom_header = #SBATCH -C EGRESS
    #SBATCH --account=ABC123
cvmfs_job_wrapper = False

[StartdLogging]
send_startd_logs = False
url = s3.amazonaws.com
bucket = pyglidein-logging-bridges

[StardChecks]
enable_startd_checks = True

[CustomEnv]
CLUSTER = workflow.isi.edu
    

This configuration will obviously look different for different clusters. A few things to note:

  • address is the location of the server we started earlier

  • tarball is the full path to our custom glidein.tar.gz file we created above.

  • CLUSTER is the location of your HTCondor central manager. In many cases this is the same host you started the server on. Please note that if you do not set this variable, the glideins will try to register into the IceCube infrastructure.

  • #SBATCH -C EGRESS is PSC Bridges specific and enables outbound network connectivity from the compute nodes.

  • #SBATCH --account=ABC123 specifies which allocation to charge the job to. This is a required setting on many, but not all, HPC systems. On PSC Bridges, you can get a list of your allocation by running the projects command, and looking for the Charge ID field.

We also need secrets file. We are not using any remote logging in this example, but the file still has to exist with the following content:

[StartdLogging]
access_key = 
secret_key =
    

At this point we can try our first glidein:

pyglidein_client --config=bridges.config --secrets=secrets
    

Once we have a seen a successful glidein, we can add the client to the crontab:

# m  h  dom mon dow   command
*/10 *   *   *   *    (cd ~/pyglidein/ && pyglidein_client --config=bridges.config --secrets=secrets) >~/cron-pyglidein.log 2>&1
    

With this setup, glideins will now appear automatically based on the demand in the local HTCondor queue.