13. Migration Notes
13.1. Migrating From Pegasus 4.9.X to Pegasus 5.0
The Pegasus 5.0 Release is a major release of Pegasus with the adoption of YAML for representation of all major catalogs. In this release, the following are now represented in YAML:
Abstract Workflow
Replica Catalog
Transformation Catalog
Site Catalog
Kickstart Provenance Records
In addition, 5.0 has a new Python API developed from the grounds up that in addition to generating the abstract workflow and all the catalogs, allows you to plan, submit, monitor, analyze, and generate statistics of your workflow.
To use this new API refer to the Moving From DAX3 to Pegasus.api.
If you are an existing user, please follow these instructions to upgrade.
Python3 Support
All Pegasus tools are Python 3 compliant.
5.0 release will require Python 3 on workflow submit node
Python PIP packages for workflow composition and monitoring
Change in default data configuration
In Pegasus 5.0, the default data configuration has been changed to condorio . Until 4.9.x releases, the default configuration was sharedfs.
Note
Your current workflows would not plan successfully, if you do not have the property pegasus.data.configuration set. Set the property pegasus.data.configuration to sharedfs if you dont have it explicitly set in your properties file.
Output Replica Catalog
In Pegasus 5.0, your outputs are now registered into a sqlite database in your root workflow submit directory. This is different from the input replica catalog that you specify using the pegasus.catalog.prefix . See the chapter on Replica Catalog for more details.
Database upgrade
In Pegasus 5.0, the database schema has been revised to remove inconsistent/unused indexes, and extended to enable tracking of the maxrss and avg_cpu properties.
To migrate your 4.9.x databases to 5.0 refer to the section below on how to use pegasus-db-admin.
13.1.1. Database Upgrades From Pegasus 4.9.x to Pegasus current version
In Pegasus all databases are managed by a single tool: pegasus-db-admin. Databases will be automatically updated when pegasus-plan is invoked, but WORKFLOW databases from past runs may not be updated accordingly. The pegasus-db-admin tool provides an option to automatically update all databases from completed workflows in the MASTER database. To enable this option, run the following command:
$ pegasus-db-admin update -a
Your database has been updated.
Your database is compatible with Pegasus version: 5.0.0
Verifying and updating workflow databases:
15/15
Summary:
Verified/Updated: 15/15
Failed: 0/15
Unable to connect: 0/15
Unable to update (active workflows): 0/15
Log files:
20200721T113456-dbadmin.out (Succeeded operations)
20200721T113456-dbadmin.err (Failed operations)
This option generates a log file for succeeded operations, and a log file for failed operations. Each file contains the list of URLs of the succeeded/failed databases.
Note that, if no URL is provided, the tool will create/use a SQLite database in the user’s home directory: ${HOME}/.pegasus/workflow.db.
For complete description of pegasus-db-admin, see the documentation for pegasus-db-admin.
13.2. Moving From DAX3 to Pegasus.api
13.2.1. Overview
In Pegasus 5.0, a new YAML based workflow format has been introduced to replace
the older DAX3 XML format (Pegasus is still capable of running DAX3 based
workflows however, it is encouraged that the new workflow generators be used
to create these YAML based workflows). Additionally, the Site Catalog, Transformation
Catalog, and Replica Catalog formats have all been updated to a YAML based format.
Using the respective modules in the Pegasus.api package, these catalogs can be
generated programatically. Furthermore, the pegasus.properties
file used for
configuration can be generated with this API. If you have existing catalogs that need to be converted
to the newer format, usage of Pegasus catalog conversion tools are covered in the
subsequent sections.
Attention
The Pegasus.api package requires Python3.5+.
13.2.2. Properties
The pegasus.properies
file format remains the same in this release however
you can now programatically generate this file with Properties
.
The following illustrates how this can be done:
rops = Properties()
props["globus.maxtime"] = 900
props["globus.maxwalltime"] = 1000
props["dagman.retry"] = 4
props.write()
13.2.3. Catalogs
13.2.3.1. Site Catalog
Prior to the 5.0 release, the Site Catalog has been written in XML. Although the format has changed from XML to YAML, the overall structure of this catalog remains unchanged.
To convert an existing Site Catalog from XML to YAML use pegasus-sc-converter.
For example, to convert a Site Catalog file, sites.xml
, to YAML, use the following
command:
pegasus-sc-converter -i sites.xml -o sites.yml
The following illustrates how Pegasus.api.site_catalog.SiteCatalog
can
be used to generate a new Site Catalog programatically based on an existing XML based Site Catalog.
from Pegasus.api import *
# create a SiteCatalog object
sc = SiteCatalog()
# create a "local" site
local = Site("local", arch=Arch.X86_64, os_type=OS.LINUX)
# create and add a shared scratch and local storage directories to the site "local"
local_shared_scratch_dir = Directory(Directory.SHARED_SCRATCH, path="/tmp/workflows/scratch")\
.add_file_servers(FileServer("file:///tmp/workflows/scratch", Operation.ALL))
local_local_storage_dir = Directory(Directory.LOCAL_STORAGE, path="/tmp/workflows/outputs")\
.add_file_servers(FileServer("file:///tmp/workflows/outputs", Operation.ALL))
local.add_directories(local_shared_scratch_dir, local_local_storage_dir)
# create a "condorpool" site
condorpool = Site("condorpool", arch=Arch.X86_64, os_type=OS.LINUX)
# create and add job managers to the site "condorpool"
condorpool.add_grids(
Grid(Grid.GT5, contact="smarty.isi.edu/jobmanager-pbs", scheduler_type=Scheduler.PBS, job_type=SupportedJobs.AUXILLARY),
Grid(Grid.GT5, contact="smarty.isi.edu/jobmanager-pbs", scheduler_type=Scheduler.PBS, job_type=SupportedJobs.COMPUTE)
)
# create and add a shared scratch directory to the site "condorpool"
condorpool_shared_scratch_dir = Directory(Directory.SHARED_SCRATCH, path="/lustre")\
.add_file_servers(FileServer("gsiftp://smarty.isi.edu/lustre", Operation.ALL))
condorpool.add_directories(condorpool_shared_scratch_dir)
# create a "staging_site" site
staging_site = Site("staging_site", arch=Arch.X86_64, os_type=OS.LINUX)
# create and add a shared scratch directory to the site "staging_site"
staging_site_shared_scratch_dir = Directory(Directory.SHARED_SCRATCH, path="/data")\
.add_file_servers(
FileServer("scp://obelix.isi.edu/data", Operation.PUT),
FileServer("http://obelix.isi.edu/data", Operation.GET)
)
staging_site.add_directories(staging_site_shared_scratch_dir)
# add all the sites to the site catalog object
sc.add_sites(
local,
condorpool,
staging_site
)
# write the site catalog to the default path "./sites.yml"
sc.write()
<?xml version="1.0" encoding="UTF-8"?>
<sitecatalog xmlns="http://pegasus.isi.edu/schema/sitecatalog"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-4.0.xsd"
version="4.0">
<site handle="local" arch="x86_64" os="LINUX">
<directory type="shared-scratch" path="/tmp/workflows/scratch">
<file-server operation="all" url="file:///tmp/workflows/scratch"/>
</directory>
<directory type="local-storage" path="/tmp/workflows/outputs">
<file-server operation="all" url="file:///tmp/workflows/outputs"/>
</directory>
</site>
<site handle="condor_pool" arch="x86_64" os="LINUX">
<grid type="gt5" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="auxillary"/>
<grid type="gt5" contact="smarty.isi.edu/jobmanager-pbs" scheduler="PBS" jobtype="compute"/>
<directory type="shared-scratch" path="/lustre">
<file-server operation="all" url="gsiftp://smarty.isi.edu/lustre"/>
</directory>
</site>
<site handle="staging_site" arch="x86_64" os="LINUX">
<directory type="shared-scratch" path="/data">
<file-server operation="put" url="scp://obelix.isi.edu/data"/>
<file-server operation="get" url="http://obelix.isi.edu/data"/>
</directory>
</site>
</sitecatalog>
13.2.3.2. Replica Catalog
The Replica Catalog has been moved from a text based file format to YAML. To convert
an existing Replica Catalog from the text based File format to YAML use pegasus-rc-converter.
For example, to convert a Replica Catalog file, rc.txt
, to YAML, use the following
command:
pegasus-rc-converter -I File -O YAML -i rc.txt -o replicas.yml
The following illustrates how Pegasus.api.replica_catalog.ReplicaCatalog
can be
used to generate a new Replica Catalog programatically based on an existing text based
Replica Catalog.
from Pegasus.api import *
rc = ReplicaCatalog()\
.add_replica("local", "f.a", "/Volumes/data/inputs/f.a")\
.add_replica("local", "f.b", "/Volumes/data/inputs/f.b")\
.write()
# the Replica Catalog will be written to the default path "./replicas.yml"
f.a file:///Volumes/data/inputs/f.a site="local"
f.b file:///Volumes/data/inputs/f.b site="local"
13.2.3.3. Transformation Catalog
The Transformation Catalog has been moved from a text based format to YAML. To convert
an existing Transformation Catalog from the text based file format to YAML, use
pegasus-tc-converter. For example, to convert a Transformation Catalog
file, tc.txt
, to YAML, use the following command:
pegasus-tc-converter -i tc.txt -I Text -O YAML -o transformations.yml
The following illustrates how Pegasus.api.transformation_catalog.TransformationCatalog
can
be used to generate a new Transformation Catalog programatically based on an
existing text based Transformation Catalog.
from Pegasus.api import *
# create the TransformationCatalog object
tc = TransformationCatalog()
# create and add the centos-pegasus container
centos_cont = Container(
"centos-pegasus",
Container.DOCKER,
"docker:///rynge/montage:latest",
mounts=["/Volumes/Workf/lfs1:/shared-data/:ro"]
).add_profiles(Namespace.ENV, JAVA_HOME="/opt/java/1.6")
tc.add_containers(centos_cont)
# create and add the transformation
keg = Transformation(
"keg",
namespace="example",
version="1.0",
site="isi",
pfn="/path/to/keg",
is_stageable=False,
container=centos_cont
).add_profiles(Namespace.ENV, APP_HOME="/tmp/myscratch", JAVA_HOME="/opt/java/1.6")
tc.add_transformations(keg)
# write the transformation catalog to the default file path "./transformations.yml"
tc.write()
tr example::keg:1.0 {
profile env "APP_HOME" "/tmp/myscratch"
profile env "JAVA_HOME" "/opt/java/1.6"
site isi {
pfn "/path/to/keg
arch "x86"
os "linux"
type "INSTALLED"
container "centos-pegasus"
}
}
cont centos-pegasus{
type "docker"
image "docker:///rynge/montage:latest"
mount "/Volumes/Work/lfs1:/shared-data/:ro"
profile env "JAVA_HOME" "/opt/java/1.6"
}
13.2.4. Workflow (formerly DAX)
Pegasus 5.0 brings major API changes to our most used DAX3 python API. Moving
forward, users should use the Pegasus.api
package described in the Python
API reference. The following section shows both the DAX3 and Pegasus.api
representations of the classic diamond workflow.
Note
Method signatures in the Java DAX API remain exactly the same as it was prior to the 5.0 release with the exception that it can now generate YAML. It is recommended to use the Python API moving forward as it supports more features such as catalog generation and access to pegasus command line tools.
#!/usr/bin/env python
import logging
from pathlib import Path
from Pegasus.api import *
logging.basicConfig(level=logging.DEBUG)
# --- Replicas -----------------------------------------------------------------
with open("f.a", "w") as f:
f.write("This is sample input to KEG")
fa = File("f.a").add_metadata(creator="ryan")
rc = ReplicaCatalog().add_replica("local", fa, Path(".").resolve() / "f.a")
# --- Transformations ----------------------------------------------------------
preprocess = Transformation(
"preprocess",
site="condorpool",
pfn="/usr/bin/pegasus-keg",
is_stageable=False,
arch=Arch.X86_64,
os_type=OS.LINUX
)
findrange = Transformation(
"findrange",
site="condorpool",
pfn="/usr/bin/pegasus-keg",
is_stageable=False,
arch=Arch.X86_64,
os_type=OS.LINUX
)
analyze = Transformation(
"analyze",
site="condorpool",
pfn="/usr/bin/pegasus-keg",
is_stageable=False,
arch=Arch.X86_64,
os_type=OS.LINUX
)
tc = TransformationCatalog().add_transformations(preprocess, findrange, analyze)
# --- Workflow -----------------------------------------------------------------
'''
[f.b1] - (findrange) - [f.c1]
/ \
[f.a] - (preprocess) (analyze) - [f.d]
\ /
[f.b2] - (findrange) - [f.c2]
'''
wf = Workflow("diamond")
fb1 = File("f.b1")
fb2 = File("f.b2")
job_preprocess = Job(preprocess)\
.add_args("-a", "preprocess", "-T", "3", "-i", fa, "-o", fb1, fb2)\
.add_inputs(fa)\
.add_outputs(fb1, fb2)
fc1 = File("f.c1")
job_findrange_1 = Job(findrange)\
.add_args("-a", "findrange", "-T", "3", "-i", fb1, "-o", fc1)\
.add_inputs(fb1)\
.add_outputs(fc1)
fc2 = File("f.c2")
job_findrange_2 = Job(findrange)\
.add_args("-a", "findrange", "-T", "3", "-i", fb2, "-o", fc2)\
.add_inputs(fb2)\
.add_outputs(fc2)
fd = File("f.d")
job_analyze = Job(analyze)\
.add_args("-a", "analyze", "-T", "3", "-i", fc1, fc2, "-o", fd)\
.add_inputs(fc1, fc2)\
.add_outputs(fd)
wf.add_jobs(job_preprocess, job_findrange_1, job_findrange_2, job_analyze)
wf.add_replica_catalog(rc)
wf.add_transformation_catalog(tc)
try:
wf.plan(submit=True)\
.wait()\
.analyze()\
.statistics()
except PegasusClientError as e:
print(e)
#!/usr/bin/env python
from Pegasus.DAX3 import *
import sys
import os
if len(sys.argv) != 3:
print "Usage: %s PEGASUS_HOME SHARED_SCRATCH" % (sys.argv[0])
sys.exit(1)
# Create a abstract dag
diamond = ADAG("diamond")
# Add input file to the DAX-level replica catalog
a = File("f.a")
a.addPFN(PFN("file://" + os.getcwd() + "/f.a", "local"))
diamond.addFile(a)
a1 = File("f.a1")
a1.addPFN(PFN("file://" + sys.argv[2] + "/f.a1", "condorpool"))
diamond.addFile(a1)
# Add executables to the DAX-level replica catalog
# In this case the binary is pegasus-keg, which is shipped with Pegasus, so we use
# the remote PEGASUS_HOME to build the path.
e_preprocess = Executable(namespace="diamond", name="preprocess", version="4.0", os="linux", arch="x86_64", installed=False)
e_preprocess.addPFN(PFN("file://" + sys.argv[1] + "/bin/pegasus-keg", "condorpool"))
diamond.addExecutable(e_preprocess)
e_findrange = Executable(namespace="diamond", name="findrange", version="4.0", os="linux", arch="x86_64", installed=False)
e_findrange.addPFN(PFN("file://" + sys.argv[1] + "/bin/pegasus-keg", "condorpool"))
diamond.addExecutable(e_findrange)
e_analyze = Executable(namespace="diamond", name="analyze", version="4.0", os="linux", arch="x86_64", installed=False)
e_analyze.addPFN(PFN("file://" + sys.argv[1] + "/bin/pegasus-keg", "condorpool"))
diamond.addExecutable(e_analyze)
# Add a preprocess job
preprocess = Job(namespace="diamond", name="preprocess", version="4.0")
b1 = File("f.b1")
b2 = File("f.b2")
preprocess.addArguments("-a preprocess","-T60","-i",a,"-o",b1,b2)
preprocess.uses(a, link=Link.INPUT)
preprocess.uses(a1, link=Link.INPUT)
preprocess.uses(b1, link=Link.OUTPUT)
preprocess.uses(b2, link=Link.OUTPUT)
diamond.addJob(preprocess)
# Add left Findrange job
frl = Job(namespace="diamond", name="findrange", version="4.0")
c1 = File("f.c1")
frl.addArguments("-a findrange","-T6-","-i",b1,"-o",c1)
frl.uses(b1, link=Link.INPUT)
frl.uses(c1, link=Link.OUTPUT)
diamond.addJob(frl)
# Add right Findrange job
frr = Job(namespace="diamond", name="findrange", version="4.0")
c2 = File("f.c2")
frr.addArguments("-a findrange","-T60","-i",b2,"-o",c2)
frr.uses(b2, link=Link.INPUT)
frr.uses(c2, link=Link.OUTPUT)
diamond.addJob(frr)
# Add Analyze job
analyze = Job(namespace="diamond", name="analyze", version="4.0")
d = File("f.d")
analyze.addArguments("-a analyze","-T60","-i",c1,c2,"-o",d)
analyze.uses(c1, link=Link.INPUT)
analyze.uses(c2, link=Link.INPUT)
analyze.uses(d, link=Link.OUTPUT, register=True)
diamond.addJob(analyze)
# Add control-flow dependencies
diamond.addDependency(Dependency(parent=preprocess, child=frl))
diamond.addDependency(Dependency(parent=preprocess, child=frr))
diamond.addDependency(Dependency(parent=frl, child=analyze))
diamond.addDependency(Dependency(parent=frr, child=analyze))
# Write the DAX to stdout
diamond.writeXML(sys.stdout)
To begin creating a workflow, you will first need to import the classes made
available in Pegasus.api
. Simply replace DAX3
with api
.
from Pegasus.api import *
from Pegasus.DAX3 import *
The workflow object has been changed from ADAG
to Workflow
. By default,
job dependencies will be inferred based on job input and output files.
wf = Workflow("diamond")
diamond = ADAG("diamond")
In DAX3, you were able to add files directly to the ADAG
object. With the newer 5.0 api,
any file that has a physical file name (i.e. any initial input file to the workflow)
should be added to the ReplicaCatalog
.
In this example, we add the replica catalog to the workflow after all input files
have been added to it. You also have the option to write this out to a separate file
for pegasus-plan
to pick up.
fa = File("f.a").add_metadata(creator="ryan")
rc = ReplicaCatalog().add_replica("local", fa, Path(".").resolve() / "f.a")
wf.add_replica_catalog(rc)
a = File("f.a")
a.addPFN(PFN("file://"+ os.getcwd() + "/f.a", "local"))
diamond.addFile(a)
In DAX3, you were also able to add executables directly to the ADAG
object. In
5.0, the way to do this is to first add them to a TransformationCatalog
and then add that catalog to the workflow as shown below. Note that we now refer
to executables as transformations. In DAX3, you were not able to add containers
directly to the ADAG
object. They would instead need to be cataloged in the
text based transformation catalog file. With the new api, you may create
containers and add them to the workflow through the transformation catalog. For
more information see Containers. Just as with the replica catalog,
you have the option to write this catalog out to a separate file for pegasus-plan
to pick up.
tc = TransformationCatalog()
preprocess = Transformation(
"preprocess",
site="condorpool",
pfn="/usr/bin/pegasus-keg",
is_stageable=False,
arch=Arch.X86_64,
os_type=OS.LINUX
)
tc.add_transformations(preprocess)
wf.add_transformation_catalog(tc)
e_preprocess = Executable(namespace="diamond", name="preprocess", version="4.0", os="linux", arch="x86_64", installed=False)
e_preprocess.addPFN(PFN("file://" + sys.argv[1] + "/bin/pegasus-keg", "condorpool"))
diamond.addExecutable(e_preprocess)
When specifying AbstractJob
inputs and outputs,
simply add the File
s as inputs or outputs.
Unlike DAX3, you do not need to specify job.uses(..)
as seen below.
fb1 = File("f.b1")
fb2 = File("f.b2")
job_preprocess = Job(preprocess)\
.add_args("-a", "preprocess", "-T", "3", "-i", fa, "-o", fb1, fb2)\
.add_inputs(fa)\
.add_outputs(fb1, fb2)
wf.add_jobs(job_reprocess)
preprocess = Job(namespace="diamond", name="preprocess", version="4.0")
b1 = File("f.b1")
b2 = File("f.b2")
preprocess.addArguments("-a preprocess","-T60","-i",a,"-o",b1,b2)
preprocess.uses(a, link=Link.INPUT)
preprocess.uses(b1, link=Link.OUTPUT)
preprocess.uses(b2, link=Link.OUTPUT)
diamond.addJob(preprocess)
Hierarchical workflows can be created by adding SubWorkflow
jobs. The second argument, is_planned
, in SubWorkflow
specifies whether or not it has already
been planned by the pegasus planner. When is_planned=False
, this is the equivalent
of using the DAX
object in Pegasus.DAX3
. When is_planned=True
, this
is the equivalent of using the DAG
object in Pegasus.DAX3
.
blackdiamond_wf = SubWorkflow("blackdiamond.yml", is_planned=False)\
.add_args("--input-dir", "input", "--output-sites", "local", "-vvv")
sleep_wf = SubWorkflow("sleep.yml", is_planned=False)\
.add_args("--output-sites", "local", "-vvv")
wf.add_jobs(blackdiamond_wf, sleep_wf)
# Create a abstract dag
adag = ADAG('local-hierarchy')
daxfile = File('blackdiamond.dax')
dax1 = DAX(daxfile)
#DAX jobs are called with same arguments passed, while planning the root level dax
dax1.addArguments('--output-site local')
dax1.addArguments('-vvv')
adag.addJob(dax1)
# this dax job uses a pre-existing dax file
# that has to be present in the replica catalog
daxfile2 = File('sleep.dax')
dax2 = DAX(daxfile2)
dax2.addArguments('--output-site local')
dax2.addArguments( '-vvv')
adag.addJob(dax2)
Profile functionality remains the same in Pegasus 5.0 (see ProfileMixin
).
Profiles can be added to the following:
job.add_env(PATH="/bin")
job.add_condor_profile(universe="vanilla")
# Alternatively you can use:
job.add_profiles(Namespace.ENV, PATH="/bin")
job.add_profiles(Namespace.CONDOR, universe="vanilla")
# When profile keys contain non-alphanumeric characters, you can use:
job.add_profiles(Namespace.CONDOR, key="+KeyName", value="val")
job.addProfile(Profile(Namespace.ENV,'PATH','/bin'))
job.profile(Namespace.CONDOR, "universe", "vanilla")
Metadata functionality also remains the same in Pegasus 5.0 (see MetadataMixin
).
Metadata can be added to the following:
preprocess.add_metadata(time=60, created_by="ryan")
preprocess.metadata("time", "60")
preprocess.metadata("created_by", "ryan")
13.2.5. Running Workflows
Using the Python API, you can run the workflow directly from the
Workflow
you have just created. This is done
by calling plan
on the Workflow
after all jobs have been added to it. If submit=True
is given to wf.plan
,
the workflow will be planned and submitted for execution. At that point, wf.plan()
will return. If you would like to block until the actual workflow execution is called
then wf.plan(submit=True).wait()
can be used.
Attention
To use this feature, the Pegasus binaries must be added to your PATH
and it
is only supported in the new python api.
#!/usr/bin/env python3
# set this if you would like to see output from the underlying pegasus command line tools
import logging
logging.basicConfig(level=logging.INFO)
from Pegasus.api import *
wf = Workflow("diamond")
# Add properties
# ..
# .
# Add files, transformations, and jobs here
# ....
# ...
# ..
# .
try:
# plan and submit the workflow for execution
wf.plan(submit=True)
# braindump becomes accessible following a call to wf.plan()
print(wf.braindump.submit_dir)
# wait for workflow execution to complete
wf.wait()
# workflow debugging and statistics
wf.analyze()
wf.statistics()
except PegasusClientError as e:
print(e)
Tip
Because the property file, catalogs, and the workflow can all be generated and run programatically, it is recommended to keep everything in a single script so that a wrapper shell script is not needed.
13.3. Migrating From Pegasus 4.5.X to Pegasus 4.9.x
Most of the migrations from one version to another are related to database upgrades, that is addressed by running the tool pegasus-db-admin.
13.3.1. Database Upgrades From Pegasus 4.5.X to Pegasus current version
Since Pegasus 4.5 all databases are managed by a single tool: pegasus-db-admin. Databases will be automatically updated when pegasus-plan is invoked, but WORKFLOW databases from past runs may not be updated accordingly. Since Pegasus 4.6.0, the pegasus-db-admin tool provides an option to automatically update all databases from completed workflows in the MASTER database. To enable this option, run the following command:
$ pegasus-db-admin update -a
Your database has been updated.
Your database is compatible with Pegasus version: 4.7.0
Verifying and updating workflow databases:
21/21
Summary:
Verified/Updated: 21/21
Failed: 0/21
Unable to connect: 0/21
Unable to update (active workflows): 0/21
Log files:
20161006T134415-dbadmin.out (Succeeded operations)
20161006T134415-dbadmin.err (Failed operations)
This option generates a log file for succeeded operations, and a log file for failed operations. Each file contains the list of URLs of the succeeded/failed databases.
Note that, if no URL is provided, the tool will create/use a SQLite database in the user’s home directory: ${HOME}/.pegasus/workflow.db.
For complete description of pegasus-db-admin, see the man page.
13.3.2. Migration from Pegasus 4.6 to 4.7
In addition to the database changes, in Pegasus 4.7 the default submit directory layout was changed from a flat structure where all submit files independent of the number of jobs in the workflow appeared in a single directory. For 4.7, the default is a hierarchal directory structure two levels deep. To use the earlier layout, set the following property
pegasus.dir.submit.mapper Flat
13.4. Migrating From Pegasus <4.5 to Pegasus 4.5.X
Since Pegasus 4.5 all databases are managed by a single tool: pegasus-db-admin. Databases will be automatically updated when pegasus-plan is invoked, but it may require manually invocation of the pegasus-db-admin for other Pegasus tools.
The check command verifies if the database is compatible with the Pegasus’ latest version. If the database is not compatible, it will print the following message:
$ pegasus-db-admin check
Your database is NOT compatible with version 4.5.0
If you are running the check command for the first time, the tool will prompt the following message:
Missing database tables or tables are not updated:
dbversion
Run 'pegasus-db-admin update <path_to_database>' to create/update your database.
To update the database, run the following command:
$ pegasus-db-admin update
Your database has been updated.
Your database is compatible with Pegasus version: 4.5.0
The pegasus-db-admin tool can operate directly over a database URL, or can read configuration parameters from the properties file or a submit directory. In the later case, a database type should be provided to indicate which properties should be used to connect to the database. For example, the tool will seek for pegasus.catalog.replica.db.* properties to connect to the JDBCRC database; or seek for pegasus.catalog.master.url (or pegasus.dashboard.output, which is deprecated) property to connect to the MASTER database; or seek for the pegasus.catalog.workflow.url (or pegasus.monitord.output, which is deprecated) property to connect to the WORKFLOW database. If none of these properties are found, the tool will connect to the default database in the user’s home directory (sqlite:///${HOME}/.pegasus/workflow.db).
Example: connection by providing the URL to the database:
$ pegasus-db-admin create sqlite:///${HOME}/.pegasus/workflow.db
$ pegasus-db-admin update sqlite:///${HOME}/.pegasus/workflow.db
Example: connection by providing a properties file that contains the information to connect to the database. Note that a database type (MASTER, WORKFLOW, or JDBCRC) should be provided:
$ pegasus-db-admin update -c pegasus.properties -t MASTER
$ pegasus-db-admin update -c pegasus.properties -t JDBCRC
$ pegasus-db-admin update -c pegasus.properties -t WORKFLOW
Example: connection by providing the path to the submit directory containning the braindump.txt file, where information to connect to the database can be extracted. Note that a database type (MASTER, WORKFLOW, or JDBCRC) should also be provided:
$ pegasus-db-admin update -s /path/to/submitdir -t WORKFLOW
$ pegasus-db-admin update -s /path/to/submitdir -t MASTER
$ pegasus-db-admin update -s /path/to/submitdir -t JDBCRC
Note that, if no URL is provided, the tool will create/use a SQLite database in the user’s home directory: ${HOME}/.pegasus/workflow.db.
For complete description of pegasus-db-admin, see the man page.
13.5. Migrating From Pegasus 3.1 to Pegasus 4.X
With Pegasus 4.0 effort has been made to move the Pegasus installation to be FHS compliant, and to make workflows run better in Cloud environments and distributed grid environments. This chapter is for existing users of Pegasus who use Pegasus 3.1 to run their workflows and walks through the steps to move to using Pegasus 4.0
13.5.1. Move to FHS layout
Pegasus 4.0 is the first release of Pegasus which is Filesystem Hierarchy Standard (FHS) compliant. The native packages no longer installs under /opt. Instead, pegasus-* binaries are in /usr/bin/ and example workflows can be found under /usr/share/pegasus/examples/.
To find Pegasus system components, a pegasus-config tool is provided. pegasus-config supports setting up the environment for
Python
Perl
Java
Shell
For example, to find the PYTHONPATH for the DAX API, run:
export PYTHONPATH=`pegasus-config --python`
For complete description of pegasus-config, see the man page.
13.5.2. Stampede Schema Upgrade Tool
Starting Pegasus 4.x the monitoring and statistics database schema has changed. If you want to use the pegasus-statistics, pegasus-analyzer and pegasus-plots against a 3.x database you will need to upgrade the schema first using the schema upgrade tool /usr/share/pegasus/sql/schema_tool.py or /path/to/pegasus-4.x/share/pegasus/sql/schema_tool.py
Upgrading the schema is required for people using the MySQL database for storing their monitoring information if it was setup with 3.x monitoring tools.
If your setup uses the default SQLite database then the new databases run with Pegasus 4.x are automatically created with the correct schema. In this case you only need to upgrade the SQLite database from older runs if you wish to query them with the newer clients.
To upgrade the database
For SQLite Database
cd /to/the/workflow/directory/with/3.x.monitord.db
Check the db version
/usr/share/pegasus/sql/schema_tool.py -c connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db
2012-02-29T01:29:43.330476Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.init |
2012-02-29T01:29:43.330708Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start |
2012-02-29T01:29:43.348995Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema
| Current version set to: 3.1.
2012-02-29T01:29:43.349133Z ERROR netlogger.analysis.schema.schema_check.SchemaCheck.check_schema
| Schema version 3.1 found - expecting 4.0 - database admin will need to run upgrade tool.
Convert the Database to be version 4.x compliant
/usr/share/pegasus/sql/schema_tool.py -u connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db
2012-02-29T01:35:35.046317Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.init |
2012-02-29T01:35:35.046554Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start |
2012-02-29T01:35:35.064762Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema
| Current version set to: 3.1.
2012-02-29T01:35:35.064902Z ERROR netlogger.analysis.schema.schema_check.SchemaCheck.check_schema
| Schema version 3.1 found - expecting 4.0 - database admin will need to run upgrade tool.
2012-02-29T01:35:35.065001Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.upgrade_to_4_0
| Upgrading to schema version 4.0.
Verify if the database has been converted to Version 4.x
/usr/share/pegasus/sql/schema_tool.py -c connString=sqlite:////to/the/workflow/directory/with/workflow.stampede.db
2012-02-29T01:39:17.218902Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.init |
2012-02-29T01:39:17.219141Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema.start |
2012-02-29T01:39:17.237492Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema | Current version set to: 4.0.
2012-02-29T01:39:17.237624Z INFO netlogger.analysis.schema.schema_check.SchemaCheck.check_schema | Schema up to date.
For upgrading a MySQL database the steps remain the same. The only thing that changes is the connection String to the database
E.g.
/usr/share/pegasus/sql/schema_tool.py -u connString=mysql://username:password@server:port/dbname
After the database has been upgraded you can use either 3.x or 4.x clients to query the database with pegasus-statistics, as well as pegasus-plotsand pegasus-analyzer.
13.6. Migrating From Pegasus 2.X to Pegasus 3.X
With Pegasus 3.0 effort has been made to simplify configuration. This chapter is for existing users of Pegasus who use Pegasus 2.x to run their workflows and walks through the steps to move to using Pegasus 3.0
13.6.1. PEGASUS_HOME and Setup Scripts
Earlier versions of Pegasus required users to have the environment variable PEGASUS_HOME set and to source a setup file $PEGASUS_HOME/setup.sh | $PEGASUS_HOME/setup.csh before running Pegasus to setup CLASSPATH and other variables.
Starting with Pegasus 3.0 this is no longer required. The above paths are automaticallly determined by the Pegasus tools when they are invoked.
All the users need to do is to set the PATH variable to pick up the pegasus executables from the bin directory.
$ export PATH=/some/install/pegasus-3.0.0/bin:$PATH
13.6.2. DAX Schema
Pegasus 3.0 by default now parses DAX documents conforming to the DAX
Schema 3.2 available here
and is
explained in detail in the chapter on API references.
Starting Pegasus 3.0 , DAX generation API’s are provided in Java/Python and Perl for users to use in their DAX Generators. The use of API’s is highly encouraged. Support for the old DAX schema’s has been deprecated and will be removed in a future version.
For users, who still want to run using the old DAX formats i.e 3.0 or earlier, can for the time being set the following property in the properties and point it to dax-3.0 xsd of the installation.
pegasus.schema.dax /some/install/pegasus-3.0/etc/dax-3.0.xsd
13.6.3. Site Catalog Format
Pegasus 3.0 by default now parses Site Catalog format conforming to the
SC schema 3.0 ( XML3 ) available here
and is explained in detail in the chapter on Catalogs.
Pegasus 3.0 comes with a pegasus-sc-converter that will convert users old site catalog ( XML ) to the XML3 format. Sample usage is given below.
$ pegasus-sc-converter -i sample.sites.xml -I XML -o sample.sites.xml3 -O XML3
2010.11.22 12:55:14.169 PST: Written out the converted file to sample.sites.xml3
To use the converted site catalog, in the properties do the following
unset pegasus.catalog.site or set pegasus.catalog.site to XML3
point pegasus.catalog.site.file to the converted site catalog
13.6.4. Transformation Catalog Format
Pegasus 3.0 by default now parses a file based multiline textual format of a Transformation Catalog. The new Text format is explained in detail in the chapter on Catalogs.
Pegasus 3.0 comes with a pegasus-tc-converter that will convert users old transformation catalog ( File ) to the Text format. Sample usage is given below.
$ pegasus-tc-converter -i sample.tc.data -I File -o sample.tc.text -O Text
2010.11.22 12:53:16.661 PST: Successfully converted Transformation Catalog from File to Text
2010.11.22 12:53:16.666 PST: The output transfomation catalog is in file /lfs1/software/install/pegasus/pegasus-3.0.0cvs/etc/sample.tc.text
To use the converted transformation catalog, in the properties do the following
unset pegasus.catalog.transformation or set pegasus.catalog.transformation to Text
point pegasus.catalog.transformation.file to the converted transformation catalog
13.6.5. Properties and Profiles Simplification
Starting with Pegasus 3.0 all profiles can be specified in the properties file. Profiles specified in the properties file have the lowest priority. Profiles are explained in the detail in theconfiguration chapter. As a result of this a lot of existing Pegasus Properties were replaced by profiles. The table below lists the properties removed and the new profile based names.
Old Property Key |
New Property Key |
pegasus.local.env |
no replacement. Specify env profiles for local site in the site catalog |
pegasus.condor.release |
condor.periodic_release |
pegasus.condor.remove |
condor.periodic_remove |
pegasus.job.priority |
condor.priority |
pegasus.condor.output.stream |
pegasus.condor.output.stream |
pegasus.condor.error.stream |
condor.stream_error |
pegasus.dagman.retry |
dagman.retry |
pegasus.exitcode.impl |
dagman.post |
pegasus.exitcode.scope |
dagman.post.scope |
pegasus.exitcode.arguments |
dagman.post.arguments |
pegasus.exitcode.path.* |
dagman.post.path.* |
pegasus.dagman.maxpre |
dagman.maxpre |
pegasus.dagman.maxpost |
dagman.maxpost |
pegasus.dagman.maxidle |
dagman.maxidle |
pegasus.dagman.maxjobs |
dagman.maxjobs |
pegasus.remote.scheduler.min.maxwalltime |
globus.maxwalltime |
pegasus.remote.scheduler.min.maxtime |
globus.maxtime |
pegasus.remote.scheduler.min.maxcputime |
globus.maxcputime |
pegasus.remote.scheduler.queues |
globus.queue |
13.6.6. Profile Keys for Clustering
The pegasus profile keys for job clustering were renamed. The following table lists the old and the new names for the profile keys.
Old Pegasus Profile Key |
New Pegasus Profile Key |
collapse |
clusters.size |
bundle |
clusters.num |
13.6.7. Transfers Simplification
Pegasus 3.0 has a new default transfer client pegasus-transfer that is invoked by default for first level and second level staging. The pegasus-transfer client is a python based wrapper around various transfer clients like globus-url-copy, lcg-copy, wget, cp, ln . pegasus-transfer looks at source and destination url and figures out automatically which underlying client to use. pegasus-transfer is distributed with the PEGASUS and can be found in the bin subdirectory .
Also, the Bundle Transfer refiner has been made the default for pegasus 3.0. Most of the users no longer need to set any transfer related properties. The names of the profiles keys that control the Bundle Transfers have been changed . The following table lists the old and the new names for the Pegasus Profile Keys and are explained in details in the Profiles Chapter.
Old Pegasus Profile Key |
New Pegasus Profile Keys |
bundle.stagein |
stagein.clusters | stagein.local.clusters | stagein.remote.clusters |
bundle.stageout |
stageout.clusters | stageout.local.clusters | stageout.remote.clusters |
13.6.8. Worker Package Staging
Starting Pegasus 3.0 there is a separate boolean property pegasus.transfer.worker.package to enable worker package staging to the remote compute sites. Earlier it was bundled with user executables staging i.e if pegasus.catalog.transformation.mapper property was set to Staged .
13.6.9. Clients in bin directory
Starting with Pegasus 3.0 the pegasus clients in the bin directory have a pegasus prefix. The table below lists the old client names and new names for the clients that replaced them
Old Client |
New Client |
rc-client |
pegasus-rc-client |
tc-client |
pegasus-tc-client |
pegasus-get-sites |
pegasus-sc-client |
sc-client |
pegasus-sc-converter |
tailstatd |
pegasus-monitord |
genstats and genstats-breakdown |
pegasus-statistics |
show-job |
pegasus-plots |
dirmanager |
pegasus-dirmanager |
exitcode |
pegasus-exitcode |
rank-dax |
pegasus-rank-dax |
transfer |
pegasus-transfer |