Although much work has been done in developing the national cyberinfrastructure in support of science, there is still a gap between the needs of the scientific applications and the capabilities provided by the resources. Leadership-class systems are optimized for highly-parallel, tightly coupled applications. Many scientific applications, however, are composed of a large number of loosely-coupled individual components, many with data and control dependencies. Running these complex, many-step workflows robustly and easily still poses difficulties on today’s cyberinfrastructure. One effective solution that allows applications to efficiently use the current cyberinfrastructure is resource provisioning using Condor glideins.
Corral and glideinWMS currently operate as standalone resource provisioning systems. GlideinWMS was initially developed to meet the needs of the CMS (Compact Muon Solenoid) experiment at the Large Hadron Collider (LHC) at CERN. It generalizes a Condor glideIn system developed for CDF (The Collider Detector at Fermilab) and first deployed for production in 2003. It has been in production across the Worldwide LHC Computing Grid (WLCG), with major contributions from the Open Science Grid (OSG) in support of CMS for the past two years, and has recently been adopted for user analysis. GlideinWMS also is currently being used by the CDF, DZero, and MINOS experiments, and servicing the NEBioGrid and Holland Computing Center communities. GlideinWMS has been used in production with more than 8,000 concurrently running jobs; the CMS use alone totals over 45 million hours.
Corral, a tool developed to complement the Pegasus Workflow Management System was recently built to meet the needs of workflow-based applications running on the TeraGrid. It is being used today by the Southern California Earthquake Center (SCEC) CyberShake application. In a period of 10 days in May 2009, SCEC used Corral to provision a total of 33,600 cores and used them to execute 50 workflows, each containing approximately 800,000 application tasks, which corresponded to 852,120 individual jobs executed on the TeraGrid Ranger system. The 50-fold reduction from the number of workflow tasks to the number of jobs is due to job-clustering features within Pegasus designed to improve overall performance for workflows with short duration tasks. The integrated CorralWMS system will provide a robust and scalable resource provisioning service that supports a broad set of domain application workflow and workload execution environments. The aim is to integrate and enable these services across local and distributed computing resources, the major national cyberinfrastructure providers (Open Science Grid and TeraGrid), as well as emerging commercial and community cloud environments.
After the initial development phase of the CorralWMS project, we realized how the work being done also benefitted the glideinWMS project. There was no need to maintain separate codebases or develop functionality that would be specific to CorralWMS. We decided to brand the entire system as glideinWMS, which would now have two frontend services for users to choose from depending on their needs.The original VO Frontend would continue to service virtual organizations and can contain many users for on-demand resources and scheduling. The administrator manages the service, and possibly the credentials for all the users in their VO. The Corral Frontend is what individual users run to schedule resources needed for workflows. Each Corral user manages their service and credentials. Both of the frontends communicate with the glideinWMS Factory through the same protocol and can affect glidein behavior in the same manner.
Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures, Mats Rynge, Gideon Juve, Gaurang Mehta, Ewa Deelman, Krista Larson, Burt Holzman, Igor Sfiligoi, Frank Würthwein, G. Bruce Berriman, Scott Callaghan, Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011), December 2011.
Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker, Mats Rynge, Gideon Juve, Karan Vahi, Scott Callaghan, Gaurang Mehta, Philip J. Maechling, Ewa Deelman, XSEDE'12, July 2012.
Funded by the National Science Foundation under the OCI SDCI program, grant #0943725