Pegasus Users Group Meeting 2021

PUG 2021 - February 23rd and February 25th 2021 – Virtual Event

2020 brought major improvements to Pegasus with the 5.0 release. To keep the momentum up in 2021 and ensure Pegasus is heading in the most beneficial direction for our users, we had our first ever Pegasus Users Group Meeting (PUG 2021). Because of the pandemic, this event was a virtual workshop spread over 2 days.

PUG 2021 aimed to give an opportunity to Pegasus users and collaborators to interact with Pegasus developers and share ideas and provide feedback. The users group meeting was a mix of user experience talks along with technical Pegasus talks, tutorials, and office hours.

If you have any questions/issues, please contact us at pegasus@isi.edu.

We use Slack as our main communication channel and community discussions. Please, click the button below to join us on Slack:

Agenda

Day 1: February 23rd, 2021

Time	Title of Talk	Presenter
9:00-9:10am PST noon-12:10pm EST	Introduction and Meeting Logistics	Wendy Whitcup (USC/ISI)
9:10-9:20am PST 12:10-12:20pm EST	Welcome to the Workshop	Ewa Deelman (USC/ISI)
9:10-9:20am PST 12:10-12:20pm EST	Pegasus in 5 Minutes	Ryan Tanaka (USC/ISI)
9:20-9:35am PST 12:20-12:35pm EST	Pegasus 5.0 Pegasus 5.0 is the latest stable release of Pegasus that was released in November 2020. A key highlight of this release, is a brand new Python3 based Pegasus API that allows users to compose workflows and to control their execution programmatically. This talk will give an overview of the new API and highlight various key improvements introduced that address system usability (including a comprehensive, yet easy-to-navigate documentation, and training), and the development of core functionalities for improving the management and processing of large, distributed data sets, and the management of experiment campaigns defined as ensembles.	Karan Vahi (USC/ISI)
9:35-9:50am PST 12:35-12:50pm EST	Ensemble Manager Data processing and analytics pipelines often comprise multiple workflows which are executed over an extended period of time. As such, it is necessary to be able to run and monitor groups of similar workflows, also known as ensembles. The Pegasus Ensemble Manager is a tool provided by Pegasus which enables users to create ensembles, dynamically add and stop workflows, and monitor workflow runs. Furthermore, the Pegasus Ensemble Manager can trigger workflow runs based on the arrival of new data. In this talk, we dive into the features provided by the Pegasus Ensemble Manager, discuss potential use cases, and update you on new features, soon to be released.	Ryan Tanaka (USC/ISI)
9:50-10:10am PST 12:50-1:10pm EST	Building an Integrated Assessment Model with Pegasus Integrated assessment models (IAMs) have become important tools to explore the interactions between different modeled components of socio-environmental systems (SES). One common application is projecting the impacts of potential climate change mitigation policies. Vermont EPSCoR has built an IAM framework using Pegasus for the Lake Champlain basin. The current iteration of the Vermont EPSCoR IAM includes a regional climate model, a land use land change agent-based model, two available hydrology models, and a coupled lake hydrodynamic and water quality model. The Vermont EPSCoR IAM framework is designed to allow bidirectional asynchronous feedbacks between the models and is modular to allow component models to be exchanged for another component model from the same domain. This modular design also allows for smoother upgrades between different versions of the same component model. In this session, the design decisions and tradeoffs made along the way to implement these features will be discussed.	Patrick J Clemins (University of Vermont)
10:10-10:30am PST 1:10-1:30pm EST	Weather Feature Extraction on the Academic Cloud for Drone Route Planning The concept of Urban Air Mobility (UAM) is rapidly advancing. The skies will soon be filled with small aircraft performing delivery missions, video surveillance, atmospheric sensing, and even providing ad-hoc communications networks for disaster relief efforts. However these vehicles are sensitive to winds, precipitation, and temperature, and their abilities to tolerate these conditions can change as fast as the weather itself. For this reason we have developed a system for monitoring observed and predicted meteorological conditions, extracting areas of risk tailored to the specific and dynamic vehicle parameters of particular flights, and generating suggested routes to efficiently avoid areas of risk as conditions evolve. This is a compute intensive process, with many types of observations and forecast products contributing to the information base. We use the ExoGENI academic cloud as the basis for a scalable infrastructure, and the Pegasus WMS to orchestrate the various product generation and extraction routines which ultimately inform the flight pathing algorithm. Here we describe the overall system, the role of Pegasus in managing the load, and propose interfaces that may be useful going ahead.	Eric Lyons (University of Massachusetts Amherst)
10:30-11:00am PST 1:30-2:00pm EST	Break
11:00-11:20 PST 2:00-2:20pm EST	Leveraging Pegasus to find colliding black holes in the data from the LIGO and Virgo gravitational-wave observatories In the last 5 years, the Advanced LIGO and Advanced Virgo observatories have opened the field of gravitational-wave astronomy by observing over 50 colliding black holes and neutron stars. These collisions are among the most extreme and energetic events in the Universe and understanding the properties of such collisions, and the environments in which they happen, offer us a new window to understand the Universe's formation and evolution. Analysing the data taken by these observatories to find compact binary mergers relies on a technique called matched-filtering, which is embarrassingly parallel and well suited for high-throughput computing. The PyCBC codebase is one of the primary analysis toolkits used inside (and outside) the LIGO and Virgo collaborations to observe these objects. In this talk we will discuss how Pegasus is used within PyCBC to create the workflows that are used to detect compact binary mergers, and briefly explain the science that this has enabled.	Ian Harry (LIGO, University of Portsmouth)
11:20-11:40AM PST 2:20-2:40pm EST	Connecting Observation and Theory---Black Hole Modeling with Large Scale Synthetic Data Generation The Event Horizon Telescope (EHT) is a Very-Long-Baseline Interferometry (VLBI) experiment that links together multiple radio telescopes around the globe and captured the first event horizon scale resolution images of a black hole in 2019. Given the sparsity of the array, EHT image reconstructing is not unique, which post a significant challenge in interpreting EHT data. In this talk, I will present how the EHT uses Pegasus to deploy a large scale synthetic data generation pipeline on the Open Science Grid. These synthetic data sets play an important role in EHT science. First, they allow us to verify and validate our data processing and image reconstruction algorithms. Second, they help us to understand how the natural of black hole accretion flows can bias our measurements. Third, they provide a forward modeling pathway for us to compare simulations with observations. I will conclude the talk with a plan on how to use Pegasus for some other EHT workflows.	Chi-kwan Chan, Michael Janssen (Event Horizon Telescope, University of Arizona)
11:40-12:00pm PST 2:40-3:00pm EST	Running Rubin LSST Science pipelines on AWS The Legacy Survey of Space and Time (LSST, www.lsst.org), operated by the Vera C. Rubin Observatory, is a 10-year astronomical survey due to start operations in 2022 that will image half the sky every three nights. LSST will produce ~20TB of raw data per night which will be calibrated and analyzed in almost real time. Rubin estimates that the total amount of data that will be collected during operations to be about 60 petabytes (PB) from which a 20PB large catalog will be produced. At these data volumes reprocessing even a relatively small subset of data requires significant infrastructure. We describe how we were, by using HTCondor, Pegasus WMS and integrating Amazon Web Services functionality in Rubin's Middleware code, able to execute Rubin LSST Science Pipelines at scale. We discuss challenges, benefits of such a system and describe the performance, scalability and cost of deploying such a system in the cloud.	Dino Bektešević (LSST, University of Washington)
12:00-12:20pm PST 3:00-3:20pm EST	Searching for Dark Matter with XENONnT and Pegasus XENONnT is a ton-scale liquid xenon time projection chamber that will very soon search for dark matter deep under the Gran Sasso mountains in central Italy. Despite the ultra low background level we expect to reach in XENONnT, the low-energy threshold required to search for dark matter results in a data volume of O(PB) per year, much too large to store on a single site. We thus use Rucio for the data management across multiple storage sites across the world and Pegasus for the distributed processing workflow on the Open Science Grid, which produces the high-level data used for final analyses. At the same time, Pegasus is also used for the Monte Carlo data production, a critical step in the analysis of XENONnT data. In this talk, I will summarize the science goals of XENONnT and how Pegasus is used to help us reach those goals.	Evan Shockley (University of California, San Deigo)
12:20-12:35pm PST 3:20-3:35pm EST	Coffee Break
12:35-1:30pm PST 3:35-4:30pm EST	Pegasus 5.0 Tutorial Tutorial on Pegasus 5.0 using Jupyter notebooks	Mats Rynge

Day 2: February 25th, 2021

Time	Title of Talk	Presenter
9:00-9:10am PST noon-12:10pm EST	Pegasus 101	Rafael Ferreira da Silva (USC/ISI)
9:10-9:30am PST 12:10-12:30pm EST	ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses using Pegasus WMS Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Scalability and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: 1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; 2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; 3) Use of high-performance and high-throughput computational platforms; 4) Generation of hierarchical population-based genotypes at different scales of resolution based on combinations of multi-locus and Bayesian statistical approaches for classification; 5) Detection of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases and association with genotypic classifications; and 6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species and the Pegasus WMS facilitates addition or removal of programs from the workflow or modification of options within them. All the dependencies of ProkEvo can be distributed via conda environment or Docker image. To demonstrate versatility of the ProkEvo platform, we performed population-based analyses from available genomes of three distinct pathogenic bacterial species as individual case studies (three serovars of Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus). The specific case studies used reproducible Python and R scripts documented in Jupyter Notebooks and collectively illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be used to generate novel hypotheses about the evolutionary history and ecological characteristics of specific populations of each pathogen. Collectively, our study shows that ProkEvo presents a viable option for scalable, automated analyses of bacterial populations with powerful applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.	Natasha Pavlovikj (University of Nebraska-Lincoln)
9:30-9:50am PST 12:30-12:50pm EST	Performing large-scale seismic hazard analysis using Pegasus workflows Predicting the location, size, and origin times of future earthquake is arguably the most fundamental problem in earthquake science. Modern earthquake forecasts make probabilistic statements about the future occurrence of ground motions, given a region and time period. Accurately estimating these ground motions is key to societal mitigation and preparedness efforts. To accomplish this, the Southern California Earthquake Center (SCEC) has developed a physics-based probabilistic seismic hazard analysis method, called CyberShake, that quantifies future ground motions using suites of 3D wave propagation simulations. The CyberShake code base involves over 15 codes written by 7 different developers, including large parallel CPU and GPU jobs and small serial jobs. Pegasus-WMS has enabled SCEC to flexibly distribute large production CyberShake workflows between NSF, DOE, and USC supercomputers. In this talk, we will discuss how we have utilized Pegasus to create and manage CyberShake workflows, and how Pegasus has supported the execution of large-scale suites of seismic hazard calculations running for months on thousands of nodes and generating over a petabyte of data.	Scott Callaghan (SCEC)
9:50-10:10am PST 12:50-1:10pm EST	GeoEDF: A Framework for Geospatial Research Workflows Earth science research typically involves a complex workflow of data acquisition from various remote data sources, followed by preprocessing, data preparation and fusion, and finally, simulation or analysis and validation. In practice, such workflows are still a mix of custom-built, non-reusable code, often requiring the use of desktop tools and intermediate data transfers at various stages. Researchers consequently end up spending inordinate amounts of time on data wrangling rather than focusing on actual science. GeoEDF is designed to address these inefficiencies by making remote datasets directly usable in computational code and facilitating earth science workflows that are based entirely in a science gateway. GeoEDF enables the composition of declarative, end-to-end, “plug-and-play” workflows in the earth sciences that can be executed on diverse computational resources via close integration with the Pegasus workflow management system.	Rajesh Kalyanam (Purdue University)
10:10-10:30am PST 1:10-1:30pm EST	NLP SAGA Workflows I will describe how we are using Pegasus to create repeatable, checkpointed workflows that address common patterns in natural language processing(NLP) both when training models and when decoding with interconnected pipelines. I will describe the VISTA wrapper, a wrapper for Pegasus that simplifies writing such pipelines.	Marjorie Freedman (USC/ISI)
10:30-11:00am PST 1:30-2:00pm EST	Break
11:00-11:10am PST 2:00-2:10pm EST	ML Workflows Using Pegasus Over the last decade, rapid advances in ML, and most notably its subfield of Deep Learning (DL), have resulted in enormous progress in computer visions, speech recognition, and natural language processing. The process of developing a DL model to solve a problem inherently consists of a number of steps. Typical workflow includes: data acquisition, data preprocessing, model selection, hyperparameters optimization, inference and model evaluations. Training a deep learning model is iterative, computationally expensive and usually requires vast amounts of data. With increase in data and complexity of problems, there is a need for distributed learning in HPC. In this talk, I will overview the steps of a typical machine learning workflow and talk about our lab’ experience of porting scientific machine learning experiments to Pegasus WMS.	Patrycja Krawczuk (USC/ISI)
11:10-11:20am PST 2:10-2:20pm EST	Advanced Task Monitoring with Panorama With the increased prevalence of employing workflows for scientific computing and a push towards exascale computing, it has become paramount that we are able to collect and analyze characteristics of scientific applications to better understand their impact on the underlying infrastructure and vice-versa. In this talk we present the Panorama360 online monitoring architecture that collects end-to-end workflow execution and infrastructure statistics, from distributed and heterogeneous resources. The collected statistics are stored in Elasticsearch using a flexible schema format and by using Kibana and Grafana one can dive into the statistics on a workflow and a job level.	George Papadimitriou (USC/ISI)
11:20-11:30am PST 2:20-2:30pm EST	ML Analysis of Workflow Data As the scale of today's workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. This talk presents the Machine Learning methods that use different levels of workflow execution data collected from Pegasus WMS. This talk will present anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. This talk show the workflow-level analysis with k-means clustering and three classifiers (decision tree, Naive Bayes, and Isolation Forest).	Cong Wang (RENCI)
11:30-11:40am PST 2:30-2:40pm EST	Pegasus and Open Science Grid - a perfect match! Open Science Grid (OSG) is an excellent execution environment for Pegasus workflows. In this talk we will explore technical aspects of both OSG and Pegasus, which makes them work so well together. Topics will include HTCondor features, data management in distributed environments with tools like stashcp and SciTokens, and OSG Singularity containers.	Mats Rynge (USC/ISI)
11:40-11:50am PST 2:40-2:50pm EST	Pegasus HUB PegasusHub provides a curated collection of open source Pegasus workflow repositories hosted at GitHub. The main goal of this framework is to showcase community efforts for advancing science. We invite all users from the community to share their workflow repositories through this framework, which may inspire new/other users on their quest to tackle their scientific problems using workflows, and help Pegasus developers to broaden their understanding of the community needs and software usage.	Rafael Ferreira da Silva (USC/ISI)
11:50-12:00pm PST 2:50-3:00pm EST	Automated Processing of Phenotypic Data Submissions using Pegasus NIMH Repository and Genomics Resource (NRGR), maintains biomaterials, demographic, and phenotypic data from over 200,000 well-characterized individuals with a range of psychiatric illnesses, their family members, and unaffected controls. NRGR receives these data from principal investigators of NIMH-funded studies. The center is then responsible for curating the clinical data submitted and creating collections of well-characterized, high-quality patient and control data and biosamples that are widely used for psychiatric research. Previously, the curation effort was largely manual with ad-hoc harmonization procedures in place that led to inconsistencies and variance across studies and disorders. To streamline and formalize this process we developed a web-based automated quality control system (AutoQC) for phenotypic data submissions. Each user submission is implemented as a distributed workflow managed by Pegasus WMS. This facilitates the addition of new curation checks into the framework. If new checks are later added, we can easily re-submit the existing, accepted distributions, resulting in improved quality of the overall datasets.	Rajiv Mayani (USC/ISI)
12:00-12:15pm PST 3:00-3:15pm EST	Q/A For Lightning Talks/ Discussion Question and Answer Session for the Lightning Talks followed by discussion.	All
12:15-12:30pm PST 3:15-3:30pm EST	Closing Remarks / Coffee Break	-
12:30-1:30pm PST 3:30-4:30pm EST	Pegasus Office Hours	Pegasus Team

Code of Conduct

This workshop is dedicated to providing a welcoming and supportive environment for all people, regardless of background or identity. By participating in this workshop, participants agree to abide by this Code of Conduct and accept the procedures by which any Code of Conduct incidents are resolved. We do not tolerate behavior that is disrespectful or that excludes, intimidates, or causes discomfort to others. We do not tolerate discrimination or harassment based on characteristics that include, but are not limited to, gender identity and expression, sexual orientation, disability, physical appearance, body size, citizenship, nationality, ethnic or social origin, pregnancy, familial status, veteran status, genetic information, religion or belief (or lack thereof), membership of a national minority, property, age, education, socio-economic status, technical choices, and experience level.

Everyone who participates in workshop activities is required to conform to this Code of Conduct. It applies to all spaces managed by or affiliated with the workshop. Workshop hosts are expected to assist with the enforcement of the Code of Conduct. By participating, participants indicate their acceptance of the procedures by which the workshop resolves any Code of Conduct incidents.

Expected behavior

All participants in the PUG2021 and communications are expected to show respect and courtesy to others. All interactions should be professional regardless of platform: either online or in-person. In order to foster a positive and professional learning environment, we encourage the following kinds of behaviors in all workshop events and platforms:

Use welcoming and inclusive language
Be respectful of different viewpoints and experiences
Gracefully accept constructive criticism
Focus on what is best for the community
Show courtesy and respect towards other community members

Unacceptable behavior

Examples of unacceptable behavior by participants at any workshop event/platform include:

written or verbal comments which have the effect of excluding people on the basis of membership of any specific group
causing someone to fear for their safety, such as through stalking, following, or intimidation
violent threats or language directed against another person
the display of sexual or violent images
unwelcome sexual attention
nonconsensual or unwelcome physical contact
sustained disruption of talks, events, or communications
insults or put downs
sexist, racist, homophobic, transphobic, ableist, or exclusionary jokes
excessive swearing
incitement to violence, suicide, or self-harm
continuing to initiate interaction (including photography or recording) with someone after being asked to stop
publication of private communication without consent

Consequences of Unacceptable behavior

If you believe someone is violating the Code of Conduct, we ask that you report it to any of the workshop organizers. Participants who are asked to stop any inappropriate behavior are expected to comply immediately. If a participant engages in behavior that violates this code of conduct, the organizers may warn the offender, ask them to leave the event or platform, or investigate the Code of Conduct violation and impose appropriate sanctions.

This code of conduct is adopted from the FABRIC Community Workshop and the excellent code of conduct articulated by Software Carpentry.