Accelerating Edge-to-Cloud Science Workflows with Pegasus on Sage

with No Comments

Accelerating Edge-to-Cloud Science Workflows with Pegasus on Sage

 

What is Sage?

Sage is a National Science Foundation (NSF)-funded national-scale cyberinfrastructure to support Artificial Intelligence (AI) linked to distributed instrumentation. The Sage project provides a robust, scalable edge-to-cloud environment designed specifically to handle diverse AI workloads and facilitate efficient storage and management of AI-analyzed time-series data via the Sage cloud environment, nicknamed “the Beehive”.

The Pegasus Workflow Management System (Pegasus WMS) significantly enhances the capabilities of SAGE by providing powerful, adaptable data processing workflows that address various scientific, environmental, and analytical research needs.

The integration of Pegasus and Sage allows users to balance the execution of computations at the edge and in the cloud depending on the application needs in terms of the urgency of computations, the size of the dataset to be processed, or the complexity of processing.

Detailed Exploration of Pegasus WMS Use Cases within Sage

Edge-based Preprocessing and Data Reduction

In environments where immediate data filtering and preprocessing are essential, Pegasus workflows offer a highly efficient solution. Workflows can be encapsulated into containerized applications deployed directly onto the Sage edge nodes, enabling real-time data analysis at the data generation source. This approach is particularly valuable in scenarios requiring quick decisions, autonomous action, or to find the most important data to save or process.

For example, workflows processing video streams can select and retain frames featuring specific objects such as wildlife, vehicles, or people. This reduces data storage requirements and minimizes the bandwidth needed to transfer large data volumes, thus enhancing overall system performance and enabling quicker downstream analyses.

The example below shows a Pegasus workflow containing data-processing tasks packaged as a Sage app, which is then scheduled to run on Sage edge nodes to process sensor data and store the processed results on the Sage cloud data store. With this approach, the workflow is executed entirely within one of Sage’s edge nodes, and it is suited for scenarios where preprocessing is needed to filter out data of no interest. For example, preserving only those frames from a video stream with an object of interest.

Edge-based Preprocessing

Cloud-based Intensive Analysis and Post-processing

Pegasus workflows also facilitate more computationally intensive tasks by leveraging high-performance computing (HPC) resources, academic and commercial clouds, or local computing infrastructure. These workflows are designed to efficiently retrieve pre-filtered datasets from the Sage cloud’s Beehive datastore for complex analyses. For example, researchers can utilize workflows to classify images filtered by edge nodes, categorizing them into specific biological families or species. Additionally, Pegasus workflows can support extensive statistical analyses, such as tracking and analyzing the frequency of detection of particular species over time. Researchers can then correlate this information with environmental data (e.g., temperature, humidity, rainfall) to study the impact of climate change on migration patterns and ecosystem dynamics.

In the example below, Pegasus workflows can be developed to run on HPC, commercial or academic clouds, or your laptop to fetch data from Sage Cloud’s Beehive data store for processing or post-processing data. The approach is suited for time-consuming tasks, but is not best suited for running on edge nodes or tasks that require data from across multiple nodes. For example, to classify the filtered images containing an object of interest, the photos filtered on the edge nodes had birds in them, and are now being processed to classify them into their respective families. Another example is to compute statistical data such as the number of times an object of interest, say a wood stork (bird), was detected per day over the years to study their migratory patterns, which can be correlated with other data such as temperature, humidity, etc. to study the climate conditions that impact the migratory patterns.

Cloud-based Postprocessing

Technical Aspects and Implementation

Code for both scenarios has been developed and is available via the Pegasus-SAGE GitHub repository. The workflows leverage containerization technologies like Docker for consistent and portable deployment across the diverse infrastructure available within Sage. The repository includes example scripts, configurations, and detailed documentation to help users replicate and extend these workflows.

Technically, the workflows are defined using Pegasus’ Python API, allowing precise specification of task dependencies, resource requirements, and data management strategies. This structured approach simplifies workflow development, execution, and reproducibility across different computing environments.

By combining Pegasus WMS’s advanced workflow capabilities with the flexible edge-to-cloud framework offered by Sage, researchers and data scientists gain an integrated platform that can handle both rapid, localized data preprocessing and comprehensive, large-scale cloud-based analytics. This integration significantly accelerates the pace of discovery, enabling deeper and more meaningful insights into complex environmental and scientific phenomena.

Try It Yourself: Running Pegasus on Sage

To replicate or extend these workflows, follow these steps:

  1. Clone the Repository:
    git clone https://github.com/pegasus-isi/pegasus-sage.git
    cd pegasus-sage
  2. Review the Examples: The repo includes both edge and cloud examples with Docker configs, Pegasus Abstract workflow files, and setup scripts.
  3. Deploy a Workflow:
    • Use pegasus-plan and pegasus-run to deploy workflows.
    • Modify inputs to apply your own sensor data.
  4. Monitor Output:
    • Results will be stored in the Beehive data store or your local/cloud output directory.

Example Applications

  • Ecologists detecting animal presence in field videos.
  • Climate scientists studying species migration trends.
  • Environmental researchers correlating sensor data across distributed geographies.

Relevant Links

Pegasus Sage GitHub: https://github.com/pegasus-isi/pegasus-sage
Sage: https://sagecontinuum.org/
Pegasus WMS: https://pegasus.isi.edu/

Acknowledgements

This work is funded by the National Science Foundation under grants: 2403051