The USC Epigenome Center is currently using the Illumina Genetic Analyzer (GA) system to generate high throughput DNA sequence data (up to 8 billion nucleotides per week) to map the epigenetic state of human cells on a genome-wide scale. Epigenomic Workflow (computational jobs are shown as circles, data transfer jobs as rhomboids).
The Center has implemented an automated analysis pipeline using Pegasus-WMS to support these sequencing efforts. The workflow shown above consists of seven basic steps:
- transfer sequence data to the cluster storage system,
- split sequence files into multiple parts to be processed in parallel,
- convert sequence files to the appropriate file format,
- filter out noisy and contaminating sequences,
- map sequences to their genomic locations,
- merge output from individual mapping steps into a single global map, and
- use sequence maps to calculate the sequence density at each position in the genome.
The Epigenome Center is currently using this workflow to process its production DNA methylation and histone modification data. While the workflow currently implements the minimum requirements to effectively analyze the data, we are currently working to add quality control and checkpoint steps to make the pipeline more robust.
Scientists: Ben Berman and Peter Laird, USC Epigenome Center
Publications: Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta, Bruce Berriman, Benjamin P. Berman and Phil Maechling. Scientific Workflow Applications on Amazon EC2. Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE Internation Conference on e-Science (e-Science 2009), Oxford UK, December 9-11, 2009.