PACT: A pipeline for analysis of circulating tumor DNA

Developed by the Christopher Maher Lab at Washington University in St. Louis in collaboration with the labs of Dr. Aadel Chaudhuri and Dr. Russell Pachynski.

Overview

Standardized workflows for sensitive and reproducible detection of both small and large genomic alterations using targeted ctDNA sequencing, shared in a Common Workflow Language (CWL) pipeline.

For additional details and benchmarking, see: Jace Webster, Ha X Dang, Pradeep S Chauhan, Wenjia Feng, Alex Shiang, Peter K Harris, Russell K Pachynski, Aadel A Chaudhuri, Christopher A Maher. 2023. PACT: A pipeline for analysis of circulating tumor DNA. Bioinformatics. 39(8). doi:10.1093/bioinformatics/btad489

Quick Start

Download the repository with git clone https://github.com/ChrisMaherLab/PACT.git

A number of tools exist for running CWL pipelines. In our benchmarking analysis, all pipelines were run using the Cromwell CWL interpreter (v54), which can be downloaded here. For additional information about using Cromwell, we suggest their user guide and their configuration tutorials.

As PACT is designed for use in high performance computing environments (HPCs) and HPCs can be highly customizable and variable between different institutions, a comprehensive guide on how to configure different CWL interpreters for specific HPCs is not possible here. We highly recommend reviewing the above documentation (or the documentation for your preferred CWL interpreter) to ensure correct integration with your HPC.

After installation and configuration of Cromwell (if that is your preferred interpreter), the pipeline(s) can be run using:

java -Dconfig.file=<config.file> -jar <cromwell.jar> run -t cwl -i <input_yaml> pipelines/<pipeline>.cwl

For additional information about writing, reading and using CWL files, see the official CWL user guide.

To help ensure proper installation and setup, example files (sample bam, matched control bam, healthy bam, targeted regions bed, blacklist bed, low complexity regions bed) are located in the example_data folder. Note that due to file size, git lfs may be required for download. In order to run these files with the SV pipeline, the hg19 reference genome and annotation is also needed (see instructions at the bottom of this page for installation). If run correctly, the output from the SV pipeline should be consistent with the output file at example_data/example.out.bedpe which describes a single translocation between chromosomes 10 and 13. The example_ymls/sv_example.yml can be used to run this analysis, but filepaths in the yml will need to be updated to reflect your PACT installation and the locations of your genome reference and annotation data.

Structure

This repository is organized as follows:

Directory	Description
pipelines	Full workflows, which rely on subworkflows and tools
subworkflows	Workflows called by pipelines that combine tools to form intermediate files
tools	Individual steps in the workflow containing single commands or scripts
example_ymls	Example format for input yml files using minimal inputs
example_data	Example input and output data for setup and testing purposes

Inputs

The provided workflows accept a variety of optional and/or required input files. Example input yaml files have been provided in the example_ymls directory, which contain all required inputs and a brief description of expected values. Additional inputs are available for additional customization of the pipeline(s), and can be seen in the inputs section of the corresponding CWL file in the pipelines directory.

Common/required inputs are described below, including how to label the information in an input yaml file, the workflows the file is used in, and a brief description.

Reference Genome Inputs

Input label	Applicable workflow(s)	Description
reference	All workflows (required)	Reference genome fasta file. A .fai index file made using `samtools faidx` and a .dict file made using Picard's `CreateSequenceDictionary` command should be present in the directory.
ref_genome	SV and CNA workflows (required)	Name of reference genome used. Should match the name used by any applicable annotation databases (eg. hg19)
ref_flat	CNA workflow (required)	Genome annotation file in refFlat format

Annotation Information

Input label	Applicable workflow(s)	Description
snpEff_data	SV workflow (required)	snpEff annotation database directory. This can be downloaded using snpEff's download command: `java -jar snpEff.jar download <database>`.
vep_cache_dir	SNV workflow (required)	vep annotation cache information. See the ensembl website (https://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html) for information about downloading the cache.
vep_ensembl_assembly	SNV workflow (required)	A string containing the name of the genome assembly associated with the provided vep cache (eg GRCh37)
vep_ensembl_version	SNV workflow (required)	A string containing the version number of the provided cache (eg 106)
all_genes	CNA workflow (required)	Bed file of all annotated genes. First three columns are standard bed format, 4th column has gene name, 5th column has score value (arbitrary number, is not used), 6th column has +/- strand. No headed is expcted.

Region and Variant Information

Input label	Applicable workflow(s)	Description
target_regions	All workflows (required)	A bed file containing the genomic regions covered by the targeted panel used for sequencing
neither_region	SV workflow (required)	A bed file. All SVs that contain a breakpoint within these regions will be discarded. We recommend the blacklist regions provided by 10xgenomics. Their hg19 bed file can be found here: http://cf.10xgenomics.com/supp/genome/hg19/sv_blacklist.bed.
notboth_region	SV workflow (required)	A bed file. SVs with >1 breakpoint within these regions will be discarded. We recommend Heng Li's low complexity regions, found here: https://github.com/lh3/varcmp/raw/master/scripts
sv_whitelist	SV workflow (optional)	A bed file. Contains regions that include expected SV breakpoint sites. This will reduce the read support requirement for SVs from these regions, which will allow the user to manually review variants of interest.
whitelist_vcf	SNV workflow (required)	VCF and accompanying .tbi file (using the `tabix -p`) command. VCF represents any whitelisted SNVs/Indels. VCF file may be empty (but still properly formatted) if desired
target_genes	CNA workflow (required)	Bed file describing all genes targeted by the target panel. First three columns are standard bed format, 4th column is gene name, 5th column is description. Copy number control genes should be labeled as 'CN-control' in the description, all others can use any desired description

Samples and Controls

Input label	Applicable workflow(s)	Description
sample_bams	All workflows (required)	An array of bam files that contain reads generated from targeted sequencing of cfDNA. Arrays can be provided in the input .yaml file as described by the (CWL user guide) or as shown in our example input .yamls
matched_control_bams	All workflows (required)	An array of matched control bam files. The order of the array should be the same order as the sample_bams array (eg the `nth` entry in both arrays should correspond to the `nth` patient)
panel_of_normal_bams	All workflows (required)	An array of bam files containing reads from healthy, normal samples sequenced using the same targeted panel used on the samples/matched controls. If such a panel is unavailable, this panel can instead be composed of any available matched control samples.

jbwebster / pact Goto Github PK

pact's Introduction

PACT: A pipeline for analysis of circulating tumor DNA

Overview

Quick Start

Structure

Inputs

pact's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent