Code Monkey home page Code Monkey logo

pact's Introduction

PACT: A pipeline for analysis of circulating tumor DNA

Developed by the Christopher Maher Lab at Washington University in St. Louis in collaboration with the labs of Dr. Aadel Chaudhuri and Dr. Russell Pachynski.

Overview

Standardized workflows for sensitive and reproducible detection of both small and large genomic alterations using targeted ctDNA sequencing, shared in a Common Workflow Language (CWL) pipeline.

For additional details and benchmarking, see: Jace Webster, Ha X Dang, Pradeep S Chauhan, Wenjia Feng, Alex Shiang, Peter K Harris, Russell K Pachynski, Aadel A Chaudhuri, Christopher A Maher. 2023. PACT: A pipeline for analysis of circulating tumor DNA. Bioinformatics. 39(8). doi:10.1093/bioinformatics/btad489

Quick Start

Download the repository with git clone https://github.com/ChrisMaherLab/PACT.git

A number of tools exist for running CWL pipelines. In our benchmarking analysis, all pipelines were run using the Cromwell CWL interpreter (v54), which can be downloaded here. For additional information about using Cromwell, we suggest their user guide and their configuration tutorials.

As PACT is designed for use in high performance computing environments (HPCs) and HPCs can be highly customizable and variable between different institutions, a comprehensive guide on how to configure different CWL interpreters for specific HPCs is not possible here. We highly recommend reviewing the above documentation (or the documentation for your preferred CWL interpreter) to ensure correct integration with your HPC.

After installation and configuration of Cromwell (if that is your preferred interpreter), the pipeline(s) can be run using:

java -Dconfig.file=<config.file> -jar <cromwell.jar> run -t cwl -i <input_yaml> pipelines/<pipeline>.cwl

For additional information about writing, reading and using CWL files, see the official CWL user guide.

To help ensure proper installation and setup, example files (sample bam, matched control bam, healthy bam, targeted regions bed, blacklist bed, low complexity regions bed) are located in the example_data folder. Note that due to file size, git lfs may be required for download. In order to run these files with the SV pipeline, the hg19 reference genome and annotation is also needed (see instructions at the bottom of this page for installation). If run correctly, the output from the SV pipeline should be consistent with the output file at example_data/example.out.bedpe which describes a single translocation between chromosomes 10 and 13. The example_ymls/sv_example.yml can be used to run this analysis, but filepaths in the yml will need to be updated to reflect your PACT installation and the locations of your genome reference and annotation data.

Structure

This repository is organized as follows:

Directory Description
pipelines Full workflows, which rely on subworkflows and tools
subworkflows Workflows called by pipelines that combine tools to form intermediate files
tools Individual steps in the workflow containing single commands or scripts
example_ymls Example format for input yml files using minimal inputs
example_data Example input and output data for setup and testing purposes

Inputs

The provided workflows accept a variety of optional and/or required input files. Example input yaml files have been provided in the example_ymls directory, which contain all required inputs and a brief description of expected values. Additional inputs are available for additional customization of the pipeline(s), and can be seen in the inputs section of the corresponding CWL file in the pipelines directory.

Common/required inputs are described below, including how to label the information in an input yaml file, the workflows the file is used in, and a brief description.

Reference Genome Inputs
Input label Applicable workflow(s) Description
reference All workflows (required) Reference genome fasta file. A .fai index file made using samtools faidx and a .dict file made using Picard's CreateSequenceDictionary command should be present in the directory.
ref_genome SV and CNA workflows (required) Name of reference genome used. Should match the name used by any applicable annotation databases (eg. hg19)
ref_flat CNA workflow (required) Genome annotation file in refFlat format
Annotation Information
Input label Applicable workflow(s) Description
snpEff_data SV workflow (required) snpEff annotation database directory. This can be downloaded using snpEff's download command: java -jar snpEff.jar download <database>.
vep_cache_dir SNV workflow (required) vep annotation cache information. See the ensembl website (https://useast.ensembl.org/info/docs/tools/vep/script/vep_cache.html) for information about downloading the cache.
vep_ensembl_assembly SNV workflow (required) A string containing the name of the genome assembly associated with the provided vep cache (eg GRCh37)
vep_ensembl_version SNV workflow (required) A string containing the version number of the provided cache (eg 106)
all_genes CNA workflow (required) Bed file of all annotated genes. First three columns are standard bed format, 4th column has gene name, 5th column has score value (arbitrary number, is not used), 6th column has +/- strand. No headed is expcted.
Region and Variant Information
Input label Applicable workflow(s) Description
target_regions All workflows (required) A bed file containing the genomic regions covered by the targeted panel used for sequencing
neither_region SV workflow (required) A bed file. All SVs that contain a breakpoint within these regions will be discarded. We recommend the blacklist regions provided by 10xgenomics. Their hg19 bed file can be found here: http://cf.10xgenomics.com/supp/genome/hg19/sv_blacklist.bed.
notboth_region SV workflow (required) A bed file. SVs with >1 breakpoint within these regions will be discarded. We recommend Heng Li's low complexity regions, found here: https://github.com/lh3/varcmp/raw/master/scripts
sv_whitelist SV workflow (optional) A bed file. Contains regions that include expected SV breakpoint sites. This will reduce the read support requirement for SVs from these regions, which will allow the user to manually review variants of interest.
whitelist_vcf SNV workflow (required) VCF and accompanying .tbi file (using the tabix -p) command. VCF represents any whitelisted SNVs/Indels. VCF file may be empty (but still properly formatted) if desired
target_genes CNA workflow (required) Bed file describing all genes targeted by the target panel. First three columns are standard bed format, 4th column is gene name, 5th column is description. Copy number control genes should be labeled as 'CN-control' in the description, all others can use any desired description
Samples and Controls
Input label Applicable workflow(s) Description
sample_bams All workflows (required) An array of bam files that contain reads generated from targeted sequencing of cfDNA. Arrays can be provided in the input .yaml file as described by the (CWL user guide) or as shown in our example input .yamls
matched_control_bams All workflows (required) An array of matched control bam files. The order of the array should be the same order as the sample_bams array (eg the nth entry in both arrays should correspond to the nth patient)
panel_of_normal_bams All workflows (required) An array of bam files containing reads from healthy, normal samples sequenced using the same targeted panel used on the samples/matched controls. If such a panel is unavailable, this panel can instead be composed of any available matched control samples.

pact's People

Contributors

jbwebster avatar mr-c avatar chrismaherlab avatar hdng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.