Code Monkey home page Code Monkey logo

asap's Introduction

README for ASAP v2.1

Introduction

ASAP is a flexible bioinformatic pipeline for ATAC-seq data analysis. Starting from raw ATAC-seq sequencing reads, ASAP outputs raw and filtered mapping files, coverage files (reads coverage ; tn5 insertion events coverage), fragment length distribution, read exraction based on fragment length, and peak calling results.

Overview of major steps

  • Mapping
  • Post-mapping processing and filtering:
    • Filter (or not) reads that fall into user-defined blacklisted regions
    • Select reads that do not carry more than minMismatch and filter by minimum mapping quality (MAPQ)
    • Mark duplicated pairs
    • Select concordant, non-duplicated pairs.
    • Shift reads by 4bp as described in Schep et al.,2015: shift by 4bp toward to center of the transposition event.
  • Compute read coverage
  • Compute insertion events coverage
  • Fragment length distribution
  • Extract reads pairs based on a fragment length range and compute arcs between fragment extremities (protection visualization)
  • Peak calling
ASAP is:

User-friendly: requires a single configuration file. Thus, only one option is required when running the command line (see Usage of ASAP)

Flexible: provides the possibility to skip a given step(s) and target specific post-processing step(s).

Dependencies

Usage of ASAP

A configuration file required to execute the pipeline.

bash ASAP.sh [-h] [-v] [-c]

Options

-c CONFIGFILE

This is the only REQUIRED parameter for ASAP. The configuration file is a text file that gathers the full set of parameters required to execute the pipeline. (check the example ASAP_configFile_example.conf in distribution)

-h/-v

Print out the help/current version

About the configuration file:

The configuration file gathers the parameters of each step. Note that, when running the pipeline, only the "turned on" steps will performed. A step is turned on by a yes/no argument.

Here we list the different set of parameters to be filled in the configuration file:

General parameters

General information option about the run. Must be always filled.

OUTDIR:Main output directory where results are written. OUTDIR is created if does not exist
sampleName:Name of the processed sample. No space is allowed: use _ or - to mimic space if needed
CHRLEN:Chromosome info file (tab-delimited format: <Chr name><chr length>)
path:Full path to the different dependencies, if not already added to $PATH

Mapping step parameters

map:Set to "yes/no". If mapping is skipped (map=no), a BAM file must be provided to proceed. (see post-mapping steps).
FASTQ1:fastq file (R1). File can be gizpped
FASTQ2:fastq file (R2). File can be gizpped
bowtieIndex:Prefix of bowtie2 indexes
mappingParameters:Bowtie2 mapping parameters. Default: --very-sensitive -X 2000 -p 10

Post mapping steps

It is possible to skip the mapping step (map=no) and perform any of the post-mapping steps. To do so, aligned reads must be provided in a BAM file. If* map=yes*, the "turned on" post-mapping steps will performed on the internal mapping results.

BAM:aligment file in BAM format

Filtering parameters

filter:Set to "yes/no". If map=yes, filtering will be performed on internal mapping results, if map=no, filtering will be performed on the provided alignment file in BAM option.
maxMis:Maximum number of mismatches allowed per read.
blacklist:Set to "yes/no" if reads should be filtred based on a list of blacklisted regions. If "blacklist=yes", blacklisted regions must be provided in the next parameter.
blacklistedRegions:Regions used to filter reads.(tab-delimited format: <Chr name><start><end>)
shift:Set to "yes"/"no". If shift=yes, reads are shifted by 4bp so that read starts reflect the center of the Tn5 transposition event

Coverage

readCoverage:Set to "yes/no" if read coverage should be computed or not
ieventsCoverage:Set to "yes/no" if Tn5 insertion events coverage should be computed or not

Read extraction

extractReads:Set to "yes/no" if read pairs should be extracted based on a given range of fragment length
lowBoundary:Lower boundery of the range: [lowBoundary,upBoundary]. Default=100
upBoundary:Upper boundery of the range: [lowBoundary,upBoundary]. Default=250
arcs:set to "yes/no" if extracted fragments should represented as arcs (linked extremities)

Fragment length

fragDist:Set to "yes/no" if fragment length distribution should be computed or not

Peak calling

callpeak:Set to "yes/no" if peak calling should be computed or not.
control:Control bam file. Note that peak calling can be performed without a control, however, one can provide a control such as ATAC-seq on genomic DNA. Leave option empty if no control is used.
MODE:Peak calling mode: <broad/narrow>. Default=broad
modelParameters:MACS2 shifting options
fdr:Cutoff for peak detection. Default=0.01
gsize:Effective genome size of tair10 (gsize=10e7)

Output files

ASAP outputs mapping files, coverage files, fragments distribution table/plot and MACS2 peak calling results.

Mapping output

*.mapped.sorted.bam:Contains mapped reads (bowtie2 raw mapping results)

Filtering/post-processing outputs

*.(un)masked.(un)shifted.bam:Contains the selected set of reads after filtering. Ideally, accessible peaks are called using this file.
*.csv:Summary of filtering step is CSV format

Coverage outputs

*.(un)masked.(un)shifted.ievent.bam:Contains Tn5 insertion events. Basically, instead of showing reads, only the position corresponding to Tn5 insertion event are shown)
*.(un)masked.(un)shifted.bw:Genome-wide coverage of ATAC reads
*.(un)masked.(un)shifted.ievent.bw:Genome-wide coverage of Tn5 insertion events

Read extraction

*.subReads.f3.frag*.bam:Contains the set of extracted reads based on the given rage of fragment length
*.subReads.f3.frag*.bw:Genome-wide coverage of the set of extracted reads based on the given rage of fragment length
*.subReads.f3.frag*.arcs.bed:arcs between fragment extremities. This file is visualized on IGV

Fragment length distribution

*.TLEN.f3F16.txt:Counts/frequencies of fragments length
*.TLEN.f3F16.png:Plot of fragment length distribution

Peak calling outputs

Output are stored in an directory: peak_calling_<sampleName>. Check MACS2 output list

asap's People

Contributors

akramdi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.