Code Monkey home page Code Monkey logo

test's Introduction

#Overview

Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3' UTRs. Existing assembly strategies often fragment long 3' UTRs ( illustrated below, left), and importantly, no de-novo (reference annotation independent) transcript assembly of the algorithms in popular use can identify tandem 3' UTR isoforms generated by alternative cleavage and polyadenylation (APA) using only RNA-seq data (below, right). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools.

motivation image

To address these limitations, we have developed Isoform Structural Change Model (IsoSCM), a new method for transcript assembly that incorporates change-point analysis to improve the 3' UTR annotation process. If we assume that sequenced reads are distributed proportional to isoform abundance, the boundaries of transcription will be marked by a change in the level of coverage. In instances where short and long isoforms are co-expressed there can be a “step-like” pattern of coverage at the boundary of the nested exon model. To identify terminal exon boundaries, we thus seek change-points that mark transitions in RNA-seq coverage, as illustrated below.

changepoint image

Real RNA-seq data contains biases that can cause the sampling of reads across the transcript body to deviate from a uniform distribution. These biases typically cause short segments of the transcript to be sequenced at a lower frequency than the rest of the transcript, resulting in a local drop in the level of coverage. The conventional (unconstrained) change-point detection procedure identifies these local drops in coverage as distinct segments. However, in the context of annotating transcripts we wish to disregard these local aberrations as they do not correspond to terminal exon boundaries. In order to minimize the effect of local biases on transcript model inference we introduce additional constraints to distinguish localized changes from change-points that mark a sustained decrease in the level of coverage. The difference between constrained and unconstrained formulations is contrasted using a toy example below.

constrained changepoint image

We estimate relative usage of a polyadenylation site by considering the read density in the segments up and downstream of a change-point, such that

usage = 1 - ( covdn / covup )

where covdn/up are estimates of read density downstream/upstream of a change point. In addition to identifying change-points that mark the boundaries of 3'UTRs, IsoSCM can also be used to estimate the relative usage of 3'UTR isoforms, and identify sites with differential usage between two conditions. Conveniently, the change-point detection framework used by IsoSCM can also be used to identify differential usage of tandem 3' UTRs between two samples by performing a joint segmentation, as illustrated below.

differential image

The resulting joint segmentation can be used to quantitate differential usage of tandem 3'UTR isoforms. For the example shown above we can calculate covup and covdn, the level of coverage in the segments up and downstream of the identified change-point, and use these quantities to define differential site usage in each tissue:

usagebrain = 1 - ( covdownstream,brain / covupstream,brain ) ≅ 0.68

usageliver = 1 - ( covdownstream,liver / covupstream,liver ) ≅ 0.91

Δusagebrain-liver = usagebrain-usageliver ≅ -0.23

As for the example above, polyadenylation sites that are differentially used between a pair of conditions will be marked by a change-point that has a non-zero Δusage.

The functions of IsoSCM described above are performed by the assemble, enumerate, compare commands, detailed below.

#Running IsoSCM

##assemble command

java -Xmx2048m -jar IsoSCM.jar assemble [optional arguments]

Taking a BAM file as input, the assemble command will construct a splice-graph of exons, and segment terminal exons using the constrained change-point detection procedure. The minimal arguments to run the assemble command require the BAM file, a base-name for generated output files, and the strandedness of the RNA-seq data. IsoSCM is compatible BAM files generated by many short read mappers, and only requires that spliced reads include the XS flag specifying the inferred strandedness of the spliced read. IsoSCM is a Java application, and requires a JRE version >=1.6 to be installed. We recommend running IsoSCM with at least 2GB of memory via the java -Xmx argument, although should be increased as necessary if OutOfMemoryError exceptions are encountered.

Required arguments:

          -bam                    BAM file to be assembled
          
          -base                   The output basename. This will be used as the
                                  prefix for output files.
                                  
          -s                      the strandedness, can be "reverse_forward" or
                                  "unstranded"

IsoSCM has a number of optional arguments with default values pertaining to either splice-graph assembly or change-point detection.

Assembly arguments:

          -c                      bin width for evaluating the confidence of a
                                  change point
                                  Default: 2
                                  
          -coverage               whether or not to calculate coverage of
                                  generated models
                                  Default: true
                                  
          -dir                    The output directory
                                  Default: ./isoscm
                                  
          -insert_size_quantile   If the data is paired, and this value is specified
                                  IsoSCM will attempt to scaffold gaps between segments
                                  spanned by mates separated by at most this quantile
                                  
          -internal               whether or not to identify change-points within 
                                  internal (non-terminal) exons
                                  Default: false
                                  
          -jnct_alpha             The significance level for binomial test to
                                  accept splice junction
                                  Default: 0.05
                                  
          -merge_radius           gaps smaller than this distance (nt) will be merged
                                  Default: 100
                                  
          -merge_segments         Optional: bed file of regions to merge across

Segmentation arguments:

          -min_fold               the minimum fold change between neighboring
                                  segments expressed as a ratio of coverage downstream divided by
                                  coverage upstream. acceptable values in the range [0.0-1.0]
                                  Default: 0.5
                                  
          -min_terminal           terminal segments are extended by
                                  this amount before segmentation
                                  Default: 300
                                  
          -nb_r                   r parameter of the NB(p,r) prior on RNA-seq coverage within a segment.
                                  The p parameter is not specified as it is integrated over.
                                  Default: 10
                                  
          -segment_p              p parameter of NB(p,r) prior on segment length (between change-points)
                                  Default: 0.95
                                  
          -segment_r              r parameter of NB(p,r) prior on segment length (between change-points)
                                  Default: 10
                                  
          -t                      the threshold above which a segment is
                                  considered expressed
                                  Default: 1
                                  
          -w                      the width of window to be used by the SCM
                                  Default: 20

If the BAM file to process is called brain.bam and the sample was sequenced with the stranded dUTP protocol, the minimal call to IsoSCM assemble would look like

java -Xmx2048m -jar IsoSCM.jar assemble -bam brain.bam -base brain -s reverse_forward

##enumerate command

java -Xmx2048m -jar IsoSCM.jar enumerate [arguments]

Taking a splice-graph generated by the assemble command as input, the enumerate command will list transcript isoforms that are compatible with the splice graph. Since the splice-graph may imply a large number of isoforms, there is a parameter to specify the maximum number of isoforms that will be reported for a single transcript. Loci with more than this number of isoforms are reported as "skipped". This function is not required for assessing differential 3'UTR isoform usage, and is provided as a means to visualize assembled models.

Required arguments:

         -max_isoforms            loci with more than this number of isoforms will be
                                  skipped
         -x                       configuration XML file from the assembly step

##compare command

java -Xmx2048m -jar IsoSCM.jar compare [arguments]

Taking the splice-graphs generated by the assemble for two samples as input, the compare command performs joint segmentation of the terminal exons of coexpressed transcripts, and reports the differential usage of each identified change-point.

Required arguments:

          -base                   The output basename. This will be used as the
                                  prefix for output files.
          -x1                     XML configuration file for the assembly step from sample 1
          -x2                     XML configuration file for the assembly step from sample 2

Optional arguments:

          -dir                    The output directory
                                  Default: ./compare

#IsoSCM Output

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.