#Overview
Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3' UTRs. Existing assembly strategies often fragment long 3' UTRs ( illustrated below, left), and importantly, no de-novo (reference annotation independent) transcript assembly of the algorithms in popular use can identify tandem 3' UTR isoforms generated by alternative cleavage and polyadenylation (APA) using only RNA-seq data (below, right). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools.
To address these limitations, we have developed Isoform Structural Change Model (IsoSCM), a new method for transcript assembly that incorporates change-point analysis to improve the 3' UTR annotation process. If we assume that sequenced reads are distributed proportional to isoform abundance, the boundaries of transcription will be marked by a change in the level of coverage. In instances where short and long isoforms are co-expressed there can be a “step-like” pattern of coverage at the boundary of the nested exon model. To identify terminal exon boundaries, we thus seek change-points that mark transitions in RNA-seq coverage, as illustrated below.
Real RNA-seq data contains biases that can cause the sampling of reads across the transcript body to deviate from a uniform distribution. These biases typically cause short segments of the transcript to be sequenced at a lower frequency than the rest of the transcript, resulting in a local drop in the level of coverage. The conventional (unconstrained) change-point detection procedure identifies these local drops in coverage as distinct segments. However, in the context of annotating transcripts we wish to disregard these local aberrations as they do not correspond to terminal exon boundaries. In order to minimize the effect of local biases on transcript model inference we introduce additional constraints to distinguish localized changes from change-points that mark a sustained decrease in the level of coverage. The difference between constrained and unconstrained formulations is contrasted using a toy example below.
We estimate relative usage of a polyadenylation site by considering the read density in the segments up and downstream of a change-point, such that
usage = 1 - ( covdn / covup )
where covdn/up are estimates of read density downstream/upstream of a change point. In addition to identifying change-points that mark the boundaries of 3'UTRs, IsoSCM can also be used to estimate the relative usage of 3'UTR isoforms, and identify sites with differential usage between two conditions. Conveniently, the change-point detection framework used by IsoSCM can also be used to identify differential usage of tandem 3' UTRs between two samples by performing a joint segmentation, as illustrated below.
The resulting joint segmentation can be used to quantitate differential usage of tandem 3'UTR isoforms. For the example shown above we can calculate covup and covdn, the level of coverage in the segments up and downstream of the identified change-point, and use these quantities to define differential site usage in each tissue:
usagebrain = 1 - ( covdownstream,brain / covupstream,brain ) ≅ 0.68
usageliver = 1 - ( covdownstream,liver / covupstream,liver ) ≅ 0.91
Δusagebrain-liver = usagebrain-usageliver ≅ -0.23
As for the example above, polyadenylation sites that are differentially used between a pair of conditions will be marked by a change-point that has a non-zero Δusage.
The functions of IsoSCM described above are performed by the assemble, enumerate, compare commands, detailed below.
#Running IsoSCM
##assemble command
java -Xmx2048m -jar IsoSCM.jar assemble [optional arguments]
Taking a BAM file as input, the assemble command will construct a splice-graph of exons, and segment terminal exons using the constrained change-point detection procedure. The minimal arguments to run the assemble command require the BAM file, a base-name for generated output files, and the strandedness of the RNA-seq data. IsoSCM is compatible BAM files generated by many short read mappers, and only requires that spliced reads include the XS flag specifying the inferred strandedness of the spliced read. IsoSCM is a Java application, and requires a JRE version >=1.6 to be installed. We recommend running IsoSCM with at least 2GB of memory via the java -Xmx argument, although should be increased as necessary if OutOfMemoryError exceptions are encountered.
Required arguments:
-bam BAM file to be assembled
-base The output basename. This will be used as the
prefix for output files.
-s the strandedness, can be "reverse_forward" or
"unstranded"
IsoSCM has a number of optional arguments with default values pertaining to either splice-graph assembly or change-point detection.
Assembly arguments:
-c bin width for evaluating the confidence of a
change point
Default: 2
-coverage whether or not to calculate coverage of
generated models
Default: true
-dir The output directory
Default: ./isoscm
-insert_size_quantile If the data is paired, and this value is specified
IsoSCM will attempt to scaffold gaps between segments
spanned by mates separated by at most this quantile
-internal whether or not to identify change-points within
internal (non-terminal) exons
Default: false
-jnct_alpha The significance level for binomial test to
accept splice junction
Default: 0.05
-merge_radius gaps smaller than this distance (nt) will be merged
Default: 100
-merge_segments Optional: bed file of regions to merge across
Segmentation arguments:
-min_fold the minimum fold change between neighboring
segments expressed as a ratio of coverage downstream divided by
coverage upstream. acceptable values in the range [0.0-1.0]
Default: 0.5
-min_terminal terminal segments are extended by
this amount before segmentation
Default: 300
-nb_r r parameter of the NB(p,r) prior on RNA-seq coverage within a segment.
The p parameter is not specified as it is integrated over.
Default: 10
-segment_p p parameter of NB(p,r) prior on segment length (between change-points)
Default: 0.95
-segment_r r parameter of NB(p,r) prior on segment length (between change-points)
Default: 10
-t the threshold above which a segment is
considered expressed
Default: 1
-w the width of window to be used by the SCM
Default: 20
If the BAM file to process is called brain.bam and the sample was sequenced with the stranded dUTP protocol, the minimal call to IsoSCM assemble would look like
java -Xmx2048m -jar IsoSCM.jar assemble -bam brain.bam -base brain -s reverse_forward
##enumerate command
java -Xmx2048m -jar IsoSCM.jar enumerate [arguments]
Taking a splice-graph generated by the assemble command as input, the enumerate command will list transcript isoforms that are compatible with the splice graph. Since the splice-graph may imply a large number of isoforms, there is a parameter to specify the maximum number of isoforms that will be reported for a single transcript. Loci with more than this number of isoforms are reported as "skipped". This function is not required for assessing differential 3'UTR isoform usage, and is provided as a means to visualize assembled models.
Required arguments:
-max_isoforms loci with more than this number of isoforms will be
skipped
-x configuration XML file from the assembly step
##compare command
java -Xmx2048m -jar IsoSCM.jar compare [arguments]
Taking the splice-graphs generated by the assemble for two samples as input, the compare command performs joint segmentation of the terminal exons of coexpressed transcripts, and reports the differential usage of each identified change-point.
Required arguments:
-base The output basename. This will be used as the
prefix for output files.
-x1 XML configuration file for the assembly step from sample 1
-x2 XML configuration file for the assembly step from sample 2
Optional arguments:
-dir The output directory
Default: ./compare
#IsoSCM Output