Code Monkey home page Code Monkey logo

qsvar's Introduction

qsvaR

Lifecycle: stable Bioc release status Bioc devel status Bioc downloads rank Bioc support Bioc history Bioc last commit Bioc dependencies Codecov test coverage R build status GitHub issues GitHub pulls DOI

Differential expressions analysis requires the ability to normalize complex datasets. In the case of postmortem brain tissue we are tasked with removing the effects of bench degradation. The qsvaR package combines an established method for removing the effects of degradation from RNA-seq data with easy to use functions. It is the second iteration of the qSVA framework (Jaffe et al, PNAS, 2017).

The first step in the qsvaR workflow is to create an RangedSummarizedExperiment object with the transcripts identified in our qSVA experiment. If you already have a RangedSummarizedExperiment of transcripts we can do this with the getDegTx() function as shown below.If not this can be generated with the SPEAQeasy (a RNA-seq pipeline maintained by our lab) pipeline using the --qsva flag. If you already have a RangedSummarizedExperiment object with transcripts then you do not need to run SPEAQeasy. This flag requires a full path to a text file, containing one Ensembl transcript ID per line for each transcript desired in the final transcripts R output object (called rse_tx). The sig_transcripts argument in this package should contain the same Ensembl transcript IDs as the text file for the --qsva flag.The goal of qsvaR is to provide software that can remove the effects of bench degradation from RNA-seq data.

Installation Instructions

Get the latest stable R release from CRAN. Then install qsvaR using from Bioconductor the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("qsvaR")

And the development version from GitHub with:

BiocManager::install("LieberInstitute/qsvaR")

Example

This is a basic example which shows how to obtain the quality surrogate variables (qSVs) for the brainseq phase II dataset. qSVs are essentially principal components from an rna-seq experiment designed to model bench degradation. For more on principal components you can read and introductory article here. At the start of this script we will have an RangedSummarizedExperiment and a list of all the transcripts found in our degradation study. At the end we will have a table with differential expression results that is adjusted for qSVs.

## R packages we'll use
library("qsvaR")
library("limma")
library("qsvaR")

## We'll download example data from the BrainSeq Phase II project
## described at http://eqtl.brainseq.org/phase2/.
##
## We'll use BiocFileCache to cache these files so you don't have to download
## them again for other examples.
bfc <- BiocFileCache::BiocFileCache()
rse_file <- BiocFileCache::bfcrpath(
    "https://s3.us-east-2.amazonaws.com/libd-brainseq2/rse_tx_unfiltered.Rdata",
    x = bfc
)

## Now that we have the data in our computer, we can load it.
load(rse_file, verbose = TRUE)
#> Loading objects:
#>   rse_tx

In this next step we subset for the transcripts associated with degradation. These were determined by Joshua M. Stolz et al, 2022. We have provided three models to choose from. Here the names "cell_component", "top1500", and "standard" refer to models that were determined to be effective in removing degradation effects. The "standard" model involves taking the union of the top 1000 transcripts associated with degradation from the interaction model and the main effect model. The "top1500" model is the same as the "standard" model except the union of the top 1500 genes associated with degradation is selected. The most effective of our models, "cell_component", involved deconvolution of the degradation matrix to determine the proportion of cell types within our studied tissue. These proportions were then added to our model.matrix() and the union of the top 1000 transcripts in the interaction model, the main effect model, and the cell proportions model were used to generate this model of qSVs. In this example we will choose "cell_component" when using the getDegTx() and select_transcripts() functions.

The above venn diagram shows the overlap between transcripts in each of the previously mentioned models.

The above venn diagram shows the overlap between transcripts in each of the previously mentioned models.

## Next we get the degraded transcripts for qSVA from the "cell_component"
## model
DegTx <- getDegTx(rse_tx, type = "cell_component")

## Now we can compute the Principal Components (PCs) of the degraded
## transcripts
pcTx <- getPCs(DegTx, "tpm")

Next we use the k_qsvs() function to calculate how many PCs we will need to account for the variation. A model matrix accounting for relevant variables should be used. Common variables such as Age, Sex, Race and Religion are often included in the model. Again we are using our RangedSummarizedExperiment DegTx as the rse_tx option. Next we specify the mod with our model.matrix(). model.matrix() creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly. For more information on creating a design matrix for your experiment see the documentation here. Again we use the assayname option to specify the we are using the tpm assay, where TPM stands for transcripts per million.

## Using a simple statistical model we determine the number of PCs needed (k)
mod <- model.matrix(~ Dx + Age + Sex + Race + Region,
    data = colData(rse_tx)
)
k <- k_qsvs(DegTx, mod, "tpm")
print(k)
#> [1] 34

Now that we have our PCs and the number we need we can generate our qSVs.

## Obtain the k qSVs
qsvs <- get_qsvs(pcTx, k)
dim(qsvs)
#> [1] 900  34

This can be done in one step with our wrapper function qSVA which just combinds all the previous mentioned functions.

## Example use of the wrapper function qSVA()
qsvs_wrapper <- qSVA(rse_tx = rse_tx, type = "cell_component", mod = mod, assayname = "tpm")
dim(qsvs_wrapper)
#> [1] 900  34

Differential Expression

Next we can use a standard limma package approach to do differential expression on the data. The key here is that we add our qSVs to the statistical model we use through model.matrix(). Here we input our Ranged SummarizedExperiment object and our model.matrix with qSVs. Note here that the Ranged SummarizedExperiment object is the original object loaded with the full list of transcripts, not the the one we subsetted for qSVs. This is because while PCs can be generated from a subset of genes, differential expression is best done on the full dataset. The expected output is a sigTx object that shows the results of differential expression.

library("limma")

## Add the qSVs to our statistical model
mod_qSVA <- cbind(
    mod,
    qsvs
)

## Extract the transcript expression values and put them in the
## log2(TPM + 1) scale
txExprs <- log2(assays(rse_tx)$tpm + 1)

## Run the standard linear model for differential expression
fitTx <- lmFit(txExprs, mod_qSVA)
eBTx <- eBayes(fitTx)

## Extract the differential expression results
sigTx <- topTable(eBTx,
    coef = 2,
    p.value = 1, number = nrow(rse_tx)
)

## Explore the top results
head(sigTx)
#>                         logFC   AveExpr         t      P.Value    adj.P.Val
#> ENST00000553142.5 -0.06547988 2.0390889 -5.999145 2.921045e-09 0.0005786386
#> ENST00000552074.5 -0.12911383 2.4347985 -5.370828 1.009549e-07 0.0099992338
#> ENST00000510632.1  0.08994392 0.9073516  4.920042 1.037016e-06 0.0473143146
#> ENST00000530589.1 -0.10297938 2.4271711 -4.918806 1.043399e-06 0.0473143146
#> ENST00000572236.1 -0.05358333 0.6254025 -4.819980 1.697403e-06 0.0473143146
#> ENST00000450454.6  0.08446871 1.0042440  4.816539 1.726143e-06 0.0473143146
#>                           B
#> ENST00000553142.5 10.200286
#> ENST00000552074.5  6.767821
#> ENST00000510632.1  4.524039
#> ENST00000530589.1  4.518145
#> ENST00000572236.1  4.051142
#> ENST00000450454.6  4.035041

Finally, you can compare the resulting t-statistics from your differential expression model against the degradation time t-statistics adjusting for the six different brain regions. This type of plot is called DEqual plot and was shown in the initial qSVA framework paper (Jaffe et al, PNAS, 2017). We are really looking for two patterns exemplified here in Figure 1 (cartoon shown earlier). A direct positive correlation with degradation shown in Figure 1 on the right tells us that there is signal in the data associated with qSVs. An example of nonconfounded data or data that has been modeled can be seen in Figure 1 on the right with its lack of relationship between the x and y variables.

Cartoon showing patterns in DEqual plots

Cartoon showing patterns in DEqual plots

## Generate a DEqual() plot using the model results with qSVs
DEqual(sigTx)
Result of Differential Expression with qSVA normalization.

Result of Differential Expression with qSVA normalization.

For comparison, here is the DEqual() plot for the model without qSVs.

## Generate a DEqual() plot using the model results without qSVs
DEqual(topTable(eBayes(lmFit(txExprs, mod)), coef = 2, p.value = 1, number = nrow(rse_tx)))
Result of Differential Expression without qSVA normalization.

Result of Differential Expression without qSVA normalization.

In these two DEqual plots we can see that the first is much better. With a correlation of -0.014 we can effectively conclude that we have removed the effects of degradation from the data. In the second plot after modeling for several common variables we still have a correlation of 0.5 with the degradation experiment. This high correlation shows we still have a large amount of signal from degradation in our data potentially confounding our case-control (SCZD vs neurotypical controls) differential expression results.

qsvar's People

Contributors

hediatnani avatar joshstolz avatar jwokaty avatar lcolladotor avatar nturaga avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

qsvar's Issues

getBonfTx todos

Expand functionality for downstream analyses

We plan to add helper functions for visualizing the relationship between qSVs and other covariates, as is typically done in several analyses. This will likely involve making heatmaps with ComplexHeatmap. We will study what qsvaR users have done for other analyses when designing these helper functions.

Complete select_transcripts()

You might want to use code like this to create the vector of transcript IDs.

x <- letters
x
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
cat(paste0('c("', paste(x, collapse = '", "'), '")'))
#> c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")

Created on 2022-03-15 by the reprex package (v2.0.1)

Then you can copy paste it into

return("TODO")
.

Document how to use Salmon and SPEAQeasy output in the vignette

Explore whether we can use the example data from tximport to create an RSE from Salmon transcript count output files and run the qSVA functions on it.

See:

@Nick-Eagles might be able to help compare the code from tximport and/or rnaseqDTU with what we have in SPEAQeasy.

You could also use the SPEAQeasy-example data (see https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/pipeline_outputs/count_objects/rse_tx_Jlab_experiment_n42.Rdata) to document how to import those files and use them for this package.

Package data

will need help making the data in data/ available to the package so that the examples work.

k_qsvs todos

get_qsvs todos

> seq_len(0)
integer(0)
> 1:0
[1] 1 0
> seq_len(-3)
Error in seq_len(-3) : argument must be coercible to non-negative integer
> 1:(-3)
[1]  1  0 -1 -2 -3

Video guides for qsvaR

We want to create a collection of short videos (likely shared on YouTube) demonstrating how to use the different features of qsvaR. We will collate these videos into a new user guide (a new vignette). These videos will be helpful to explain the different components related to qsvaR such as the experimental design for RNA degradation experiments, the selection of transcripts associated with degradation (related to #36 #37), computing the qSVs, and using the qSVs in downstream analyses (related to #35).

Some of these videos will be inspired by actual use cases we hear from our users using publicly available data.

This will be an evolving process as new videos will have to be made to reflect new features added to qsvaR.

getDegTx todos

Screen Shot 2021-11-02 at 11 27 14 AM

See https://github.com/LieberInstitute/recount3/blob/master/R/create_rse_manual.R#L3-L6 for another example of how I use it in a sentence.

Helper functions for reproducing statistical modeling results that can be applied to subsets of the available data or new degradation datasets

We will write helper functions for reproducing the "main" and "interaction" model results based on the data provided in #32. These functions can then be applied to either subsets of the data to compute new statistical results that users might want to use as input (related to #33). Alternatively these functions could be applied to new degradation data from either new experiments carried at LIBD or data generated elsewhere for other tissues / organisms.

K_qsvs coverage

  • a test that shows full rank error

  • a test that shows low expression error (sva doesn't work)

Document concepts and link to external resources for more detail

qsvaR workflow overview image

We plan to improve the qsvaR workflow image and further expand the related background documentation for understanding the different steps of the process, what data you need, what are the outputs of qsvaR, and how you can use them for downstream analyses.

Provide access to all statistical model results

We will likely deposit to Bioconductor's ExperimentHub the full output from limma and the full RangedSummarizedExperiment object for the 119 degradation samples used in this study. Doing so will allow users to re-use the data from this study in different ways.

We will also add an equivalent function to spatialLIBD::fetch_data() in order to download this data.

Figure 3

cord equal
facet wrap
black line is hard to see (maybe red)
correlation value moved on to plot
density graphs need a scale
MAIN VS INT
old version of plot is clearer.
A bar plot is clearer to show region distribution

Design a 2 day workshop / short course

Similar to LieberInstitute/spatialLIBD#68

Develop a long format workshop to teach in 2 days the basics of differential expression analysis using RNA-seq data from postmortem human brain, which is affected by degradation. To make this workshop more general, it will also cover how you can select transcripts associated with degradation if you were to generate the relevant data.

We might want to learn about the format used by https://carpentries.org/index.html for short courses prior to designing this short course / long workshop. The target audience would be users who have some basic familiarity with Bioconductor, SummarizedExperiment, limma and/or similar tools.

Translate course and intro documentation to Spanish

Translate to Spanish the short workshop #38 and other intro level materials to help increase access to these technologies to Spanish-speaking individuals. This will help increase diversity of our user base and stimulate use of these analytical technologies in other parts of the world.

[BUG] Complete qSVA() wrapper

It seems like a bug that sig_transcripts at

qsvaR/R/qSVA.R

Line 23 in 498143f

qSVA <- function(rse_tx, type = "cell_component", sig_transcripts = select_transcripts(type), mod, assayname) {
is not used later on the qSVA() function.

Interactive tool for assessing confounding of DE results with RNA degradation

We will build an interactive website, likely powered by shiny, such that users can upload their differential expression results at the gene-level and make DEqual plots to assess if their results appear to be confounded by RNA degradation.

This tool (website) will enable users to check publicly available differential expression results and identify genes which could potentially be false positives. The website will generate an automated report that can then be shared with collaborators. The function for making this report will also be available as an R function that can be used in an non-interactive way.

If possible (due to memory constraints), this tool will also support exon-level, transcript-level, and/or exon-exon junction level DE results.

Figure 1.

add abcd
move correlation value inside the plot

Add DEqual plots to package

Use code for creating DEqual plot in /dcl01/lieber/ajaffe/lab/degradation_experiments/Joint/all/SCRIPTS/qsva_purr.R to create a helper function for assessing the extent of degradation.

[Feature Request] Support ENSEMBL IDs

We should add a is_gencode = TRUE default argument such that when it's set to FALSE, it matches using ENSEMBL IDs instead of Gencode IDs. Aka, it removes the trailing .[0-9] in the IDs.

This should include a unit test that checks the results using the same data with Gencode IDs, then manually makes them ENSEMBL IDs, and checks that with is_gencode = FALSE we get exactly the same results (might have to use set.seed() on this unit test).

Guide for downstream analyses

We will write a user guide (vignette) for downstream analyses using qSVs generated with qsvaR. This guide will showcase the helper functions from #34. Potentially not all the code will be evaluated given testing restrictions on Bioconductor. Though we will explore the use of "long tests". The goal of this user guide is to help orient users into what analyses they should likely run once they have estimated qSVs. This will also help users determine whether they should use qSVs in their analyses or not, for example, if they are analyzing data from a brain region or another tissue for which no degradation data exists.

Update DESCRIPTION

  • Title
  • Authors info: check other packages like megadepth for example.
  • Description: about 3 sentences. Should end with a period. Note the spacing (see other packages).
  • Add other biocViews terms. Aim for at least 5 terms.

Figure 2

telegraph subsetting
Main and interaction should be horizontal
label arrows
pair line plot with matrices

Figure 4

facet grid
labels
tile plot
variable names need to be cleaned up

Figure 5

add dot size to legend
move to ggplot

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.