The rnaseq_tutorial from rmflight

Included Data
Finding Data
Installing R & RStudio
- Windows
- macOS
- Linux
Alternatives to RStudio
- VSCode
Installing Packages
- About Installation Commands
Creating a Project
Downloading recount Data
- Extract Useful Data
Downloading ARCHS4 Data
- Getting Useful Data Out
- Notes on Saving & Reloading Data
Running An Analysis

This RNASeq transcriptomics analysis will be carried out using R, a statistical programming language and interactive environment. I recommend using the RStudio interactive development environment, because it allows us to easily interact with R, and sets up “projects” that control where data is found.

Included Data

Note that I’ve included some of the smaller files under the data_files directory so that it is easy to see what happens with some of the code down below. The easiest way to get them is either by downloading directly, or cloning the entire project.

Finding Data

We want to avoid having to do everything from scratch in this project for various reasons. Therefore, I recommend picking a dataset from either the ARCHS4 or recount2 projects, where all of the data has been preprocessed, and we simply need to download it and start working with it.

You should think about a disease or condition you might be interested in, and see if there are any datasets in ARCHS4 or recount2 that may be suitable. Make sure to make a list of experiment IDs or links to share with your mentor!

ARCHS4 link: https://maayanlab.cloud/archs4/
Recount2 link: https://jhubiostatistics.shinyapps.io/recount/

Discuss the possible data sets with your mentor before proceeding to attempt to download any data.

Installing R & RStudio

We are going to use R, a data analysis programming language. If you want to learn about using it in general, there are several good tutorials you can check out.

swirl: an interactive tutorial that runs within R itself.
R for Data Science: a book on general data processing. The HTML version of the book is free.
Teacup Giraffes: an introductory course that introduces running things in R and how to get some simple statistics out.

Windows

Video Tutorial

Install R by downloading and running this .exe file from CRAN. Also, please install the RStudio IDE.

macOS

Video Tutorial

Install R by downloading and running this .pkg file from CRAN. Also, please install the RStudio IDE.

Linux

You can download the binary files for your distribution from CRAN. Or you can use your package manager (e.g. for Debian/Ubuntu run sudo apt-get install r-base and for Fedora run sudo dnf install R). Also, please install the RStudio IDE.

Please make sure your R installation works by starting RStudio. You should see a screen that looks like this:

This provides the Console, where R code is actually executed; Environment, displaying which objects are present; History, providing a history of which commands have been run; and several other panes you can read about elsewhere.

Alternatives to RStudio

VSCode

The only recommendation for another editor for working in R I’ve seen is for VSCode (unless you already use EMacs or Vim for everything. If that’s you, you probably don’t need my guidance). There are some guidelines to using R in VSCode by people who do it all the time. I recommend checking out the above link or searching for VSCode and R to find more information.

Installing Packages

R makes code available as packages. We will be using several hosted on CRAN (the official R package repository), as well as some from Bioconductor, and some written by Dr. Flight. Installing packages we use the command install.packages("packageName").

For example, we can start by installing the “here” and “remotes” package, which will be used to install some other packages we will use. You can type the command below into the R console part of RStudio

install.packages("here")
install.packages("remotes")

Please type the command yourself, and don’t copy paste! Typing it helps you to build proficiency and a muscle memory that will become further ingrained the more you do it.

Hopefully you didn’t get any errors when you typed that in and pressed enter. If you did get errors, they were likely something like:

Warning message:
package ‘rmotes’ is not available (for R version 4.0.0)

Error in install.package("remotes") : 
  could not find function "install.package"

The first one means you spelled the package name wrong. The other one is telling you that you spelled the command wrong. These are very common errors especially when installing things. Some of the error messages from R are helpful, others less so. However, when there is an error, R is generally trying to be helpful, so do read them carefully, and google them if they are not obvious. Some are also collected here: https://rmflight.github.io/rerrors/

If it executed correctly, it should tell you at the end.

Other packages we will need to install include:

# provides really nice plotting abilities
install.packages("ggplot2") 
# provides access to biologically related packages in Bioconductor
install.packages("BiocManager") 
install.packages("dplyr")

We can then use BiocManager to install biologically related packages.

# loads BiocManager so you can use it
library(BiocManager) 
# runs the BiocManager command install
BiocManager::install("recount") 
# installs DEseq2 package
BiocManager::install("DESeq2")
BiocManager::install("circlize")
install.packages("viridis")
install.packages("rmarkdown")
# installs a package used for quality control and analysis
remotes::install_github("MoseleyBioinformaticsLab/visualizationQualityControl")
remotes::install_github("MoseleyBioinformaticsLab/ICIKendallTau")
remotes::install_github("rmflight/categoryCompare2")

While these are installing, you should notice lots of other packages being installed as well. Hopefully none of them are generating errors while they are installing.

About Installation Commands

Some information on what we did above:

library() is a function for loading up an installed package. The command library is the function, and you tell the function it’s arguments with the brackets (). If you call the function name without (), R will actually print the function definition.
BiocManager is a package for managing packages from the Bioconductor project.
BiocManager::install() is calling the install() function from the BiocManager package. It is slightly faster than doing this:

library(BiocManager)
install("DEseq2")

The package::function() only works if a package and it’s dependencies are installed, obviously.
It can be very useful when you know exactly what function you need, and you are only going to need one.
It’s also useful when there are several packages that have the same function name, you make sure you are calling the correct one.

Creating a Project

Now we need to create our analysis project, where all of the data and code are going to live. You can create a new project using the “Projects” option on the global toolbar, it should be at the furthest right corner. More on using RStudio projects is available here: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

I recommend naming the project something memorable and informative.

Also, you should change two of the project options

Project 
  -> Project Options
  -> Restore .RData into workspace at startup -- “Never”
  -> Save workspace to .RData on exit -- “Never”

Many errors in analyses are caused by having old data loaded instead of starting from scratch. Therefore, to avoid having those kinds of issues, the options above should be enacted either globally or at the very least on each project.

If you are not using RStudio, then you should create a file named “.here” (see /software/R_libs/R400/here/help/here) in the directory where your project will be, and always start your session in that directory so that R knows your relative file paths.

Downloading recount Data

Recount provides instructions on downloading data at https://jhubiostatistics.shinyapps.io/recount/

In short, you choose an identifier that corresponds to the data you want, say “SRX10000”, and ask recount to download it for you.

library('recount')
# make sure to change the STUDY ID to a real one!
url = download_study('SRX10000')
url 
# loading the data to work with it
load(file.path("SRX10000", "rse_gene.RData"))

Alternatively, you can download a data file corresponding to the lung data, and use it. We will use this example file for all of our other example data processing.

download.file("http://duffel.rail.bio/recount/v2/TCGA/rse_gene_lung.Rdata", destfile = here::here("data_files/rse_gene_lung.Rdata"))

Alternatively, you can download it using this link: http://duffel.rail.bio/recount/v2/TCGA/rse_gene_lung.Rdata

Extract Useful Data

We want to extract both the original gene level counts, scaled counts, and information about the samples in the data. We do it this way to make some other things easier.

library(recount)
load(here::here("data_files/rse_gene_lung.Rdata"))

# we need to scale the counts closer to what we would normally expect
# before we extract them.
rse_raw = read_counts(rse_gene, round = TRUE)
raw_counts = assays(rse_raw)$counts
sample_data = colData(rse_gene)

# some of these are going to be specific to the TCGA data, I think
sample_info = data.frame(project = sample_data$project,
                         sample_id = sample_data$gdc_file_id,
                         gender = sample_data$gdc_cases.demographic.gender,
                         project_name = sample_data$gdc_cases.project.name,
                         race = sample_data$gdc_cases.demographic.race,
                         sample_type = sample_data$gdc_cases.samples.sample_type,
                         primary_site = sample_data$gdc_cases.project.primary_site,
                         tumor_stage = sample_data$gdc_cases.diagnoses.tumor_stage,
                         disease_type = sample_data$cgc_file_disease_type)

gene_data = rowRanges(rse_gene)
gene_info = as.data.frame(mcols(gene_data))

rse_scaled = scale_counts(rse_gene)
scaled_counts = assays(rse_scaled)$counts

saveRDS(raw_counts, file = here::here("data_files/recount_lung_raw_counts.rds"))
saveRDS(sample_info, file = here::here("data_files/recount_lung_sample_info.rds"))
saveRDS(scaled_counts, file = here::here("data_files/recount_lung_scaled_counts.rds"))
saveRDS(gene_info, file = here::here("data_files/recount_lung_gene_info.rds"))

Downloading ARCHS4 Data

For ARCHS4, we download the full data set (human is 12 GB, mouse is probably larger), and then subset it by samples of interest.

destination_file = here::here("data_files/archs4_human_matrix_v9.h5")
extracted_expression_file = here::here("data_files/archs4_LUNG_expression_matrix.tsv")
url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix_v9.h5"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    download.file(url, destination_file, quiet = FALSE, mode = 'wb')
}

Getting Useful Data Out

lung_samples = readLines(here::here("data_files/archs4_lung_samplelist.txt"))

library("rhdf5")
human_file = here::here("data_files/archs4_human_matrix_v9.h5")
# you can see what is in the file
h5ls(human_file)

samples = h5read(human_file, "meta/samples/geo_accession")
genes = h5read(human_file, "meta/genes/genes")
titles = h5read(human_file, "meta/samples/title")
series = h5read(human_file, "meta/samples/series_id")

sample_locations = which(samples %in% lung_samples)
expression = t(h5read(human_file, "data/expression", index=list(sample_locations, 1:length(genes))))

colnames(expression) = samples[sample_locations]
rownames(expression) = genes

sample_info = data.frame(sample_id = samples[sample_locations],
                         title = titles[sample_locations],
                         series = series[sample_locations])

saveRDS(expression, here::here("data_files/archs4_lung_counts.rds")_
saveRDS(sample_info, here::here("data_files/archs4_lung_sample_info.rds"))

Notes on Saving & Reloading Data

In contrast to lots of tutorials where they recommend saving data using save(), I prefer saveRDS(). The reason is because readRDS() makes you assign the object to a new name, and that name can be whatever makes sense to you. load() will load the data with whatever name it had previously, which can be very, very annoying, especially for an analysis like this. So for example, if we want, we can load the sample_info we saved above with a different name:

info = readRDS(here::here("data_files/archs4_lung_sample_info.rds"))

Running An Analysis

OK, so we have data from RNA-seq transcriptomics experiments. What exactly happened to get this far?

Messenger ribonucleic acid (mRNA) was extracted from cells
High abundance RNAs were removed (probably)
- Why would we do this? Which RNA would be in high abundance? (Hint: the machinery to translate mRNA also contains RNA)
Converted to DNA
Amplified using PCR
Sequenced, probably using some form of next-generation sequencing
Those sequences are then aligned to a reference genome
And then how many sequences align to each part of the genome give us the counts above

Determining Useful Samples & Subsampling

Depending on your computer’s RAM (random access memory), you may or may not be able to analyze all genes or samples at the same time, and if you tried to run the ARCHS4 extraction code above, you may have run out of RAM, and either had the process take forever, or had your computer crash completely.

Let your mentor know if your computer crashed and you still want to use ARCHS4 data!!

One thing we can do to reduce the compute requirements, and at least work out what code is going to work, is to:

Subset to a more relevant set of samples, which is likely necessary unless you are working with a rather simple single study.
Subset to a more manageable set of genes and/or samples.

Useful Samples

Let’s see if we can subset the recount samples to something more reasonable than all of the samples.

library(dplyr)
lung_info = readRDS(here::here("data_files/recount_lung_sample_info.rds"))

knitr::kable(head(lung_info))

project	sample_id	gender	project_name	race	sample_type	primary_site	tumor_stage	disease_type
TCGA	191fe3d1-febf-4585-b6f4-263bfad4dd7e	male	Lung Squamous Cell Carcinoma	not reported	Primary Tumor	Lung	stage iia	Lung Squamous Cell Carcinoma
TCGA	672afeb7-e9aa-4a44-aa9c-ef6344ae5c5c	male	Lung Adenocarcinoma	white	Primary Tumor	Lung	stage iib	Lung Adenocarcinoma
TCGA	670d8333-6723-4b4f-b533-d2bff803a9bf	female	Lung Adenocarcinoma	white	Primary Tumor	Lung	stage ib	Lung Adenocarcinoma
TCGA	ab6c1203-4d5d-484a-abd9-9b333017b1ed	male	Lung Squamous Cell Carcinoma	white	Primary Tumor	Lung	stage iib	Lung Squamous Cell Carcinoma
TCGA	88860792-6084-41df-b3a1-7d36a2502b5a	male	Lung Squamous Cell Carcinoma	white	Primary Tumor	Lung	stage iia	Lung Squamous Cell Carcinoma
TCGA	4b3d8b07-8b44-45f3-be30-a0b6edbf8265	male	Lung Adenocarcinoma	black or african american	Primary Tumor	Lung	stage iiia	Lung Adenocarcinoma

We can see from the table that we have a bunch of useful information:

sample_id: a sample id, that corresponds to the column names of our expression data
gender: what gender was the sample from
project_name: some information about the project
race: what race of a person is the sample from
sample_type: what type of sample is it
primary_site: where is it thought that the primary tumor is from
tumor_stage: what stage is the tumor at
disease_type: what type of disease is it

We can also get an idea of what is in the data asking what the unique values of each column are. The data we have is a data.frame, which is really a list underneath, so we can iterate over specific pieces using purrr.

dplyr::select(lung_info, gender, project_name, race, sample_type, primary_site, tumor_stage, disease_type) %>%
  purrr::iwalk(., function(.x, .y){
    message(.y)
    print(unique(.x))
  })

## [1] "male"   "female"
## [1] "Lung Squamous Cell Carcinoma"
## [2] "Lung Adenocarcinoma"         
## [1] "not reported"                    
## [2] "white"                           
## [3] "black or african american"       
## [4] "asian"                           
## [5] "american indian or alaska native"
## [1] "Primary Tumor"       "Solid Tissue Normal"
## [3] "Recurrent Tumor"    
## [1] "Lung"
##  [1] "stage iia"    "stage iib"    "stage ib"    
##  [4] "stage iiia"   "stage iv"     "stage iiib"  
##  [7] "stage ia"     "not reported" "stage i"     
## [10] "stage ii"     "stage iii"   
## [1] "Lung Squamous Cell Carcinoma"
## [2] "Lung Adenocarcinoma"

Using this information, we can start to think about how to slice and dice the data. For example, we probably want to use only one type of lung cancer, and there are two types here. We also want to work with primary tumors only, and also those that are from a higher stage.

We start with 1156 total samples.

adeno_info = dplyr::filter(lung_info, 
                           disease_type %in% "Lung Adenocarcinoma",
                           !(tumor_stage %in% c("not reported", "stage ia", "stage i", "stage ib")),
                           !(sample_type %in% c("Recurrent Tumor")))

This gives us 264 samples. Lets verify that we only have what we want:

dplyr::select(adeno_info, disease_type, tumor_stage, sample_type) %>%
  purrr::iwalk(., function(.x, .y){
    message(.y)
    print(unique(.x))
  })

## [1] "Lung Adenocarcinoma"
## [1] "stage iib"  "stage iiia" "stage iia"  "stage iiib"
## [5] "stage iv"   "stage ii"  
## [1] "Primary Tumor"       "Solid Tissue Normal"

We also need to add something that is a bit more useful as an identifier of “normal” and “cancerous” tissue.

adeno_info = dplyr::mutate(
  adeno_info,
  disease = dplyr::case_when(
    grepl("Tumor", sample_type) ~ "cancer",
    grepl("Normal", sample_type) ~ "normal"
  ))

unique(adeno_info$disease)

## [1] "cancer" "normal"

table(adeno_info$disease)

## 
## cancer normal 
##    236     28

So, severely unbalanced, with 236 and only 28.

But now we can make a smaller version of the lung data with just these samples.

# we have to transform this to upper because that is what is on the matrix
adeno_info = dplyr::mutate(adeno_info, sample_id2 = toupper(sample_id))
lung_matrix = readRDS(here::here("data_files/recount_lung_raw_counts.rds"))

adeno_matrix = lung_matrix[, adeno_info$sample_id2]
dim(adeno_info)

## [1] 264  11

saveRDS(adeno_info, file = here::here("data_files/adeno_info.rds"))
saveRDS(adeno_matrix, file = here::here("data_files/adeno_raw_counts.rds"))

In addition to using the smaller set of samples, we can also select a smaller set of genes. We will look first for those that have a non-zero value in at least one sample. And then we will take a random sample of those.

set.seed(1234)
is_1 = purrr::map_lgl(seq(1, nrow(adeno_matrix)), function(in_row){
  sum(adeno_matrix[in_row, ] > 0) > 0
})
use_rows = sample(which(is_1), 6000)
adeno_raw_sub = adeno_matrix[use_rows, ]
saveRDS(adeno_raw_sub, file = here::here("data_files/adeno_raw_sub_counts.rds"))

scaled_counts = readRDS(here::here("data_files/recount_lung_scaled_counts.rds"))
adeno_scaled = scaled_counts[, adeno_info$sample_id2]
adeno_scaled_sub = adeno_scaled[use_rows, ]


saveRDS(adeno_scaled, file = here::here("data_files/adeno_scaled_counts.rds"))
saveRDS(adeno_scaled_sub, file = here::here("data_files/adeno_scaled_sub_counts.rds"))

This is really, really useful, because if you are having memory problems with the full set, then you can use the much smaller subset to test your code with, work it all out, and then run the code somewhere else with more memory available.

Quality Control / Quality Assurance

Quality control and quality assurance means we are looking for things that don’t fit with the others. We can use correlation amongst the samples to check if they match each other and see if something doesn’t fit.

library(visualizationQualityControl)
library(ggplot2)
adeno_scaled_counts = readRDS(here::here("data_files/adeno_scaled_counts.rds"))
adeno_raw_counts = readRDS(here::here("data_files/adeno_raw_sub_counts.rds"))
adeno_info = readRDS(here::here("data_files/adeno_info.rds"))

Correlation

We use a special correlation that is able to incorporate missing values when it calculates a pairwise ranked correlation. You can read more about it here. Notice here we used the sub matrix of 6000 genes so it will actually calculate. The correlations for this group are also available in the GitHub repo, under data_files/adeno_cor_6K.rds.

library(furrr)
plan(multicore)
library(ICIKendallTau)
sample_cor = ici_kendalltau(t(adeno_raw_counts))

saveRDS(sample_cor, file = here::here("data_files/adeno_cor_6K.rds"))

sample_cor = readRDS(here::here("data_files/adeno_cor_6K.rds"))
all.equal(adeno_info$sample_id2, colnames(sample_cor$cor))

## [1] TRUE

med_cor = median_correlations(sample_cor$cor, adeno_info$disease)

ggplot(med_cor, aes(x = med_cor)) + geom_histogram() + 
  facet_wrap(~ sample_class, ncol = 1)

In this plot, we’ve plotted the distributions of the median sample-sample correlations for both of the cancer and the normal samples. Notice that in each of these, there is at least one outlier sample. We can also see that the “normal” distribution is in general higher than the “cancer” distribution.

We could also look at these in a heatmap, and we will see that each of these would have different correlations to the others.

use_cor = sample_cor$cor
# make a short id, because our sample_id's are really, really long
adeno_info$short_id = paste0("S", seq(1, nrow(adeno_info)))
# we have to change them here too, because we don't want them to 
# overwhelm the heatmap
colnames(use_cor) = rownames(use_cor) = adeno_info$short_id
rownames(adeno_info) = adeno_info$short_id
cor_order = similarity_reorderbyclass(use_cor, adeno_info[, c("disease"), drop = FALSE], transform = "sub_1")

library(circlize)
colormap = colorRamp2(seq(0.5, 1, length.out = 20), viridis::viridis(20))

data_legend = generate_group_colors(2)
names(data_legend) = c("cancer", "normal")
row_data = adeno_info[, "disease", drop = FALSE]
row_annotation = list(disease = data_legend)

visqc_heatmap(use_cor, colormap, "Sample-Sample Correlation",
              name = "ICI-Kt", row_color_data = row_data,
              row_color_list = row_annotation, col_color_data = row_data,
              col_color_list = row_annotation, row_order = cor_order$indices,
              column_order = cor_order$indices)

Outlier Fraction

For this one we will use the full matrix, because it is still quick.

out_frac = outlier_fraction(t(adeno_scaled_counts), adeno_info$disease)

ggplot(out_frac, aes(x = frac)) + geom_histogram() + 
  facet_wrap(~ sample_class, ncol = 1)

These are not nearly as clear cut.

Combine

We can combine these two scores and look for outliers within the combined score, for each of “normal” and “cancer”.

outliers = determine_outliers(med_cor, out_frac)

ggplot(outliers, aes(x = score, fill = outlier)) + 
  geom_histogram(position = "identity") +
  facet_wrap(~ sample_class.frac, ncol = 1)

Add to Info

Now we can combine the outlier information with the previous information we had.

names(adeno_info)

##  [1] "project"      "sample_id"    "gender"      
##  [4] "project_name" "race"         "sample_type" 
##  [7] "primary_site" "tumor_stage"  "disease_type"
## [10] "disease"      "sample_id2"   "short_id"

names(outliers)

## [1] "sample_id"         "med_cor"          
## [3] "sample_class.cor"  "sample_class.frac"
## [5] "frac"              "score"            
## [7] "outlier"

adeno_info_outliers = dplyr::left_join(adeno_info,
                                     outliers[, c("sample_id", "score", "outlier")], 
                                     by = c("sample_id2" = "sample_id"))
saveRDS(adeno_info_outliers, file = here::here("data_files/adeno_info_outliers.rds"))

Principal Components Analysis

And we can check what principal components analysis (PCA) shows us after removing the outliers. What we are looking for is that either PC1 or PC2 separate the two types of samples.

We use the scaled counts because recount has taken care of the “normalization” aspect for us. We also use the full data set with all of the genes. We also log-transform because PCA doesn’t like the error structure in the raw data. log1p is used because we have zero values, so we want to actually transform using log(value + 1) to make it work, and this is a built-in function that makes sure 1 is added in a sane way for large and small values.

kept_info = dplyr::filter(adeno_info_outliers, !outlier)
kept_counts = log1p(adeno_scaled_counts[, kept_info$sample_id2])

kept_pca = prcomp(t(kept_counts), center = TRUE, scale. = FALSE)

kept_pca2 = cbind(as.data.frame(kept_pca$x), kept_info)

ggplot(kept_pca2, aes(x = PC1, y = PC2, color = disease)) + 
  geom_point() +
  theme(legend.position = c(0.9, 0.1))

Yes! Everything looks A-OK! It might look like this with the outliers left in too, but why keep things that don’t seem to look quite like the others? Those will only introduce variance that we don’t want in our statistical calculations.

Differential Analysis

Now we need to run statistics on the genes and see what is differentially expressed between these two groups of samples.

We will use the sample_info to select the samples, and we will use the raw counts we extracted previously.

library(DESeq2)

adeno_info_outliers = readRDS(here::here("data_files/adeno_info_outliers.rds"))
count_data = readRDS(here::here("data_files/adeno_raw_counts.rds"))

adeno_info_outliers = dplyr::filter(adeno_info_outliers, !outlier)
count_data = count_data[, adeno_info_outliers$sample_id2]

And now we create the object that DESeq2 needs for the analysis.

# we need to convert the disease to a factor for the design of the comparisons
adeno_info_outliers$disease = factor(adeno_info_outliers$disease, levels = c("normal", "cancer"))
dds = DESeqDataSetFromMatrix(countData = count_data,
                             colData = adeno_info_outliers,
                             design = ~disease)
# this takes a few minutes to run, so I didn't run it
# directly in the tutorial
dds = DESeq(dds)
res = results(dds)
saveRDS(res, file = here::here("data_files/adeno_deseq_results.rds"))

res = readRDS(here::here("data_files/adeno_deseq_results.rds"))
res

## log2 fold change (MLE): disease cancer vs normal 
## Wald test p-value: disease cancer vs normal 
## DataFrame with 58037 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE
##                     <numeric>      <numeric> <numeric>
## ENSG00000000003.14 2980.04372       1.047969 0.1410078
## ENSG00000000005.5     7.60756       2.285406 0.7518177
## ENSG00000000419.12 1501.37515       0.154944 0.1097474
## ENSG00000000457.13 1121.24995       0.416828 0.0892688
## ENSG00000000460.16  759.36940       1.117129 0.1017870
## ...                       ...            ...       ...
## ENSG00000283695.1   0.0657895     -0.2853105  1.932684
## ENSG00000283696.1  37.1191671     -0.1679310  0.166336
## ENSG00000283697.1  42.4001328      0.1553250  0.116881
## ENSG00000283698.1   0.5227232      0.0909751  0.498562
## ENSG00000283699.1   0.0690275     -0.3883752  1.670033
##                         stat      pvalue        padj
##                    <numeric>   <numeric>   <numeric>
## ENSG00000000003.14   7.43199 1.06973e-13 1.22651e-12
## ENSG00000000005.5    3.03984 2.36704e-03 5.73466e-03
## ENSG00000000419.12   1.41183 1.58001e-01 2.37857e-01
## ENSG00000000457.13   4.66936 3.02136e-06 1.24007e-05
## ENSG00000000460.16  10.97516 5.03176e-28 2.60668e-26
## ...                      ...         ...         ...
## ENSG00000283695.1  -0.147624    0.882640          NA
## ENSG00000283696.1  -1.009589    0.312692    0.417595
## ENSG00000283697.1   1.328910    0.183878    0.269469
## ENSG00000283698.1   0.182475    0.855210    0.900477
## ENSG00000283699.1  -0.232555    0.816107          NA

# we convert to a data.frame, 
# and then filter down to most significant and biggest changes
res = as.data.frame(res) 
sig_res = dplyr::filter(res, padj <= 0.001, abs(log2FoldChange) >= 2)
saveRDS(sig_res, here::here("data_files/adeno_deseq_sig.rds"))

Functional Enrichment

So you got a list of significant genes! Now what?? Ideally, we want to be able to look for groups of genes with shared functionality. One simple way to do that is to ask:

For some shared functionality among the differential genes, how many of the differential genes have it?
If I select the same number of differential genes at random from the genome, how many of those genes will have the shared functionality?

These values can be approximated using hypergeometric distribution and test, which is what we can easily do using the categoryCompare2 package. We are using it because it has facilities for some really neat things. However, if you don’t like it, you can easily use something else.

If you want some more background on the subject, I definitely recommend checking out one of the first papers on this topic by Gavin Sherlock. I also recommend reading up on what the Gene Ontology is.

So the data we need for this are the significant gene lists, as well as the total genes measured in this experiment, and their annotations.

Gene Lists

We are going to use the gene info object from way, way back at the beginning of this tutorial to get the actual gene names.

gene_info = readRDS(here::here("data_files/recount_lung_gene_info.rds"))
all_genes = unique(unlist(gene_info$symbol))

sig_res = readRDS(here::here("data_files/adeno_deseq_sig.rds")) %>%
  dplyr::mutate(gene_id = rownames(.)) %>%
  dplyr::left_join(., gene_info, by = "gene_id")

sig_up = dplyr::filter(sig_res, log2FoldChange > 0) %>%
  dplyr::pull(symbol) %>% unlist(.) %>% unique()
sig_down = dplyr::filter(sig_res, log2FoldChange < 0) %>%
  dplyr::pull(symbol) %>% unlist(.) %>% unique()

Annotationss

And now we will get the Gene Ontology annotations for those genes directly from Bioconductor.

library(org.Hs.eg.db)
library(GO.db)
library(categoryCompare2)
go_all_gene = select(org.Hs.eg.db, keys = all_genes, keytype = "SYMBOL", columns = c("GOALL", "ONTOLOGYALL"))
go_2_gene = split(go_all_gene$SYMBOL, go_all_gene$GOALL)
go_2_gene = purrr::map(go_2_gene, unique)
go_desc = select(GO.db, keys = names(go_2_gene), columns = "TERM", keytype = "GOID")$TERM
names(go_desc) = names(go_2_gene)

Now we create the categoryCompare2 annotation object.

go_annotation = annotation(annotation_features = go_2_gene,
                           description = go_desc,
                           annotation_type = "GO")
go_annotation

##       Annotation Type: GO 
##          Feature Type: UNKNOWN 
## Number of Annotations: 22687 
##       Number of Genes: 19307

Enrichments

up_enrich = hypergeometric_feature_enrichment(
  new("hypergeom_features", significant = sig_up,
      universe = all_genes, annotation = go_annotation),
  p_adjust = "BH"
)
down_enrich = hypergeometric_feature_enrichment(
  new("hypergeom_features", significant = sig_down,
      universe = all_genes, annotation = go_annotation),
  p_adjust = "BH"
)

comb_enrich = combine_enrichments(up = up_enrich,
                                  down = down_enrich)
comb_sig = get_significant_annotations(comb_enrich, padjust <= 0.01, counts >= 2)
comb_sig

## An object of class "combined_enrichment"
## Slot "enriched":
## $up
##    Enrichment Method:  
##      Annotation Type: GO 
## Significant Features: 1441 
##        Universe Size: 19307 
## 
## $down
##    Enrichment Method:  
##      Annotation Type: GO 
## Significant Features: 672 
##        Universe Size: 19307 
## 
## 
## Slot "annotation":
##       Annotation Type: GO 
##          Feature Type: UNKNOWN 
## Number of Annotations: 22687 
##       Number of Genes: 19307 
## 
## Slot "statistics":
## Signficance Cutoffs:
##   padjust <= 0.01
##   counts >= 2
## 
## Counts:
##    up down counts
## G1  1    1     23
## G2  1    0    175
## G3  0    1    340
## G4  0    0  22149

Very cool, we have shared annotations to both up and down genes, as well as some that are specific to each. Now, how do we see what they are?

One way is to use a graph, where each annotation is linked to others based on their shared genes.

graph_sig = generate_annotation_graph(comb_sig, low_cut = 0, hi_cut = Inf)
graph_sig

## A cc_graph with
## Number of Nodes = 538 
## Number of Edges = 110852 
##    up down counts
## G1  1    1     23
## G2  1    0    175
## G3  0    1    340

That is a lot of edges! Like, way, way too many. Lets remove a bunch of them. What we are going to do is remove those edges where less than 80% (the 0.8) of the genes are shared between the annotations.

graph_sig = remove_edges(graph_sig, 0.8)
graph_sig

## A cc_graph with
## Number of Nodes = 538 
## Number of Edges = 577 
##    up down counts
## G1  1    1     23
## G2  1    0    175
## G3  0    1    340

Much better!

Now lets generate color for each group, and do some grouping.

assign_sig = annotation_combinations(graph_sig)
assign_sig = assign_colors(assign_sig)
communities_sig = assign_communities(graph_sig)
community_labels = label_communities(communities_sig, go_annotation)

We can look at these in a table. We will only display the first 10 rows here, you can look at the full table in a tab delimited file.

table_sig = table_from_graph(graph_sig, assign_sig, community_labels)
write.table(table_sig, file = here::here("data_files/enrichment_table.txt"),
            sep = "\t", row.names = FALSE, col.names = TRUE)
knitr::kable(table_sig[1:20, ], digits = 2)

name	description	sig_group	up.p	up.odds	up.expected	up.counts	up.padjust	down.p	down.odds	down.expected	down.counts	down.padjust	group
	secretion		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1
GO:0002274	myeloid leukocyte activation	down	1.00	0.55	48.81	28	1.00	0	2.02	22.76	43	0.00	1
GO:0002275	myeloid cell activation involved in immune response	down	1.00	0.59	40.45	25	1.00	0	2.09	18.86	37	0.01	1
GO:0002283	neutrophil activation involved in immune response	down	1.00	0.58	36.12	22	1.00	0	2.15	16.85	34	0.01	1
GO:0002444	myeloid leukocyte mediated immunity	down	1.00	0.56	40.98	24	1.00	0	2.13	19.11	38	0.00	1
GO:0002446	neutrophil mediated immunity	down	1.00	0.60	36.94	23	1.00	0	2.24	17.23	36	0.00	1
GO:0006887	exocytosis	down	0.96	0.79	66.95	54	1.00	0	1.96	31.22	57	0.00	1
GO:0032940	secretion by cell	down	0.61	0.97	105.39	103	1.00	0	1.92	49.15	86	0.00	1
GO:0042119	neutrophil activation	down	1.00	0.57	37.02	22	1.00	0	2.10	17.26	34	0.01	1
GO:0045055	regulated exocytosis	down	0.94	0.80	58.66	48	1.00	0	2.14	27.36	54	0.00	1
GO:0046903	secretion	down	0.42	1.02	115.61	118	1.00	0	2.00	53.91	97	0.00	1
GO:0140352	export from cell	down	0.61	0.98	109.34	107	1.00	0	1.99	50.99	92	0.00	1
	circulatory system development		NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	2
GO:0001525	angiogenesis	down	0.90	0.81	44.56	37	1.00	0	3.24	20.78	59	0.00	2
GO:0001568	blood vessel development	down	0.73	0.92	58.07	54	1.00	0	2.89	27.08	69	0.00	2
GO:0001944	vasculature development	down	0.79	0.90	60.53	55	1.00	0	2.91	28.23	72	0.00	2
GO:0035239	tube morphogenesis	down	0.36	1.05	69.86	73	1.00	0	2.59	32.58	75	0.00	2
GO:0035295	tube development	down	0.08	1.17	84.41	97	0.91	0	2.54	39.37	88	0.00	2
GO:0045765	regulation of angiogenesis	down	0.92	0.76	29.48	23	1.00	0	2.70	13.75	34	0.00	2
GO:0048514	blood vessel morphogenesis	down	0.73	0.92	51.65	48	1.00	0	2.96	24.09	63	0.00	2

And put them in a network widget:

network_sig = graph_to_visnetwork(graph_sig, assign_sig, community_labels)

# we don't run this in the tutorial, because it results in a big
# json object 
vis_visnetwork(network_sig)

generate_legend(assign_sig)

Back to Original Data

A lot of times, we want to be able to query our original data set, asking what genes are annotated to the GO terms as well as the information about them. For this to work, we have to first create the correct mapping of SYMBOL to genes.

all_gene_symbol = sig_res %>%
  dplyr::select(gene_id, symbol)
symbol_2_gene = purrr::map_df(seq(1, nrow(all_gene_symbol)),
                              function(in_row){
                                data.frame(gene_id = all_gene_symbol$gene_id[in_row],
                                           symbol = unlist(all_gene_symbol$symbol[in_row]))
                              })

And then merge this back with the original data.

sig_res2 = dplyr::left_join(
  sig_res %>%
    dplyr::select(-symbol), all_gene_symbol, by = "gene_id"
)

Nice. Now we can filter this back down to what is specific to the up and down enrichments we had earlier.

up_data = sig_res2 %>%
  dplyr::filter(symbol %in% sig_up)
down_data = sig_res2 %>%
  dplyr::filter(symbol %in% sig_down)

Alternative Method for Visualizing Enrichment

Instead of categoryCompare, lets try something else.

We extract the statistical results, and only keep those GO terms that were significant in either set of genes.

go_gene_count = purrr::imap_dfr(go_2_gene, function(.x, .y){
  data.frame(GO = .y, n_gene = length(.x))
})
go_desc_df = data.frame(GO = names(go_desc),
                        definition = go_desc,
                        row.names = NULL)
stats = comb_sig@statistics@statistic_data
sig = comb_sig@statistics@significant@significant
sig_any = rowSums(sig) > 0
keep_stats = stats[sig_any, ]
keep_stats$GO = rownames(keep_stats)
dim(keep_stats)

## [1] 538  11

keep_stats = dplyr::left_join(keep_stats, unique(go_all_gene[, c("GOALL", "ONTOLOGYALL")]), by = c("GO" = "GOALL"))
keep_stats = dplyr::left_join(keep_stats, go_gene_count, by = "GO")
keep_stats = dplyr::left_join(keep_stats, go_desc_df, by = "GO")
saveRDS(keep_stats, file = "data_files/go_stats.rds")

Now lets break those GO terms into groups by semantic similarity. We do the Biological Process first, because it’s the biggest, and usually the most informative.

# this needs to be run in a base R session, NOT RStudio!!
library(simplifyEnrichment)
library(ComplexHeatmap)
library(magrittr)

# first test that this will actually run.
# If the code below takes > 10 seconds (seriously, count to 10), kill it, 
# and download and install the 2.1.0 version of the
# cluster package
# https://cran.r-project.org/src/contrib/Archive/cluster/cluster_2.1.0.tar.gz
# remotes::install_local("path_2_downloaded_cluster_2.1.0.tar.gz")
test_go = random_GO(500)
test_mat = GO_similarity(test_go, "BP")
test_cluster = cluster_terms(test_mat)

## Cluster 500 terms by 'binary_cut'... 17 clusters, used 1.237396 secs.

You want to run this code outside RStudio (i.e. open just “R”). RStudio plot viewer does weird things that mean you can’t enlarge the plot.

library(simplifyEnrichment)
library(ComplexHeatmap)
library(magrittr)

# now we can run the actual data we have
go_stats = readRDS("data_files/go_stats.rds")
bp_sig = go_stats %>%
  dplyr::filter(ONTOLOGYALL %in% "BP", n_gene <= 500, n_gene >= 5)
bp_go = bp_sig$GO
bp_mat = GO_similarity(bp_go, "BP")
bp_clusters = cluster_terms(bp_mat)

## Cluster 230 terms by 'binary_cut'... 15 clusters, used 0.4439886 secs.

bp_sig = bp_sig %>%
  dplyr::mutate(up.fdr = -1 * log10(up.padjust),
                down.fdr = -1 * log10(down.padjust))
bp_fdr = as.matrix(bp_sig[, c("up.fdr", "down.fdr")])
rownames(bp_fdr) = bp_sig$GO
library(circlize)
col_map = colorRamp2(seq(0, 3, length.out = 20), viridis::viridis(20))
fdr_heatmap = Heatmap(bp_fdr, col_map, "-Log10(FDR)", cluster_rows = FALSE, cluster_columns = FALSE, width = unit(6, "cm"))

# if we order by size, then we can see the clusters, and pull them out
# using another bit of data.
bp_heatmap = ht_clusters(bp_mat, bp_clusters, ht_list = fdr_heatmap,
                         exclude_words = c("regulation", "process", "pathway"),
                         order_by_size = TRUE)

cluster_membership = data.frame(GO = rownames(bp_mat),
                                cluster = bp_clusters)
bp_sig = dplyr::left_join(bp_sig, cluster_membership, by = "GO")
cluster_sizes = bp_sig %>%
  dplyr::group_by(cluster) %>%
  dplyr::summarise(n_go = length(GO)) %>%
  dplyr::arrange(dplyr::desc(n_go))
knitr::kable(cluster_sizes)

cluster	n_go
2	81
4	48
1	46
8	13
3	8
6	6
7	6
13	6
10	3
11	3
14	3
5	2
9	2
12	2
15	1

Now we can see that the third cluster down, is actually those terms in cluster 1. If we are interested in those terms, because they have some of the best separation between the two, and are almost all up, we can get those terms and the genes annotated to them.

cluster_1 = bp_sig %>%
  dplyr::filter(cluster %in% 1)

Now with this information, can we find the genes that were annotated to the terms and their fold-changes?? Hint, have the GO terms and their annotated genes in go_2_gene, and then have the up and down data available.

rmflight / rnaseq_tutorial Goto Github PK

rnaseq_tutorial's Introduction