hms-dbmi / dseqr Goto Github PK

View Code? Open in Web Editor NEW

20.0 8.0 4.0 37.74 MB

single-cell and bulk RNA-seq analyses from counts → pathways → drug candidates.

Home Page: https://docs.dseqr.com

License: Other

R 96.30% JavaScript 1.26% Shell 0.12% CSS 1.76% Dockerfile 0.44% HTML 0.11%

single-cell-rna-seq rna-seq drug-repurposing bulk-rna-seq

dseqr's Introduction

Dseqr

End-to-End RNA-Seq Analysis

Dseqr is a web application that helps you run 10X single-cell and bulk RNA-seq analyses from fastq → pathways → drug candidates.

💡 Read the Docs and Open Dseqr →

Local setup

# install
install.packages('remotes')
remotes::install_github('hms-dbmi/dseqr')

# initialize and run new project
library(dseqr)
project_name <- 'example'

# directory to store application and project files in
data_dir <- './dseqr'

run_dseqr(project_name, data_dir)

To enable bulk fastq.gz import, first build a kallisto index for quantification. To do so run:

# default as used by run_dseqr
indices_dir <- file.path(data_dir, '.indices_dir')

rkal::build_kallisto_index(indices_dir)

scRNA-seq fastqs

dseqr can directly import cellranger formatted count matrices. If you are starting from fastq files, first install kb-python:

# install kallisto|bustools wrapper (required)
pip install kb-python

Then run pseudo-quantification:

# download pre-built index (mouse or human)
dseqr::download_kb_index(indices_dir, species = 'human')

# run pseudo-quantification
data_dir <- 'path/to/folder/with/fastqs'
dseqr::run_kb_scseq(indices_dir, data_dir, species = 'human')

# clean intermediate files produced by kb
dseqr::clean_kb_scseq(data_dir)

The resulting cellranger formatted count matrix files will be in the data_dir subdirectory bus_output/counts_unfiltered/cellranger.

Prefer docker?

# pull image
docker pull alexvpickering/dseqr

# run at http://0.0.0.0:3838/ and keep data on exit
docker run -v /full/path/to/data_dir:/srv/dseqr \
-p 3838:3838 \
alexvpickering/dseqr R -e 'library(dseqr); run_dseqr("example", "/srv/dseqr")'

Host it

To spin up your own AWS infrastructure to host dseqr, see dseqr.aws →

dseqr's People

Contributors

Stargazers

Watchers

Forkers

saisaitian nailouzhang qindan2008 huizhong1993

dseqr's Issues

generate some small sample data for testing/dev/code sample purposes

IBD data is too large for github.

Going to just repackage first 1000 or so reads from each sample from IBD data.

break up L1000/CMAP02 pdata title into components

allow user to select control/test groups

get ancillary info

Go ahead with:

GRAS
SIDER links
Drugbank links
Wikipedia links

Also see #4 for further discussion

MDS plot showing on counts scale, not log counts

Probably required exprs_sva to be run through limma::voom.

detect paired samples for pair-ended experiments

See issue #2 (detect single/paired end experiments) for some related discussion.

Will put this on hold until we have some example paired data to work with.

show top CMAP02/L1000 compounds

annotate loaded rna-seq quants

Needs to match CMAP02/L1000 annotation as I currently have it (HGNC symbol).

show histogram of correlations across all drugs

figure out why Alex and Sam's IBD analyses differ

I got about a 0.45 pearson correlation between the logFC values for my analysis run with the current drugseqr pipeline and your Biojupies analysis. This seems kind of low for two different pipelines on the same dataset (maybe not?).

Here are possible differences/unknowns that I can gather based on looking at Sam's report that might contribute:

quantification of raw data (Alex - salmon, Sam - ?)
library normalization (Alex - edgeR::calcNormFactors, Sam - logCPM?)
surrogate variable analysis (Alex - Yes, Sam - No)
limma analysis (Alex - limma::voom transformation, Sam - ?)

I'll try running a couple variations on my pipeline (e.g. without sva) to see how much difference it makes.

Let me know if you have any ideas samuelfinlayson

┆Attachments: mds-sva.png

implement study selection for explore_results

Should show CMAP02/L1000 results based on selection

select_contrast rows increases height when add group

match sample annotations to raw data files

find opposing signatures in CMAP02/L1000

expand ancillary info

Started with Broad Repurposing Hub Data (Thanks samuelfinlayson!)

Near term:

SIDER side effects link out
Drugbank linkout
GRAS

Longer term:

info on approved drug with closest RDKIT morgan similarity

benchmark L1000 similarity measures across cell lines

samuelfinlayson suggested assessing similarity measures like in my previous blog post but instead of using external GEO datasets, using LINCS data from one cell line to try and retrieve the same perturbation in a different cell line.

This approach has some nice advantages:

no need to gather/label external data
creates a larger training dataset for possible neural networks

make run_salmon handle paired-end and single-end reads

group color/names erase on sort and don't work on initially sorted table

Related to #9

NA cids in L1000 pdata

causes malformed links for #6

incorrect pubchem cids in L1000 pdata

Incorrect values include:

0
-666
MLS003116075

Like #36, plan to fix in private l1000 package and propagate.

show ellipses for wide columns in explore_results

e.g. moa column

DT tolower(x) alerts on filtering

started happening after 03346dc

Seems to be related to UTF-8 issues

nnet based connectivity mapping

samuelfinlayson expressed interest in training a Siamese neural network that could take as input a LINCS/CMAP02 signature and a query signature and produce a ranked list of compounds. This would contrast with more classical signature-similarity approaches (pearson correlation, cosine similarity, KS).

He mentioned this in the context of his PhD work as well.

validate user selections for select_pairs and add unit tests

L1000 is slow in explore_results

The problem is that it takes a bit of time to summarise the ~230,000 signatures by compound name. To resolve this I am going to first subset the results, keeping all entries for compounds that have a correlation value less than the 100th top result with a clinical phase. This way, when filtering by clinical phase, you will still see 100 results. When not filtering by clinical phase, you will also see all compounds that have a correlation value less than (stronger negative correlation) the 100th top results with a clinical phase.

collapse results for same drug

To do:

show all correlations for a compound
sort based on all signatures for same pert (see #40)?
show a tool tip on hover that indicates cell line, dose, etc.

adapt pipeline for single cell RNA-seq data

unit test load_seq helpers

get dprime values from result of diff_expr

l1000_es and cmap_es are dprime effect size values so they should be queried using equivalent measures.

For same L1000 compound, some entries have missing cids while others don't

e.g. 'WZ-4-145' has cid 73707531 for some entries but none for others.

Fill in the missing cids using non-missing entries. This will be done in my private l1000 package so that the fix propagates to other code that uses it.

use a within-same-drug measure to sort L1000/CMAP02 results

Sorting measures can be evaluated using the same benchmark/data that I have used in the past.

I have mentioned to samuelfinlayson that performing differential expression on L1000/CMAP02 at the pert level (ignoring dose, time, and cell line) results in worse benchmark results. This suggests that any sort of resorting should be carefully benchmarked as it could also decrease performance.

rnama.com does something like the following:

perts are first sorted by decreasing percentage of consistent signatures
ties are first broken by number of total signatures
ties are then broken by largest absolute correlation value

This sorting on rnama.com has not been benchmarked.

isaackohane1 mentioned this stackexchange as well but I'm not totally clear on the how the stats work when you wants to get a confidence interval across multiple correlations (for the same compound).

Empirical fdr/pvals can also be derived across multiple correlations using fdrtool. This approach could be used to derive pvals and then do some sort of p-value meta analysis (lots of methods in metap package). There are also correlation value meta-analysis methods (e.g. meta::metacor).

I think that unless we benchmark and find that resorting improves connectivity mapping, the cautious approach is to maintain the original sort (i.e. minimum correlation).

run differential expression analysis

Requires porting/adapting code from crossmeta::diff_expr and private repo rnama::diff_expr (initial rnama.com website which would run sva::svaseq on RNA-seq data).

cmap_es and l1000_es HGNC symbols don't always match up with tx2gene

For example, in l1000_es, Cell Migration Inducing Hyaluronidase 2 is annotated as TMEM2 whereas get_tx2gene returns CEMIP2. This is likely related to changes in official HGNC symbols (TMEM2 used to be the official HGNC symbol for what is now CEMIP2).

I'm not sure what the best approach is to resolve this. Possibilities:

save a reference entrezid --> HGNC mapping and always use that. The downside of this approach would be that over time the official symbols may differ. I think I will go with this and use the mappings that I used for cmap_es and l1000_es. This was downloaded from NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz on 2017-08-12.
save everything at the entrezid level and only map to HGNC as needed. The downsides to this are increased complexity and loss of interpretability. Specifically, mapping from entrezid --> HGNC would have to occur at multiple different locations instead of once and all the data would be annotated with entrezid numbers instead of gene symbols. There would also be a loss of integration with other code I have developed that may be useful in the future. If everything was from scratch I might do this but I think it is probably a poor choice now.

rank-based differential expression

This could be useful for datasets where we only have disease samples (e.g. chronic fatigue patient).

isaackohane1 mentioned some previous work where control samples from a different study could be used successfully if gene rank-based differential expression was used.

show ancillary info

grab CMAP02/L1000 data

use most advanced clinical phase for L1000/CMAP02

Some compounds with the same Pubchem CID have a different clinical_phase. I think it makes the most sense to show the most advanced phase only. This will simplify sorting for #37.

load rna-seq quants

filter CMAP02/L1000 results based on clinical status

do monoclonal ab's cause similar gex changes as crispr?

isaackohane1 and samuelfinlayson were discussing the contrast between the specificity of cripr and antibodies as compared to small molecules.

The question came up regarding whether targeting a gene at the protein level through a monoclonal anti-body would produce similar changes in gene expression as those caused by cripr at the genetic level.

isaackohane1 mentioned the possibility of applying for funding to create a CMAP-like dataset for monoclonal antibodies.

add ability to pair samples (e.g. sibling controls or self control) to select_contrast

Rashmi mentioned something like 26 SJIA patients that blood samples have been taken from (expecting ~40 at upcoming family day) including some unaffected sibling control samples.

If running an analysis using multiple patients with controls that are specific to each patient this can be specified in the model matrix.

Some L1000 compounds that should have a clinical status do not

e.g. atorvastatin, canrenoic-acid

download transcriptome/build index for salmon

Pubchem CIDs and DrugBank IDs are not informative

Seeing the value of these ids (Pubchem CIDs used for links to both Pubchem and SIDER) tells you nothing useful other than that they exist.

I think it makes more sense to just have a single External Links column with an image linking to the respective data source. This will reduce table width (single column vs 3), be visually appealing, and be just as informative.

Related to #43

get CMAP02/L1000 annotations

raw rna-seq quants using salmon

Possible sources:

FASTQ files of bulk rna-seq reads
single-cell rna-seq data
affy microarrays

Will initially assume bulk rna-seq with biological reps as a starting point.

detect single/paired end experiments

Need a raw data format standard for distinguishing if multiple files belong to the same sample and if the data is single-end or paired-end.

The SRA (GEO) uses the format SRR_1.fastq.gz where:

SRR is the sample name
_1 indicates that file contains reads for one end in a paired-end experiment
absence of either _1 or _2 indicates single-end reads.

Some standard like above would be useful for auto-determination of quantification flags.

diff_expr mds plot could be more informative

Would be nice to see:

before and after sva plot (side by side)
ability to hover over samples to inspect outliers
use ggplot2/plotly to improve appearance/fix legend positioning (e.g. here)
add some jitter/alpha so that overlapping points are visible.

sva in diff_setup fails if not all samples are selected

Related to #8

Caused by difference on mod0 (uses selected and non-selected samples) and mod (uses selected samples only).