Code Monkey home page Code Monkey logo

dseqr's Introduction

CI DOI

Dseqr

End-to-End RNA-Seq Analysis

Dseqr is a web application that helps you run 10X single-cell and bulk RNA-seq analyses from fastq → pathways → drug candidates.

💡 Read the Docs and Open Dseqr →

Local setup

# install
install.packages('remotes')
remotes::install_github('hms-dbmi/dseqr')

# initialize and run new project
library(dseqr)
project_name <- 'example'

# directory to store application and project files in
data_dir <- './dseqr'

run_dseqr(project_name, data_dir)

To enable bulk fastq.gz import, first build a kallisto index for quantification. To do so run:

# default as used by run_dseqr
indices_dir <- file.path(data_dir, '.indices_dir')

rkal::build_kallisto_index(indices_dir)

scRNA-seq fastqs

dseqr can directly import cellranger formatted count matrices. If you are starting from fastq files, first install kb-python:

# install kallisto|bustools wrapper (required)
pip install kb-python

Then run pseudo-quantification:

# download pre-built index (mouse or human)
dseqr::download_kb_index(indices_dir, species = 'human')

# run pseudo-quantification
data_dir <- 'path/to/folder/with/fastqs'
dseqr::run_kb_scseq(indices_dir, data_dir, species = 'human')

# clean intermediate files produced by kb
dseqr::clean_kb_scseq(data_dir)

The resulting cellranger formatted count matrix files will be in the data_dir subdirectory bus_output/counts_unfiltered/cellranger.

Prefer docker?

# pull image
docker pull alexvpickering/dseqr

# run at http://0.0.0.0:3838/ and keep data on exit
docker run -v /full/path/to/data_dir:/srv/dseqr \
-p 3838:3838 \
alexvpickering/dseqr R -e 'library(dseqr); run_dseqr("example", "/srv/dseqr")'

Host it

To spin up your own AWS infrastructure to host dseqr, see dseqr.aws →

dseqr's People

Contributors

alex-free avatar alexvpickering avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dseqr's Issues

get ancillary info

Go ahead with:

  • GRAS
  • SIDER links
  • Drugbank links
  • Wikipedia links

Also see #4 for further discussion

figure out why Alex and Sam's IBD analyses differ

I got about a 0.45 pearson correlation between the logFC values for my analysis run with the current drugseqr pipeline and your Biojupies analysis. This seems kind of low for two different pipelines on the same dataset (maybe not?).

Here are possible differences/unknowns that I can gather based on looking at Sam's report that might contribute:

  • quantification of raw data (Alex - salmon, Sam - ?)
  • library normalization (Alex - edgeR::calcNormFactors, Sam - logCPM?)
  • surrogate variable analysis (Alex - Yes, Sam - No)
  • limma analysis (Alex - limma::voom transformation, Sam - ?)

I'll try running a couple variations on my pipeline (e.g. without sva) to see how much difference it makes.

Let me know if you have any ideas samuelfinlayson

┆Attachments: mds-sva.png

benchmark L1000 similarity measures across cell lines

samuelfinlayson suggested assessing similarity measures like in my previous blog post but instead of using external GEO datasets, using LINCS data from one cell line to try and retrieve the same perturbation in a different cell line.


This approach has some nice advantages:

  • no need to gather/label external data
  • creates a larger training dataset for possible neural networks

nnet based connectivity mapping

samuelfinlayson expressed interest in training a Siamese neural network that could take as input a LINCS/CMAP02 signature and a query signature and produce a ranked list of compounds. This would contrast with more classical signature-similarity approaches (pearson correlation, cosine similarity, KS).

He mentioned this in the context of his PhD work as well.

L1000 is slow in explore_results

The problem is that it takes a bit of time to summarise the ~230,000 signatures by compound name. To resolve this I am going to first subset the results, keeping all entries for compounds that have a correlation value less than the 100th top result with a clinical phase. This way, when filtering by clinical phase, you will still see 100 results. When not filtering by clinical phase, you will also see all compounds that have a correlation value less than (stronger negative correlation) the 100th top results with a clinical phase.

collapse results for same drug

To do:

  • show all correlations for a compound
  • sort based on all signatures for same pert (see #40)?
  • show a tool tip on hover that indicates cell line, dose, etc.

use a within-same-drug measure to sort L1000/CMAP02 results

Sorting measures can be evaluated using the same benchmark/data that I have used in the past.

I have mentioned to samuelfinlayson that performing differential expression on L1000/CMAP02 at the pert level (ignoring dose, time, and cell line) results in worse benchmark results. This suggests that any sort of resorting should be carefully benchmarked as it could also decrease performance.


rnama.com does something like the following:

  • perts are first sorted by decreasing percentage of consistent signatures
  • ties are first broken by number of total signatures
  • ties are then broken by largest absolute correlation value

This sorting on rnama.com has not been benchmarked.

isaackohane1 mentioned this stackexchange as well but I'm not totally clear on the how the stats work when you wants to get a confidence interval across multiple correlations (for the same compound).

Empirical fdr/pvals can also be derived across multiple correlations using fdrtool. This approach could be used to derive pvals and then do some sort of p-value meta analysis (lots of methods in metap package). There are also correlation value meta-analysis methods (e.g. meta::metacor).


I think that unless we benchmark and find that resorting improves connectivity mapping, the cautious approach is to maintain the original sort (i.e. minimum correlation).

run differential expression analysis

Requires porting/adapting code from crossmeta::diff_expr and private repo rnama::diff_expr (initial rnama.com website which would run sva::svaseq on RNA-seq data).

cmap_es and l1000_es HGNC symbols don't always match up with tx2gene

For example, in l1000_es, Cell Migration Inducing Hyaluronidase 2 is annotated as TMEM2 whereas get_tx2gene returns CEMIP2. This is likely related to changes in official HGNC symbols (TMEM2 used to be the official HGNC symbol for what is now CEMIP2).

I'm not sure what the best approach is to resolve this. Possibilities:

  1. save a reference entrezid --> HGNC mapping and always use that. The downside of this approach would be that over time the official symbols may differ. I think I will go with this and use the mappings that I used for cmap_es and l1000_es. This was downloaded from NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz on 2017-08-12.

  2. save everything at the entrezid level and only map to HGNC as needed. The downsides to this are increased complexity and loss of interpretability. Specifically, mapping from entrezid --> HGNC would have to occur at multiple different locations instead of once and all the data would be annotated with entrezid numbers instead of gene symbols. There would also be a loss of integration with other code I have developed that may be useful in the future. If everything was from scratch I might do this but I think it is probably a poor choice now.

rank-based differential expression

This could be useful for datasets where we only have disease samples (e.g. chronic fatigue patient).

isaackohane1 mentioned some previous work where control samples from a different study could be used successfully if gene rank-based differential expression was used.

do monoclonal ab's cause similar gex changes as crispr?

isaackohane1 and samuelfinlayson were discussing the contrast between the specificity of cripr and antibodies as compared to small molecules.

The question came up regarding whether targeting a gene at the protein level through a monoclonal anti-body would produce similar changes in gene expression as those caused by cripr at the genetic level.

isaackohane1 mentioned the possibility of applying for funding to create a CMAP-like dataset for monoclonal antibodies.

Pubchem CIDs and DrugBank IDs are not informative

Seeing the value of these ids (Pubchem CIDs used for links to both Pubchem and SIDER) tells you nothing useful other than that they exist.

I think it makes more sense to just have a single External Links column with an image linking to the respective data source. This will reduce table width (single column vs 3), be visually appealing, and be just as informative.

Related to #43

raw rna-seq quants using salmon

Possible sources:

  • FASTQ files of bulk rna-seq reads
  • single-cell rna-seq data
  • affy microarrays

Will initially assume bulk rna-seq with biological reps as a starting point.

detect single/paired end experiments

Need a raw data format standard for distinguishing if multiple files belong to the same sample and if the data is single-end or paired-end.

The SRA (GEO) uses the format SRR_1.fastq.gz where:

  • SRR is the sample name
  • _1 indicates that file contains reads for one end in a paired-end experiment
  • absence of either _1 or _2 indicates single-end reads.

Some standard like above would be useful for auto-determination of quantification flags.

diff_expr mds plot could be more informative

Would be nice to see:

  • before and after sva plot (side by side)
  • ability to hover over samples to inspect outliers
  • use ggplot2/plotly to improve appearance/fix legend positioning (e.g. here)
  • add some jitter/alpha so that overlapping points are visible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.