Code Monkey home page Code Monkey logo

demux's Introduction

demux

A pipeline for running single-cell demultiplexing simulations with demuxlet.

Introduction

demux is a Snakemake pipeline for simulating a multiplexed droplet scRNA-seq (dscRNA-seq) experiment using data from individual scRNA-seq samples and quantifying the effectiveness of deconvoluting the sample identify of each cell in the simulated dataset with demuxlet. Such an analysis is helpful for reducing the cost of library preparations for dscRNA-seq experiments.

Here is an example flowchart depicting the demux pipeline with five input samples. flowchart

Each step is briefly described below:

  1. unique_barcodes: aggregate cell barcodes across all samples provided as input and remove any cell barcodes that appear more than once
  2. simulate: simulate a multiplexed dscRNA-seq experiment with a specified doublet rate (default: 0.3). The doublet rate specifies the percentage of cells from the aggregate dataset expected to be found in doubletes. We define two types of doublets: (1) doublets containing cells from different samples, and (2) doublets containing cells from the same samples
  3. table: create a reference table mapping the original (ground truth) barcodes to the new barcodes (for analyzing demuxlet performance)
  4. new bam: edit the BAM files corresponding to each sample provided as input to reflect simulated doublets. For ever pair of cells randomly selected to be in a doublet, we change the cell barcode of one cell in the pair to match that of the other cell.
  5. merge: merge the edited BAM files into one BAM file to reflect a multiplexed experiment.
  6. sort: sort the merged BAM file
  7. demux: run demuxlet with the BAM file as input
  8. results: analyze demuxlet performance

Download

Execute the following command.

git clone https://github.com/zrcjessica/demux.git

Setup

Dependencies

The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:

conda create -n snakemake -c bioconda -c conda-forge 'snakemake==5.18.0' --no-channel-priority

We highly recommend you install Snakemake via conda like this so that you can use the --use-conda flag when calling snakemake to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.

Input

demux minimally requires the following inputs, which must be specified in the config.yml file:

See below for additional input parameters.

It is recommended to symlink your data into the gitignored data/ folder:

ln -s /path/to/your/data data

If you ever need to switch the input to a different dataset, you can just change the symlink path.

Output

demux returns a table summarizing the performance of demuxlet on the simulated data and a plot showing the precision-recall curves.

You can also symlink your output, if you think you might want to change it in the future:

ln -s /iblm/netapp/data1/jezhou/Telese_Rat_Amygdala/demultiplex_simulation/out out

Execution

Locally:

./run &

or on a SGE cluster:

qsub run

Executing the pipeline on your own data

You must modify the config.yml file to specify paths to your data. The config file is currently configured to run the pipeline on our data (in the git-ignored data/ folder). The config file contains the following variables:

data*

The data variable contains nested variables for each of your samples, with the paths to their corresponding BAM (reads) and filtered barcodes (barcodes) files (Cell Ranger output) as well as the sample's vcf_id.

vcf*

Give the path to the vcf file containing genotypes for all samples nested in the data variable.

samples

List the samples from those nested in the data variable that you want to be included as input to the demultiplexing simulation. If this line is not provided or commented out, all samples from the data variable will be used.

rate

Doublet rate to be used for demultiplexing simulations. Defaults to 0.3.

out

Path to directory in which to write output files. If not provided, defaults to out. The directory will be created if it does not already exist.

* Inputs required

Files and directories

A Snakemake pipeline for running the demultiplexing simulation.

Config file that defines options and input for the pipeline.

Various scripts used by the pipeline. See the script README for more information.

The dependencies of our pipeline, specified as conda environment files. These are used by Snakemake to automatically install our dependencies at runtime.

An example bash script for executing the pipeline using snakemake and conda. Any arguments to this script are passed directly to snakemake.

demux's People

Contributors

aryarm avatar zrcjessica avatar

Stargazers

Marta Pérez Alcántara avatar  avatar

Watchers

James Cloos avatar  avatar

demux's Issues

a snakemake pipeline

It'd be great if we could execute this workflow as a Snakemake pipeline, so that we can run it multiple times for different inputs. And parallelize steps. And organize things.

a script for summarizing simulation statistics

We need a script to run at the end of the pipeline after demuxlet. It should calculate summary statistics for the results of the simulation:

  • percent of cells that map to correct animal
    • how many mistakes are there?
  • what is the ideal number of animals to use per experiment?
  • how many doublets get detected correctly as doublets?

sample-specific information in each sample's BAM file

new_bam.py should remove sample-specific information from each sample's BAM file
and probably replace it with something universal across all of the samples

What do we need to change?

  1. The RG tags (both in the header and in each read)
  2. The PG tags (both in the header and in each read)
  3. The CO tags (from the header only)
  4. The read IDs?
  5. The header, in its entirety
    Like, we need to somehow prepare the headers for merging

detecting doublets whose cells came from the same sample

I like to use Github issues as TODO lists. First item of business: conflicting UMIs.

The problem

UMIs are used by demuxlet and appear in the UB tags within the BAM file. When we create doublets, there is the possibility of duplicate UMIs appearing within the same doublet.

Is this a cause for concern?

Probably. demuxlet might discard duplicate UMIs or otherwise treat duplicates differently.
But it's hard to say how prevalent this problem will be until we actually encounter it.

The solution

Not sure about this yet. Ideally, the new_bam.py script that changes the CB tags could change the UMIs in a way that prevents them from conflicting. But I'm not sure how we would do this given our current workflow design: new_bam.py is called once for each sample's BAM file; they aren't considered in tandem.
Another idea would be to merge the BAM files together and then fix the duplicate UMIs after that. That might be fastest, but it would probably involve extra steps, like sorting the merged BAM by UB tag (ugh).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.