Code Monkey home page Code Monkey logo

cellsnp's Introduction

cellSNP

PyPI Build Status

cellSNP aims to pileup the expressed alleles in single-cell or bulk RNA-seq data, which can be directly used for donor deconvolution in multiplexed single-cell RNA-seq data, particularly with vireo, which assigns cells to donors and detects doublets, even without genotyping reference.

cellSNP heavily depends on pysam, a Python interface for samtools and bcftools. This program should give very similar results as samtools/bcftools mpileup. Also, there are two major differences comparing to bcftools mpileup:

  1. cellSNP can pileup either the whole genome or a list of positions, with directly splitting into a list of cell barcodes, e.g., for 10x genome. With bcftools, you may need to manipulate the RG tag in the bam file if you want to divide reads into cell barcode groups.
  2. cellSNP uses simple filtering for outputting SNPs, i.e., total UMIs or counts and minor alleles fractions. The idea here is to keep most information of SNPs and the downstream statistical model can take the full use of it.

Installation

cellSNP is available through pypi. To install, type the following command line, and add -U for upgrading:

pip install cellSNP

Alternatively, you can download or clone this repository and type python setup.py install to install. In either case, add --user if you don't have the permission as a root or for your Python environment.

From v0.1.0, cellSNP requires pysam>=0.15.2, so make sure you are using the right version of pysam. Try pip uninstall pysam and then reinstall pip install -U pysam

Quick usage

Note1, cellSNP now support save data into sparse matrices. When genotyping at single cell level (mode 1 or 2), please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz, though the latter is still supported.

Note2, by default, cellSNP count UMIs instead of reads. If your bam file doesn't have UMIs, please add --UMItag None.

Once installed, check all arguments by type cellSNP -h (see a snapshot) There are three modes of cellSNP:

  • Mode 1: pileup a list of SNPs for single cells in a big BAM/SAM file

Require: a single BAM/SAM file, e.g., from cellranger, a VCF file for a list of common SNPs. This mode is recommended comparing to mode 2, if a list of common SNP is known, e.g., human (see Candidate SNPs below)

cellSNP -s $BAM -b $BARCODE -O $OUT_DIR -R $REGION_VCF -p 20

Recommend filtering SNPs with <20UMIs or <10% minor alleles for downstream donor deconvolution, by adding --minMAF 0.1 --minCOUNT 20

  • Mode 2: pileup the whole genome for single cells in a big BAM/SAM file
cellSNP -s $BAM -b $BARCODE -O $OUT_DIR -p 22

Recommend filtering SNPs with <100UMIs or <10% minor alleles for saving space and speed up inference when pileup whole genome: --minMAF 0.1 --minCOUNT 100

Note, this mode may output false positive SNPs, for example somatic variants or falses caussed by RNA editing. These false SNPs are probably not consistent in all cells within one individual, hence confounding the demultiplexing. Nevertheless, for species, e.g., zebrafish, without a good list of common SNPs, this strategy is still worth a good try, and it does not take much more time than mode 1.

  • Mode 3: pileup a list of SNPs for one or multiple bulk BAM/SAM files

Require: one or multiple BAM/SAM files, their according sample ids, and a VCF file for a list of common SNPs.

cellSNP -s $BAM1,$BAM2,$BAM3 -I sample_id1,sample_id2,sample_id3 -o $OUT_FILE -R $REGION_VCF -p 20

Set filtering thresholds according to the downstream analysis.

List of candidate SNPs

A quality list of candidate SNPs (ususally common SNPs) are important for mode 1 and mode 3. If a list of genotyped SNPs is available, it can be used to pile up. Alternatively, for human, common SNPs in population that have been idenetified from consortiums can also be very good candidates, e.g., gnomAD and 1000_Genome_Project. For the latter, we have compiled a list of 37 million common variants with this bash script and stored in this folder.

In case you want to lift over SNP positions in vcf file from one genome build to another, see our LiftOver_vcf wrap function.

FAQ and releases

For troubleshooting, please have a look of FAQ.rst, and we welcome reporting any issue.

All releases are included in pypi. Notes for each release are recorded in release.rst.

cellsnp's People

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.