Code Monkey home page Code Monkey logo

mg2sc-docker's Introduction

mg2sc

Analyzing metagenomics of single cell RNA-seq

The aim of the project was developing a method that extracts unmapped reads and uses metagenomic tools to classify their taxonomy on a single-cell level. This information is quantified for each transcript and cell, resulting in a count matrix with cell ID by transcript count for each organism.

Usage

Developed on python 3.8.

Required python packages

  • pysam v0.16.0.1
  • scipy v1.6.2
  • regex v2021.4.4

Required command line packages

Setting up kraken2

In additional to these packages, you'll need to setup a reference database for kraken2. Please see the kraken2 manual for how to do this (https://github.com/DerrickWood/kraken2/wiki/Manual). Alternatively, you can download pre-built Kraken 2 database (e.g. Standard from 12/2/2020, 36GB) from https://benlangmead.github.io/aws-indexes/k2.

Please ensure that you have enough memory available for reading in the kraken2 database, e.g. for the example database, more than 40 GB of memory is recommended. If you have less memory available, you could go with a smaller reference database.

Execution Both files, scMeG-kraken.py and k2sc.py are required for the full workflow and need to be in the same folder.

command:
python scMeG-kraken.py --input [bamfile, e.g. starsolo/Aligned.sortedByCoord.out.bam] --outdir [output directory] / --DBpath [path to kraken database] --threads [#, e.g. 8] --prefix [prefered file prefix] --verbosity [error/warning/info/debug]

successfull run:

  • "Sparse matrix with single cell information created"
  • "Run finished."

Interpretation of results

The pipeline produces a cellranger2-style formatted output folder, which can be easily imported for downstream analysis. The observation space is the complete list of barcodes with reads identified by the pipeline, and the feature space is complete list of organisms (and other elements of the kraken2 hierarchy) that were identified.

It can be useful to collapse this count matrix to more general categories, such as family or genus. You can use src/collapse_taxonomy.py for that, with a quick demo shown in a notebook.

Background

The first step of the tool workflow is the file preparation. From the user input, that most importantly includes the read alignment file and the corresponding reference database, the unaligned reads are extracted, and the data is converted into a FASTQ file.

The FASTQ file serves as input for the metagenomic analysis. The output, sequencing read IDs with assigned taxonomy, can be combined with the cell barcode and transcript UMI from the alignment file with unmapped reads to assign each cell and each transcript a taxonomy. If a transcript has different taxonomies assigned, the taxonomy ID with the highest read count is chosen.

Finally, the results can be exported as a sparse matrix, which facilitates the integration with the differential gene expression data and cell type annotation into AnnData objects. For Kraken2, the read is assigned to the lowest common ancestor that it maps to. Therefore, the sparse matrix contains the lowest mapped taxonomic level.

scMeG-kraken

mg2sc-docker's People

Contributors

sinanugur avatar julianeweller avatar ktpolanski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.