Code Monkey home page Code Monkey logo

map2b's Introduction

MAP2B


MetAgenomic Profiler based on type IIB restriction site

Why do I need to run MAP2B?

Accurate species identification and abundance estimation are critical for the interpretation of whole metagenome sequencing (WMS) data. Numerous computational methods, broadly referred to as metagenomic profilers, have been developed to identify species in microbiome samples by classification of sequencing reads and quantification of their relative abundances. Yet, existing metagenomic profilers typically suffer from false-positive identifications and consequently biased relative abundance estimation. Indeed, false positives can be accounted for more than 90% of total identified species. Here, we present a new metagenomic profiler MAP2B to resolve those issues.

For more details, please see Eliminate false positives in metagenomic profiling based on type IIB restriction sites.

The workflow of MAP2B

Instead of directly estimating the relative abundances of the species through aligning reads against the whole microbial genome or marker genes as existing metagenomic profilers do, we use the following two-round reads alignment strategy:

(A) For any input WMS data, 2b tags can be extracted by 2B in silico digestion.
(B) WMS-originated 2b tags will be mapped against a preconstructed unique 2b tag database, which contains ~50,000 identifiable species.
(C) In the 1st round of reads alignment, genome coverage, taxonomic count, sequence count and G score are calculated for each species.
(D) The four features above will be passed into a preconstructed false positive recognition model.
(E) A high-precision species identification result will be generated.
(F) A sample-dependent unique 2b tag database will be constructed based on the species identification result.
(G) In the second round of reads alignment, we estimate taxonomic abundance for each species.

workflow

Installation

System requirements

Dependencies

All scripts in MAP2B are programmed by Perl and Python, and execution of MAP2B is recommended in a conda environment. This program could work properly in the Unix systems, or Mac OSX, as all required packages can be appropreiately download and installed.

Memory usage

> 14G RAM is required to run this pipeline.

Download the pipeline

  • Clone the latest version from GitHub (recommended):

    git clone https://github.com/sunzhengCDNM/MAP2B/
    cd MAP2B

    This makes it easy to update the software in the future using git pull as bugs are fixed and features are added.

  • Alternatively, directly download the whole GitHub repo without installing GitHub:

    wget https://github.com/sunzhengCDNM/MAP2B/archive/master.zip
    unzip master.zip
    cd MAP2B-master

Install MAP2B in a conda environment

  • Conda installation
    Miniconda provides the conda environment and package manager, and is the recommended way to install MAP2B.

  • Create a conda environment for MAP2B pipeline:
    After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:

    conda update conda

    Once you have conda installed, create a conda environment with the yml file config/MAP2B-20230420-conda.yml.

    conda env create -n MAP2B.1.5 --file config/MAP2B-20230420-conda.yml

  • Activate the MAP2B conda environment by running the following command:

    conda activate MAP2B.1.5 or source activate MAP2B.1.5

    Make sure the conda environment of MAP2B has been activated by running the above command before you run MAP2B everytime.

  • The workflow begins by checking the database's existence, and if it is not found, the corresponding database will be downloaded automatically to the software installation path. This download process may take some time, but it ensures that the necessary databases are readily available for the workflow. Alternatively, you can also download the GTDB database and RefSeq database independently using the following commands:

    • for GTDB database

      python3 scripts/DownloadDB.py -l config/GTDB.CjePI.database.list -d database/GTDB

    • for RefSeq database

      python3 scripts/DownloadDB.py -l config/RefSeq.CjePI.database.list -d database/RefSeq

    Now, everything is ready for MAP2B :), Let's get started.

Using MAP2B

Quick start

MAP2B is a highly automatic pipeline, and only a few parameters are required for the pipeline.

  • We prepared a real pair-end sequencing data of a MOCK community:

    cd example
    mkdir -p data/
    wget -t 0 -O data/shotgun_MSA-1002_1.fq.gz https://figshare.com/ndownloader/files/38346149/shotgun_MSA-1002_1.fq.gz
    wget -t 0 -O data/shotgun_MSA-1002_2.fq.gz https://figshare.com/ndownloader/files/38346155/shotgun_MSA-1002_2.fq.gz

  • After downloading the sequencing data, we can finally run MAP2B:

    python3 ../bin/MAP2B.py -i data.list

    In data.list you can learn how to prepare your input data, both single-end and paired-end data can be used as input.

sample1 <tab> shotgun1_left.fastq(.gz) <tab> shotgun1_right.fastq(.gz)
sample2 <tab> shotgun2.fastq(.gz)
sample3 ...

Parameters

The main program is bin/MAP2B.py in this repo. You can check out the usage by printing the help information via python3 bin/MAP2B.py -h.

usage: MAP2B.py [-h] -i INPUT [-o OUTPUT] [-d DATABASE] [-p PROCESSES]
                [-g GSCORE]

optional arguments:
  -h, --help    show this help message and exit
  -i INPUT      The filepath of the sample list. Each line includes an input sample ID and the file path of corresponding DNA sequence data where each field should be separated by <tab>. The line in this file that begins with # will be ignored. 
                  sample <tab> shotgun.1.fq(.gz) (<tab> shotgun.2.fq.gz)
  -o OUTPUT     Output directory, default ./MAP2B_result
  -s {GTDB,RefSeq}  Data source, choose from GTDB or RefSeq, default GTDB
  -d DATABASE   Database path for MAP2B pipeline, MAP2B_path/database
  -p PROCESSES  Number of processes, note that more threads may require more memory, default 1
  -g GSCORE     Using G score as the threshold for species identification, -g 5 is recommended. Enabling G score will automatically shutdown false positive recognition model, default none

author: Liu Jiang, Zheng Sun
mail: [email protected], [email protected]
last update: 2023/04/20 20:03:47
version:  1.5
  • If you are dealing with low-biomass samples, we recommend using the -g 3 or -g 5 parameters to keep as many species as possible. Although false positive detection is still a challenge for low-biomass samples, please keep in mind that the G-score ranking is highly relevant to the likelihood that a species is a true positive. Then, you can set up a threshold for G-score based on your understanding.

Reference

Acknowledgement

This work was supported by the National Institutes of Health grant number R01AI141529, R01HD093761, RF1AG067744, UH3OD023268, U19AI095219, U01HL089856, and the Charles A. King Trust Postdoctoral Fellowship.

What's new

Version 1.5 2023-04-20 by Zheng and Jiang

  • Minor bug fixes

Version 1.4 2023-02-27 by Zheng and Jiang

  • Add optional database: GTDB or RefSeq
  • Minor bug fixes

Version 1.3 2023-01-17 by Zheng and Jiang

  • We have simplified our database and modified the main body program to speed up the execution time
  • MAP2B is laptop friendly now! The minimum RAM required is only 14G and the space for the database is reduced to ~15G
  • Cancel -e option

Version 1.2 2022-12-02 by Zheng and Jiang

  • Minor bug fixes
  • You can set up a G score -g directly and ignore the false positive recognition model
  • Coverage information will be generated together with the abundance table

map2b's People

Contributors

sunzhengcdnm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.