Code Monkey home page Code Monkey logo

mimic2's Introduction

MiMiC2-logo

What is MiMiC2?

MiMiC2 is a bioinformatic pipeline for the selection of a few microbial genomes that functionally represent an entire ecosystem, termed a synthetic community (SynCom).

If you want to install MiMiC2, go to the intallation instructions.

If you want help running MiMiC2, go to the running MiMiC2 guide.

If you want to understand all the options, go to the options list.

If you want premade datasets, look at our datasets provided section.

MiMiC2 workflow

The general process can be seen in the workflow:

MiMiC2-workflow

Guide to run MiMiC2

MiMiC2 consists of a few major steps, but before that can begin you must prepare your data.

Data Preparation

You must use the MiMiC2-BUTLER.py script to convert a folder containing multiple genomes/samples Pfam annotations into a single Pfam profile file of the entire dataset.

MiMiC2-BUTLER.py needs a few bits of information to run, detailed under the options list . We have provided example data to help you understand the process. The example of HiBC can be run using the code below:

MiMiC2-BUTLER.py -s datasets/isolate_collections/HiBC/Pfams/ -p datasets/core/Pfam-A.clans.tsv -t hmmscan -e .hmmer -o HiBC-0.6-profile.txt

Running MiMiC2

Once you have a Pfam profile of both your environment samples, and your genome collection, you can run MiMiC2. The options for running MiMiC2 are detailed in the options list section.

MiMiC2.py -g /PATH/TO/GENOME-COLLECTION -t /PATH/TO/TAXONOMIC-FILE --taxonomiclevel s -s /PATH/TO/SAMPLES -m /PATH/TO/METADATA --group GROUP --models /PATH/TO/MODELS/FOLDER -c 10 -o OUTPUT-PREFIX

Example Run of MiMiC2

As an example of how MiMiC2 can be used, we will repeat the creation of the IBD SynCom from the MiMiC2 paper.

Run the code below, which uses the HiBC collection, along with the IBDMDB sample collection, with GEMs made with GapSeq:

MiMiC2.py -g datasets/isolate_collections/HiBC/HiBC_profile.txt -t /PATH/TO/TAXONOMIC-FILE --taxonomiclevel s -s /PATH/TO/SAMPLES -m /PATH/TO/METADATA --group GROUP --models /PATH/TO/MODELS/FOLDER -c 10 -o Test-IBD-SynCom

The output can be found under ./Test-IBD-SynCom/.

Installation Instructions

  1. Clone the repository.
git clone https://github.com/thh32/MiMiC2.git
  1. Enter the MiMiC2 folder.
cd MiMiC2
  1. Create the environment with mamba.
mamba create --no-channel-priority -n mimic2 \
    -c bioconda -c conda-forge \
    "python=3.11" "numpy=1.24.3" "scipy=1.10.1" \
    "conda-forge::matplotlib-base" "seaborn=0.13.0" \
    "pandas=1.5.3" "tdqm" 
  1. Activate the environment.
mamba activate mimic2
  1. Install glpk and the required R packages.

Warning

Sudo access is required for this If you do not have sudo access please talk to your system administrator.

Install glpk-dev:

sudo apt-get -y install libglpk-dev

Next, install the R packages required for metabolic modelling:

Rscript -e 'remotes::install_github("SysBioChalmers/sybil")'
Rscript -e 'remotes::install_github("euba/bacarena")'
  1. Add the MiMiC2 folder to your ~/.bashrc and apply the changes.
  • Add to .bashrc: export PATH="/PATH/TO/MiMiC2:$PATH"
  • Enter in terminal:
source ~/.bashrc
  • Reactivate the mamba environment:
mamba activate mimic2
  1. Run MiMiC2 on your chosen genome collection and metagenomic samples:
    Basic usage:

Options

MiMiC2-BUTLER.py options

  -h, --help            show this help message and exit
  -s {INPUT}, --samples {INPUT}
                        Provide a folder which contains all of your Pfam
                        annotated genomes/metagenomes.
  -p {INPUT}, --pfam {INPUT}
                        Pfam file e.g. Pfam-A.clans.csv, provided for Pfam v32
                        in `datasets/core/`
  -t {TEXT}, --tool {TEXT}
                        State the tool used to annotate the geomes against the
                        Pfam database: `hmmsearch` or `hmmscan`
  -o {OUTPUT}, --output {OUTPUT}
                        Prefix for all the Pfam-profile file e.g. HuSynCom.
  -e {TEXT}, --extension {TEXT}
                        Provide the extension for your Pfam annotation files.

MiMiC2.py options

  -h, --help            show this help message and exit
  -s {INPUT}, --samples {INPUT}
                        Pfam vector file of all metagenomic samples to be
                        studied.
  -m {INPUT}, --metadata {INPUT}
                        Metadata file detailing the group assignment of each
                        sample.
  --group {INPUT}       Name of the group of interest for SynCom creation.
  -p {INPUT}, --pfam {INPUT}
                        Pfam file e.g. Pfam-A.clans.csv, provided for Pfam v32
                        in `datasets/core/`
  -g {INPUT}, --genomes {INPUT}
                        Pfam vector file of genome collection.
  --models {INPUT}      Folder containing metabolic models for each genome.
                        Must be provided as RDS files such as those provided
                        by GapSeq.
  -t {INPUT}, --taxonomy {INPUT}
                        Taxonomic assignment of each genome.
  --taxonomiclevel {INPUT}
                        Taxonomic level for filtering (species = s,genus = g,
                        class = c, order = o, phyla = p).
  -c {INT}, --consortiasize {INT}
                        Define the SynCom size the user is after.
  --corebias {FLOAT}    The additional weighting provided to Pfams core to the
                        studied group (default = 0.0005).
  --groupbias {FLOAT}   The additional weighting provided to Pfams
                        significantly enriched in the studied group (default =
                        0.0012).
  --prevfilt {FLOAT}    Prevalence filtering threshold for shortlisting
                        genomes for inclusion in the final SynCom selection
                        (default = 33.3).
  -o {OUTPUT}, --output {OUTPUT}
                        Prefix for all output files e.g. HuSynCom.
  --iterations {INT}    Change the number of iterations to select sample
                        specific strains in step 1.
  --exclusion {INPUT}   Provide file which includes a csv list of genome names
                        to be excluded during SynCom selection.

Datasets Provided

Genome Collection Datasets

In the datasets/isolate_collections/ folder we provide preprocessed input files for a publicly available isolate collections, allowing construction of any designed SynComs.

We have also processed two large collections which include MAGs of uncultured taxa in datasets/mag_collections/, preventing experimental use of these SynComs. These are provided to allow initial study of SynComs prior to isolation, allowing researchers to target the isolation towards microbes of particular interest.

For each of these, the Pfam profiles and taxonomic assignments files are provided, allowing the direct use without further processing.

Collection Name Environment Number of Genomes Isolates or Genomes
PiBAC Pig gut 117 Isolates
HiBC Human gut 229 Isolates
miBCII Mouse gut 211 Isolates
Hungate1000 Rumen 410 Isolates
GTDB r202 N/A 47,893 Genomes
Pasolli et al, 2019 Human microbiome 4,930 Genomes

Environmental Datasets

In the Datasets/Environmental folder we provide Pfam profiles for all samples studied within the MiMiC2 paper.

Study Environment Condition
Shabat et al, 2016. Bovine rumen N/A
Wylensek et al, 2020. Pig gut N/A
Lesker et al, 2020 Mouse gut N/A
Lloyd-Price et al, 2019 Human gut Ulcerative colitis Vs nonIBD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.