Code Monkey home page Code Monkey logo

amromics's Introduction

AMRomics

Welcome to AMRomics

AMRomics is a software package to analyze microbial genomics data collections. It provides a pipeline that bundles the current best practice for multiple aspects of AMR analyses. The pipeline analysis results can be represented and visualized via a web application. The web application also provides efficient data management.

AMRomics is written in python, it includes the followings dependencies:

  • blast (known to work with 2.10.1+)
  • samtools (1.11)
  • trimmomatic (0.39)
  • spades (3.15.2)
  • shovill (1.1.0)
  • flye (2.9)
  • prokka (1.14.6)
  • mlst (2.19.6)
  • abricate (1.0.1 | Database: vfdb ecoli_vf ecoh card megares resfinder argannot ncbi plasmidfinder)
  • roary (3.13.0)
  • iqtree (2.1.2)
  • fasttree

Installation

Conda

The simplest method is installed via conda:

  1. Make sure a conda version is installed in the computer. If not done already, download and install the appropriate conda, such as anaconda from https://repo.anaconda.com/archive/

  2. Create a conda environment with all the necessary dependencies: From the repository directory run

git clone --recursive https://github.com/amromics/amromics.git
cd amromics
conda create -y -c conda-forge -c defaults --name amromics python=3.10 mamba
source activate amromics
mamba install -y -c conda-forge -c bioconda -c anaconda -c etetoolkit -c rpetit3 -c defaults --file requirements.txt
pip install panta
  1. Setup MLST database: AMRomics requires a copy of pubMLST database set up on the folder that AMRomics pipeline is run from. We make available the database in the accompanied file db.tar.gz updated on Feb 1, 2024. Just unzip the tarball
tar zxvf db.tar.gz

Alternatively, you can update the latest database by running the following command line. Note the running make take some time depending on the bandwidth network and the responsiveness of pubMLST server

./amr-analysis.py download_db

Docker

The pipeline can be installed via Docker as well.

git clone --recursive https://github.com/amromics/amromics.git
cd amromics
docker built -t amromics .

The working directory from the container is /tmp/amromics, user can run amromics commands by mounting the host working directory (where the git cloned into, e.g. ~/workspace/amromics) into this destination (by using -v). For example, if user want to update the latest database:

chmod 777 ~/workspace/amromics
docker run -v ~/workspace/amromics/:/tmp/amromics/ amromics amr-analysis.py download_db

Note that from the container working directory, a db is already available by unzipping the file db.tar.gz before hand.

Usage

Input preparation

AMRomics takes in as input a collection of bacterial samples in various data types: sequencing reads (fastq format), assembly (fasta format) or assembly with annotations (gff3 where annotations are followed by assembly in fasta). The list of samples in the collection is summarized in a tsv file with the following column headers:

  • sample_id: An unique ID for each sample. sample_id has to be one word i.e., no space and contains only alphanumerical, _, and - characters
  • sample_name: A description of the sample
  • input_type: can take in one of gff (for annotations in gff3 format, expecting the assembly at the end of the gff file as in the input from prokka), asm, assembly (for assembly in fasta format), Illumina (for Illumina sequencing reads in fastq format), pacbio-raw, pacbio-hifi, pacbio-corr (for Pacbio sequencing reads in fastq format), nano-raw, nano-hq, nano-corr (for Nanopore sequencing reads in fastq format).
  • files: path to the input data file(s). For Illumina input type, two fastq files can be supplied if paired-end sequencing is used, and they are separated by a semicolon. For all other input type, only one input file is expected.
  • trim: if the value is True or yes, the read data will be trimmed - only applied for Illumina sequencing reads
  • genus: the Genus of the sample, eg Escherichia
  • species: the species of the sample, eg. coli
  • gsize: an estimate of the genome size, in number, eg 5000000, needed only for subsample Illumina sequencing data to 100x
  • metadata: any relevant information for the sample, in the format key1:value1; key2:value2. For example: Geographic Location:Houston,USA;Insert Date:8/8/2017;Host Name:Human, Homo sapiens;ampicillin:Resistant;aztreonam:Resistant;ciprofloxacin:Resistant;gentamicin:Susceptible;tetracycline:Susceptible

We provide several examples in the folder examples - See below.

AMRomics pipeline can be invoked with one command line ./amr-analysis.py. Its usage is as follows:

./amr-analysis.py  pg -h
usage: amromics pg [-h] [-t THREADS] [-m MEMORY] -c COLLECTION_ID [-n COLLECTION_NAME] -i INPUT [--work-dir WORK_DIR] [--time-log TIME_LOG]
                   [--method METHOD] [--genetree GENETREE] [--progressive PROGRESSIVE] [--tree TREE] [--overwrite OVERWRITE] [--initdb {True,False}]

Pan-genome analysis of a collection

options:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of threads to use, 0 for all (default: 0)
  -m MEMORY, --memory MEMORY
                        Amount of memory in Gb to use (default: 30)
  -c COLLECTION_ID, --collection-id COLLECTION_ID
                        Collection ID (default: None)
  -n COLLECTION_NAME, --collection-name COLLECTION_NAME
                        Collection name (default: )
  -i INPUT, --input INPUT
                        Input file (default: None)
  --work-dir WORK_DIR   Working directory (default: data/work)
  --time-log TIME_LOG   Time log file (default: None)
  --method METHOD       panta or roary (default: panta)
  --genetree GENETREE   Run phylogenty for each gene cluster or not (default: False)
  --progressive PROGRESSIVE
                        Run pangenome in progressive mode (default: False)
  --tree TREE           fasttree or iqtree (default: fasttree)
  --overwrite OVERWRITE
                        Force overwrite exist results (default: False)
  --initdb {True,False}
                        Init full database (default: False)

Examples

We prepare several collections from public data. To download the raw data for these collections

To download the miniature dataset that consists of 5 samples: 1 in sequence assembly, 2 with Illumina sequencing reads, 1 with Nanopore and 1 with Pacbio sequencing reads.

cd examples/Kp24/raw
./download_kp4.sh
cd ../../

To download the 24 sample collection

cd examples/Kp24/raw
./download_kp24.sh
cd ../../

To download the 90 sample collection

cd examples/Kp24/raw
./download_kp24.sh
cd ../../

The following command will run that 24 samples through the pipeline, and import the results to the web-app for visualization:

./amr-analysis.py pg --time-log k24_time.log  -t 7 -m 25 -c KpClinicalGRBZ -i examples/Kp24/Kp24.tsv --work-dir data/work  -n "Collection of 24 clinical isolates from Greek and Brazil"

or if using docker image amromics built as above:

docker run -v ~/workspace/amromics/examples/Kp24:/tmp/amromics/examples/Kp24 amromics amr-analysis.py pg --time-log examples/Kp24/k24_time.log  -t 16 -m 25 -c KpClinicalGRBZ -i examples/Kp24/config_Kp24.tsv --work-dir examples/Kp24/data/work  -n "Collection of 24 clinical isolates from Greek and Brazil"

Run with progressive mode:

./amr-analysis.py pg --time-log k24_time.log  -t 7 -m 25 -c KpClinicalGRBZ --progressive True -i examples/Kp89/Kp89.tsv --work-dir data/work  -n "Collection of 24+89 clinical isolates from Greek and Brazil"

Output

Output from the pipeline is generated under the directory specified by --work-dir. The results include 2 output sub-folders corresponding to 2 stages of the pipeline: samples/ for isolate analysis of each individual sample and collections/ for pan-genome analysis results of the whole collection.

work-dir/
├── samples/
│   ├── sample1/
│   ├── sample2/
│   ├── ...
├── collections/<col_name>/
│   ├── alignments/
│   ├── pangenome/
│   ├── phylogeny/
│   ├── VCFs/
│   ├── sample_set.json

Individual samples

Results for each sample are written in its dedicated sub-folder, e.g. sample1/ as below. It stores final output from various modules, such as assembly, MLST, annotation, resistome/virolome detection using different databases:

sample1/
├── sample1_assembly.fasta
├── sample1_dump.json
├── sample1.faa
├── sample1.ffn
├── sample1.gff
├── sample1_mlst.tsv
├── sample1_plasmid.tsv
├── sample1_resistome_argannot.tsv
├── sample1_resistome_card.tsv
├── sample1_resistome_ecoh.tsv
├── sample1_resistome_megares.tsv
├── sample1_resistome_ncbi.tsv
├── sample1_resistome_resfinder.tsv
├── sample1_resistome.tsv
├── sample1_virulome.tsv

Collection pangenome

Analysis for the collection from AMRomics returns output in sub-folders below:

  • pangenome/: pangenome results from panta or roary (gene clusters, representative sequences, gene presence/absence matrix...)
  • alignments/: alignments of each gene cluster from the generated pangenome
  • phylogeny/: core gene alignment and the corresponding phylogenetic tree
  • VCFs/: variant calling for each sample from the pangenome (DNA and protein). A pangenome reference, pangenome_reference.fasta, is built by concatenation of representative sequences for each and every gene clusters. This reference is used to generate variant profile for each sample, which is the concatenation of the variations of all its genes, in a VCF file.

License

MIT License

Citations

(added soon)

Authors

amromics's People

Contributors

leducquangpm avatar mdcao avatar trangthnguyen avatar ntanh1999 avatar hsnguyen avatar amromics avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.