Code Monkey home page Code Monkey logo

vica's Introduction

VIral and Circular content from metAgenomes (VICA)

find_circular.py

Finds circular contigs in metagenome assemblies by identifying forward read overlaps at the start / end of contigs. Tested successfully on circular contigs from metagenome assemblies produced by Soapdenovo2, IDBA_UD, and SPAdes.

Requirements: (1) Lastz should be in your /usr/bin path, (2) BioPython.

usage: find_circular.py [-h] -i INPUT [-l READ_LENGTH] [-m MIN_CONTIG_SIZE]

Finds circular contigs using lastz.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input assembly filename
  -l READ_LENGTH, --read_length READ_LENGTH
                        Read length (default: 101 bp)
  -m MIN_CONTIG_SIZE, --min_contig_size MIN_CONTIG_SIZE
                        Minimum contig size to check for (default: 3 kbp)

classify.py

Classifies contigs based on gene size, strandedness, and intergenic region length. Works best on contigs > 20 Kb; will only attempt to classify contigs with at least 10 coding regions.

Requirements: BioPython, Scikit-learn, Prodigal.

python classify.py -i metagenome_assembly.fa
usage: classify.py [-h] -i INPUT [-m MIN_CONTIG_SIZE] [-p PRODIGAL_PATH]

Classifies viral metagenomic contigs using a random forest classifier built on
public datasets. Features are gene length, intergenic space length,
strandedness, and prodigal calling gene confidence.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input assembly filename
  -m MIN_CONTIG_SIZE, --min_contig_size MIN_CONTIG_SIZE
                        Minimum contig size to use (default: 10 kbp)
  -p PRODIGAL_PATH, --prodigal_path PRODIGAL_PATH
                        Path to prodigal (default: prodigal)

identify_host.py

Identifies putative bacterial and archaeal hosts for provided viral genomes by comparing tetranucleotide frequencies of viral genomes to (a) known GenBank genomes, (b) metagenomic assembled contigs with marker proteins, or (c) GenBank genomes which have been identified in the provided metagenome. Visualizes putative phage-host relationships.

Requirements: scikit-learn, hmmer3 (hmmsearch), biopython, blastp.

There are three sets of results that are calculated. The first file is *_genbank.txt - this approach is a naive host prediction in which the viral contigs were compared to all possible GenBank genomes, and the closest reported cellular genomes were reported. The second set of results is *_markercontigs.txt - this approach searches the provided entire metagenome assembly for contigs with core marker ribosomal proteins and BLASTs these proteins against a marker protein database. Viral contig tetranucleotide frequencies are then compared to the tetranucleotide frequencies of these marker host contigs, and the closest host contigs with their closest BLASTP annotation are reported for each viral contig. The third set of results is *_markergenbank* - this approach is similar to the 2nd, but instead of comparing viral contigs directly to metagenomic host contigs, they are compared to the full GenBank genomes of the identified bacterial / archaeal genera in the metagenomic assembly (from marker protein BLASTP results).

If no metagenome assembly is provided, then results from just the first GenBank approach will be obtained. Which method to use depends on each circumstance, but in general use the markercontigs results if your metagenomic assembly is sufficiently large to have many host contigs > 20 kbp.

python identify_host.py -i ./phage_contigs.fa -a assembled_contigs.fa -v -o ./host_identification/
usage: identify_host_markers.py [-h] -i INPUT [-a ASSEMBLY] [-v] [-b] [-hm]
                                [-o]

Predicts host(s) for a contig through tetranucleotide similarity.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Viral contig(s) fasta file
  -a ASSEMBLY, --assembly ASSEMBLY
                        Cellular metagenome assembly FASTA file (to search for
                        host marker genes).
  -v, --visualize       Visualizes NMDS of tetranucleotide frequencies for
                        host contigs and viral contigs.
  -b, --blast_path      Path to the BLASTP executable (default: blastp).
  -hm, --hmmsearch_path
                        Path to the hmmsearch executable (default: hmmsearch).
  -o, --output          Path to the directory to store output (will create if
                        does not exist).

compare_tetramers.py

Calculates the euclidean distance between the tetranucleotide frequencies for two given contigs or sets of contigs.

Requirements: scikit-learn.

python compare_tetramers.py contigs1.fa contigs2.fa

crispr_matches.py

Finds CRISPR loci in an assembled metagenome using minced, and searches for spacer matches to identified spacers throughout the rest of the assembled metagenome.

Requirements: BLAST+, minced.

python crispr_matches.py -i contigs.fa

vica's People

Contributors

alexcritschristoph avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.