Code Monkey home page Code Monkey logo

ggmatch's Introduction

ggMatch

Greedy Gene Matching tool.

ggMatch finds reciprocal best blast/diamond hits across a large number of genomes, in an iterative fashion. The iterative nature removes the need to do the dreaded all-vs-all blast between all genomes. The figure below shows an example gene matching process for a single gene against 298 fungal genomes extracted from the JGI. The black node represents the original query sequence, and edges between other nodes represent high quality reciprocal best blast hits. Each node color represents a different iteration. In the first iteration, we discover the yellow genes, and in later iterations we discover other genes based on the newly discovered genes. A traditional reciprocal best hit search would have revealed only the yellow nodes in this network.

Example gene graph created by ggMatch

Description of method

COMING SOON

Usage

Dependencies

  • Snakemake
  • Conda
  • Interproscan (optional)

Getting started

Simple example

You can download and run the example dataset with the following commands:

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch
  ./ggMatch examples/basic/config.json

Yeast Gene Order Browser example

Also included is an example from the yeast gene order browser

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch/examples/ygob
  ./prepare_ygob_example.sh
  cd ../..
  ./ggMatch examples/ygob/config.json

Output

The tool outputs three files:

  • outdir/run/compare/cmpTable.tsv : A matrix of similarity scores to the original query for each species (-1 if absent)

This score is determined by the number of positive scoring matches in the alignment divided by the length of the query

  • outdir/run/graph/nodes.tsv : A file describing the genes in the discovery network (can be loaded with gephi or cytoscape)

Format:

Id query label iteration
node1 query3 "QuerySet:182893" 0
node2 query3 "93_Amamu1:182893" 1
node3 query3 "92_Aspsy1:40389" 1
node7 query3 "QUERY:query3" -1
node4 query2 "QuerySet:201797" 0
node5 query2 "93_Amamu1:201797" 1
node6 query2 "QUERY:query2" -1
node8 query1 "QUERY:query1" -1
node9 query1 "92_Aspsy1:61310" 0
  • outdir/run/graph/edges.tsv : A file describing the discovery order in the discovery network

Format:

source target validated
node1 node2 1
node1 node3 1
node7 node1 1
node4 node5 1
node6 node4 1
node8 node9 1

Loading the edges.tsv and nodes.tsv files into cytoscape produces the following network: Discovery graph created by ggMatch for the example dataset

Running your own problem

Please look at the example JSON file, and the default parameters in (pipeline_components/defaults.json) The general format takes this form:

{
  "genomes" : {  },
  "queries" : {  },
  "outdir" : "./output",
}

You can modify a number of parameters. These can be found in (pipeline_components/defaults.json).

You can validate your configuration file with ggMatch -v config.json

Adding genomes

A genome is a set of proteins. Provide for each genome a multisequence fasta file of protein sequences. For each genome, add the location of the fasta file in the JSON file, indexed by a , under the "genomes" heading: For example:

"genomes": {
  "genome1" : { "prots" : "proteins_genome1.fasta" },
  "genome2" : { "prots" : "proteins_genome2.fasta" }
}

Defining a query

A query takes the form of a multisequence fasta file. A simple query is a single seed sequence, but if you have pre-existing knowledge of another ortholog, you can provide multiple seed sequences in the same query file.

If the genomes from which these queries originate also exist in your genome, you can prefix the sequence description in the fasta file with "genomeID:". This will link the query sequence to the genome. If that query is identified as a match against any other genome, then the reciprocal blast will be performed against that genome, rather than the set of query sequences.

For example:

>genome1:gene1
MPDDVWSGSSTCSLSSDGMSVRKDMKPEFHRAWPRCTAKAMDLEINEKMPHNETTEVAGVTKIKAVEAVG
GKTGKYIMYAGLAMVMVIYELDNSTVGTYRNFASSDFHQLGKLATLNTAASIITAIFKPPIAKLSDVLGR
GEAYVVTLTFYILSYILC
>prot2
MVAHNFSPRDAQFLTYTNGVSQALMGMGTGLLMYRYRTYKWIGVAGAVIRLVGYGVMVRLRTNESSIAEL
FIVQLVQGIGSGIIETIIIVAAQISVPHAELAQVTSLVMLGTFLGNGIGSAVAGAIYTNQLRDRLEIHLG
PGAAEGQLATLYNSITDRLPEWGTAERTAVNQALGDGHNLVQVTPDSSRSDSLDIEKPKARCF

These queries need to be added to the configuration JSON file:

  "queries" : {
    "query1" : "query_prots.fasta"
  }

ggmatch's People

Contributors

thiesgehrmann avatar

Watchers

James Cloos avatar  avatar  avatar

ggmatch's Issues

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.