ggMatch

Greedy Gene Matching tool.

ggMatch finds reciprocal best blast/diamond hits across a large number of genomes, in an iterative fashion. The iterative nature removes the need to do the dreaded all-vs-all blast between all genomes. The figure below shows an example gene matching process for a single gene against 298 fungal genomes extracted from the JGI. The black node represents the original query sequence, and edges between other nodes represent high quality reciprocal best blast hits. Each node color represents a different iteration. In the first iteration, we discover the yellow genes, and in later iterations we discover other genes based on the newly discovered genes. A traditional reciprocal best hit search would have revealed only the yellow nodes in this network.

Description of method

COMING SOON

Usage

Dependencies

Snakemake
Conda
Interproscan (optional)

Getting started

Simple example

You can download and run the example dataset with the following commands:

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch
  ./ggMatch examples/basic/config.json

Yeast Gene Order Browser example

Also included is an example from the yeast gene order browser

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch/examples/ygob
  ./prepare_ygob_example.sh
  cd ../..
  ./ggMatch examples/ygob/config.json

Output

The tool outputs three files:

outdir/run/compare/cmpTable.tsv : A matrix of similarity scores to the original query for each species (-1 if absent)

This score is determined by the number of positive scoring matches in the alignment divided by the length of the query

outdir/run/graph/nodes.tsv : A file describing the genes in the discovery network (can be loaded with gephi or cytoscape)

Format:

Id	query	label	iteration
node1	query3	"QuerySet:182893"	0
node2	query3	"93_Amamu1:182893"	1
node3	query3	"92_Aspsy1:40389"	1
node7	query3	"QUERY:query3"	-1
node4	query2	"QuerySet:201797"	0
node5	query2	"93_Amamu1:201797"	1
node6	query2	"QUERY:query2"	-1
node8	query1	"QUERY:query1"	-1
node9	query1	"92_Aspsy1:61310"	0

outdir/run/graph/edges.tsv : A file describing the discovery order in the discovery network

Format:

source	target	validated
node1	node2	1
node1	node3	1
node7	node1	1
node4	node5	1
node6	node4	1
node8	node9	1

Loading the edges.tsv and nodes.tsv files into cytoscape produces the following network:

Running your own problem

Please look at the example JSON file, and the default parameters in (pipeline_components/defaults.json) The general format takes this form:

{
  "genomes" : {  },
  "queries" : {  },
  "outdir" : "./output",
}

You can modify a number of parameters. These can be found in (pipeline_components/defaults.json).

You can validate your configuration file with ggMatch -v config.json

Adding genomes

A genome is a set of proteins. Provide for each genome a multisequence fasta file of protein sequences. For each genome, add the location of the fasta file in the JSON file, indexed by a , under the "genomes" heading: For example:

"genomes": {
  "genome1" : { "prots" : "proteins_genome1.fasta" },
  "genome2" : { "prots" : "proteins_genome2.fasta" }
}

Defining a query

A query takes the form of a multisequence fasta file. A simple query is a single seed sequence, but if you have pre-existing knowledge of another ortholog, you can provide multiple seed sequences in the same query file.

If the genomes from which these queries originate also exist in your genome, you can prefix the sequence description in the fasta file with "genomeID:". This will link the query sequence to the genome. If that query is identified as a match against any other genome, then the reciprocal blast will be performed against that genome, rather than the set of query sequences.

For example:

>genome1:gene1
MPDDVWSGSSTCSLSSDGMSVRKDMKPEFHRAWPRCTAKAMDLEINEKMPHNETTEVAGVTKIKAVEAVG
GKTGKYIMYAGLAMVMVIYELDNSTVGTYRNFASSDFHQLGKLATLNTAASIITAIFKPPIAKLSDVLGR
GEAYVVTLTFYILSYILC
>prot2
MVAHNFSPRDAQFLTYTNGVSQALMGMGTGLLMYRYRTYKWIGVAGAVIRLVGYGVMVRLRTNESSIAEL
FIVQLVQGIGSGIIETIIIVAAQISVPHAELAQVTSLVMLGTFLGNGIGSAVAGAIYTNQLRDRLEIHLG
PGAAEGQLATLYNSITDRLPEWGTAERTAVNQALGDGHNLVQVTPDSSRSDSLDIEKPKARCF

These queries need to be added to the configuration JSON file:

  "queries" : {
    "query1" : "query_prots.fasta"
  }

thiesgehrmann / ggmatch Goto Github PK

ggmatch's Introduction

ggMatch

Description of method

Usage

Dependencies

Getting started

Simple example

Yeast Gene Order Browser example

Output

Running your own problem

Adding genomes

Defining a query

ggmatch's People

Contributors

Watchers

ggmatch's Issues

Confirm that use of BLAST's `-max_target_seqs` is intentional

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent