Quota synteny alignment

Author: Haibao Tang (tanghaibao), Brent Pedersen (brentp)
Email: [email protected]
License: BSD

Introduction

Typically in comparative genomics, we can identify anchors, chain them into syntenic blocks and interpret these blocks as derived from a common descent. However, when comparing two genomes undergone ancient genome duplications (plant genomes in particular), we have large number of blocks that are not orthologous, but are paralogous. This has forced us sometimes to use ad-hoc rules to screen these blocks. So the question is: given the expected depth (quota) along both x- and y-axis, select a subset of the anchors with maximized total score.

This program tries to screen the clusters based on the depth constraints enforced by the user. For example, between rice-sorghum comparison, we can enforce 1:1 ratio to get all the orthologous blocks; or maybe 4:2 to grab orthologous blocks between athaliana-poplar. But the quota has to be given by the user. The program than tries to optimize the scores of these blocks globally.

To see the algorithm in action without installation, please go to CoGe SynMap tool. Select "Analysis Options", select algorithm options for "Merge Syntenic Blocks" (quota_align.py--merge) and/or "Syntenic Depth" (quota_align.py --quota).

Installation

Download the most recent codes at:

git clone http://github.com/tanghaibao/quota-alignment.git

Required dependencies

Python version >=2.7
GNU linear programming kit GLPK.

Please put the executable glpsol on the PATH.

Optional dependencies

SCIP.

Faster integer programming solver, choose the binary (32-bit, 64-bit) that fits your machine. SCIP might have dependency on LAPACK, which needs to be installed too. Please rename the executable scip and put it on the PATH, for example:

sudo cp scip-x.x.0.linux.x86_64 /usr/local/bin/scip
sudo chmod +x !$

bx-python package.

This is only required when user wants to analyze .maf formatted data (use maf_utils.py):

easy_install bx-python

BCBio package.

This is only required when user wants to convert .gff file to .bed format, see section Pre- and post-processing.

Cookbook

Default package comes with the test data for case 1 and 2 in run.sh. More test data set can be downloaded here. Unpack into the folder, and execute run.sh.

BLAST anchors chaining and quota-based screening

First you need to figure out a way to convert the BLAST result into the following format (called .raw format), see section Pre- and post-processing, in particular on filtering BLAST output:

1       6       1       4848    12
1       7       1       4847    10
1       8       1       4847    50
1       9       1       4846    14

Where the five columns correspond to chr1, pos1, chr2, pos2, and -log10(E-value). Then we can do something like:

quota_align.py --format=raw --merge --Dm=20 --min_size=5 --quota=2:1 maize_sorghum.qa

--merge asks for chaining, distance cutoff --Dm=20 for extending the chain, --min_size=5 for keeping the chains that are long enought; --quota=2:1 turns on the quota-based screening (and asks for two-to-one match, in this case, lineage specific WGD in maize genome, make every 2 maize region matching 1 sorghum region). Note that if you set the quota wrong, e.g. suppose you don't know the quota ratio between maize and sorghum, and you typed 1:1, you will see the coverage reports to be too low:

write (134) clusters to 'data/maize_sorghum.qa.filtered'
genome X coverage: 62.6%
genome Y coverage: 97.5%

In this case, genome X (maize) has only slightly over half of the genome aligned, missing the duplicated counterpart.

BLASTZ anchors chaining and quota-based screening

Most often you will have the .maf file. First convert it to .qa format:

cluster_utils.py --format=maf athaliana_lyrata.maf athaliana_lyrata.qa

Then you want to do the chaining and the screening in one step:

quota_align.py --merge --Dm=20000 --quota=1:1 --Nm=40000 athaliana_lyrata.qa

--merge asks for chaining, and the distance cutoff --Dm=20000 for extending the chain; --quota=1:1 turns on the quota-based screening (and asks for one-to-one match), and the overlap cutoff --Nm=40000. The reason to specify an overlap cutoff is because the quota-based screening is based on 1D block overlap. Sometimes due to the over-chaining, two blocks will only slightly overlap. Therefore the distance 40000 is how much slight overlap we tolerate.

Finally you can get the screened .maf file by doing:

maf_utils.py athaliana_lyrata.qa athaliana_lyrata.maf

Your final screened .maf file is called athaliana_lyrata.maf.filtered. Hint: you can compare the original and filtered .maf using Miller lab's Gmaj tool.

Find quota-screened paralogous blocks

First we need to figure out how to get the input data. See the last two sections for preparing data from BLAST and BLASTZ. Then we can do something like the following:

quota_align.py --format=raw --merge --Dm=20 --min_size=5 --self --quota=2:2 grape_grape.raw

The reason for setting up --quota=2:2 is because grape has paleo-hexaploidy event. Therefore many regions will have 3 copies, but we need to remove the self match. Therefore we should do 2:2 instead. --self option may be turned on for finding paralogous blocks, when you have reduced the redundancies in your .qa file (note that self-match is symmetric across diagonal). The reason for that is in the self-matching case, the constraints on the union of the constraints on both axis, rather than on each axis separately.

For a lineage that has tetraploidy event (genome doubling), using the example of brachypodium (which has undergone an ancient pan-grass tetraploidy), we can do:

quota_align.py --format=raw --merge --Dm=20 --self --quota=1:1 brachy_brachy.raw

Note in this case, --quota=1:1 since we have most regions in 2 copies, but we need to ignore the self match. Therefore the rule is when searching paralogous blocks (always do --quota=x:x, where x is the multiplicity minus 1).

Format block order for GRIMM analysis

This is so far only supported when --quota=1:1. For example:

quota_align.py --merge --quota=1:1 athaliana_lyrata.qa
cluster_utils.py --print_grimm athaliana_lyrata.qa.filtered

The script will print this:

>genome X
1 2 3 4 5 6 7 8 9 10 11$
12 13 14 15 16 17 18 19$
20 21 22 23 24 25 26 27 28 29 30 31$
32 33 34 35 36$
37 38 39 40 41$
42 43 44 45 46 47 48 49 50$
51 52 53 54 55 56 57 58$
59 60 61$
62 63$
>genome Y
-1 2 -3 4 -6 -7 5 8 10 9 11 -14 13 -12 15 16 17 18 -19$
37 38 24 -25 26 29 28 -30 -27 31 32 33 -34 35 36$
-21 -20 22 23 39 40 41$
-50 49 -48 44 46 -45 47 63 -62 -55 -54 53 -52 51$
-42 43 56 57 -58 -59 60 -61$

This is the input format for Glenn Tesler's GRIMM software. You can either run it locally or on their website.

Pre- and post-processing

There are a few utility scripts included in scripts/ folder.

GFF to BED

Most annotation groups only provide .gff file (see gff format) for the annotation of gene models. I often convert the .gff file to a simpler .bed format (see bed format). You can do the following to create the .bed file (BCBio module required):

gff_to_bed.py athaliana.gff >athaliana.bed

This will get protein-coding models and put these in the .bed format. .gff file must be gff3-compatible, otherwise you have to write customized parser (in fact, this is recommended as most .gff file for genome projects are not compatible). .bed format is required for doing BLAST filtering, see below.

BLAST filtering

The integer programming solver cannot solve large problem instance (say >60000 variables), this mostly will not happen if we filter our anchors carefully (removing redundant and weak anchors). To filter the BLAST results before chaining, using the blast_to_raw.py shipped in this package. Say you have BLAST file (tabular format) ready. You need to do:

blast_to_raw.py athaliana_grape.blastp --qbed=athaliana.bed --sbed=grape.bed --tandem_Nmax=10 --cscore=.5

This will convert the BLAST file into the .raw formatted file that quota_align.py can understand (use --format=raw). For your convenience, several BLAST filters are also implemented in blast_to_raw.py. Notice these BLAST filters are optional.

Remove local dups

Option --tandem_Nmax=10 will group the local dups that are within 10 gene distance. When this option is on, blast_to_raw.py will write new .nolocaldups.bed file, these will substitute your original .bed file from now on.

Remove repetitive matches

For genes that have many hits, we will adjust the evalue:

adjusted_evalue(A, B) = evalue(A, B) ** ((counts_of_blast / counts_of_genes) / (counts(A) + counts(B)))

Use the cscore filtering

Option --cscore=.5 will keep only the hits that have a good score. See reference for cscore in the supplementary of sea anemone paper. C-score between gene A and B is defined:

cscore(A, B) = score(A, B)/max(best score of A, best score of B)

Typically, after the blast_to_raw.py, we can do the quota_align.py directly:

quota_align.py --format=raw --merge --Dm=20 --min_size=5 --quota=4:1 athaliana_grape.raw

Plot dot plot

To visualize the quota-align.py result, all you need is the .qa.filtered result, and two .bed file (remember if you have removed local dups above, make sure you use the .nolocaldups.bed). As an example:

qa_plot.py --qbed=athaliana.nolocaldups.bed --sbed=grape.nolocaldups.bed athaliana_grape.qa.filtered

This will generate a dot plot that you can stare to spot any problem. Below is an example of athaliana-grape dot plot when quota of 4:1 is enforced (meaning that there are expected 4 athaliana regions mapping to 1 grape region).

The result of quota-based screening can be compared to the raw blast result. Using the blast_plot.py in script folder. The syntax is similar to qa_plot, only on differernt input format:

blast_plot.py --qbed=athaliana.bed --sbed=grape.bed athaliana_grape.blastp

Summary

The following is shell script run.sh that can be used from a BLAST output to the dot plot figure. Please note that you need to modify the path and params to make it work on your machine:

#!/bin/bash

# quota-alignment folder
QA=${HOME}/projects/quota-alignment/
# query species
SA=brapa
### target species
SB=athaliana

# filter blast results (note that it needs to be tab-delimited blast m8 format)
${QA}/scripts/blast_to_raw.py ../blast/${SA}_${SB}.blastz --qbed=${SA}.bed --sbed=${SB}.bed --tandem_Nmax=10 --cscore=0.7
# run the quota-based screening
${QA}/quota_align.py --format=raw --merge --Dm=30 --min_size=5 --quota=3:1 ../blast/${SA}_${SB}.raw
# visualize result as dot plot
${QA}/scripts/qa_plot.py --qbed=${SA}.nolocaldups.bed --sbed=${SB}.nolocaldups.bed ../blast/${SA}_${SB}.raw.filtered

Reference

Tang et al. (2011) Screening synteny blocks in pairwise genome comparisons through integer programming. [ BMC Bioinformatics ]

Is it possible/appropriate to determine the WGD events from the genome coverages?

In many whole genome de novo sequencing project, the scaffolds and contigs are not assembled into pseudomolecules. As a result, it's very difficult to determine exact ratio between subject genome and reference genome from a synteny map.

According to your description in README.rst, quota_align.py can calculate the genome coverages from a specified ratio, and the coverage will be two low if a wrong ratio is specified. So my question is, is it possible or appropriate to determine the exact ratio between two genomes?

Here are two real examples

Example 1:

sp1: the species we studied (unknown)
Cca: Coffea canephora (no WGD event after γ)
Vvi: Vitis vinifera (no WGD event after γ)
Sly: Solanum lycopersicum (genome triplicated after γ)

sp1 vs Cca

--quota	genome X coverage (sp1)	genome Y coverage (Cca)
1:1	55.6%	95.6%
2:1	85.4%	97.0%
3:1	95.6%	96.9%
4:1	95.7%	96.9%
6:1	95.7%	96.9%

sp1 vs Vvi

--quota	genome X coverage (sp1)	genome Y coverage (Vvi)
1:1	58.7%	93.3%
2:1	84.4%	94.9%
3:1	93.3%	94.5%
4:1	93.5%	94.2%
6:1	93.5%	94.2%

sp1 vs Sly

--quota	genome X coverage (sp1)	genome Y coverage (Vvi)
1:3	72.5%	95.5%
2:3	92.6%	99.5%
3:3	99.6%	99.6%

Question:

Can I infer that sp1 genome underwent a whole genome triplication after γ?

Example 2:

sp2: the species we studied (unknown)
Cca: Coffea canephora (no WGD event after γ)

sp2 vs Cca

--quota	genome X coverage (sp2)	genome Y coverage (Cca)
1:1	37.5%	95.0%
2:1	60.5%	97.1%
3:1	74.3%	95.9%
4:1	83.4%	95.9%
6:1	89.6%	95.6%
8:1	90.3%	95.5%

Question:

Can I infer that sp2 genome underwent a round of whole genome triplication and a round of whole genome duplication (3 * 2 = 6) after γ?

I examined this method in Arabidopsis vs grape, Arabidopsis vs Brassica rapa and poplar vs peach. It seemed to work well.

Thanks for your attention!
Xiaofei Zeng

tanghaibao / quota-alignment Goto Github PK

quota-alignment's Introduction

Quota synteny alignment

Introduction

Installation

Cookbook

BLAST anchors chaining and quota-based screening

BLASTZ anchors chaining and quota-based screening

Find quota-screened paralogous blocks

Format block order for GRIMM analysis

Pre- and post-processing

GFF to BED

BLAST filtering

Plot dot plot

Summary

Reference

quota-alignment's People

Contributors

Stargazers

Watchers

Forkers

quota-alignment's Issues

Example 1:

sp1 vs Cca

sp1 vs Vvi

sp1 vs Sly

Question:

Example 2:

sp2 vs Cca

Question:

Recommend Projects

Recommend Topics

Recommend Org