guyleonard / orthomcl_tools Goto Github PK

11.0 2.0 4.0 40 KB

A couple of tools I find indispensable for post orthomcl down-stream analysis.

License: GNU General Public License v2.0

Perl 96.55% R 3.45%

orthomcl_tools's Introduction

orthomcl_tools

A couple of tools I made for post orthoMCL/orthAgogue down-stream analysis. The first tool computes a set of CSV files with presence/absence or 'count' information relating to your orthologue groups and taxa. The second tool allows you to take the output of a DOLLOP analysis along with the orthologue groups and plot them to locations on the tree topology as a list and set of alignments for each node. The final two scripts allow you to plot this information in a image created in R using ggtree.

Citation

orthomcl_groups_analysis.pl

Run this script to:

Collate all sequences from each ortholog group into separate FASTA files.
Generate a Presence Absense Grid
Is the taxa/genome represented in the ortholog group? 0 or 1
Also generates a transposed grid in a phylip-like format which is useful for Dollo Parsimony / ML analyses.
Generate Count (tally) Grid
How many representations (genes) are present in each ortholog groups? 0...n

You will need these files:

- from orthomcl: goodProteins.fasta
- from orthomcl: groups.txt
- from orthomcl: compliantFasta directory

You will need to convert the presence_absence_grid.csv file into a phylip-like format. To do this you must do four things:

transpose the data,
remove the first line (header information),
insert spaces between the 'taxa' names and 0/1s and
add number of taxa and number of 'sites' to the top of the file.

To do this:

 1. perl -F, -lane 'for ( 0 .. $#F ) { $rows[$_] .= $F[$_] }; eof && print map "$_\n", @rows' presence_absense_grid.csv > presence_absense_grid_transposed.csv
 2. perl -ni -e 'print unless $. == 1' presence_absense_grid_transposed.csv
 3. cp presence_absense_grid_transposed.csv presence_absense_grid.phy
 4. sed -i 's/\(\w\{4\}\)\(.*\)/\1      \t\2/g' presence_absense_grid.phy
 5. Open in your favourite text editor and add " XX YYYY", where XX = number of taxa, and YYYY = number of ortholog groups

Before running extract_dollop_output_sequences_v2-fast.pl you will need to run 'dollop' from the PHYLIP package. This requires the phylip-like file you created in the step above, and possibly a user-specified tree topology.

extract_dollop_output_sequences_v2-fast.pl

A program to take the "outfile" from a PHYLIP DOLLOP run and parse the output in to a more useful format. It requires at least two pieces of information:

the "outfile" from DOLLOP
the number of states being tested (i.e. number of orthologs)

For running with OrthoMCL data you will also need:

a list of orthogroup names (edit groups.txt accordingly)
a directory with the *.fasta sequences of each orthogroup (from the previous script)

The program outputs four different options:

A phylip-like output file, parsed from the state information in the "outfile" from dollop.
A tab-separated report listing the node and number of gain, loss and 'core' states in either of two styles:
'New-style': This makes the node column one value, which is represented by a internal node number or leaf label. The number is the node number starting from '1' on the first leaf.
'Old-style': This makes the node column into two values, corresponding to positions across the tree using the node values from the dollop outfile.
This outputs a directory, in the 'old-style' format (e.g. root__1 or 3__label), making a directory for each node transition along with three files. One each for loss, gain and core.
This copies the *.fasta files from a location for each of the lists in the previous directory, to their appropriate directory.

Mandatory Input:
	-i Dollop outfile
	-s Number of states (orthogroups)
	-o Output Directory
Other Options (one or all required):
outfile to phylip-like:
	-c Convert to Phylip-like File (not needed if using -p)
Report Tables:
	-n New-style Report (includes internal tree node numbers)
	-r Old-style report (between-nodes from dollop outfile)
Structured Lists:
	-l Lists of Core/Losses/Gains (requires -g)
	-g Ortholog Group List
	-p phylip-like file (unless -c)
Sequence Collation:
	-f Get .fasta files for groups from directory (requires -d)
	-d The *.fasta directory
	-n Nodes directory (unless -l)
	-x Exclude "core" marked ortholog groups
e.g. Equivalent: program.pl -i input -s number -o output -cr -l -g list.txt or program.pl -i input -s number -o output -rl -g list.txt -p phylip-like.phy

add_internal_node_labels.pl

For the R script below to work your tree needs to have it's internal nodes labelled as well as the leafs. This will add numbers to all the internal nodes, you need to specifiy how many leafs nodes (i.e. taxa) you have.

plot_tree_gains_loss_in_R.r

Requires

ggplot2
ggtree
internally labelled tree
'new-style' output from extract_dollop_output_sequences_v2-fast.pl

redundancy_check.pl

orthomcl_tools's People

Contributors

Stargazers

Watchers

Forkers

anandksrao wangpanqiao atongsa mingyue-mingyue

orthomcl_tools's Issues

Does not generate output directory

When using the -o option with -ac it does not create the output directory, failing with this message:
Generating: Presence\Absense Grid
Can't open 'agrp_1e5/presence_absense_grid.csv' for writing: 'No such file or directory' at orthomcl_groups_analysis.pl line 98

questions about results from orthomcl_groups_analysis.pl

Hi @guyleonard

Thanks for your nice tools. I have some questions about the results from orthomcl_groups_analysis.pl. Now I attached the first 10 lines from my groups.txt, and from count_list.csv and presence_absense_grid.csv. Please check them out.

Basically my first question is: totally I have 16 taxa which are listed in the header of count_list.csv, my understanding is, count_list.csv contains how many proteins are presented in each group for each taxa, so there should be 16 columns for each group. However, I found there more than 16 columns for many groups in count_list.csv. For example, for first group, there are 21 numbers, instead of 16. Is there something wrong?

My second question is: during orthomcl_groups_analysis.pl runing, I got error message like this:

Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36816.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36817.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36818.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36819.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36820.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36821.
Use of uninitialized value $accession in hash element at /data/chen/software/orthomcl_tools-1.0/orthomcl_groups_analysis.pl line 155, <$_[...]> line 36822.

There are lots of this message. What happened, is there something wrong? Thanks.

Best regards,
Chongjing Xia
groups_head.txt

count_list_head.txt
presence_absense_grid_head.txt

ran into erro

Hi guylenard:
I ran into this error when running orthomcl_groups_analysis.pl.

'Retreiving: Family1 with 1215 sequences
Can't call method "seq" on an undefined value at ../orthomcl_tools/orthomcl_groups_analysis.pl line 204, <$_[...]> line 61419.'

I used this command:
perl ../orthomcl_tools/orthomcl_groups_analysis.pl -o dollopinput -g groups.txt -p compliantFasta/goodProteins.fasta -a Presence_Absense.grid.txt -c tally.grid -f compliantFasta/

Can you help me out of this?
Thanks a lot for your help and for developing this utility.

Kind Regards,
Huanlee

what is the orthoMCL groups.txt

Hi @guyleonard

I try to use your orthomcl_groups_analysis.pl to analyze orthomcl output. I wonder what is the orthoMCL groups.txt? From OrthoMCL I got output mclOutput, and also I found orthologs.txt, coorthologs.txt, and inparalogs.txt from /pairs directory. Which one is the input for orthomcl_groups_analysis.pl ? Or I have to prepare groups.txt myself? Thanks.

Best,
Chongjing

extract_dollop_output

Hi,

I was trying to use extract_dollop_output_sequences_v2-fast.pl on outfile from dollop. But my output files "*newstyle_report.txt" has same node number getting repeated several times.
When I ran the script, the following message was shown
"Argument "Cpe29" isn't numeric in addition (+) at extract_dollop_output_sequences_v2-fast.pl line 375, <$[...]> line 248" for many lines.
Can please me help me with that.
[My data consist of 124 species and 87 states]
with regards
Kavitha

identify the lost gene

Hi Leonard,

I had done the dollop gene gain/loss for set of genes. I have not used ortho grouping as they all belong to same family mostly. I have got the number of gene/loss. I also need to identify which genes are lost along the phylogeny including the internal branches. Is it possible to obtain those information about the which gene is lost.

regards
Kavitha