Comments (1)
Hi Filipe,
PIRATE clusters down to 98% (default) using CD-HIT, then performs all-vs-all BLAST/DIAMOND followed by hierarchical MCL clustering at different amino acid similarity cutoffs. The lowest % identity MCL clusters are used as the gene_families/clusters in the output of PIRATE and the highest %id value for that cluster at which all of the isolates are clustered together is reported as the 'threshold' value in the output file (e.g. all 4 sequences in geneA cluster together at 50,60,70,80,90% identity but split into two clusters of 2 sequences at 95% therefore 90% is reported as the threshold value in gene_families.tsv).
As PIRATE uses a much lower default %id value than roary it will almost always report smaller number of genes (gene_families). At higher thresholds (98%) these will be more in line with default roary. PIRATE reports these higher %id clustering as 'alleles' of a gene family.
PIRATE may report some genes with very high copy number per isolate, which may represent related gene families which are poorly resolved by sequence similarity and the rudimentary paralog splitting algorithm used in PIRATE. If this is the case there are a few things I can suggest to improve the resolution of the clustering.
As for the selection of thresholds, it will depend on if you have a clonal or diverse collection of isolates. If you have a reasonably clonal collection I would increase the default % identity threshold values as lower thresholds are providing little useful information and may lead to clustering of closely related genes (Note: you can also increase the CD-HIT clustering cutoff to provide finer scale allelic info for clonal collections). For diverse collections lower thresholds would be more suitable, but comparison to tools like roary, which are designed for relatively clonal collections/species, may not be productive.
All the best,
Sion
from pirate.
Related Issues (20)
- extract_feature_sequences.pl failed HOT 2
- error observed during "aligning all feature sequences" HOT 2
- Missing genome in output HOT 12
- Output gene sequences to run gene alignment separately HOT 4
- PIRATE_plots.pdf created by plot_summary.R HOT 1
- Error after MCL clustering step HOT 5
- How do you tell which gene families are single-copy or multi-copy? HOT 2
- Feature request: Option to include original IDs and annotations in fasta headers for align_features_sequences script HOT 2
- Average_dose =1 is appropriate to determine whether a gene family is a single copy? HOT 1
- - ERROR: link_clusters.pl failed. HOT 1
- Undefined subroutine &main::translate called HOT 2
- Error when running PIRATE MCL process
- For some single loci, a gene family but for others not. HOT 1
- problem in installation HOT 9
- Bump version in new release HOT 4
- Missing output files and coregenom files HOT 3
- Running on large dataset HOT 2
- stuck at threshold 60 during MCL clustering HOT 3
- PIRATE.pangenome_summary.txt HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pirate.