Code Monkey home page Code Monkey logo

Comments (1)

SionBayliss avatar SionBayliss commented on August 15, 2024

Hi Filipe,

PIRATE clusters down to 98% (default) using CD-HIT, then performs all-vs-all BLAST/DIAMOND followed by hierarchical MCL clustering at different amino acid similarity cutoffs. The lowest % identity MCL clusters are used as the gene_families/clusters in the output of PIRATE and the highest %id value for that cluster at which all of the isolates are clustered together is reported as the 'threshold' value in the output file (e.g. all 4 sequences in geneA cluster together at 50,60,70,80,90% identity but split into two clusters of 2 sequences at 95% therefore 90% is reported as the threshold value in gene_families.tsv).

As PIRATE uses a much lower default %id value than roary it will almost always report smaller number of genes (gene_families). At higher thresholds (98%) these will be more in line with default roary. PIRATE reports these higher %id clustering as 'alleles' of a gene family.

PIRATE may report some genes with very high copy number per isolate, which may represent related gene families which are poorly resolved by sequence similarity and the rudimentary paralog splitting algorithm used in PIRATE. If this is the case there are a few things I can suggest to improve the resolution of the clustering.

As for the selection of thresholds, it will depend on if you have a clonal or diverse collection of isolates. If you have a reasonably clonal collection I would increase the default % identity threshold values as lower thresholds are providing little useful information and may lead to clustering of closely related genes (Note: you can also increase the CD-HIT clustering cutoff to provide finer scale allelic info for clonal collections). For diverse collections lower thresholds would be more suitable, but comparison to tools like roary, which are designed for relatively clonal collections/species, may not be productive.

All the best,
Sion

from pirate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.