Code Monkey home page Code Monkey logo

Comments (4)

axbazin avatar axbazin commented on June 27, 2024

Thank you for your kind words!

Indeed I do not think we have such a file among the possible outputs.
The closest would probably be the "matrix.csv" file (see this which has all the information you want, but has the list of genes rather than the raw count itself. Transforming this file may be your easiest way out.

Otherwise, using the gene_families.tsv file along with one of the file that links genes to genomes (e.g., the gff or the genes annotation table files) is a good solution too.

Adelme

from ppanggolin.

aababc1 avatar aababc1 commented on June 27, 2024

Thank you for your prompt response and suggestion!

I have two more question .

As your suggestion,I generated custom script for that. The number of the all genes in result file matched with the number of lines in gene_families.tsv . It would be greatful if you see attached file could be routinely integrated with ppanggolin in my analysis pipeline.
make_count_table.py.txt

First one is, ppanggolin represent F in third column in gene_families.tsv file that are fragmented. I included them for the analysis. I thought the fragmented gene sequences could be annotated functionally and it could be utilized for downstream anlaysis. I wonder your opinion about fragmented genes information inclusion in downstream analysis.

Second one is about gene families threshold. In the paper,coverage 80 % identity 80% was utilized for gene family construction. This are frequently used for gene family clustering, but I have question about adjusting the coverage and identity for species level in microbial comparative genomic analysis. If the species are different, users should choose different clustering criteria , or just default values could be utilized for analysis? And if someone lowering the identity to 50%, there could be severe bias introduced in downstream analysis based on pangenome function annotation based on pangenome reference sequences?

Thank you very much Adelme

`

`

from ppanggolin.

axbazin avatar axbazin commented on June 27, 2024

Your script looks fine for me, it does seem to be doing what you want.

About the fragmented genes, it depends on the "downstream analysis" and the biological question. From the technical point of view, if you are annotating genes independently from their gene families, then I think it is fine. If you are doing functional annotations at the scope of gene families, I'd remove them as they may not be able to realize the "function" that they would be annotated with.

For the question of gene families threshold, indeed I'd recommend to lower the identity threshold for clustering. If they are "close" sister species (e.g. Neisseria meningitidis and Neisseria gonorrhoeae) 80% is fine, but in general lowering it is better. However you are correct, it may generate a strong bias if you are annotating your gene families, as some paralogs with different functions may be annotated exactly the same way, in that case. That will only be true for some families though. It's a balance to have between wrongly clustered paralogs and wrongly splitted orthologs, you can adjust the threshold depending on what's important to your own analysis/biological question.

In my opinion, while annotating gene families is "practical" and much faster, annotating genes directly is still best if you want to avoid mistakes as much as possible.

Have a nice day!
Adelme

from ppanggolin.

aababc1 avatar aababc1 commented on June 27, 2024

Thank you so much for your very detail explanation .

I asked the gene family clustering threshold and fragmented gene families because I am handling fragmented genomes such as MAG. As you commented, annotate genomes individually will show best accuracy I think. I will test some things based on you advice. My �questions are all resolved. Thank you once again.

Have a nice day!

from ppanggolin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.