Comments (4)
Thank you for your kind words!
Indeed I do not think we have such a file among the possible outputs.
The closest would probably be the "matrix.csv" file (see this which has all the information you want, but has the list of genes rather than the raw count itself. Transforming this file may be your easiest way out.
Otherwise, using the gene_families.tsv file along with one of the file that links genes to genomes (e.g., the gff or the genes annotation table files) is a good solution too.
Adelme
from ppanggolin.
Thank you for your prompt response and suggestion!
I have two more question .
As your suggestion,I generated custom script for that. The number of the all genes in result file matched with the number of lines in gene_families.tsv . It would be greatful if you see attached file could be routinely integrated with ppanggolin in my analysis pipeline.
make_count_table.py.txt
First one is, ppanggolin represent F in third column in gene_families.tsv file that are fragmented. I included them for the analysis. I thought the fragmented gene sequences could be annotated functionally and it could be utilized for downstream anlaysis. I wonder your opinion about fragmented genes information inclusion in downstream analysis.
Second one is about gene families threshold. In the paper,coverage 80 % identity 80% was utilized for gene family construction. This are frequently used for gene family clustering, but I have question about adjusting the coverage and identity for species level in microbial comparative genomic analysis. If the species are different, users should choose different clustering criteria , or just default values could be utilized for analysis? And if someone lowering the identity to 50%, there could be severe bias introduced in downstream analysis based on pangenome function annotation based on pangenome reference sequences?
Thank you very much Adelme
`
`
from ppanggolin.
Your script looks fine for me, it does seem to be doing what you want.
About the fragmented genes, it depends on the "downstream analysis" and the biological question. From the technical point of view, if you are annotating genes independently from their gene families, then I think it is fine. If you are doing functional annotations at the scope of gene families, I'd remove them as they may not be able to realize the "function" that they would be annotated with.
For the question of gene families threshold, indeed I'd recommend to lower the identity threshold for clustering. If they are "close" sister species (e.g. Neisseria meningitidis and Neisseria gonorrhoeae) 80% is fine, but in general lowering it is better. However you are correct, it may generate a strong bias if you are annotating your gene families, as some paralogs with different functions may be annotated exactly the same way, in that case. That will only be true for some families though. It's a balance to have between wrongly clustered paralogs and wrongly splitted orthologs, you can adjust the threshold depending on what's important to your own analysis/biological question.
In my opinion, while annotating gene families is "practical" and much faster, annotating genes directly is still best if you want to avoid mistakes as much as possible.
Have a nice day!
Adelme
from ppanggolin.
Thank you so much for your very detail explanation .
I asked the gene family clustering threshold and fragmented gene families because I am handling fragmented genomes such as MAG. As you commented, annotate genomes individually will show best accuracy I think. I will test some things based on you advice. My �questions are all resolved. Thank you once again.
Have a nice day!
from ppanggolin.
Related Issues (20)
- Wrong File Type Error HOT 1
- Dead link HOT 3
- Extract fasta files HOT 4
- Does PPanGGOLiN provides functional annotation? HOT 3
- Rarefaction HOT 9
- Annotation with user-provided CDS fasta sequences HOT 4
- Error in ppanggolin msa when running all partitions HOT 3
- Exception: Reading the gbff file. Expected type is string, given type was '<class 'NoneType'>' HOT 1
- different results ppanggolin projection with gbff or fasta files HOT 1
- MAFFT error when running ppanggolin MSA HOT 8
- RGP borders not in regions_of_plasticity.tsv HOT 2
- Clarification about the contents of `gene_to_gene_family.tsv ` from projection HOT 4
- product_string HOT 2
- Getting MSAs for single-copy gene families when duplicates are tolerated HOT 3
- Reading the gbff file error HOT 3
- ppanggolin msa --partition core HOT 3
- Writing gene-related data failed HOT 3
- error while writing genome annotations HOT 1
- ppanggolin projection: ValueError: The region is already with a different spot. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppanggolin.