Comments (7)
I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared ppanggolin_result/gene_families.tsv
to my own clustering file and there were indeed some proteins in my clustering file that were not in any of the GFF...
So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs.
Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files
from ppanggolin.
About the error you got, this is quite misleading.
We’ve already identified some issues with external clustering files process and have patched them and improved error handling in PR #278. So the error messages should be clearer in the next release.
Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the family_tsv
file is not up to date, so I’ll address that too.)
from ppanggolin.
It ended up working with my new GFF files, thanks for the workaround that helped me debug !
from ppanggolin.
On my machine, the example in the git repository works perfectly like this:
ppanggolin workflow --anno genomes.gbff.list -c 15 -o mock_test --clusters clusters.tsv --infer_singletons
So the problem seems to really come from my clustering file
I tried reformatting mine like the example (representative-tab-gene) and I still get the same error
Thanks,
Eric
from ppanggolin.
Hi,
Thanks for raising this issue!
It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them.
PPanGGOLiN uses the gene ID from the CDS
line of the GFF file. For example, with the following gene:
NZ_CALCQZ010000140.1 RefSeq gene 11218 11601 . - . ID=gene-QDP31_RS09520;Name=QDP31_RS09520
NZ_CALCQZ010000140.1 Protein Homology CDS 11218 11601 . - 0 ID=cds-WP_279012685.1;Parent=gene-QDP31_RS09520
The ID would be cds-WP_279012685.1
.
At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format <genome>_CDS_<id>
, such as GCF_000173495.1_CDS_0759
. In this case you get the following log: INFO: gene identifiers used in the provided annotation files were not unique, PPanGGOLiN will use self-generated identifiers.
To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:
ppanggolin info -p myannopang/pangenome.h5 --parameters
If you see # used_local_identifiers: False
, it means PPanGGOLiN used internal IDs instead of those from the annotation file.
In your case, it looks like the genes in your clustering table follow the pattern <genomeID>:<contigID>_<id>
, which PPanGGOLiN doesn’t recognize and can’t map back to the pangenome genes.
from ppanggolin.
I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the family_tsv
file using the write_pangenome
command.
This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file.
To sum up the commands would be:
ppanggolin annotate --anno list_gff.tsv -o ppanggolin_result
ppanggolin cluster -p ppanggolin_result/pangenome.h5
ppanggolin write_pangenome --families_tsv -o ppanggolin_result -f
from ppanggolin.
Hi,
Thanks for your quick reply !
So I renamed my proteins like this <genomeID>:<contigID>_<id>
because in my dataset some contigs coming from different genomes have redundant names, and I edited the GFF files as well after the ID=
flag
Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log:
2024-09-03 14:11:02 annotate.py:l1084 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset
Thanks a lot !
from ppanggolin.
Related Issues (20)
- error while writing genome annotations HOT 1
- ppanggolin projection: ValueError: The region is already with a different spot. HOT 2
- Add gene name info in the Tile Plot HOT 2
- 'ascii' codec HOT 1
- rarefaction curve : Population must be a sequence HOT 3
- Annotation error: gene coordinates exceeding contig length HOT 2
- conda installs old version HOT 2
- ValueError: The gene family has not beed associated to a partition. HOT 7
- Non-deterministic clustering (possibly due to multi-threading) HOT 4
- ValueError: max() iterable argument is empty HOT 5
- Segmentation Fault at Partition HOT 2
- Error from Pyton 3.12 HOT 6
- Let PPanGGOLIN keep running if partition step fails
- PPanGGOLiN version 2.1.1 - Failed building new version HOT 6
- Clustering issue HOT 3
- Population must be a sequence HOT 3
- Formula for calculating chao could be wrong HOT 1
- Error annotating pangenome HOT 5
- Difference between Persistent, Core (exact, soft), Shell, Cloud, and Accessory (exact, soft) Gene Families
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppanggolin.