Hello ! I'm trying to use my external clustering results with my dat

Can't use external clustering: "Exception: Representative gene has not been set" about ppanggolin HOT 7 CLOSED

ericolo commented on September 27, 2024

Can't use external clustering: "Exception: Representative gene has not been set"

from ppanggolin.

Comments (7)

ericolo commented on September 27, 2024 2

I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared ppanggolin_result/gene_families.tsv to my own clustering file and there were indeed some proteins in my clustering file that were not in any of the GFF...

So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs.

Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files

from ppanggolin.

JeanMainguy commented on September 27, 2024 1

About the error you got, this is quite misleading.
We’ve already identified some issues with external clustering files process and have patched them and improved error handling in PR #278. So the error messages should be clearer in the next release.

Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the family_tsv file is not up to date, so I’ll address that too.)

from ppanggolin.

ericolo commented on September 27, 2024 1

It ended up working with my new GFF files, thanks for the workaround that helped me debug !

from ppanggolin.

ericolo commented on September 27, 2024

On my machine, the example in the git repository works perfectly like this:
ppanggolin workflow --anno genomes.gbff.list -c 15 -o mock_test --clusters clusters.tsv --infer_singletons

So the problem seems to really come from my clustering file
I tried reformatting mine like the example (representative-tab-gene) and I still get the same error

Thanks,
Eric

from ppanggolin.

JeanMainguy commented on September 27, 2024

Hi,

Thanks for raising this issue!

It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them.

PPanGGOLiN uses the gene ID from the CDS line of the GFF file. For example, with the following gene:

NZ_CALCQZ010000140.1 RefSeq gene 11218 11601 . - . ID=gene-QDP31_RS09520;Name=QDP31_RS09520
NZ_CALCQZ010000140.1 Protein Homology CDS 11218 11601 . - 0 ID=cds-WP_279012685.1;Parent=gene-QDP31_RS09520

The ID would be cds-WP_279012685.1.

At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format <genome>_CDS_<id>, such as GCF_000173495.1_CDS_0759. In this case you get the following log: INFO: gene identifiers used in the provided annotation files were not unique, PPanGGOLiN will use self-generated identifiers.

To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:

ppanggolin info -p myannopang/pangenome.h5 --parameters

If you see # used_local_identifiers: False, it means PPanGGOLiN used internal IDs instead of those from the annotation file.

In your case, it looks like the genes in your clustering table follow the pattern <genomeID>:<contigID>_<id>, which PPanGGOLiN doesn’t recognize and can’t map back to the pangenome genes.

from ppanggolin.

JeanMainguy commented on September 27, 2024

I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the family_tsv file using the write_pangenome command.

This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file.

To sum up the commands would be:


ppanggolin annotate --anno list_gff.tsv -o ppanggolin_result
ppanggolin cluster -p ppanggolin_result/pangenome.h5

ppanggolin write_pangenome --families_tsv -o ppanggolin_result -f

from ppanggolin.

ericolo commented on September 27, 2024

Hi,

Thanks for your quick reply !

So I renamed my proteins like this <genomeID>:<contigID>_<id> because in my dataset some contigs coming from different genomes have redundant names, and I edited the GFF files as well after the ID= flag

Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log:
2024-09-03 14:11:02 annotate.py:l1084 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.

I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset

Thanks a lot !

from ppanggolin.

Can't use external clustering: "Exception: Representative gene has not been set" about ppanggolin HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent