Code Monkey home page Code Monkey logo

Comments (7)

ericolo avatar ericolo commented on September 27, 2024 2

I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared ppanggolin_result/gene_families.tsv to my own clustering file and there were indeed some proteins in my clustering file that were not in any of the GFF...

So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs.

Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files

from ppanggolin.

JeanMainguy avatar JeanMainguy commented on September 27, 2024 1

About the error you got, this is quite misleading.
We’ve already identified some issues with external clustering files process and have patched them and improved error handling in PR #278. So the error messages should be clearer in the next release.

Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the family_tsv file is not up to date, so I’ll address that too.)

from ppanggolin.

ericolo avatar ericolo commented on September 27, 2024 1

It ended up working with my new GFF files, thanks for the workaround that helped me debug !

from ppanggolin.

ericolo avatar ericolo commented on September 27, 2024

On my machine, the example in the git repository works perfectly like this:
ppanggolin workflow --anno genomes.gbff.list -c 15 -o mock_test --clusters clusters.tsv --infer_singletons

So the problem seems to really come from my clustering file
I tried reformatting mine like the example (representative-tab-gene) and I still get the same error

Thanks,
Eric

from ppanggolin.

JeanMainguy avatar JeanMainguy commented on September 27, 2024

Hi,

Thanks for raising this issue!

It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them.

PPanGGOLiN uses the gene ID from the CDS line of the GFF file. For example, with the following gene:

NZ_CALCQZ010000140.1 RefSeq gene 11218 11601 . - . ID=gene-QDP31_RS09520;Name=QDP31_RS09520
NZ_CALCQZ010000140.1 Protein Homology CDS 11218 11601 . - 0 ID=cds-WP_279012685.1;Parent=gene-QDP31_RS09520

The ID would be cds-WP_279012685.1.

At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format <genome>_CDS_<id>, such as GCF_000173495.1_CDS_0759. In this case you get the following log: INFO: gene identifiers used in the provided annotation files were not unique, PPanGGOLiN will use self-generated identifiers.

To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:

ppanggolin info -p myannopang/pangenome.h5 --parameters

If you see # used_local_identifiers: False, it means PPanGGOLiN used internal IDs instead of those from the annotation file.

In your case, it looks like the genes in your clustering table follow the pattern <genomeID>:<contigID>_<id>, which PPanGGOLiN doesn’t recognize and can’t map back to the pangenome genes.

from ppanggolin.

JeanMainguy avatar JeanMainguy commented on September 27, 2024

I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the family_tsv file using the write_pangenome command.

This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file.

To sum up the commands would be:


ppanggolin annotate --anno list_gff.tsv -o ppanggolin_result
ppanggolin cluster -p ppanggolin_result/pangenome.h5

ppanggolin write_pangenome --families_tsv -o ppanggolin_result -f

from ppanggolin.

ericolo avatar ericolo commented on September 27, 2024

Hi,

Thanks for your quick reply !

So I renamed my proteins like this <genomeID>:<contigID>_<id> because in my dataset some contigs coming from different genomes have redundant names, and I edited the GFF files as well after the ID= flag

Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log:
2024-09-03 14:11:02 annotate.py:l1084 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.

I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset

Thanks a lot !

from ppanggolin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.