labgem / ppanggolin Goto Github PK

View Code? Open in Web Editor NEW

214.0 14.0 23.0 135.27 MB

Build a partitioned pangenome graph from microbial genomes

Home Page: https://ppanggolin.readthedocs.io/en/latest/

License: Other

C 22.78% Python 77.18% Cython 0.04%

comparative-genomics bioinformatics microbial-genomics microbiology bacteria pangenome

ppanggolin's People

Contributors

Stargazers

Watchers

ppanggolin's Issues

Option to extract the RGPs

Hey guys,

It would be perfect if you could provide the option to extract the RGPs. From what I understood, right now panrgp provides the coordinates for the start and stop, but on option to extract the delimited regions.
This tool looks quite interesting so far, good work!

Error: annotate step with GFF3 inputs.

Hello,
Thanks so much for making such a comprehensive tool; I am especially appreciative of the workflow and step-by-step options. I have a pangenome of interest that I'd like to use ppanggolin with. I have already clustered my genes into gene families so have gone the step-by-step root with the goal of providing ppanggolin with a cluster file.
Out of my ~200 or so annotations, I was able to get all but 3 to load successfully. Below is an example of the error that I'm getting:

$ ppanggolin annotate --anno erroring.lines.newdownloads 
2020-04-08 08:15:08 main.py:l166 INFO   Command: /home/fwhelan/miniconda3/bin/ppanggolin annotate --anno erroring.lines.newdownloads
2020-04-08 08:15:08 main.py:l167 INFO   PPanGGOLiN version: 1.1.82
2020-04-08 08:15:08 annotate.py:l318 INFO       Reading erroring.lines.newdownloads the list of organism files ...
  0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/fwhelan/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 308, in launchReadAnno
    return readAnnoFile(*args)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 313, in readAnnoFile
    return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 239, in read_org_gff
    attributes = getGffAttributes(gff_fields)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 199, in getGffAttributes
    (key, value) = att.strip().split('=')
ValueError: not enough values to unpack (expected 2, got 1)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fwhelan/miniconda3/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/main.py", line 169, in main
    ppanggolin.annotate.launch(args)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 409, in launch
    readAnnotations(pangenome, args.anno, cpu = args.cpu, pseudo = args.use_pseudo)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/annotate/annotate.py", line 330, in readAnnotations
    for org, flag in p.imap_unordered(launchReadAnno, args):
  File "/home/fwhelan/miniconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
ValueError: not enough values to unpack (expected 2, got 1)
  0%|                                                                                                                                                                                                                                                   | 0/1 [00:00<?, ?file/s]
$ head erroring.lines.newdownloads 
Pseudomonas_aeruginosa_LESB58_125.gff   Pseudomonas_aeruginosa_LESB58_125.gff.gz

I ran this with a freshly downloaded GFF3 file from here (http://pseudomonas.com/strain/download?c1=organism&v1=Pseudomonas+aeruginosa+LESB58+&v2=&c2=assemblyLevel) just to be sure I hadn't inadvertently manipulated it in some way.

Thanks for your help!
--Fiona

requirement file

when using pip with the requirement file I've got the error:

$ pip install -r requirements.txt 
ERROR: Invalid requirement: 'tqdm=4.*' (from line 2 of requirements.txt)
Hint: = is not a valid operator. Did you mean == ?

$ pip --version
pip 19.2.3

Do you use this file with pip ? It seems that the syntax is not correct.
I'm using miniconda, it may be a cause.

compatibility with newest compiler

Hi, I'm using GCC version 10.2.0 to compile this project and I've run into these errors:

  gcc -shared -L/apps/gent/CO7/skylake-ib/software/libffi/3.3-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/libffi/3.3-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/GMP/6.2.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/GMP/6.2.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/XZ/5.2.5-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/XZ/5.2.5-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/SQLite/3.33.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/SQLite/3.33.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/ncurses/6.2-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/ncurses/6.2-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/libreadline/8.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/libreadline/8.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/zlib/1.2.11-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/zlib/1.2.11-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/bzip2/1.0.8-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/bzip2/1.0.8-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/libffi/3.3-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/libffi/3.3-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/GMP/6.2.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/GMP/6.2.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/XZ/5.2.5-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/XZ/5.2.5-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/SQLite/3.33.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/SQLite/3.33.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/ncurses/6.2-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/ncurses/6.2-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/libreadline/8.0-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/libreadline/8.0-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/zlib/1.2.11-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/zlib/1.2.11-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/bzip2/1.0.8-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/bzip2/1.0.8-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib -L/scratch/gent/vo/001/gvo00117/easybuild/CO7/skylake-eth/software/rpy2/3.4.5-foss-2020b/lib64 -L/scratch/gent/vo/001/gvo00117/easybuild/CO7/skylake-eth/software/rpy2/3.4.5-foss-2020b/lib -L/apps/gent/CO7/skylake-ib/software/gmpy2/2.1.0b5-GCC-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/gmpy2/2.1.0b5-GCC-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/plotly.py/4.14.3-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/plotly.py/4.14.3-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/SciPy-bundle/2020.11-foss-2020b/lib64 -L/apps/gent/CO7/skylake-ib/software/SciPy-bundle/2020.11-foss-2020b/lib -L/apps/gent/CO7/skylake-ib/software/networkx/2.5-foss-2020b/lib64 -L/apps/gent/CO7/skylake-ib/software/networkx/2.5-foss-2020b/lib -L/apps/gent/CO7/skylake-ib/software/PyTables/3.6.1-foss-2020b/lib64 -L/apps/gent/CO7/skylake-ib/software/PyTables/3.6.1-foss-2020b/lib -L/apps/gent/CO7/skylake-ib/software/tqdm/4.56.2-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/tqdm/4.56.2-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/Python/3.8.6-GCCcore-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/Python/3.8.6-GCCcore-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/FFTW/3.3.8-gompi-2020b/lib64 -L/apps/gent/CO7/skylake-ib/software/FFTW/3.3.8-gompi-2020b/lib -L/apps/gent/CO7/skylake-ib/software/ScaLAPACK/2.1.0-gompi-2020b/lib64 -L/apps/gent/CO7/skylake-ib/software/ScaLAPACK/2.1.0-gompi-2020b/lib -L/apps/gent/CO7/skylake-ib/software/OpenBLAS/0.3.12-GCC-10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/OpenBLAS/0.3.12-GCC-10.2.0/lib -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib64 -L/apps/gent/CO7/skylake-ib/software/GCCcore/10.2.0/lib -O2 -ftree-vectorize -march=native -fno-math-errno -I/apps/gent/CO7/skylake-ib/software/prodigal/2.6.3-GCCcore-10.2.0/include -I/apps/gent/CO7/skylake-ib/software/Python/3.8.6-GCCcore-10.2.0/include -I/apps/gent/CO7/skylake-ib/software/FFTW/3.3.8-gompi-2020b/include -I/apps/gent/CO7/skylake-ib/software/OpenBLAS/0.3.12-GCC-10.2.0/include build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_stats.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_exe.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_alg.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_nei.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_mod.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_rnd.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/lib_io.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_hlp.o build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/genmemo.o -L/apps/gent/CO7/skylake-ib/software/Python/3.8.6-GCCcore-10.2.0/lib -o build/lib.linux-x86_64-3.8/nem_stats.cpython-38-x86_64-linux-gnu.so
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: error: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_exe.o: multiple definition of 'out_stderr'
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_stats.o: previous definition here
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: error: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_alg.o: multiple definition of 'out_stderr'
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_stats.o: previous definition here
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: error: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_nei.o: multiple definition of 'out_stderr'
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_stats.o: previous definition here
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: error: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_mod.o: multiple definition of 'out_stderr'
  /apps/gent/CO7/skylake-ib/software/binutils/2.35-GCCcore-10.2.0/bin/ld.gold: build/temp.linux-x86_64-3.8/ppanggolin/nem/NEM/nem_stats.o: previous definition here

I found out that GCC 10.* changed default flag from -fcommon to -fno-common so I had to patch setup.py - added
extra_compile_args=['-fcommon'], to ext_modules and the main thing.
Maybe you can explicitly set that too, so others dont have the same problem?
thanks

Lowercase FastA

Currently PPanggolin does not work on FastA input with lowercase sequence. This should be a quick fix, though.

It fails on reverse_complement (maybe other places, but this fails first).

Unable to Render graph using RAST Annotations

Hi All,

I am currently having an issue when I try using my own annotations that were generated using RAST. Is RAST a supported annotator? I am noticing some differences in the .gbk file formatting for my RAST annotations when compared to NCBI .gbk files (which I have gotten to work successfully).

I can provide the files if it would be helpful.

Thank you!

coverage badge and no test ?

hi, I like this soft.
What the coverage badge for ? i didn't see any test.

I'd like to make my #Hacktoberfest2019 on your project, will you accept unit tests ? Any other simple ideas/needs ?

str(Gene) should return string

While debugging my code, I've got this error :
TypeError: __str__ returned non-string (type int)

my code was like
print("gene : {}".format(gene)) and of course I used an integer as gene's id.

File format and list

Hi,
I am new to using Github codes to run the pangenome analysis. I am using more than 50 genomes to do this and some of the genomes are from JGI and hence I have .gff and .fna files. Most of my genomes are from NCBI and I could easily get .gbff and .fna files. I updated the gbff and fasta files and list of organisms. Now, the genome files from JGI, should I list them in a separate file or I can put them in the same gbff list? If I make another list, I think I may have to include this information somewhere in the code. I am sorry for very basic level questions. I am trying to learn this thing. Thank you so much for your help.
Sincerely,
Shailabh

Request: provide line number when cluster file input errors.

Hello again,

I am running into an error when I try to provide my own cluster file at the ppanggolin cluster step. Would it be possible for ppanggolin to output the line number in my input file that raises the error so that I can debug? When I was having issues with the annotate, the marker on the right (# of organisms) really helped me narrow down my issue, but the number at the cluster step doesn't seem to match my number of genes.

Thank you!
--Fiona

$ ppanggolin cluster -p pangenome.h5 --clusters ppanggolin_clustersfile.tsv 
2020-04-08 16:14:49 main.py:l166 INFO   Command: /home/fwhelan/miniconda3/bin/ppanggolin cluster -p pangenome.h5 --clusters ppanggolin_clustersfile.tsv
2020-04-08 16:14:49 main.py:l167 INFO   PPanGGOLiN version: 1.1.83
2020-04-08 16:14:49 readBinaries.py:l37 INFO    Getting the current pangenome's status
2020-04-08 16:14:49 readBinaries.py:l283 INFO   Reading pangenome annotations...
100%| 1175179/1175179 [00:07<00:00, 165412.45gene/s]
100%| 209/209 [00:39<00:00,  5.31organism/s]
2020-04-08 16:15:37 cluster.py:l233 INFO        Reading ppanggolin_clustersfile.tsv the gene families file ...
 20%| 5514030/26997601 [00:06<00:13, 1545234.37bytes/s]
Traceback (most recent call last):
  File "/home/fwhelan/miniconda3/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/main.py", line 171, in main
    ppanggolin.cluster.launch(args)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/cluster/cluster.py", line 284, in launch
    readClustering(pangenome, args.clusters, args.infer_singletons, args.force)
  File "/home/fwhelan/miniconda3/lib/python3.7/site-packages/ppanggolin/cluster/cluster.py", line 247, in readClustering
    (fam_id, gene_id, is_frag) = elements if len(elements) == 3 else elements+[None]
ValueError: too many values to unpack (expected 3)
 21%| 5636655/26997601 [00:11<00:42, 501539.00bytes/s]
$ wc -l ppanggolin_clustersfile.tsv 
1161210 ppanggolin_clustersfile.tsv

add the parameter file in the case of a partitioning by chunck

mu and epsilon vectors need to be computed

Annotated file input

I'm trying to run the program with my own annotation files, but I can't seem to get it working right. So far I've tried:

To use only gbff files, so I wrote an input file that goes

which I input to the annot parameter. This resulted in an error in makeGraph.py as "NoneType object has no attribute removed"
To use both .fasta and .gff files so the input was:

again, using only the annot parameter
Finally I tried giving fasta and gbff files separately to --fasta and --anno options but I ended up with the same error as in 1.

I have went over the example input files in testingDataset directory, but I can't see where I got it wrong. Also, apart from this issue, I was wondering if it is possible to retain the annotations your own pipeline generates?

Ppackage ppanggolin-v0.3.88-py37h01d97ff_0 requires pytables 3.5.*, but none of the providers can be installed

I am trying to install Ppanggolin on my MacOS Catalina (OSx64) through mamba but I keep getting this error. Please let me know how I could address this issue. Thank you so much for your time.

Data of gexf/light_gexf I can get

Hi,

I'm using the PPanGGOLiN for visualizing physical location of gene families. I tried to make light_gexf networks using Gephi with 3 different datasets, and I find out that one of light_gexf file contains node information of Number of triangle, Degree, Weighted degree.. which are not included in other light_gexf files even though all they got through same commands and parameters (I confirmed that they have same parameters with command: ppanggolin info -p pangenome.h5 --parameters.

The information (Number of triangle, Degree..) are exactly what I'm looking for, so how can I get all these data from other datasets too?

Below is the commands for 2 datasets without containing the information:

ppanggolin annotate --anno MAGs_gff_Input.txt --output MAGs_all --cpu 30
ppanggolin cluster -p MAGs_all/pangenome.h5 --clusters MAG_GeneFamily_all.txt --infer_singletons --cpu 30
ppanggolin graph -p MAGs_all/pangenome.h5 --cpu 30
ppanggolin partition -p MAGs_all/pangenome.h5 --cpu 30
ppanggolin rgp -p MAGs_all/pangenome.h5
ppanggolin write -p MAGs_all/pangenome.h5 --regions --output -f MAGs_all
ppanggolin spot -p MAGs_all/pangenome.h5
ppanggolin draw -p MAGs_all/pangenome.h5 --cpu 30 --output MAGs_all --tile_plot --ucurve --spots all -f
ppanggolin write -p MAGs_all/pangenome.h5 --cpu 30 --light_gexf --output MAGs_all -f
ppanggolin write -p MAGs_all/pangenome.h5 --cpu 30 --gexf --output MAGs_all -f
ppanggolin module -p MAGs_all/pangenome.h5
ppanggolin write -p MAGs_all/pangenome.h5 --modules --output MAGs_all -f
ppanggolin write -p MAGs_all/pangenome.h5 --spot_modules --output MAGs_all -f
ppanggolin write -p MAGs_all/pangenome.h5 --csv --output MAGs_all -f
ppanggolin write -p MAGs_all/pangenome.h5 --partitions --output MAGs_all -f

And this is the commands for the dataset containing the information:

ppanggolin annotate --anno MAGs_gff_Input_Case.txt --output MAGs_Case --cpu 30
ppanggolin cluster -p MAGs_Case/pangenome.h5 --clusters MAG_GeneFamily_Ac.txt --infer_singletons --cpu 30
ppanggolin graph -p MAGs_Case/pangenome.h5 --cpu 30
ppanggolin partition -p MAGs_Case/pangenome.h5 --cpu 30
ppanggolin rgp -p MAGs_Case/pangenome.h5
ppanggolin write -p MAGs_Case/pangenome.h5 --regions --output -f MAGs_Case
ppanggolin spot -p MAGs_Case/pangenome.h5
ppanggolin draw -p MAGs_Case/pangenome.h5 --cpu 30 --output MAGs_Case --tile_plot --ucurve --spots all -f
ppanggolin write -p MAGs_Case/pangenome.h5 --cpu 30 --light_gexf --output MAGs_Case -f
ppanggolin write -p MAGs_Case/pangenome.h5 --cpu 30 --gexf --output MAGs_Case -f
ppanggolin module -p MAGs_Case/pangenome.h5 --cpu 30
ppanggolin write -p MAGs_Case/pangenome.h5 --modules --output MAGs_Case -f
mkdir MAGs_Case/spot_plots
ppanggolin write -p MAGs_Case/pangenome.h5 --spot_modules --output MAGs_Case/spot_plots -f --cpu 30
ppanggolin write -p MAGs_Case/pangenome.h5 --csv --output MAGs_Case -f --cpu 30
ppanggolin write -p MAGs_Case/pangenome.h5 --partitions --output MAGs_Case -f --cpu 30

Merge pangenome graphs

Hi there,

Is it possible to merge pangenome graphs from independent runs? I know panaroo has that option, and would like to know if it would be possible to do so with ppanggolin.
If not, could you please provide me alternatives to compare the pangenome of independent runs?

Thanks!

Problem of file format with Bakta

I test PPanGGOLiN with Bakta outputs.

With .gbff ==> .gbk, it's working as wanted. +1
But with .gff3 ==> .gff, it seems have a format problem. =(

singularity exec --no-home --cleanenv -B /myhome:/myhome ~/Bureau/Tools/PPanGGOLin/PPanGGOLin.simg ppanggolin workflow --anno organism_list.tsv --output ppanggolin_dir
2021-09-27 10:23:56 main.py:l180 INFO	Command: /usr/local/bin/ppanggolin workflow --anno organism_list.tsv --output ppanggolin_dir
2021-09-27 10:23:56 main.py:l181 INFO	PPanGGOLiN version: 1.1.159
2021-09-27 10:23:56 annotate.py:l340 INFO	Reading organism_list.tsv the list of organism files ...
  0%|                                                                          | 0/183 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 328, in readAnnoFile
    return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 253, in read_org_gff
    attributes = getGffAttributes(gff_fields)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 204, in getGffAttributes
    attributes_field = [f for f in gff_fields[GFF_attribute].strip().split(';') if len(f)>0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 322, in launchReadAnno
    return readAnnoFile(*args)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 330, in readAnnoFile
    raise Exception(f"Reading the gff3 file '{filename}' raised an error.")
Exception: Reading the gff3 file '/myhome/Bureau/test/all_gff/174C3.gff3' raised an error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/ppanggolin", line 33, in <module>
    sys.exit(load_entry_point('ppanggolin==1.1.159', 'console_scripts', 'ppanggolin')())
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/main.py", line 191, in main
    ppanggolin.workflow.workflow.launch(args)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/workflow/workflow.py", line 29, in launch
    readAnnotations(pangenome, args.anno, cpu = args.cpu, getSeq = getSeq, show_bar=args.show_prog_bars)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.159-py3.6-linux-x86_64.egg/ppanggolin/annotate/annotate.py", line 351, in readAnnotations
    for org, flag in p.imap_unordered(launchReadAnno, args):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
Exception: Reading the gff3 file '/myhome/Bureau/test/all_gff/174C3.gff3' raised an error.
  0%|                                                                          | 0/183 [00:00<?, ?file/s]

1 - Can be possible to have a check format with clear error message (like 'format is not correct.') ?
2 - Can be possible to take in input the .gff3 of bakta ?

Thanks in advance.

Unable to Render graph using RAST Annotations #44--FOLLOW UP

Hi All,

I am trying to use some RAST annotations that seem to be generating a file that will not open in Gephi. Here is an example of the file and also an example of the pangenomeGraph file I am getting (it is blank when opening in Gephi).

Thank you! Awesome work on PPanGGOLiN
FN_KCOM1281 - Copy.zip

pangenomeGraph-RAST.zip

Importing Anvio Pangenomic gene clusters into PPAnGGOLiN

Hello @labgem and @merenlab (@meren),

I've been using both PPAnGGOLiN and Anvio pangenomic pipelines and really loving both tools. I've used both tools independently with Prokka annotated .gff files without a problem. But now I want to import Anvio clusters into PPAnGGOLiN instead of using the default MMseqs2 clustering. The reason for this is that being able to visualize the same gene clusters with both methods would be really useful for our research. It is difficult to make sense of the data with 2 independent clustering methods since it is like comparing apples and oranges. We are able to make really cool observations with each method but we can't compare with the other one.

In order to import the Anvio clustering into PPanGGOLiN I need:

A .tsv file listing in the first column the gene family names, and in the second column the gene ID that is used in the annotation files. Using anvi-summarize I got the info needed to generate the .tsv file.
The annotated genomes with gene IDs that match the ones listed in the previous .tsv file. The problem is that the gene_callers_id provided by Anvio don't match the original ones in the Prokka annotation. The Prokka annotated genomes were parsed into two text files, one for gene calls and one for annotations, with the script gff_parser.py. By default, Prokka annotates also tRNAs, rRNAs and CRISPR regions. However, gff_parser.py will only utilize open reading frames reported by Prodigal in the Prokka output in order to be compatible with the pangenomic Anvio pipeline. While parsing new gene_callers_id were generated only for the ORFs that were imported into Anvio. I found out that anvi-get-sequences-for-gene-calls can be used to export new .fasta and .gff files with only the ORFs that match the gene IDs on the .tsv file. But I think that there is an issue with the formating of these .gff files not being compatible with the expected .gff files on PPanGGOLiN

I tried running:
ppanggolin annotate --anno Anvio7GenomesAnno.txt --fasta Anvio7GenomesFasta.txt

And I got an error that I am not sure if it is related to PPanGGOLiN or to the format of the .gff/.fasta files obtained from Anvio.

2021-02-19 13:37:26 main.py:l180 INFO	Command: /Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin annotate --anno analysis_PPanGGOLiN/Anvio7GenomesAnno.txt --fasta analysis_PPanGGOLiN/Anvio7GenomesFasta.txt -o analysis_PPanGGOLiN/OutputFromAnvio7 --basename FromAnvio7
2021-02-19 13:37:26 main.py:l181 INFO	PPanGGOLiN version: 1.1.131
2021-02-19 13:37:26 annotate.py:l337 INFO	Reading analysis_PPanGGOLiN/Anvio7GenomesAnno.txt the list of organism files ...
  0%|                                                                                                                                 | 0/28 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 325, in readAnnoFile
    return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 277, in read_org_gff
    if contig.name != gff_fields[GFF_seqname]:
UnboundLocalError: local variable 'contig' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 319, in launchReadAnno
    return readAnnoFile(*args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 327, in readAnnoFile
    raise Exception(f"Reading the gff3 file '{filename}' raised an error.")
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/main.py", line 183, in main
    ppanggolin.annotate.launch(args)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 428, in launch
    readAnnotations(pangenome, args.anno, cpu = args.cpu, pseudo = args.use_pseudo, show_bar=args.show_prog_bars)
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 349, in readAnnotations
    for org, flag in p.imap_unordered(launchReadAnno, args):
  File "/Users/isabelfe/opt/anaconda3/envs/PPanGGOLiN/lib/python3.6/multiprocessing/pool.py", line 761, in next
    raise value
Exception: Reading the gff3 file 'analysis_PPanGGOLiN/Anvio7_Exported_Genomes/Anvio7_dpi_genomes_forncbi_kpl3033_c.current.gff' raised an error.
  0%|

I hope someone from one of your teams can help me with this. Really it would be really cool to have both tools on the same set of gene clusters.

Thanks a lot,

Isabel

add the possibility to use 'align' even when the clustering is not performed through PPanGGOLiN

Mentionned in #56 (34th comment)

Being able to use 'align' when using an external clustering would be quite useful.

Possibilities of realizing that are either by choosing a representative sequence for each cluster somehow, or use all genes instead of a representative for the aligning part. Doing the latter should not take too long as this is not traditionally used with a lot of protein sequences as input.

Unable to Render Hotspots--Running on WSL

Hi All,

I am currently trying to use PPanGGOLiN's draw_hotspots subcommand, but am getting no result in my output. I would expect there to be some result on the set of genomes I am using, but am also wondering if there are any known issues with running PPanGGOLiN on Windows via Windows Subsystem for Linux(WSL) and an Ubuntu distribution?

Essentially, I am just curious if this is a real result or the result of using PPanGGOLiN via WSL.

User-Defined Filtering and Annotation

Hello!

I am not sure if I have missed something in the wiki or this is possible in some way I haven't quite figured out, so I am reaching out to see if you can point me in the correct direction. I have a pangenome graph that I am viewing in Gephi that I would love to be able to take a defined subset of ORF from the underlying collection (e.g. found enriched in a specific niche) and ask the question "which nodes on the pangenome image do these ORF fall into, and are those nodes forming syntenic blocks across the pangenome?" Further, it would be incredibly advantageous to be able to do this en-masse with collections of genes (user-defined subsets of ORF) and create something like a 'user defined' partition (that can be handled in the same way as ppanggolin-defined partitions (highlight all on Gephi image, color change, etc.).

Thanks!

Gff3 and gbk files will not work but fasta does

Hi there,

I am trying to do an analysis of 33 prochlorococcus and synechococcus genomes using ppangolin. However, I keep on getting an error when I run gff3 and gbk files. I am sure that my format of my input file is correct since it works with fasta sequences. There has to be an issue with the formatting of these gff3 and gbk files. For gff3 files, the files begin with "##gff-version 3" as the program requires. Also, for the gbk files, the files begin with the locus as the program requires. Also, I am running the command "ppanggolin workflow --anno" so it should work with gff3 and gbk files as stated on the website. So I am unsure why ppangolin keeps on producing an error. Here is the error message I keep on receiving:
-----start
2021-02-06 16:16:33 main.py:l169 INFO Command: /usr/local/source/miniconda3/bin/ppanggolin workflow --anno test_input
2021-02-06 16:16:33 main.py:l170 INFO PPanGGOLiN version: 1.1.96
2021-02-06 16:16:33 annotate.py:l322 INFO Reading test_input the list of organism files ...
0%| | 0/2 [00:00<?, ?file/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/source/miniconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 312, in launchReadAnno
return readAnnoFile(*args)
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 317, in readAnnoFile
return read_org_gff(organism_name, filename, circular_contigs, getSeq, pseudo)
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 243, in read_org_gff
attributes = getGffAttributes(gff_fields)
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 197, in getGffAttributes
attributes_field = [f for f in gff_fields[GFF_attribute].strip().split(';') if len(f)>0]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/source/miniconda3/bin/ppanggolin", line 10, in
sys.exit(main())
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/main.py", line 180, in main
ppanggolin.workflow.workflow.launch(args)
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/workflow/workflow.py", line 29, in launch
readAnnotations(pangenome, args.anno, cpu = args.cpu, getSeq = getSeq)
File "/usr/local/source/miniconda3/lib/python3.6/site-packages/ppanggolin/annotate/annotate.py", line 334, in readAnnotations
for org, flag in p.imap_unordered(launchReadAnno, args):
File "/usr/local/source/miniconda3/lib/python3.6/multiprocessing/pool.py", line 761, in next
raise value
IndexError: list index out of range
0%|
-------- end

I would really appreciate it if you could help me troubleshoot this problem. Thank you for your time in advance.

Sincerely,
Manveer

Suggestion : Add length gene on Bokeh Spot frame

Hello =)

It is possible to add just bellow the start and stop information field (and above to the gene name) on the Bokeh dynamic "gene" frame the length of the "gene" ?

This information can be very useful in case of truncate gene or to compare gene variant with alternative start/stop codon.

Thanks in advance.

Have a good day.

Format for MY_CLUSTERS_FILE

I want to import clusters from another pangenomic analysis tool (Anvio). I know that I need "a .tsv file listing, in the first column the gene family names, and in the second column the gene ID that is used in the annotation files."

Could you provide an example of this MY_CLUSTERS_FILE? How should the IDs be separated in the second column?

Thanks

gbk file is not work!

Hello,
I tried using gbff file, it work well. But when I use gbk file, it did not work.
Thanks,
Fuyou

Any Tips on Making a Pangenome on Genetically Disparate Species?

Hi folks,

Thanks for making this easy to follow tool. I have a collection of genomes from the same family and I would like to make a pangenome out of them. However, these genomes are quit different from one another and this software is unable to finish the partitioning portion of pangenome creation. I keep getting this error:

Exception: Statistical partitionning does not work on your data. This usually happens because you used very few (<15) genomes.

Any tips on how to overcome this? Is this even possible? One of the endpoints I'd like to get to is the presence absence table

error while running ppanggolin workflow --fasta testingDataset/organisms.fasta.list command.

I have installed ppanggolin in my ubuntu 18.04.5 LTS 64 bit version computer. I have followed all the instruction and successfully installed ppanggolin. I have checked the version of ppanggolin which shows as follows,
(base) dinesh@dinesh7k:~$ ppanggolin -v ppanggolin 1.1.96
I ran the following command in order to check the test data provided by you in the sourc code 1.1.96
ppanggolin workflow --fasta testingDataset/organisms.fasta.list
The above command displays error as follows,
(base) dinesh@dinesh7k:~/Documents/tools/PPanGGOLiN-1.1.96$ ppanggolin workflow --fasta testingDataset/organisms.fasta.list Traceback (most recent call last): File "/home/dinesh/anaconda3/bin/ppanggolin", line 10, in <module> sys.exit(main()) File "/home/dinesh/anaconda3/lib/python3.7/site-packages/ppanggolin/main.py", line 157, in main checkInputFiles(fasta = args.fasta) File "/home/dinesh/anaconda3/lib/python3.7/site-packages/ppanggolin/main.py", line 70, in checkInputFiles checkTsvSanity(fasta) File "/home/dinesh/anaconda3/lib/python3.7/site-packages/ppanggolin/main.py", line 50, in checkTsvSanity raise Exception(f"Some of the given files do not exist. The non-existing files are the following : '{' '.join(nonExistingFiles)}'") Exception: Some of the given files do not exist. The non-existing files are the following : 'FASTA/GCF_001317785.1_7396_3_13_genomic.fna.gz FASTA/GCF_001729845.1_ASM172984v1_genomic.fna.gz FASTA/GCF_003788895.1_sc110_genomic.fna.gz FASTA/GCF_000318785.1_ASM31878v1_genomic.fna.gz FASTA/GCF_002777155.1_ASM277715v1_genomic.fna.gz FASTA/GCF_001183765.1_ASM118376v1_genomic.fna.gz FASTA/GCF_000026905.1_ASM2690v1_genomic.fna.gz FASTA/GCF_000220105.1_ASM22010v1_genomic.fna.gz FASTA/GCF_001398215.1_7501_6_50_genomic.fna.gz FASTA/GCF_000318825.1_ASM31882v1_genomic.fna.gz FASTA/GCF_002777095.1_ASM277709v1_genomic.fna.gz FASTA/GCF_000318545.1_ASM31854v1_genomic.fna.gz FASTA/GCF_000318805.1_ASM31880v1_genomic.fna.gz FASTA/GCF_000092685.1_ASM9268v1_genomic.fna.gz FASTA/GCF_001183825.1_ASM118382v1_genomic.fna.gz FASTA/GCF_002776955.1_ASM277695v1_genomic.fna.gz FASTA/GCF_001213045.1_5082_8_5_genomic.fna.gz FASTA/GCF_000304515.1_Cm_FSW4_genomic.fna.gz FASTA/GCF_000092665.1_ASM9266v1_genomic.fna.gz FASTA/GCF_001398135.1_7501_6_52_genomic.fna.gz FASTA/GCF_000472205.1_E_CS88_f__genomic.fna.gz FASTA/GCF_000318945.1_ASM31894v1_genomic.fna.gz FASTA/GCF_000318865.1_ASM31886v1_genomic.fna.gz FASTA/GCF_002776935.1_ASM277693v1_genomic.fna.gz FASTA/GCF_001293965.1_ASM129396v1_genomic.fna.gz FASTA/GCF_001398295.1_7396_3_21_genomic.fna.gz FASTA/GCF_006508235.1_ASM650823v1_genomic.fna.gz FASTA/GCF_000093005.1_ASM9300v1_genomic.fna.gz FASTA/GCF_000318925.1_ASM31892v1_genomic.fna.gz FASTA/GCF_000590575.1_ASM59057v1_genomic.fna.gz FASTA/GCF_001729905.1_ASM172990v1_genomic.fna.gz FASTA/GCF_000226605.1_ASM22660v1_genomic.fna.gz FASTA/GCF_000590695.1_ASM59069v1_genomic.fna.gz FASTA/GCF_001183845.1_ASM118384v1_genomic.fna.gz FASTA/GCF_001183805.1_ASM118380v1_genomic.fna.gz FASTA/GCF_002192615.1_ASM219261v1_genomic.fna.gz FASTA/GCF_002777115.1_ASM277711v1_genomic.fna.gz FASTA/GCF_002776885.1_ASM277688v1_genomic.fna.gz FASTA/GCF_003788785.1_ct114V1_genomic.fna.gz FASTA/GCF_000319105.1_ASM31910v1_genomic.fna.gz FASTA/GCF_000210495.1_ASM21049v1_genomic.fna.gz FASTA/GCF_000441775.1_ASM44177v1_genomic.fna.gz FASTA/GCF_000173495.1_ASM17349v1_genomic.fna.gz FASTA/GCF_006508185.1_ASM650818v1_genomic.fna.gz FASTA/GCF_002777015.1_ASM277701v1_genomic.fna.gz FASTA/GCF_002088315.1_ASM208831v1_genomic.fna.gz FASTA/GCF_002776845.1_ASM277684v1_genomic.fna.gz FASTA/GCF_000441655.1_ASM44165v1_genomic.fna.gz FASTA/GCF_000590635.1_ASM59063v1_genomic.fna.gz FASTA/GCF_006508265.1_ASM650826v1_genomic.fna.gz'
Please help me to fix this issue.

Error RGP - Two regions had an identical name

version PPanGGOLiN

ppanggolin 1.1.85

PPanGGOLiN rgp command

ppanggolin rgp -p pangenome.h5

Log

2020-05-28 00:12:32 main.py:l166 INFO	Command: /usr/local/bin/ppanggolin rgp -p pangenome.h5
2020-05-28 00:12:32 main.py:l167 INFO	PPanGGOLiN version: 1.1.85
2020-05-28 00:12:32 readBinaries.py:l37 INFO	Getting the current pangenome's status
2020-05-28 00:12:32 readBinaries.py:l283 INFO	Reading pangenome annotations...
100%|████████████████████████████████████████████████████████| 34334/34334 [00:00<00:00, 116881.11gene/s]
100%|████████████████████████████████████████████████████████████████| 9/9 [00:01<00:00,  8.24organism/s]
2020-05-28 00:12:34 readBinaries.py:l289 INFO	Reading pangenome gene families...
100%|█████████████████████████████████████████████████████████| 33572/33572 [00:00<00:00, 75337.28gene/s]
100%|████████████████████████████████████████████████████| 3797/3797 [00:00<00:00, 83620.05gene family/s]
2020-05-28 00:12:34 genomicIsland.py:l183 INFO	Detecting multigenic families...
2020-05-28 00:12:34 pangenome.py:l296 INFO	30 gene families are defined as being multigenic. (duplicated in more than 0.05 of the genomes)
2020-05-28 00:12:34 genomicIsland.py:l185 INFO	Compute Regions of Genomic Plasticity ...
 44%|████████████████████████████▉                                    | 4/9 [00:00<00:00, 64.99genomes/s]
Traceback (most recent call last):
  File "/usr/local/bin/ppanggolin", line 11, in <module>
    load_entry_point('ppanggolin==1.1.85', 'console_scripts', 'ppanggolin')()
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.85-py3.6-linux-x86_64.egg/ppanggolin/main.py", line 189, in main
    ppanggolin.RGP.genomicIsland.launch(args)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.85-py3.6-linux-x86_64.egg/ppanggolin/RGP/genomicIsland.py", line 203, in launch
    predictRGP(pangenome, force = args.force, persistent_penalty=args.persistent_penalty, variable_gain=args.variable_gain, min_length=args.min_length, min_score=args.min_score, dup_margin=args.dup_margin, cpu=args.cpu)
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.85-py3.6-linux-x86_64.egg/ppanggolin/RGP/genomicIsland.py", line 188, in predictRGP
    pangenome.addRegions(compute_org_rgp(org, persistent_penalty, variable_gain, min_length, min_score, multigenics))
  File "/usr/local/lib/python3.6/dist-packages/ppanggolin-1.1.85-py3.6-linux-x86_64.egg/ppanggolin/pangenome.py", line 311, in addRegions
    raise Exception("Two regions had an identical name, which was unexpected.")
Exception: Two regions had an identical name, which was unexpected.

The gff used in the workflow are produced by Prokka

End of my hacktober

Hi,

With my last PR ends my contrib for this hacktoberfest (I already won my t-shirt !)
I'd like to encourage you to continue this interesting project.
I also want to share the impression that the implementation is disturbing, with the edge that point to gene which point to the edge for example or the gene that can belong to an organism which has no gene. But once I said that, I must also admit that nothing obvious came to my mind to correct it. I suppose as the project will mature, things will be reorganized.

good luck.

Persistent gene families query

Hi, I generated the data using 109 genomes. As per the theory, the persistent genes must be present in all the genomes under study. I got 1600 plus persistent gene families, however, the matrix file shows those persistent genes to be present in only few genomes. I am unable to explain this. All the genome files are Prokka annotated used for this. Kindly help. Thank you.

about plastic regions.tsv

Regards.
first of all thanks for ppanggolin
I have a question about the plastic_regions.tsv file. What does mean "true or false" in columns contigBorder and wholeContig

I searched the documentation and found nothing about it and I am not an expert in this area.

I thank you in advance for your help

panrgp command not working

Hello,

I've installed the latest version 1.1.108 and ran the workflow command which had no issues. However, when trying to run the panrgp command as described:
ppanggolin panrgp --fasta genomes_list.txt
I get the error message:
ppanggolin: error: argument : invalid choice: 'panRGP' (choose from 'annotate', 'cluster', 'graph', 'partition', 'rarefaction', 'workflow', 'draw', 'write', 'align', 'info')

I even went to my pangenome files generated from the workflow command and tried:
ppanggolin rgp -p pangenome.h5

but got the same error message.
ppanggolin: error: argument : invalid choice: 'rgp' (choose from 'annotate', 'cluster', 'graph', 'partition', 'rarefaction', 'workflow', 'draw', 'write', 'align', 'info')

I also tried running the panRGP.py script in the workflow dir but that didn't work either. Can you please let me know how I can go about running the panRGP command?

Many thanks!

Runtime warning rpy2 image not found on macOS

Hello,
If you have an error looking like the one down below, it means that there are problems with R and lapack. This has happened on macOS when installing with conda from time to time.

this happens when calling the --draw_hotspots of the spot subcommand, as such:

ppanggolin spot -p pangenome.h5 --cpu 1 --spot_graph --draw_hotspots

Using other versions of R / lapack could help. However I have failed to replicate this bug on actual macOS machines, this only happened on the CI workflow so far.
If you use macOS and this happens to you please add a comment with your specs and mode of install, and maybe hopefully I can do something about it.

/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

Error: package or namespace load failed for ‘ade4’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib':
  dlopen(/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib, 6): Library not loaded: @rpath/R/lib/libRlapack.dylib
  Referenced from: /usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib
  Reason: image not found


/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

Failed with error:  

/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

‘package ‘ade4’ could not be loaded’

/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:146: RRuntimeWarning:

Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib':
  dlopen(/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib, 6): Library not loaded: @rpath/R/lib/libRlapack.dylib
  Referenced from: /usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib
  Reason: image not found


Traceback (most recent call last):
  File "/usr/local/miniconda/envs/test/bin/ppanggolin", line 8, in <module>
    sys.exit(main())
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/ppanggolin/main.py", line 208, in main
    ppanggolin.RGP.spot.launch(args)
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/ppanggolin/RGP/spot.py", line 459, in launch
    predictHotspots(pangenome, args.output, force=args.force, cpu = args.cpu, spot_graph=args.spot_graph, overlapping_match=args.overlapping_match, set_size=args.set_size, exact_match=args.exact_match_size, draw_hotspot=args.draw_hotspots, interest=args.interest, show_bar=args.show_prog_bars)
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/ppanggolin/RGP/spot.py", line 151, in predictHotspots
    draw_spots(drawn_spots, output, cpu, overlapping_match, exact_match, set_size, multigenics, elements, show_bar=show_bar)
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/ppanggolin/RGP/spot.py", line 446, in draw_spots
    spots_to_draw.append(drawCurrSpot(GeneLists, ordered_counts, elements, famcol, fname))#make R dataframes, and plot them using genoPlotR.
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/ppanggolin/RGP/spot.py", line 326, in drawCurrSpot
    importr("genoPlotR")
  File "/usr/local/miniconda/envs/test/lib/python3.6/site-packages/rpy2/robjects/packages.py", line 453, in importr
    env = _get_namespace(rname)
rpy2.rinterface.RRuntimeError: Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib':
  dlopen(/usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib, 6): Library not loaded: @rpath/R/lib/libRlapack.dylib
  Referenced from: /usr/local/miniconda/envs/test/lib/R/library/ade4/libs/ade4.dylib
  Reason: image not found```

Citation info please

Great package and really interesting work (saw you at GI2018 in Sanger)

Is there a citation for the package and/or example dataset please?

Minimum and maximum number of organism for rarefaction curve

Hi there,

I am not sure if I totally understand the processes of partitions and sampling made to build the rarefaction curves through the "ppanggolin rarefaction" command. By defect, the --min is set to 1 and --max is 100. I am running a dataset with >3000 genomes from many different species from the same genus. Therefore, I am not sure whether i should increase the --max value up tu the number of genomes in my analysis. Typically, I found that these curves have so many genomes in the X axis as the number of genomes in the analyses. However, since the processes made by ppanggolin are quite different, I am not sure about that. Morevoer, when running this command with deffault settings, the log says:

2021-01-26 06:36:57 rarefaction.py:l285 INFO Done sampling organisms in the pangenome, there are 2970 samples

Why 2970 samples? I have almost 3400 genomes in the dataset...
Could anyone advise me in this sense, please?

Thank you in advance

Install without Conda

Hello,

I'm trying to use PPanGGOLiN and ask my IT service to install it on our grid, but they answered me we don't use Conda on our system. So, my question is, is there another possibility to install PPanGGOLiN, and if yes, where can I find the documentation for it, please?
Thank you very much

Have a nice day
Cheers
C.

Edge.remove doesn't work

>       self.source._edges[self.target].discard(self)
E       AttributeError: 'Edge' object has no attribute 'discard'

discard is a set method.
We could resolve this by using defaultdict(set) for GeneFamily _edges attribute.
But as the function is not used, I suggest to delete it.

Locus Tags vs Gene_ids

Hi!

I'm using PPanGGoLin for some time and I'm really loving the tool and the ease of use.

I searched for this issue but I wasn't able to find, so here it goes.

We get the locus tags when using GenBank, however, is there a way to obtain the gene identifiers directly?

Locus tags are identified with /locus and genes with /gene in genbank.

Getting errors with bad ORGANISM_FASTA_LIST

Hi,

I was having some troubles running the workflow and graph commands.

After removing spaces from my ORGANISM_FASTA_LIST file used as input, the workflow worked smoothly.

I thought maybe people would benefit if that was somehow explained in the README, as I spent quite some time trying to debug.

Thank you for any assistance you can provide.

bug: Pangenome.addOrganism cannot take string param.

PPanGGOLiN/ppanggolin/pangenome.py

Line 241 in a3c107e

self._orgGetter[newOrg.name] = newOrg

here 'newOrg' is the string, so no 'name' attribute.

This can easily be solved by using only 'newOrg' or 'org.name'.

Refseq GFF file reading and pseudogenes

When reading files downloaded from Refseq reference, a few elements raise an error from reading the GFF :

The comment line '#!' that they use to indicate software version used and assembly version raises an error.
The actual protein IDs are stored in ";protein_ID= ..." in the attributes field rather than in "ID=...". ID=... is generic in refseq GFF (cdsXXX)
When a CDS's Parent is a pseudogene, the CDS is not listed in the proteins of the organism (as it should be), but the program searches for it anyway and fails to find it.

nucleotide or protein clustering?

Hi there,

Based on the article and the descriptions on github thought that ppanggolin made the clustering step based nucleotide gene sequences. However, I am confusing because of this sentence on this secction:

PanGGOLiN will call MMseqs2 to run the clustering on all the gene sequences using their greedy set cover algorithm for the clustering step. You can tune its parameters using --identity(default 0.8) and --coverage(default 0.8). Both proteins have to be covered by at least the proportion indicated by --coverage

Here it is said "Both proteins", which may be counfusing...
Could you please confirm me what kind of clustering does ppanggolin perform, please?

Thank you in advance

Statistics PPanGGolin

Hi,

Is it possible to have a small statistic file (like "summary_statistics.txt" from roary) or a command to extract this information from the .h5 file ?

Bon courage =)

conda/mamba pyhton_abi problems

Hi,
I'm not able to get a newer than ppanggolin 1.0.13 installed with either conda or mamba in a completely new conda env.
With mamba python-3.9.2 is installed and python_abi 3.9-1. But then mamba complains about the pyhton_abi (Problem: nothing provides python_abi 3.6.* _cp36m needed by ppanggolin-1.1.136-py36h4c5857e_0), however when downgrading to pyhton 3.6 and python_abi 3.6, it complains again and requires python_abi 3.7 (Problem: nothing provides python_abi 3.7. *_cp37m needed by ppanggolin-1.1.136-py37hf01694f_0) and this circles back to pyhton_abi 3.6 when upgrading to python 3.7. I've tried several times and different versions of both pyhton and ppanggolin witout any luck.
Anything to do or should I try it from source?
Sofie

Update MMseqs command

On the createdb of MMseqs2 (ppanggolin/cluster/cluster.py) last realease the option --dont-shuffle is replace by --shuffle

How can the search parameters be modified?

Regards
I am analyzing some bacterial strains in which I am sure there are RGPs and so far Ppanggolin has worked wonders. However I have many RGPs, is there a way to increase the search requirement? Could the threshold be modified? and in this way obtain fewer RGPs

Thanks in advance

missing annotations in graph

Hi All,

New user of PPanGGOLiN. Really great work! However, I am having some issues with using the --anno workflow. I provided a list of .gbk files that were annotated using ncbi, but the graph only displayed the generic CDS_xx.

Seems that the annotations I provided were not considered. Is prokka the only annotator that works?

Hacktober -> more unit tests

Hello,
Hacktober brings me back to your project. :)

As proposed, in the "Ideas & improvements", I'll make PR to update unit tests and have a better coverage.

Enhance drawing spot in plot

Function

ppanggolin spot -p pangenome.h5 --draw_hotspots

Proposition 1

The plot contains in the label columns, the information of the genome(s) that contains the gene structure present in the line.

I propose to remove the number (1X, 2X, 3X,...) of genome that contains the gene structure in the label by the complete list of genomes or change it by a new label (like 'Structure_1', 'Structure_2') for each line and write the corresponding genomes association in a .tsv file for each spot.

Proposition 2

With the annotation done by Prokka, the gene annotation is MYSTRAIN_CDS_151354 in PPanGGOLiN. This tag is a bit longer
than the image can draw and result a troncate of gene annotation in the plot.

Maybe is it possible to have the upper marge more important.

Have a good day.

Doubt: Unique genes

Hi,

Thank you for such a cool software.
I have recently built a pangenome for 300 genomes and it runs nicely.

However I wonder if it is possible to query the pangenome to discover genes that are present uniquely in one genome.

Additionally, is it possible to create a gene absence/presence matrix for a single genome based on the pangenome's gene content?

Thank you

labgem / ppanggolin Goto Github PK

ppanggolin's People

Contributors

Stargazers

Watchers

Forkers

ppanggolin's Issues

version PPanGGOLiN

PPanGGOLiN rgp command

Log

Function

Proposition 1

Proposition 2

Recommend Projects

Recommend Topics

Recommend Org