ecogenomics / gtdbtk Goto Github PK

View Code? Open in Web Editor NEW

439.0 439.0 82.0 29.54 MB

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.

Home Page: https://ecogenomics.github.io/GTDBTk/

License: GNU General Public License v3.0

Python 99.53% Dockerfile 0.43% Shell 0.04%

archaea bacteria bioinformatics metagenomics nomenclature phylogenetics species-assignments taxonomy

gtdbtk's People

Contributors

Stargazers

Watchers

Forkers

vesperlight2 cerebis johanneswerner pythseq ofanoyi lokeshbio arifakhanlab tjrogers86 asafpr hannet91 liupfskygre hailtedhorch miangoar shulp2211 31380 linxingchen yexianingyue rajaldebnath yshengzhi zjyzjjzmt zhaoxia413 nl123456 davidealbanese dawei1203 junwu302 6x16 leornardzhou fplazaonate mdumar-tech dayueban microbioticajon zongzhiwu ozfangfang richardwpi chloelee767 hushkuo tr11-sanger m762killer wook2014 mruehlemann sally-grindstaff tanusaroha yananzh tankmermaid christinacc03 ssyamoako yashasdevasurmutt ginaisp donovan-parks alienzj zwets frcamacho subinkim valentynbez biofarmer mattoslmp jameyzhu valentinslepukhin anyihu schaudge b-tierney ahmedbajwa03 vszhang1976 utguang neoformit theoafidian aroneys zhangqike0823 wwood kkpan11 huaxing-8118 wasade nishatakbnf taoxinyumz jimaz missthepast 25280841 artoria2e5 sophieseuleson nezuko1984 maydayxf

gtdbtk's Issues

Consider if PhyloRank plots are required as output

I noticed that the right now we create the PhyloRank outlier plots as part of the GTDB-Tk. I'm not sure how informative these are since they will never contain new user defined groups. Having these generated makes us dependent on mpld3 and jinja2 which probably isn't a big deal since they are on PyPI. We should think about if generating these is useful or just confusing to the user.

inconsistent naming in prodigal-generated .faa file

When gtdbtk calls prodigal it calls the resulting .faa file <genome_basename>_protein.faa, but then the next step in the workflow searches for <genome_basename>_genes.faa and can't find it! I can't simply rename the file and then run the workflow again because the file is deleted automatically.

HMMER then seems to have issues with integer division or modulo by zero when looking for TIGRfams and pfams (related to inability to find correct file or something separate?).

error in "variable row[]: isn't properly declared"

Hi,

I have installed all the prerequisite packages and softwares and then used pip to install the tgdbtk. But while I run gtdbtk, it errors after Identifying Pfam protein families:

Any ideas what is wrong?

Thanks

gtdbtk classify_wf --cpus 2 --genome_dir /mnt/nfs/sharknado/gtdbtk_bin_analysis/gtdbtk_input/ --out_dir /mnt/nfs/sharknado/gtdbtk_bin_analysis/gtdbtk_output/ --debug
[2018-07-31 15:51:05] INFO: GTDB-Tk v0.0.7
[2018-07-31 15:51:05] INFO: gtdbtk classify_wf --cpus 2 --genome_dir /mnt/nfs/sharknado/gtdbtk_bin_analysis/gtdbtk_input/ --out_dir /mnt/nfs/sharknado/gtdbtk_bin_analysis/gtdbtk_output/ --debug
[2018-07-31 15:51:05] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-07-31 15:51:05] INFO: Identifying markers in 9 genomes with 2 threads.
[2018-07-31 15:51:05] INFO: Running Prodigal to identify genes.
==> Finished processing 9 of 9 (100.0%) genomes.
[2018-07-31 15:51:34] INFO: Identifying TIGRFAM protein families.
==> Finished processing 9 of 9 (100.0%) genomes.
[2018-07-31 15:51:56] INFO: Identifying Pfam protein families.
==> Finished processing 9 of 9 (100.0%) genomes.
[2018-07-31 15:52:02] INFO: Done.
[2018-07-31 15:52:02] INFO: Aligning markers in 9 genomes with 2 threads.
('Unexpected error:', <type 'exceptions.UnboundLocalError'>)
Traceback (most recent call last):
File "/usr/bin/gtdbtk", line 328, in
gt_parser.parse_options(args)
File "/usr/lib/python2.7/site-packages/gtdbtk/main.py", line 295, in parse_options
self.align(options)
File "/usr/lib/python2.7/site-packages/gtdbtk/main.py", line 173, in align
options.outgroup_taxon)
File "/usr/lib/python2.7/site-packages/gtdbtk/markers.py", line 380, in align
gtdb_taxonomy = Taxonomy().read(self.taxonomy_file)
File "/usr/lib/python2.7/site-packages/biolib/taxonomy.py", line 804, in read
self.logger.error('Failed to parse taxonomy file on line %d' % (row+1))
UnboundLocalError: local variable 'row' referenced before assignment

Add script to rename the taxonomy in the phylogenetic tree

Hi,
Would it be possible to add/create a script to replace the GCF_XXXX/UBAXXX in the trees with the taxonomy, such as d__XXX; p__XXX; c__XXX; o__XXX etc...
Thanks!
Greg

Config to run with 64gb RAM?

Hi,

I'm interested in using this software for both genome classification and de novo phylogeny. However, the most powerful machine I have available is a desktop with Ubuntu 18.04 64gb RAM and 12 cores. Is it possible to run the GTDB-Tk with an alternative config to suit that machine?

Thank you,
V

Incomplete implementation of classify?

There is currently the following warning in the classify code:


"THERE RESULTS SHOULD BE REFINED TO SEE IF A GENOME CAN BE ASSIGNED TO ANY SISTER TAXON"
 "EX: PERHAPS THIS GENOME BELONGS TO A SISTER CLASS!"
 "NEED TO GET FIXED PhyloRank THRESHOLDS INTO THE MIX."

If this is still an outstanding issue, it needs to be addressed!

Insure outgroup is not filtered

The outgroup taxa must always be retained. The user might specify a command like:

> gtdbtk de_novo_wf --genome_dir ./genomes --bac120_ms --outgroup_taxon p__Acetothermia --taxa_filter p__Firmicutes --out_dir de_novo_output

In this case, the outgroup taxon p__Acetothermia must be retained even though it isn't listed in the taxa filter.

Pfam Error

Hi,

The -GENERIC_PATH variable was changed to the absolute path of the directory containing the data downloaded from the https://data.ace.uq.edu.au/public/gtdbtk/ with this command:
(GENERIC_PATH = "/projects/b1052/pythonenvs/python2.7/gtdb_database/release83/")
while the error message in the log file suggested that the path was not read properly and the pfam database could not be reached properly

" summary_stats = summary_stats[summary_stats.keys()[0]]
AttributeError: 'NoneType' object has no attribute 'keys'

Error: File existence/permissions problem in trying to open HMM file /PATH/To/GtdbTK/data/markers/tigrfam/tigrfam.hmm.
HMM file /PATH/To/GtdbTK/data/markers/tigrfam/tigrfam.hmm not found (nor an .h3m binary of it); "

The path in the error above should be "/projects/b1052/pythonenvs/python2.7/GTDB/gtdb_database/release83/", instead of "/PATH/To/GtdbTK/data/"

Thanks!

gtdbtk check_install issues warnings

Currently check_install issues warnings even when the installation is correct:

[2018-09-18 14:45:27] WARNING: Check file Taxonomy: /srv/bio_data/gtdbtk/r86/release86/taxonomy/gtdb_taxonomy.tsv
[2018-09-18 14:45:27] WARNING: Taxonomy file..........OK

These should be reports as "info" unless something is actually wrong.

reference genomes in the database

Flag to allow genomes to be supplied as called genes

It would be good if users could supply genomes as called genes in amino acid space. Basically just a flag (--genes) to indicate the FASTA file contains genes and then Prodigal can be skipped. This isn't critical, but can be useful if you have already called genes and want to make sure these exact genes are being used.

Getting the following error when using de_novo_worflow.

[2018-05-13 00:26:35] INFO: Identifying TIGRFAM protein families.
==> Finished processing 5 of 5 (100.0%) genomes.
[2018-05-13 00:27:40] INFO: Identifying Pfam protein families.
==> Finished processing 5 of 5 (100.0%) genomes.
[2018-05-13 00:27:49] INFO: Done.
[2018-05-13 00:27:49] INFO: Aligning markers in 5 genomes with 1 threads.
[2018-05-13 00:27:50] INFO: Processing 5 genomes identified as bacterial.
[2018-05-13 00:27:53] INFO: Read concatenated alignment for 18109 GTDB genomes.
[2018-05-13 00:27:53] INFO: Filtered 18109 taxa based on assigned taxonomy.
[2018-05-13 00:28:34] INFO: Masking columns of multiple sequence alignment.
('Unexpected error:', <type 'exceptions.IndexError'>)
Traceback (most recent call last):
File "/home/ubuntu/.conda/envs/snowflakes/bin/gtdbtk", line 328, in
gt_parser.parse_options(args)
File "/home/ubuntu/.conda/envs/snowflakes/lib/python2.7/site-packages/gtdbtk/main.py", line 266, in parse_options
self.align(options)
File "/home/ubuntu/.conda/envs/snowflakes/lib/python2.7/site-packages/gtdbtk/main.py", line 173, in align
options.outgroup_taxon)
File "/home/ubuntu/.conda/envs/snowflakes/lib/python2.7/site-packages/gtdbtk/markers.py", line 440, in align
self.logger.info('Masked alignment from %d to %d AA.' % (len(gtdb_msa.values()[0]),
IndexError: list index out of range

GTDB-Tk v0.1.1 does not generate *.bac120.classification.tsv file

GTDB-Tk v0.1.1 seems to not generate a main output file (*.bac120.classification.tsv).

I've used GTDB-Tk v0.0.7 for the last several months. The program worked very nicely and I always got output files as explained in the manual.

Recently, I upgraded GTDB-Tk from v0.0.7 to v0.1.1 by "pip install gtdbtk --upgrade", and also updated the reference data from R83 to R86.

After upgrading to v0.1.1, I tried analyzing some of my genomes that had already been analyzed by v0.0.7, just for comparison. But, GTDB-Tk v0.1.1 seems to not generate a main output file (.bac120.classification.tsv) and also the fastani result file (.bac120.fastani_results.tsv). The other output files are generated without any problem.

I attach the log files from v0.0.7 and v0.1.1. There were no errors for both versions.

What can I do to get a main output file from v0.1.1?

Thanks.

gtdbtk_v0.0.7.log
gtdbtk_v0.1.1.log

File indicating why each classification was made

It would be extremely useful to end users if the 'classify' command produced a file indicating why each classification was made. In particular, there are a few different cases as indicated in the GTDB-Tk application note. We should identify these in the code and then inform the user appropriately. In particular, it is critical to know when a species assignment was or was not made based on Mash distances, and when a particular level of taxonomic novelty was determined based on RED values.

error in config_template.py with release86?

I think there might be an error with the config_template (line 74):

# MSA file names
CONCAT_BAC120 = MSA_FOLDER + "gtdb_r83_bac120.faa"
CONCAT_AR122 = MSA_FOLDER + "gtdb_r83_ar122.faa"

I quickly checked the file config_template.py, there are other sections as well where there is r83 in the config template.

No genomes found in directory: path contain bins with fasta format --extension flag used identify genomes

Hi, I encountered a problem. It seems the software can't find the genomes. It confused me. Do anyone have idea what's wrong with that? Thanks!

Unable to get mask from hmm align result file

[2018-09-12 13:52:07] INFO: GTDB-Tk v0.1.1
[2018-09-12 13:52:07] INFO: gtdbtk classify_wf --genome_dir ./genomeTest --out_dir gtdbtkOutput --extension .fasta
[2018-09-12 13:52:07] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-09-12 13:52:07] INFO: Identifying markers in 11 genomes with 1 threads.
[2018-09-12 13:52:07] INFO: Running Prodigal to identify genes.
==> Finished processing 11 of 11 (100.0%) genomes.
[2018-09-12 13:59:58] INFO: Identifying TIGRFAM protein families.
==> Finished processing 11 of 11 (100.0%) genomes.
[2018-09-12 14:01:51] INFO: Identifying Pfam protein families.
==> Finished processing 11 of 11 (100.0%) genomes.
[2018-09-12 14:02:15] INFO: Done.
[2018-09-12 14:02:15] INFO: Aligning markers in 11 genomes with 1 threads.
[2018-09-12 14:02:15] INFO: Processing 11 genomes identified as bacterial.
[2018-09-12 14:02:30] INFO: Read concatenated alignment for 21263 GTDB genomes.
Process Process-9:
Traceback (most recent call last):
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 96, in _worker
marker_set_id)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 192, in _run_multi_align
result_aligns.get(db_genome_id).update(self._run_align(gene_dict, db_genome_id))
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 229, in _run_align
result = self._get_aligned_marker(marker_info.get("gene"), proc.stdout)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 268, in _get_aligned_marker
raise Exception("Unable to get mask from hmm align result file")
Exception: Unable to get mask from hmm align result file

pfam_search.pl needs to be added to project

The pfam_search.pl scripts is a custom script so we need to add this to the GTDB-Tk project and make sure it is called appropriately.

bioconda recipe

Request to have a bioconda recipe created for ease of installation.

https://bioconda.github.io/index.html

Fail at gene calling

Hello Pierre,
I most likely have an issue with my installation, but just wanted to check if there is something else happening. I'd appreciate any ideas you might have, many thanks! My run fails at prodigal, with the following log:

[2018-08-30 22:54:59] INFO: Running Prodigal to identify genes.
Process Process-3:1:1:
Traceback (most recent call last):
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/parallel.py", line 107, in __producer
rtn = producer_callback(dataItem)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 122, in _producer
prodigalParser = ProdigalGeneFeatureParser(gff_file_tmp)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 285, in init
self.__parseGFF(filename)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 302, in __parseGFF
self.translationTable = line.split(';')[4]
IndexError: list index out of range
Process Process-3:
Traceback (most recent call last):
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/external/prodigal.py", line 100, in _worker
rtn_files = self._run_prodigal(genome_id, file_path)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/external/prodigal.py", line 64, in _run_prodigal
summary_stats = summary_stats[summary_stats.keys()[0]]
AttributeError: 'NoneType' object has no attribute 'keys'
Process Process-2:1:1:
Traceback (most recent call last):
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/parallel.py", line 107, in __producer
rtn = producer_callback(dataItem)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 122, in _producer
prodigalParser = ProdigalGeneFeatureParser(gff_file_tmp)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 285, in init
self.__parseGFF(filename)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/biolib/external/prodigal.py", line 302, in __parseGFF
self.translationTable = line.split(';')[4]
IndexError: list index out of range
Process Process-2:
Traceback (most recent call last):
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/external/prodigal.py", line 100, in _worker
rtn_files = self._run_prodigal(genome_id, file_path)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/external/prodigal.py", line 64, in _run_prodigal
summary_stats = summary_stats[summary_stats.keys()[0]]
AttributeError: 'NoneType' object has no attribute 'keys'

[2018-08-30 22:55:13] INFO: Identifying TIGRFAM protein families.
[2018-08-30 22:55:13] ERROR: integer division or modulo by zero
('Unexpected error:', <type 'exceptions.ZeroDivisionError'>)
Traceback (most recent call last):
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/bin/gtdbtk", line 362, in
gt_parser.parse_options(args)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/main.py", line 319, in parse_options
self.identify(options)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/main.py", line 161, in identify
options.prefix)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/markers.py", line 232, in identify
tigr_search.run(gene_files)
File "/bioinf/home/pyilmaz/miniconda3/envs/gtdb_toolkit/lib/python2.7/site-packages/gtdbtk/external/tigrfam_search.py", line 152, in run
self.cpus_per_genome = max(1, self.threads / len(gene_files))
ZeroDivisionError: integer division or modulo by zero

Is it possible to get alignment of the submitted genomes that has the same length as genomes in GTDB ARB file?

Hi,

I'd like to try inserting my genomes into the GTDB ARB tree using several functions available in ARB program. To be able to do this, I need alignment of my genomes that has the same length as genomes in the GTDB ARB file.
But, the alignment length of GTDBTk output file (*.user_msa.fasta) is ~5,000 aa, while the alignment length of GTDB ARB file is ~35,000 aa.
(This difference seems to occur because GTDB and GTDBTk use different masks.)

Are there any method to get the alignment of my genomes that have the same length as genomes in the GTDB ARB file?

Thanks.

ImportError: No module named config

Hi,
I am interested in running your tool for taking a bin, aligning it and doing the taxonomy. I followed the instructions for installation, and now that I am checking if it runs, I have this error that I do not know how to fix:
File "/home/diana/miniconda3/envs/binner/lib/python2.7/site-packages/gtdbtk/markers.py", line 35, in
import config.config as Config
ImportError: No module named config

Naively, I tried doing this: pip install config, but it did not work. I also tried renaming the config folder I have, but it did not work either. I am sorry if this is a dump question, but I am really interested in using your tool.

Kind regards,
Diana

FastTree is not on the system path.

Hi
When i ran "The de novo workflow" process, it showed error messages as below:

[2018-03-16 23:30:12] INFO: GTDB-Tk v0.0.4b3
[2018-03-16 23:30:12] INFO: gtdbtk de_novo_wf --genome_dir my_genome --bac120_ms --outgroup_taxon p__Acetothermia --out_dir de_novo_test
[2018-03-16 23:30:12] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
FastTree is not on the system path.
Controlled exit resulting from an unrecoverable error or warning.

Could you help me fix this problem? Thanks a lot!

Looking forward to your reply.
Regards,
Alice

Add option to skip the FastANI step

Hi,
I wondered if it would be possible to add an option to skip the fastANI step as this would allow avoiding downloading the 25 Gb compressed file that seems to be only used for this step?
Best
Greg

Need to validate Genome IDs

The classify_wf and identify commands allow genome files to be specified through a batch file and each genome given a specified identifier. Since these identifiers will ultimately end up in the Newick tree, they need to be compatible with the Newick format. We should explicitly check this before running the genomes. Namely, identifiers should have none of the following characters which has special meaning in the Newick format: ,;()"

We should also verify they don't start with a GB_ or RS_.

How can I get the bac/ar120.classification.tsv?

Hi, I used the code below. But I only got following files: 1.gtdbtk_ar122_markers_summary.tsv 2.gtdbtk_bac120_markers_summary.tsv 3.gtdbtk.log 4.marker_genes. Where can I find the detailed information of the species. Thanks!

Scaled PhyloRank trees should NOT be provided to Users

The scaled PhyloRank tree is meant for internal use to help with curation. I'd rather not make these available since they can have negative branch lengths as indicated by the following message. We should make sure this tree is not being created and suppress this warning message.


WARNING: Not all branches are positive after scaling.

Unable to get mask from hmm align result file... still problematic

[2018-09-13 13:10:29] INFO: GTDB-Tk v0.1.1
[2018-09-13 13:10:29] INFO: gtdbtk classify_wf --genome_dir ./sampleGenome --out_dir gtdbtkOutput -x fasta
[2018-09-13 13:10:29] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-09-13 13:10:29] INFO: Identifying markers in 1 genomes with 1 threads.
[2018-09-13 13:10:29] INFO: Running Prodigal to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-09-13 13:10:46] INFO: Identifying TIGRFAM protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-09-13 13:10:55] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-09-13 13:11:00] INFO: Done.
[2018-09-13 13:11:00] INFO: Aligning markers in 1 genomes with 1 threads.
[2018-09-13 13:11:00] INFO: Processing 1 genomes identified as bacterial.
[2018-09-13 13:11:08] INFO: Read concatenated alignment for 21263 GTDB genomes.
Process Process-9:
Traceback (most recent call last):
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 96, in _worker
marker_set_id)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 192, in _run_multi_align
result_aligns.get(db_genome_id).update(self._run_align(gene_dict, db_genome_id))
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 229, in _run_align
result = self._get_aligned_marker(marker_info.get("gene"), proc.stdout)
File "/nfs/vedanta/sw/packages/miniconda2/lib/python2.7/site-packages/gtdbtk/external/hmm_aligner.py", line 268, in _get_aligned_marker
raise Exception("Unable to get mask from hmm align result file")
Exception: Unable to get mask from hmm align result file

[Errno 2] No such file or directory: 'gtdbtk_outdir/gtdbtk_bac120_markers_summary.tsv'

When I run gtdbtk classify_wf, it error message below:

[2018-03-07 10:42:06] INFO: GTDB-Tk v0.0.3
[2018-03-07 10:42:06] INFO: gtdbtk classify_wf --genome_dir ./test1/ --out_dir gtdbtk_outdir -x fa --min_perc_aa 50 --prefix metabat --cpus 12
[2018-03-07 10:42:06] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-03-07 10:42:06] INFO: Identifying markers in 1 genomes with 12 threads.
[2018-03-07 10:42:06] INFO: Running Prodigal to identify genes.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-03-07 10:42:18] INFO: Identifying TIGRfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-03-07 10:42:26] INFO: Identifying Pfam protein families.
==> Finished processing 1 of 1 (100.0%) genomes.
[2018-03-07 10:42:28] INFO: Done.
[2018-03-07 10:42:28] INFO: Aligning markers in 1 genomes with 12 threads.
[2018-03-07 10:42:28] ERROR: [Errno 2] No such file or directory: 'gtdbtk_outdir/gtdbtk_bac120_markers_summary.tsv'
[2018-03-07 10:42:28] ERROR: GTDB-Tk has encountered an error.
[2018-03-07 10:42:28] INFO: Done.
[2018-03-07 10:42:28] INFO: Done.

As you known, the --prefix param didn't work. And then, I change "--prefix metabat" to " --prefix gtdbtk", it work good. So，maybe it's a bug.
And GTDB-Tk need mash, but there is no mention in the README.md file of this repository.
Thanks.

Unable to get mask from hmm align result file - Why would this error occur?

help is not working

zhang@thinker:~/miniconda3/lib/python3.6/site-packages/gtdbtk/config$ gtdbtk classify_wf -h
File "/home/zhang/miniconda3/bin/gtdbtk", line 64
''' % version()
^
SyntaxError: invalid syntax

GTDB genomes are being filtered

When doing the alignment, the output indicates that GTDB genomes are being filtered:
Pruned 27 taxa with amino acids in <50.0% of columns in filtered MSA.

It doesn't really make sense to have GTDB genomes that will just be filtered. We should check if there are better representatives that won't get filtered. If not, then we should decide if we want to keep these despite their low number of columns or just accept that these species will be missing from the tree.

importing results of align into ARB tree.

Hi,
I currently don't have access to 90gb of ram to run the whole tree module, however I was able to run identify and align. I have the big GTDB tree loaded into ARB and I was wondering what is the best way to quickly add my aligned sequences into arb. What sort of selected format for the alignment I need to choose?

Thanks for your help.

Cris

some gtdb classifications result in loss of species classification?

Specifically I am concerned with the following genome:

ncbi gid: GCF_000158075.1
ncbi taxonomy:
d__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Lachnoclostridium; s__[Clostridium] asparagiforme
gtdb taxonomy:
d__Bacteria; p__Firmicutes_A; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Clostridium_M; s__

Keeping the 'asparagiforme' species classification would not have overlapped with any other classifications so I'm confused as to why it was dropped with the gtdb classification? Are there any other instances of this that you are aware of and is there any underlying pattern and reasoning behind this type of change?

explanation of incongruent classifications?

Why do fastANI and pplacer sometimes give conflicting results? I have cases where fastANI gives > 99% ANI, but all pplacer results are N/A. What does that mean?

Also, I have at least one case where we think we have a novel species based on our own fastANI calculations and tree inferences, and gtdbtk seems to agree that it's novel, but it doesn't report as much information as I would like. gtdbtk reports 'None' for all fastANI and pplacer fields, when I know through my own analyses that its closest relative by fastANI ANI is ~90%. I assume gtdbtk only reports relatives with > 95% ANI? I think it may still be useful to report the closest relative even if it is not a same-species relative.

But am I correct in assuming that I may have a novel species with the following line:
ve800_81A6_hybrid_assembly_10Apr2018 d__Bacteria;p__Fusobacteriota;c__Fusobacteriia;o__Fusobacteriales;f__Fusobacteriaceae;g__Fusobacterium_A;s__ None None None None None None None None taxonomic classification fully defined by topology None None

ANI results with type strains of species

This will be relevant once the GTDB reference tree is composed entirely of type strains of species and nominated GTDB type genomes for species without type material. At this point in time we should ensure that a genome is only assigned to a species if: i) the type genome of the species has the highest ANI, ii) the ANI >= 95%, and importantly, iii) the ANI is larger than the ANI from the selected type genome to any other type genome. Point iii is critical to ensure assignments aren't made that are likely to result in polyphyly in a phylogenetic tree.

Interpretation of the results

Hi @dparks1134 ,

In the file ‘gtdbtk.ar122.summary.tsv’, the red values are as following:

user_genome classification_method red_value
D6_st10_Refined_2 RED 0.922680053818
B06_st5_Refined_1 RED 0.922131231105
H8_st7_Refined_2 RED 0.921282265825
H13_st5_Refined_3 RED 0.921264499811

And I also found this info in the file of 'gtdbtk.ar122.red_dictionary.tsv’:
Phylum 0.227534847987
Class 0.35320600176
Order 0.553905413445
Family 0.732652342427
Genus 0.910973866402

Does that mean, these four genomes (D6, B06, H8 and H13) belong to the same genus, however can not sure if they belong to the same species?

Thank you very much!

Information about filtered genomes

If a genome is filtered since it doesn't meet one of the filtering criteria, it would be good to report this in an output file similar to what is done in the GTDB. In particular, it is nice to know if a User genome is being filtered because it has just less than 50% of bases in alignment columns or is way under this threshold. See the "GTDB tree create" output for an example of what could/should be provided to users.

IndexError with FastANI

Hi,

currently all my samples are failing and I am getting the same error for all samples.

[2018-07-26 18:13:03] INFO: Calculating Average Nucleotide Identity using FastANI.
list index out of range
Process Process-111:
Traceback (most recent call last):
  File "/data/miniconda3/envs/gtdbtk/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/data/miniconda3/envs/gtdbtk/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/data/miniconda3/envs/gtdbtk/lib/python2.7/site-packages/gtdbtk/classify.py", line 389, in _fastaniWorker
    dict_parser_distance = self._calculate_fastani_distance(items, genomes)
  File "/data/miniconda3/envs/gtdbtk/lib/python2.7/site-packages/gtdbtk/classify.py", line 618, in _calculate_fastani_distance
    dict_parser_distance = self._parse_fastani_results(os.path.join(self.tmp_output_dir,'results.tab'),list_leaf)
  File "/data/miniconda3/envs/gtdbtk/lib/python2.7/site-packages/gtdbtk/classify.py", line 648, in _parse_fastani_results
    ref_genome = os.path.basename(info[1]).replace(Config.FASTANI_GENOMES_EXT,"")
IndexError: list index out of range
GTDB-Tk has stopped before finishing

I looked into the fastANI result file and for some reason the files are empty:

$ cat gtdbtk_BS090.bac120.fastani_results.tsv 
User genome     Reference genome        ANI

Do you have any idea where this error can come from?

fastani is not on the system path

Hi
I have installed this software toolkit of GTDB-Tk following the direction, but when I ran “gtdbtk classify wf”, it showed error messages as below:

INFO: GTDB-TK v0.0.4b1
INFO: gtdbtk classify_wf --cpus 24 --genome_dir my_genomes --out_dir gtdbtk_output
WARNING: Results are still being validated and taxonomic assigmnts may be incorrect!Use at your own risk!
fastani is not on the system path.
controled exit resulting from an unrecoverable error or warning.

I have tried my best to solve the problem, but this error still exists . Could you give me some suggestions about how I should fix this?

Thanks a lot.
Looking forward to your reply.
Regards,
Alice

de_novo_wf Bootstrap values

Hi, I've been using the de_novo_wf, it has been very helpful in visualising the data! I was wondering if there is a way to incorporate bootstrapping in this wf? any suggestions?

Example dataset

Could you please make available an example dataset with the relevant output files so as to get an idea of how the final .tsv looks like? Thanks!

Output names based on input name for infer, root, and decorate.

The infer command takes an output directory and an input MSA. The name of the output files should be based on the name of the input MSA. For example, gtdbtk.bac120.msa.faa should produce a tree called gtdbtk.bac120.msa.tree. The reason for this is just in case the User runs the infer command twice (once for bacteria and once for archaea) and specifies the same output directory. It also makes it very explicit where the files come from.

A similar approach should be used for the 'root' and 'decorate' commands.

Respect genome IDs in batch file

The genome batch file provided to the identify command specifies the location of a FASTA file and a desired genome ID. Right now, I don't think this genome ID is being respected in all output files produced by the GTDB-Tk. This is probably going to be a bit hard to address since it looks like genome IDs are often being taken straight from the FASTA file names at the moment. Either we need to modify the code to handle these requested genome IDs or just simply the genome batch file to just take in the path to FASTA files.

Always report 7 rank taxonomy strings

It would be helpful if unassigned ranks were still indicated in the taxonomy strings produced by GTDB-Tk. Many downstream programs assume a complete 7 rank taxonomy string with missing rank information explicitly indicated by the rank prefix.

For example,
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__;g__; s__

is preferable to the current output of:
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales

ImportError: No module named gtdblib.trimming

Hi,

I have installed all the prerequisite packages and softwares and then used pip to install the tgdbtk. But while I run gtdbtk, it cannot find the gtdblib.trimming. The lib directory for gtdbtk is list here. What's wrong with my installation?

Thanks for your help.

ls /usr/local/lib/python2.7/dist-packages/gtdbtk -lh
-rw-r--r-- 1 root root 33K Aug 17 06:38 classify.py
-rw-r--r-- 1 root root 23K Oct 7 23:18 classify.pyc
drwxr-sr-x 2 root root 4.0K Oct 7 23:18 config
drwxr-sr-x 2 root root 4.0K Oct 7 23:18 external
-rw-r--r-- 1 root root 1.5K Aug 23 04:22 init.py
-rw-r--r-- 1 root root 503 Oct 7 23:18 init.pyc
-rw-r--r-- 1 root root 9.4K Aug 23 04:27 main.py
-rw-r--r-- 1 root root 6.4K Oct 7 23:18 main.pyc
-rw-r--r-- 1 root root 25K Aug 17 06:38 markers.py
-rw-r--r-- 1 root root 14K Oct 7 23:18 markers.pyc
-rw-r--r-- 1 root root 34K Aug 23 02:27 relative_distance.py
-rw-r--r-- 1 root root 23K Oct 7 23:18 relative_distance.pyc
-rw-r--r-- 1 root root 5.8K Mar 29 2017 reroot_tree.py
-rw-r--r-- 1 root root 3.8K Oct 7 23:18 reroot_tree.pyc
-rw-r--r-- 1 root root 4.0K Aug 17 06:38 tools.py
-rw-r--r-- 1 root root 4.6K Oct 7 23:18 tools.pyc
-rw-r--r-- 1 root root 24 Aug 23 04:20 VERSION

Sequence XX doesn't overlap any reference sequences

Hi
Analysing different MAGs with potential plasmids, I get this error :

[2018-05-30 12:27:01] INFO: Placing 355 bacterial genomes into reference tree with pplacer (be patient).
Uncaught exception: Failure("Sequence N2F3_MBin.34 doesn't overlap any reference sequences.")
Fatal error: exception Failure("Sequence N2F3_MBin.34 doesn't overlap any reference sequences.")
Uncaught exception: Sys_error("All_out/pplacer/pplacer.bac120.json: No such file or directory")
Fatal error: exception Sys_error("All_out/pplacer/pplacer.bac120.json: No such file or directory")
GTDB-Tk has stopped before finishing

Would it be possible to skip the sequence without shutting down the software ?
Cheers
Greg

de novo workflow not working

I am trying to run the denovo workflow however the unrooted tree is not being generated, this is my run

:/mnt/FastTreeMP$ gtdbtk de_novo_wf --genome_dir /mnt/genomes --bac120_ms --outgroup_taxon c__Betaproteobacteria --out_dir filtered_tree --taxa_filter o__Rhodospirillales --cpus 4 -x fasta
[2018-05-10 08:17:40] INFO: GTDB-Tk v0.0.7
[2018-05-10 08:17:40] INFO: gtdbtk de_novo_wf --genome_dir /mnt/genomes --bac120_ms --outgroup_taxon c__Betaproteobacteria --out_dir filtered_tree --taxa_filter o__Rhodospirillales --cpus 4 -x fasta
[2018-05-10 08:17:40] WARNING: Results are still being validated and taxonomic assignments may be incorrect! Use at your own risk!
[2018-05-10 08:17:40] INFO: Identifying markers in 5 genomes with 4 threads.
[2018-05-10 08:17:40] INFO: Running Prodigal to identify genes.
==> Finished processing 5 of 5 (100.0%) genomes.
[2018-05-10 08:18:46] INFO: Identifying TIGRFAM protein families.
==> Finished processing 5 of 5 (100.0%) genomes.
[2018-05-10 08:19:15] INFO: Identifying Pfam protein families.
==> Finished processing 5 of 5 (100.0%) genomes.
[2018-05-10 08:19:20] INFO: Done.
[2018-05-10 08:19:20] INFO: Aligning markers in 5 genomes with 4 threads.
[2018-05-10 08:19:20] INFO: Processing 5 genomes identified as bacterial.
[2018-05-10 08:19:24] INFO: Read concatenated alignment for 18109 GTDB genomes.
[2018-05-10 08:19:24] INFO: Filtered 18048 taxa based on assigned taxonomy.
[2018-05-10 08:19:56] INFO: Masking columns of multiple sequence alignment.
[2018-05-10 08:19:56] INFO: Masked alignment from 41155 to 5036 AA.
[2018-05-10 08:19:56] INFO: Creating concatenated alignment for 66 taxa.
[2018-05-10 08:19:56] INFO: Done.
[2018-05-10 08:19:56] WARNING: Tree inference is still under development!
[2018-05-10 08:19:56] INFO: Inferring tree with FastTree using WAG+GAMMA.
[2018-05-10 08:20:29] INFO: Done.
[2018-05-10 08:20:29] WARNING: Tree rooting is still under development!
[2018-05-10 08:20:29] ERROR: Input file does not exists: filtered_tree/gtdbtk.bac120.unrooted.tree

Any help would be greatly appreciated!

no module named config.

Hi,
I think I have installed all the dependecies properly however when trying to run gtdbtk I get the following error:

gtdbtk -h
Traceback (most recent call last):
  File "/usr/local/bin/gtdbtk", line 37, in <module>
    from gtdbtk.main import OptionsParser
  File "/home/ubuntu/.local/lib/python2.7/site-packages/gtdbtk/main.py", line 22, in <module>
    from markers import Markers
  File "/home/ubuntu/.local/lib/python2.7/site-packages/gtdbtk/markers.py", line 36, in <module>
    import config.config as Config
ImportError: No module named config

Any ideas?

Thanks

command to set_root for data

Hello,

I have built a Docker image for gtdbtk and all the dependencies, however due to the size of the database files, I can't build that within the docker image. The ultimate goal is to pull the docker image on VMs or a cluster so the installation process is easier. Specifically I was wondering if there was a way to create a command like in checkm to set the root to direct where the data files are instead of manually creating and changing the config.py file. Manually editing this file isn't really possible when submitting jobs to a cluster. Not sure if this is possible, but thought I would ask since I think a docker implementation could make it easier for a lot of people to use this tool.

ecogenomics / gtdbtk Goto Github PK

gtdbtk's People

Contributors

Stargazers

Watchers

Forkers

gtdbtk's Issues

Thanks

Recommend Projects

Recommend Topics

Recommend Org