muellan / metacache Goto Github PK

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping

License: GNU General Public License v3.0

Makefile 0.61% Shell 1.38% C++ 82.76% C 0.09% Cuda 15.15%

genome read-mapping refseq taxonomy metagenomics metagenomic-analysis metagenomic-classification metagenomes metagenomic-data ncbi-refseq

metacache's People

Stargazers

Watchers

Forkers

druvus zsutx2005 jmabuin frederic-mahe prabindh 00mjk ssyamoako guruace tong2200 koonkie-cloud-services schaudge wook2014

metacache's Issues

Results Output Visualization

Hi,

I built a custom database and have used the following command to query my paired end fastq files:
./metacache query Metacache_DB ${INPUT1} ${INPUT2} -pairfiles -abundances test_abundance.txt -lowest species -highest domain -out test_results.txt -threads ${THREAD_NUM} -omit-ranks -taxids-only -separate-cols -lineages

This produces an abundance text file and a results file. The results file looks something like below:

# TABLE_LAYOUT: query_header	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid	|	taxid
READ:INFO:12701/1	|	10115	|	0	|	8909	|	0	|	0	|	0	|	8103	|	0	|	6795	|	5817	|	0	|	3719	|	0	|	0	|	2

I know that the first taxid column will contain the lowest possible rank for that read, so I am thinking of going through each read in the results file, extract the first taxid, and then obtain its corresponding higher ranks like genus, family, order, etc. using my custom taxonomic hierarchy file and add a count to each of those ranks. In this particular example, the taxid 10115 is the species Acetobacter persici. The genus would be Acetobacter, family Acetobacteraceae, order Acetobacterales, class Alphaproteobacteria, phylum Proteobacteria, and domain bacteria. I would add one count to each of these names, so by the end of my script run, I should have the read counts for each of the taxonomic ranks.

Is this the right way of going about this if I want to visualize my samples across various taxonomic ranks using an R package like phyloseq?

Thank you.

Problem building DB with metacache-build-refseq

Hi,

I'm experiencing some issues building a RefSeq reference database for MetaCache v2.1.1.

The first issue is that the NCBI assembly_summary.txt files now have a few entries where the ftp_path field is set to na. This breaks the metacache-build-refseq script. This is easy enough to workaround once you know the problem, but it takes a bit of exploring to figure this out.

The more major issue is that metacache build command has been stuck at 98% for several hours now and all the metacache processes are in a sleep state. I am running this on a 16 CPU machine with 126 GB of RAM. Only 82 GB of RAM is used at the moment so this doesn't appear to be a memory issue. I can also verify I have plenty of disk space.

Have you experience this problem? Can anyone verify that they have recently been able to build a MetaCache DB from complete RefSeq genomes?

Thanks,
Donovan

Ion Xpress (Ion Torrent)

Hi,

I wonder if metacache can also work with Ion Torrent fastq files (single-ended)? Do these types of files need any pre-processing?

Many thanks,

Jaime

Building a custom DB

Hi,

I'm aiming to build a custom MetaCache DB by specifying the NCBI Taxon IDs for my genomes through the assembly_summary.txt file. However, it appears MetaCache might make some assumptions about either the genome filenames or the contig IDs. I have the following test setup with 2 genomes:

Genome files:

A.fna
B.fna

assemby_summary.txt:

# Created by build_custom_metacache_db.py
# assembly_accession	taxid
A.fna	1423
B.fna	1423

MetaCache results:

> metacache build test_db genomes -taxonomy /data/ncbi_taxonomy
Building new database 'copper_custom_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2450976 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.1 (20220112)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from genomes/assembly_summary.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
Added 45 reference sequences in 3.623 s
targets              45
ranked targets       0
taxa in tree         2450976
------------------------------------------------
buckets              1500007
bucket size          max: 1 mean: 1 +/- 0 <> -nan
features             5472
dead features        0
locations            5472
------------------------------------------------
1 targets are unranked.
1 targets remain unranked.
Writing database to file ... Writing database metadata to file 'test_db.meta' ... done.
Writing database part to file 'test_db.cache0' ... done.
done.
Total build time: 4.477 s

The output suggests it is unable to identify my genomes since ranked targets is 0 and it indicates there is 1 target that remains unranked. Do I need to give the input genomes filenames with a certain format? Or do the contig IDs need to have a specific format? I've also tried changing the assembly__summary.txt file to just use the assembly_accession A and B without the .fna extension.

The two genome files and the assembly__summary.txt file are in the genome directory.

Any insight into what may be happening would be much appreciated.

Thanks,
Donovan

Strain Level Database

Hello,

I noticed that MetaCache can classify reads at the sequence level. If I had my own strain level database, would I be able to use MetaCache to classify reads at strain level?

Thank you.

Interpreting abundance profile

Hi,

I'm a little confused at how to interpret the -abundance file generated by MetaCache. In particular, I'm confused at how to interpret the results with -lower set to subspecies. In a simple mock I have, if I run MetaCache with -lower at the default I get Salmonella enterica subsp. enterica reported at 8.9%. However, if I run MetaCache with -lower set to subspecies, I get Salmonella enterica subsp. enterica reported at 1.9%.

I'm unclear at why this discrepancy occurs? Which result is generally consider more accurate?

Thanks,
Donovan

target limit: number of genomes or sequences?

Hi. Is the target limit set by the DMC_TARGET_ID_TYP data type the maximum number of genome assemblies or the maximum number of contigs/scaffolds?

Guidance for working with large reference data sets

Hi,

I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"

What will the peak memory consumption during build be, when partitioning in x M sized partitions?
Atm, I am running with ${p2mc}/metacache-partition-genomes ${p2g} 1600000, since I have at max ~ 1.9T of memory available.
Is the partition size reasonable?
Does it matter for the partition size calculation, if the genomes are compressed or not?
Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in #33 (comment), with advice to build fewer partitions, to keep the merging time in check.
Should I then try to find the maximum partition size that will fit into memory during building?
Since I am partitioning anyway, do I actually need to compile with uint64_t, or check the largest partition, sequence count-wise, and see if I can get away with uint32_t?
Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?
I chose -kmerlen 20 based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?
Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?
Since the reference data set is large, should some of the advanced options be set/adjusted e.g. -remove-overpopulated-features? If so, what values should be chosen, based on the reference data?

I would be very grateful for any guidance you have.

Best

Oskar

Form and variety

Hi. Quick question. What do the form and variety attributes for -abundance-per mean? I've always taken variety to be a synonym of subspecies and am not clear what form means in terms of taxonomy. Thanks.

separator not properly handled in the output

I tried to use tab as separator to set '-separator "\t"' parameter for query. However, the control sequence '\t' is literally printed in the output, instead of converted into tab.

e.g., ERR024951.132\tspecies\tKlebsiella pneumoniae\t573

Using DB partition and MERGE does not match single DB abundance results

Hi, I am testing the validity of using smaller databases and then merging the results. However, when I am testing this, the results do not match those of querying one DB. For example, I have a DB with 40 species and created two DBs with 20 species each. When I use the MERGE function, the results of merging the two 20-species DB do not match the abundance results from the full 40-species DB. Here are the commands I am using:

metacache build 20sp_DB1 /test_merge_DB/DB1 -taxonomy ncbi_taxonomy -remove-overpopulated-features

metacache build 20sp_DB2 /test_merge_DB/DB2 -taxonomy ncbi_taxonomy -remove-overpopulated-features

metacache query 20sp_DB1 MixA_1.fastq.gz MixA_2.fastq.gz -pairfiles -tophits -queryids -lowest species -out res1.txt

metacache query 20sp_DB2 MixA_1.fastq.gz MixA_2.fastq.gz -pairfiles -tophits -queryids -lowest species -out res2.txt

metacache merge res1.txt res2.txt -lowest species -taxonomy ncbi_taxonomy -max-cand 4 -hitmin 2 -hitdiff 2 -mapped-only -abundances test_abundance.txt -abundance-per species > out_metacache_merge.txt

Issues with partitioned DB and merging

Dear Andre,

why Metacache classifies almost all reads to the species level when querying an extensive database (1000 genomes partitioned in 28 DBs).

Is there any inherent error bias introduced by partitioning databases and posterior merging?

Can you also explain the difference between -hitmin and -hitdiff? (In the GitHub page the description for both are the same)

Additionally, is there a way to set a threshold to classify at the species level based on the number of reads (Is this what you refer to as a feature)? For example, a species will be considered "present" if at least five reads are assigned.

We are contemplating the number of possible k-mers one can make with k = 16, 4^16, which would mean ~4.3 million available/possible k-mers for each species in a 1000 genome database (correct?) if they all were in the same DB.

For a 1 Gb genome, there are about 7.9 million 127 bp windows, and if 16 k-mers are selected from each, 125 million k-mers from each genome without any threshold and just one genome. This is 29 times the 4.3 million available/possible k-mers. I'm not sure of these numbers. Could you let me know if they sound right?

We were trying to extract info from the databases using the info command and found that featureCounts produces:

63168104 -> 69
805452997 -> 2
303063736 -> 10

But we are not able to relate this number to species. Is this a feature number? Is the feature different from k-mer?
Are Targets synonymous with sequences?

We want stats on how many k-mers and unique-k-mers each species is represented within the database. Is it possible to retrieve this using the current commands?

Scientific notation in abundance file result in rounding errors

Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances flag reports pair counts in scientific notation when numbers get large, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Archaea 	439684927 	461 	0.0130496%
domain 	Bacteria 	609216830 	1.05068e+06 	29.7417%

This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.

I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.

error compiling metacache

I ran into this error trying to compile metacache:

mkdir build_release
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/args_handling.cpp -o build_release/args_handling.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/cmdline_utility.cpp -o build_release/cmdline_utility.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/filesys_utility.cpp -o build_release/filesys_utility.o
g++ -std=c++11 -O3 -Wall -Wextra -Wpedantic -c src/main.cpp -o build_release/main.o
src/main.cpp: In function ‘int main(int, char**)’:
src/main.cpp:72:16: error: ‘runtime_error’ in namespace ‘std’ does not name a type
catch(std::runtime_error& e) {
^
src/main.cpp:73:37: error: ‘e’ was not declared in this scope
std::cerr << "\nABORT: " << e.what() << "!" << std::endl;
^
src/main.cpp:75:16: error: ‘invalid_argument’ in namespace ‘std’ does not name a type
catch(std::invalid_argument& e) {
^
src/main.cpp:76:37: error: ‘e’ was not declared in this scope
std::cerr << "\nABORT: " << e.what() << "!" << std::endl;
^
make: *** [build_release/main.o] Error 1

Any advice on how to resolve it would be great! Thank you.

Support for gzipped reads

Hi.

Are there plans to support reads in compressed gzipped format (i.e. my_reads.fq.gz)? This would be a major help for incorporating MetaCache into workflows.

Cheers,
Donovan

Make throws the Invalid Option error if multiple macros are passed

Hi there!

When I try building the metacache package (v2.2.3) with multiple macros, make throws the invalid option error:

Seems like $(MACROS) should be quoted here:

metacache/Makefile

Line 136 in 532a737

$(MAKE) release_dummy DIR=$(REL_DIR) ARTIFACT=$(REL_ARTIFACT) MACROS=$(MACROS)

For now, I build metacache by running this command directly:
make release_dummy DIR=build_release ARTIFACT=metacache MACROS="-DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t"

BR,
Zulfat

Classification abundances output not showing all ranks

Hi again,

I just ran MetaCache on a sample dataset. I used the following command:
./metacache query mydb.db test_genomes/ -out test_results.txt -lowest subspecies -abundances test_abundances.txt
When I took a look at the test_abundances.txt file, I only see rows for domain, family, and subspecies abundances like below:

domain:Bacteria	|	3	|	56893	|	0.123854%
family:Actinomycetaceae	|	101	|	319	|	0.00069445%
subspecies:strain_158623	|	8058	|	1	|	2.17696e-06%
subspecies:strain_287365	|	8174	|	4422	|	0.00962652%
subspecies:strain_627946	|	8239	|	167	|	0.000363552%
subspecies:strain_623479	|	10643	|	246807	|	0.537289%

I do not see species, genus, order, class, or phylum abundances. However, when I ran the above command with -abundance-per phylum, I was able to get the abundances for all the phylum ranks. Then I decided to use -allranks command that was mentioned in the documentation to get all ranks but this always brings me to the help documentation on my command line. I am not sure how to get -allranks working. My main goal is to classify all my reads at all the different ranks. Please let me know how I can do this or if I am making any mistakes here.

Edit: I suppose -allranks only works with -precision and for that I would need the header information set correctly.

Thank you.

Segmentation fault (core dumped)

I am getting this error every time I run a query to the database.

Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading. %
custom query sketching settings: -sketchlen 32 -winlen 127 -winstride 112
Classifying query sequences.
Per-Read mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/sample_kmer32.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt
Per-Taxon mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/abund_extraction.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt
[> ] 0%Segmentation fault (core dumped)

Custom Database Creation

Hello,

I just need a little clarification. In the custom database documentation, would I be able to have my own taxonomy and ID rather than using NCBI taxonomy and database if I am able to provide all the necessary files? I have my own custom database and would like to use my own taxonomy rather than NCBI taxonomy and taxids.

Thank you.

GPU version on bioconda

Dear all,

is there a GPU version of metacache on bioconda?

Can we also define MC_KMER_TYPE for GPU? Beacuse from the code it doesn't look like it:

metacache/src/config.h

Lines 44 to 48 in 4c6db91

    
           #if !defined GPU_MODE && defined MC_KMER_TYPE 
        
               using kmer_type = MC_KMER_TYPE ; 
        
           #else 
        
               using kmer_type = std::uint32_t; 
        
           #endif

Argument handling

Argument handling could need some improvements:

$ ./metacache query
terminate called after throwing an instance of 'std::length_error'
  what():  vector::reserve
Aborted

Out file interpretation

Hello,
I'm having trouble interpreting the per-read mapping report, specifically the taxid. I noticed there are some negative ones and when plotting with KronaTools most are not identified but some are classified as Eukaryota. Could you please clarify what they are?

Is there a recommended tool for plotting the results?

Thank you.

Database Build Files

Hi,

I had built a database for the GTDB, but when I check the database, all I see are the following files in my main non-database directory: GTDB_DB.cache0 & GTDB_DB.meta
I am wondering if these are the files I will need to use to classify my fastq files?

This is the command I used: ./metacache build GTDB_DB /path/to/genomes/ -taxonomy /path/to/custom/taxonomy/

Thank you.

long reads

can the use of long genomic reads improve the accuracy ?

tnx !!

Unranked targets when building custom DB

Hi,

I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build indicates 262383 targets remain unranked. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?

Thanks,
Donovan

>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv

Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.3 (20220708)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets              8302685
ranked targets       0
taxa in tree         401816
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features             518555067
dead features        0
locations            24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.

metacache-build-refseq and download-ncbi-genomes not downloading genomes

Just tried to build the database for Metacache, but neither of these commands are actually downloading the genomes. I tried the metacache-build-refseq first, then when it didn't download any, I tried the download-ncbi-genomes, thinking I might need to download them 'manually', as it were. Neither command gives an error per se; both download the assembly_summary.txt file, then when that's finished, give the following line and return to prompt:

cat: ftpdirpaths: input file is output file

Any ideas what I'm doing wrong? I was really intrigued by the paper and wanted to see how the tool performed on my data.

Report MetaCache version

Hi,

Not urgent, but it would be useful to have a CLI option that indicated the software version. Or alternatively if the version number was just indicate somewhere in the help so it could be parsed out.

Thanks,
Donovan

Database load times highly and unexpectedly non-linear

Hi,

Thanks again for your work on MetaCache. I have noticed something a bit strange with the new version (MetaCache 2.0) where database loading takes much longer than I would expect (without any signs of I/O contention in iotop). The problem seems especially bad once the database loading % exceeds 90%. Loading from 0 to 90% seems to comprise ~20% of the load time, followed by the remaining 10% of the database. I am basing these percentages off of the stderr log, see attached for a representative log.

The actual analysis seems fast once it gets going. As it stands, I think I am spending about 75% of the time per sample just reading the database, but I am not sure why. Do you have any suggestions about how to improve sample throughput? I know completion percentages are usually off but I'm not sure how to debug further.

metacache_stderr.log

AFS20 database construction

Hi,

I am trying to reproduce some of the experimental results on my own machine, but I got stuck in the database construction.

It seems that the default database script only builds the NCBI RefSeq (Release 90). Correct? Assuming that is true, I decided to build AFS20 using the commands for custom database construction since I want to see those results. So, I downloaded the first 20 reference genomes of Table 1 from the NCBI database, decompressed them, and put all the .fna files in one folder. Then I simply ran 1) make MACROS="-DMC_TARGET_ID_TYPE=uint32_t" 2) ./metacache build AFS20 Data/AFS20 -taxonomy genomes/taxonomy.

However, the program has got stuck in the "Processing reference sequences." phase forever (more than 10 hours). Can you tell me what I am doing wrong here?

I really appreciate any help you can provide.

Segmentation fault

Hello,

if I use the query option with a fastq file containing more than 15 Ts in a row I get a segmentation fault. If I replace the 16 Ts with 16 As (or more) there is no segmentation fault.

This fastq works properly:
@QHOT2:10403:11886
CTGACAGCATGTTGACTTTTGCTTCCTTTGTCAAGCACAGAAAAACAGGAAGCACAAAATCATTTTTTTTTTTTTTT
+

But if I add an extra T at the end, I get a segmentation fault.
@QHOT2:10403:11886
CTGACAGCATGTTGACTTTTGCTTCCTTTGTCAAGCACAGAAAAACAGGAAGCACAAAATCATTTTTTTTTTTTTTTT
+

thank you

metacache-build-refseq gets killed at 46% every time i try to build the database

When i try to build the refseq database the process gets killed at 46%. This happens every time exactly at the same percent value. There is enough storage available. Hardware specs are: 64GB RAM, 16 Cores

Building new database 'refseq' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2429955 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.0.1 (20210305)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned short int 16 bits
target limit         65535
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from genomes/refseq/archaea/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/bacteria/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/viral/assembly_summary.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
[=================>                                                        ] 24%
[=================>                                                        ] 24%

[==================================>                                       ] 46%Killed

Segmentation fault when using --lowest species

Hi,

I've run into an issue where MetaCache runs as expected using the following parameters, but crashes with "Command terminated by signal 11" when the --lowest species flag is added:

-pairfiles -no-map -taxids -lineage -separate-cols -threads 32 -abundances profile.tsv -abundance-per species -out classification.log"

Is there a set of incompatible flags I'm using or is it possible that using the -lowest flag has uncovered a bug?

Thanks,
Donovan

Segmentation fault (core dump) issue on custom DB

Hi,

It appears a number of users are experiencing segmentation faults on different condition (issues 27, 28). I am experiencing this on a custom DB I just built. The DB builds without issue. I've posted the DB build output results below. The segmentation fault occurs even when running MetaCache v2.2.3 with default settings, e.g.:

> metacache query my_db my_seqs.fq.gz -out results.txt

The segmentation fault occurs immediately after the DB is loaded:

> metacache query my_db my_seqs.fq.gz -out results.txt
Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading.
Classifying query sequences.
Per-Read mappings will be written to file: results.txt
[>                                                                         ] 0%Segmentation fault (core dumped)

This seems to be the same issue as previously reported. I'm running this on a machine with lots of memory (512 GB; reference DB is ~223 GB on disk and looks to require about the same amount of RAM) and disk space (>100 GB free). Notably, the following command also causes a segmentation fault:

> metacache info my_db lin > lineages.tsv
Reading database from file 'gtdb_r214_db_ext' ... Reading database metadata ...
Completed database reading.
done.
Segmentation fault (core dumped)

Any insights into what might be causing this segmentation fault?

Thanks,
Donovan

Output from building custom DB:

Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 184936 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.3.2
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from ./genomes/assembly_summary.txt
Processing reference sequences.
Added 10919768 reference sequences in 7244.87 s
targets              10919768
ranked targets       10411403
taxa in tree         184936
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 55.5022 +/- 68.7783 <> 1.65723
features             531291666
dead features        0
locations            29487841493
------------------------------------------------
All targets are ranked.
Writing database to file ... Writing database metadata to file 'gtdb_r214_db_ext.meta' ... done.
Writing database part to file 'gtdb_r214_db_ext.cache0' ... done.
done.
Total build time: 9440.64 s

speedup refseq downloads - suggestion not issue

Downloading from NCBI using ascp is much, much faster than wget. Changes something like this to your download script will speed it up by a lot:

 FTPURL="ftp://ftp.ncbi.nih.gov/genomes"
wget $FTPURL/refseq/bacteria/assembly_summary.txt
# complete genomes
 awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths ;;
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths | sed 's@ftp://ftp.ncbi.nlm.nih.gov@@' > ftpfilepaths
# ascp instead of wget
ascp -T -k2 -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh --user=anonftp --file-list=ftpfilepaths --mode=recv --host=ftp.ncbi.nlm.nih.gov .

Interpretation of reported abundance table

Hi. I'm running MetaCache query with the -abundaces profile.tsv and -abundance-per species flags. It appears this writes two profiling results to profile.tsv: the full taxon profile and a species profile. However, these profiles do not appear to agree. For example, the taxon profile reports Streptococcus dysgalactiae subsp. equisimilis at 1.15% and no other results for S. dysgalactiae. The species profile reports S. dysgalactiae at 1.93%. Why is there a discrepancy?

Relevant lines from profile.tsv:

# query summary: number of queries mapped per taxon
# rank:name | taxid | number of reads | abundance
...
subspecies:Streptococcus dysgalactiae subsp. equisimilis	119602	80740	1.15343%
...
# estimated abundance (number of queries) per species
# rank:name | taxid | number of reads | abundance
...
species:Streptococcus dysgalactiae	1334	135610	1.93728%
...

Is the best prediction of the abundance of S. dysgalactiae by MetaCache 1.15% or 1.93%?

Thanks,
Donovan

Allow for more threads?

Hi,

I tried to run the database build, but it seems that my machine is only running MetaCache on a single thread. Is there a way to increase this? I do not see an argument for specifying threads in MetaCache.

Thanks.

Compiling with uint64_t

Hi, I am trying to compile MetaCache for use with a large reference data set.
I am running it like this:

git clone https://github.com/muellan/metacache.git
cd metacache
mamba activate compile

LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH

CPLUS_INCLUDE_PATH=${CONDA_PREFIX}/include:${CPLUS_INCLUDE_PATH}
export CPLUS_INCLUDE_PATH

make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"

I get the following error:

make release_dummy DIR=build_release ARTIFACT=metacache MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
make[1]: Entering directory '/mnt/data/local_tools/metacache'
.../miniconda3/envs/compile/bin/x86_64-conda-linux-gnu-c++  -DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t -std=c++14 -Wall -Wextra -Wpedantic -I/include -O3 -c src/building.cpp -o build_release/building.o
In file included from src/options.h:31,
                 from src/candidate_structs.h:28,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
src/taxonomy.h: In member function 'void mc::ranked_lineages_of_targets::update(mc::target_id)':
src/taxonomy.h:980:45: error: no matching function for call to 'min(unsigned int, long unsigned int)'
  980 |         const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
      |                                     ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/char_traits.h:39,
                 from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/string:40,
                 from src/bitmanip.h:29,
                 from src/dna_encoding.h:27,
                 from src/hash_dna.h:27,
                 from src/config.h:34,
                 from src/candidate_structs.h:27,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:230:5: note: candidate: 'template<class _Tp> constexpr const _Tp& std::min(const _Tp&, const _Tp&)'
  230 |     min(const _Tp& __a, const _Tp& __b)
      |     ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:230:5: note:   template argument deduction/substitution failed:
In file included from src/options.h:31,
                 from src/candidate_structs.h:28,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
src/taxonomy.h:980:45: note:   deduced conflicting types for parameter 'const _Tp' ('unsigned int' and 'long unsigned int')
  980 |         const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
      |                                     ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/char_traits.h:39,
                 from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/string:40,
                 from src/bitmanip.h:29,
                 from src/dna_encoding.h:27,
                 from src/hash_dna.h:27,
                 from src/config.h:34,
                 from src/candidate_structs.h:27,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:278:5: note: candidate: 'template<class _Tp, class _Compare> constexpr const _Tp& std::min(const _Tp&, const _Tp&, _Compare)'
  278 |     min(const _Tp& __a, const _Tp& __b, _Compare __comp)
      |     ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algobase.h:278:5: note:   template argument deduction/substitution failed:
In file included from src/options.h:31,
                 from src/candidate_structs.h:28,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
src/taxonomy.h:980:45: note:   deduced conflicting types for parameter 'const _Tp' ('unsigned int' and 'long unsigned int')
  980 |         const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
      |                                     ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/algorithm:62,
                 from src/../dep/hpc_helpers/include/cuda_helpers.cuh:7,
                 from src/dna_encoding.h:30,
                 from src/hash_dna.h:27,
                 from src/config.h:34,
                 from src/candidate_structs.h:27,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3449:5: note: candidate: 'template<class _Tp> constexpr _Tp std::min(std::initializer_list<_Tp>)'
 3449 |     min(initializer_list<_Tp> __l)
      |     ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3449:5: note:   template argument deduction/substitution failed:
In file included from src/options.h:31,
                 from src/candidate_structs.h:28,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
src/taxonomy.h:980:45: note:   mismatched types 'std::initializer_list<_Tp>' and 'unsigned int'
  980 |         const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
      |                                     ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from .../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/algorithm:62,
                 from src/../dep/hpc_helpers/include/cuda_helpers.cuh:7,
                 from src/dna_encoding.h:30,
                 from src/hash_dna.h:27,
                 from src/config.h:34,
                 from src/candidate_structs.h:27,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3455:5: note: candidate: 'template<class _Tp, class _Compare> constexpr _Tp std::min(std::initializer_list<_Tp>, _Compare)'
 3455 |     min(initializer_list<_Tp> __l, _Compare __comp)
      |     ^~~
.../miniconda3/envs/compile/x86_64-conda-linux-gnu/include/c++/11.3.0/bits/stl_algo.h:3455:5: note:   template argument deduction/substitution failed:
In file included from src/options.h:31,
                 from src/candidate_structs.h:28,
                 from src/candidate_generation.h:27,
                 from src/database.h:27,
                 from src/building.h:27,
                 from src/building.cpp:24:
src/taxonomy.h:980:45: note:   mismatched types 'std::initializer_list<_Tp>' and 'unsigned int'
  980 |         const unsigned numThreads = std::min(4U, numNewTargets / (1U << 10));
      |                                     ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [Makefile:215: build_release/building.o] Error 1
make[1]: Leaving directory '/mnt/data/local_tools/metacache'
make: *** [Makefile:137: release] Error 2

Just make or e.g. make MACROS="-DMC_TARGET_ID_TYPE=uint32_t -DMC_WINDOW_ID_TYPE=uint32_t" works.
Am I calling it somehow wrong? make MACROS="-DMC_KMER_TYPE=uint64_t" from the example also fails in the same manner.

Best

Oskar

Enhancement: Provide NCBI taxids in output

The rank/tax. name combination is not necessarily unique; would it be possible to
include NCBI taxids as an additional output column?

Using as a library

Thanks for this piece of work. I am very interested in using the hashing method in TGS reads analysis and wonder how I would use this as a CMake module, etc. Are you planning to give this support. Some direction on this concern is highly appreciated.

Example use case; hash a million sequences (using LSH) and compare with another seq of sequences. Finally estimate similarity, etc.

Merging results from querying partitioned database

Dear Andre,

I am working with a partitioned database (44 total chunks, each 16GB). The querying of the database works well (I believe) and reasonably fast. However, once I have the 44 txt files, and I am merging these to get abundance-per-species, the computation is taking a lot of time (now running for three days). I am unsure if I am doing something wrong or if this is normal behavior (each txt file is around 100-300GB, and the fastq pair-end file contains 100 species with 10,000 reads each, a total of 1 million reads). Below is an example of the line I used to merge the text files:

metacache merge results_mix05 -taxonomy ncbi_taxonomy -lowest species -abundances 05_abund_extraction.txt -abundance-per species

Also, I am using the 64-bit version of metacache to generate k-mers of size 32.

Do you think if I partition the database to even smaller chunks will work faster for merging the final results?

I will appreciate any guidance you could provide.

Best wishes,

Jaime

	#if !defined GPU_MODE && defined MC_KMER_TYPE
	using kmer_type = MC_KMER_TYPE ;
	#else
	using kmer_type = std::uint32_t;
	#endif

muellan / metacache Goto Github PK

metacache's People

Stargazers

Watchers

Forkers

metacache's Issues

Recommend Projects

Recommend Topics

Recommend Org