matthewralston / kmerdb Goto Github PK

View Code? Open in Web Editor NEW

12.0 2.0 1.0 60.53 MB

Python bioinformatics CLI for k-mer counts and de Bruijn graphs

Home Page: https://matthewralston.github.io/kmerdb

License: Apache License 2.0

Python 87.93% Shell 0.83% Cython 3.50% TeX 7.74%

bioinformatics bgzf python-cli k-mer-counting k-mer-frequency k-mer-hashing k-mer

kmerdb's Introduction

README - kmerdb

Python CLI and module for k-mer profiles, similarities, and graph files

NOTE: Beta-stage .bgzf and zlib compatible k-mer count vectors and DeBruijn graph edge-list formats.

Development Status

Summary

[ x ] Homepage:
[ x ] Quick Start guide
[ x ] kmerdb usage subcommand_name
- profile - Make k-mer count vectors/profiles, calculate unique k-mer counts, total k-mer counts, nullomer counts. Import to read/write NumPy arrays from profile object attributes.
- graph - Make a weighted edge list of kmer-to-kmer relationships, akin to a De Bruijn graph.
- usage - Display verbose input file/parameter and algorithm details of subcommands.
- help - Display verbose input file/parameter and algorithm details of subcommands.
- view - View .tsv count/frequency vectors with/without preamble.
- header - View YAML formatted header and aggregate counts
- matrix - Collate multiple profiles into a count matrix for dimensionality reduction, etc.
- kmeans - k-means clustering on a distance matrix via Scikit-learn or BioPython with kcluster distances
- hierarchical - hierarchical clustering on a distance matrix via BioPython with linkage choices
- distance - Distance matrices (from kmer count matrices) including SciPy distances, a Pearson correlation coefficient implemented in Cython, and Spearman rank correlation included as additional distances.
- index - Create an index file for the kmer profile (Delayed:)
- shuf - Shuffle a k-mer count vector/profile (Delayed:)
- version - Display kmerdb version number
- citation - Silence citation suggestion
[ x ] kmerdb subcommand -h|--help

k-mer counts from .fa(.gz)/.fq(.gz) sequence data can be computed and stored for access to metadata and count aggregation faculties. For those familiar with .bam, a view and header functions are provided. This file is compatible with zlib.

Install with pip install kmerdb

Please see the Quickstart guide for more information about the format, the library, and the project.

Usage

# Usage    --help option    --debug mode
kmerdb --help # [+ --debug mode]
kmerdb usage profile


#   +

# [ 3 main features: ]     [ 1.   -    k-mer counts  ]

# Create a [composite] profile of k-mer counts from sequence files. (.fasta|.fastq|.fa.gz|.fq.gz)
kmerdb profile -vv -k 8 --output-name sample_1 sample_1_rep1.fq.gz [sample_1_rep2.fq.gz]
# Creates k-mer count vector/profile in sample_1.8.kdb. This is the input to other steps, including count matrix aggregation. --minK and --maxK options can be specified to create multiple k-mer profiles at once.
<!-- # Alternatively, can also take a plain-text samplesheet.txt with one filepath on each line. -->

#          De Bruijn graphs (not a main feature yet, delayed)
# Build a weighted edge list (+ node ids/counts = De Bruijn graph)
kmerdb graph -vv -k 12 example_1.fq.gz example_2.fq.gz edges_1.kdbg

# View k-mer count vector
kmerdb view -vv profile_1.8.kdb # -H for full header

# Note: zlib compatibility
#zcat profile_1.8.kdb

# View header (config.py[kdb_metadata_schema#L84])
kmerdb header -vv profile_1.8.kdb

## [ 3 main features: ]   [ 2. Optional normalization, PCA/tSNE, and distance metrics ]

# K-mer count matrix - Cython Pearson coefficient of correlation [ ssxy/sqrt(ssxx*ssyy) ]
kmerdb matrix -vv from *.8.kdb | kmerdb distance pearson STDIN
# 
# kmerdb matrix -vv DESeq2 *.8.kdb
# kmerdb matrix -vv PCA *.8.kdb
# kmerdb matrix -vv tSNE *.8.kdb
#   # <from> just makes a k-mer count matrix from k-mer count vectors.
# 

# Distances on count matrices [ SciPy ]  pdists + [ Cython ] Pearson correlation, scipy Spearman and scipy correlation pdist calculations are available ]
kmerdb distance -h
# 
#
# usage: kmerdb distance [-h] [-v] [--debug] [-l LOG_FILE] [--output-delimiter OUTPUT_DELIMITER] [-p PARALLEL] [--column-names COLUMN_NAMES] [--delimiter DELIMITER] [-k K]
#                       {braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,matching,minkowski,pearson,rogerstanimotorusselrao,seuclidean,sokalmichener,sokalsneath,spearman,sqeuclidean,yule} [<kdbfile1 kdbfile2 ...|input.tsv|STDIN> ...]

# [ 3 main features: ]      [ 3. Clustering: k-means and hierarchical with matplotlib ]

#    Kmeans (sklearn, BioPython)
kmerdb kmeans -vv -k 4 -i dist.tsv
#    BioPython Phylip tree + upgma
kmerdb hierarchical -vv -i dist.tsv

kmerdb is a Python CLI designed for k-mer counting and k-mer graph edge-lists. It addresses the 'k-mer' problem (substrings of length k) in a simple and performant manner. It stores the k-mer counts in a columnar format (input checksums, total and unique k-mer counts, nullomers, mononucleotide counts) with a YAML formatted metadata header in the first block of a bgzf formatted file.

Usage example

Installation

OSX and Linux release:

pip install kmerdb

Optional DESeq2 normalization

DESeq2 is an optional R dependency for rpy2-mediated normalization. Make sure development libraries are installed from the repository.

pip install -r requirements-dev.txt

Next, install DESeq2 via bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

IUPAC support:

kmerdb profile -k $k -o output input.fa # This simply discards non-IUPAC characters.

IUPAC residues (ATCG+RYSWKM+BDHV) are kept throughout the k-mer counting. But non-IUPAC residues (N) and characters are trimmed from the sequences prior to k-mer counting. Non-standard IUPAC residues are counted as doublets or triplets.

Documentation

Check out the main webpage and the Readthedocs documentation, with examples and descriptions of the module usage.

Important features to usage that may be important may not be fully documented as the project is in beta.

For example, the IUPAC treatment is largely custom, and does the sensible thing when ambiguous bases are found in fasta files, but it could use some polishing. For example, the 'N' residue rejection creates gaps in the k-mer profile from the real dataset by admittedly ommitting certain k-mer counts. This is one method for counting k-mers and handling ambiguity. Fork it and play with it a bit.

Also, the parallel handling may not always be smooth, if you're trying to load dozens of 12+ mer profiles into memory. This would especially matter in the matrix command, before the matrix is generated. You can use single-core if your machine can't collate that much into main memory at once, depending on how deep the fastq dataset is. Even when handling small-ish k-mer profiles, you may bump into memory overheads rather quickly.

Besides that, I'd suggest reading the source, the differente elements of the main page or the RTD documentation.

Development

https://matthewralston.github.io/kmerdb/developing

python setup.py test

License

Created by Matthew Ralston - Scientist, Programmer, Musician - Email

Distributed under the Apache license. See LICENSE.txt for the copy distributed with this project. Open source software is not for everyone, and im the author and maintainer. cheers, on me. You may use and distribute this software, gratis, so long as the original LICENSE.txt is distributed along with the software. This software is distributed AS IS and provides no warranties of any kind.

   Copyright 2020 Matthew Ralston

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

Contributing

Fork it (https://github.com/MatthewRalston/kmerdb/fork)
Create your feature branch (git checkout -b feature/fooBar)
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

Acknowledgements

Thanks mom and dad and my hometown, awesome hs, and the University of Delaware faculty for support and encouragement. Thanks to my former mentors, bosses, and coworkers. It's been really enjoyable anticipating what the metagenomics community might want from a tool that can handle microbial k-mers well.

Thank you to the authors of kPAL and Jellyfish for the inspiration and bit shifting trick. And thank you to others for the encouragement along the way, who shall remain nameless.

The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project began under GPL v3.0 and was relicensed with Apache v2. Hopefully this project could gain some interest. I have so much fun working on this project. There's more to it than meets the eye. I'm working on a preprint, and the draft is included in some of the latest versions of the codebase, specifically .Rmd files.

More on the flip-side. It's so complex with technology these days...

kmerdb's People

Contributors

Stargazers

Watchers

Forkers

das2000sidd

kmerdb's Issues

kdb profile doesnt print the header in OrderedDict properly.

Recently, I altered the codebase to include multi-block headers. In the process of doing so, I needed to keep track of the number of metadata blocks in the first block of the metadata blocks, i.e. the first block of the file. It should include the version and metadata_blocks fields/properties in the first two lines of the first block of the file. This means that the files block should never occur before that. I recently created a kdb file with the current master branch and could not observe the desired behavior.

kmerdb uses mode="wb" with BgzfWriter subclass

Thanks to @peterjc, he discovered a bug in my writer class that was traced back to some private methods (_write_block) for binary encoded data. Switching to mode="w" fixed the issue. Will close with the new commit.

Submit your edits for this evening by 11:30.

We should add MatthewRalston/LocalConfigurations to our developer dependencies. No, seriously.

We spent all day hardly making any modifications to the peerj.latex file, we did rename it and it did eventually get fixed enough to the point where it can stably be built. But it still has the pandoc 2.11.2 (our version) bug with the CLSReferences environment requirement. If we downgrade we might be able to circumvent this temporarily. But we did also get the emacs theme changed today for a different contrast, and we also set our .Rmd files to 'autodetect' and set mode to poly-markdown+r-mode ( I think the answer is (add-to-list 'auto-mode-alist '("\\.Rmd" . poly-markdown+r-mode))) and we have a new shortcut in case we're pissed with RStudio : C + c r. This will evaluate the spa/rmd-render function defined by John Geddes.

We have to take a moment to acknowledge the role that Vitalie Spinu has played in my past success at UD and also what I'll currently be able to do thanks to emacs polymode. And the RStudio developers for making this possible as well. Thank you for the amazing software. I'd say so far it's been about 100% RStudio until today, I encountered some rendering problems, and wanted to switch back to RStudio. But I remember how good their software was and wanted to thank both groups for their work on the interactive R experience...

And then I have to thank Yihui in the first place for KnitR. Really, really cool concept and now the RStudio team has so much to do. I mean how can we not go down the rabbit hole now and thank the R community and then the GNU community. And then I guess I'd also thank the Numpy, Scipy, and SKLearn teams for making the other parts of this project so easy to implement in a short amount of time. And I'd like to also thank a number of previous mentors and a current one who I don't quite get along with.

But anyways yeah I'm gonna go back and work some more and thanks again for everything, goodnight.

Today we learned a lot about the biblatex style I want to implement for a clean citation style, and I'm still thinking about a different style, like the plain citation module.

rstudio/rmarkdown#1649
rstudio/rmarkdown#1649
mpark/wg21#54
https://stackoverflow.com/questions/59193797/pandocs-environment-cslreferences-undefined-when-knitting-rmarkdown-to-pdf-in-r
https://gist.github.com/MatthewRalston/68f1e4f262712b0f2a60b3b617a3c8cb#file-peerj-tex-L290

The two biggest issues at the moment are:

the title and author information will not print at the top of the first page
the bibliography is printed stupidly because of the wrong pandoc version.

Figures needed concerning the distribution fitting of the general case versus individual genomes

Explain how the individual genomes need separate fits of the model, but the model selected remains the same regardless of genome.

Cullen Frey analysis to determine discrete distribution shape and similarity to Poissonian behavior.
Distribution similarities and different fits. Compare the skewness and kurtosis between each Cullen Frey plot.
Reminder if the NB model is selected and DESeq2 normalization applied, the user accepts normalization by other factors except size.
Now we could log-transform and then normalize and it would likely be fine, but we can continue to use the transformed space for clustering

Memory multiplication bug

Currently, the __init__.py file duplicates the array across parallelization pools, even when -p = 1.

Moving this to the bottom should fix some memory issues with larger choices of k.

No parallelization for fastq data

The parallelization for fastq data should be implemented correctly: meaning the pool should be utilized properly, even when accumulating k-mer from .fastq(.gz) data files.

postgres database connection is severely lowering calculation speed.

Average memory per processor was low before << 30Gb. Will scale supralinearly in memory with k. But I don't plan on analyzing many human genomes at k>20 right now. The performance tradeoff is too important right now to consider high-value k in a database framework. The goal of this project is to perform data science operations on microbial ecological projects, not to inspect the level of complexity of a sparse Nxm (m samples at 4**k = N k-mer profile depth) matrix for human genome comparisons. If you're interested in this, contact me or maybe submit a pull request. Heck, make your own fork! Just recall the original License and make an effort to contact me for academic collaborations!

The goal of this issue is to highlight a problem in the database strategy: It seems to be CPU bound, not IO bound with this level of adoption of high-speed NVMe protocols. Postgres 13.4 is not maximizing the throughput of my NVMe RAID0 array (max clocked at roughly 21Gbps).

For this reason, my calculations have been severely CPU bound and potentialy other bottlenecks in the <=0.0.10 versions of this software. For this reason, I am returning once more to in-memory comparisons, in this already ridiculously complex variable space.

It is my opinion that not even UMAP can address the complexity of point density in arbitrarily complex k-mer profile spaces. The distances are ridiculously complex in the high-dimensional space and the Pearson and Spearman correlations seem to be preferable in some cases. At this stage, I am getting stuck during dimensionality reduction. I would like to visualize clusters of organisms in 2D/3D space, and Python seems to be my best option.

My next strategy will be dimensionality reduction with UMAP on a spearman distance matrix. The Spearman + PCA approach seemed appropriate around this time, and produced good clustering results. However, there aren't enough sample points in the distance matrix yet. I'll need to construct some more complex artificial metagenomes if I want to succeed.

Zero-length blocks produced in bgzf file

The generation of the zero-length blocks is likely restricted to kdb/fileutil.py only.

K-mer count distribution summarization function

The k-mer count distribution should be produced as a summarization function.

Clarify null model revision and document

Using a first order markov model as part of the null model, and a uniform model as the remainder. The relationship is they are multiplied together and thus added when log10 transformed. So it's going down in the log odds ratio right now but not the probability of sequence.

My Pearson's showing up...

Backbone - Droeloe

CITATION isn't properly portable

Using 'CITATION': ['CITATION'] or whatever it was is the improper way to include extra data into the package.

NumPy array dtype inference is performed on the array from the file metadata["dtype"]

This fully specifies the memory profile of the in memory version of the k-mer counter. Choices : ["uint32", "uint64", "float32", "float64"]

It's worth stating that after some changes in the Pearson correlation coefficient code, my Correlation coefficients showed up. The pearson noise depends on what you're describing, a depth and or error rate. The model can't be specified. Ratios. I can't justify the 0.999 level accuracy that I'm seeing, it seem less than optimal but the real order should be specified as a Spearman correlation. I'm just doing that now. Well, I reran the pearson and suddenly my Pearsons show up and I'm like what. The last column is

0.9986406358838387      0.99870468338501        0.9987507311026925      0.998772078581407       0.9987762730886994      0.9987902044527077      0.9987863177029056   0.998805286411466        0.9988063501684376      0.9984427576272187      1.0

Alright here go my spearmans:

... Still blank. spearmanr RTFM targeted.

Urgent help needed in matrix pass

I'm trying to identify a TypeError default issue or one or more

Traceback (most recent call last):
  File "/home/matt/.pyenv/versions/kdb/bin/kmerdb", line 33, in <module>
    sys.exit(load_entry_point('kmerdb==0.5.2', 'console_scripts', 'kmerdb')())
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.2-py3.10.egg/kmerdb/__init__.py", line 1613, in cli
    args.func(args)
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.2-py3.10.egg/kmerdb/__init__.py", line 649, in get_matrix
    raise TypeError("One or more files did not have the default dtype or the default dtype did not match one or more files...")
TypeError: One or more files did not have the default dtype or the default dtype did not match one or more files...

Exchange header for metadata throughout the codebase.

I want to refactor the code to use 'metadata' as the keyword when referring to the file through the UI. Instead of 'header', which is a little non-specific and ambiguous. Historically, I think it made sense to use the term header very often, but i would prefer to use the phrase metadata to refer to the context. And the first step towards this refactor is replacing the k-mer line-specific metadata or 'neighbors' array with kmer_metadata.

Fix the language for the Pearson's correlation coefficient code. We need to actually see the Pearsons.

I want to see both the Pearson's and the Spearmans for each dataset but we need to prioritize getting the plan written. Communicated.

Probability needs a k-mer total column (N-k+1) and a unique k-mer count

The kmerdb probability function needs to count the number of steps it has made(we know ahead of time it is just L-k+1, where L is the sequence length), and the number of unique ids it has seen (set).

What if k is bigger than the sequences?

If the input sequences are smaller than k, i.e. there can be no k-mers derived from them, which module handles this? Oh, it wasn't. It shoots through Kmer.shred without checking.

Retain sqlite3 databases optionally as output.sqlite3

We had retained the databases previously and their location was printed to a debug-level logging statement, not the most obvious places to keep the retained sqlite databases under the option --keep-sqlite under the kmerdb profile optional flags. In this issue, I'll move the temporary databases to a more obviously retained location, and I'll print the retainment statement out to sys.stderr instead of to a debug level logging.

Did not catch the .fna suffix for fasta

A modest name for a more substantial bug, prior to this, I had been adopting to my own naming conventions of plain nucleic acid fasta files as .fasta or .fa and not the more explicit .fna, but I had been assuming that everything that wasn't .fq or .fastq was going to be not only fasta formatted input (which would get caught by SeqIO), but also that the suffix would be .fa or .fasta. The proper fix for this of course is to capture all 3 suffixes of fasta, both for fastq, and to raise a ValueError (less harsh than a IOError and more clear I think) about the format issue if it does not belong to the 5. Prior to this, this was not a feature of the kmerdb.seqparser module.

Create a first release for kdb

Steer the code towards resolving the following

standardize the report on the tsv filesnames you expect them to follow from the documentation.
The k-mer report by kdb.
dendrogram
pca and/or tsne plus kmeans
pca elbow + kmeans elbow

Bugfix to distances not being printed correctly in output Pandas dataframe

Distances function needs to be reworked to better parse .kdb files properly and not just input dataframe CSVs.

Larger look into the uint64 space, choosing the best k for my machine

The dimensions of biological data that we are able to investigate are formed by some choice of k, and we can represent biological scenarios in terms of .kdb files. By selecting some choice of k, we are able to concentrate the problem solving capacities of the computer, the logical imperative to mesh with the grammatical, and produce readable, semantic, organic codebase.

I don't spend enough time in the README, but it's a source of great anxiety.

Anyways yeah I'm going to consider this uint space for a moment to really get a grasp of the potential feature space.

Useless try/finally block in kmerdb/parse.py

The except statement was removed a while back. Needs to be cleaned up and un-indented.

Needs a suitable citation feature similar to GNU parallels

Even if the CITATION text isn't completely finished because we don't have URLs, publish dates etc from a preprint or an article, it is important to have the essential process for showing the citation information and silencing it if needed.

The milestone suitable for academic use means that the methodologies have all been documented in a preprint, it does not mean that the methods are necessarily sound. No warranties are made about how you use this software. There is no guarantee that my formula for the Markov probabilities is correct, there is no guarantee that a first order Markov model is a suitable null model, there is no guarantee that a uniform random distribution is suitable for the frequency (p(x)) of the 0th k-mer in the null model. The counts are accurate and the methods applied are honest, but I can't talk about when an dhow this should be used until I better understand my audience and my peers who will review this work.

Markov probability function prints the first data structure before printing the DataFrame to stdout

On line 90 of kmerdb/__init__.py, the module prints a data structure to stdout with a print statement. It also prints a pandas data frame as tsv to STDOUT. We need to remove the print statement.

No dtype option for memory conservation

Currently there is no command-line option for specifying the integer size (NumPy's dtype parameter) when accumulating counts for the k-mer profile. This should be in both kmerdb/parse.py and kmerdb/__init__.py numpy array instantiations (np.zeros()).

Number of reads equals number of k-mers in fastq file

It should be equal to the number of sequences shredded.

New regression tests needed

After the most recent pull request regarding PostgreSQL and parallelism, the regression test coverage has hit a new low of < 5%. I need better regression test coverage.

We don't have a symbol for the end of the metadata block.

While reading the header in kmerdb.fileutil.KDBReader.__init__() we need to rstrip() from each block the tag, which should be defined in kmerdb/config.py, and be also used by the writer when writing the entire header to the file.

This can be used by kmerdb view to delineate the end of the header when reading the file. The real reason to use this is if the kdb input is being read from stdin, we need to gzip decompress it and iterate using with gzip.open(sys.stdin) as stdin and stdin.readline to append to a header string, until the following is encountered. Throw everything behind this into yaml.safe_parse, followed by schema validation. Then either construct the writer to write the profile to the new file or stdout

    kdb : can someone get me some sentries
========================

kdb subcommands don't all have guarantees about .kdb file suffix, while index commands require this.

Some commands under the hood require the input command to be named .kdb, like the index command. We need to smooth this over so that the filename is caught in more places than just each function in __init__.py. I'm going to close this issue with a first patch and reopen.

K-mer db doesn't store read/fasta index location

Chase this tuple (seq_id, starting_position, strand)

For k-mer db to be a full graph database format, it would need to store the following data:

seq_id: the full qualified read ID from the fastq or fasta input
starting_position, offset from the most 5' position
reverse: forward strand assumed.

What is the desired behavior for the metadata index? What's really the next step here?

The user wants to traverse back to the corresponding sequences from within the graph API. This merely stores it in the database, and should be an addition of oooff... potentially a lot of text, making the database size much bigger than anticipated. Hopefully the annotations end with just the sequence id.

I don't know what I fixed but I fixed it.

This is an ongoing issue in the issues...

Please stand by.

Major refactor of kdb module

The code commenting level is poor, functions are too monolithic, np.array functionality is not easily substitutable,

KDB | writer | reader seems clunky.

Print a 2x2 distance matrix out as a single value

In prior versions of the distance method in bin/kdb, the distance function reported singleton distances as single printed values, instead of as TSV distance matrix. If we want to print single distances as single distances instead of as distance matrices, then we can restore this feature even with the new scipy.spatial.distance.pdist support.

The new version makes some changes to the metadata

metadata of the file so that it contains a sorting order that we're just starting to work on. To prioritize the elements of a list. We have to refine the algorithm to specify whether via Pearson or via Spearman correlation coefficients, we can focus our search on high affinity targets or low affinity targets.

We have to prioritize the method to at times de-link, de-couple, the types of the numbers flowing through the program.

Hierarchical in kmerdb/init.py doesn't anticipate all STDIN cases

At line 494 of __init__.py, the elif conditional checking arguments.input input types does not anticipate the cases when STDIN or /dev/stdin are supplied, but still require valid sys.stdin parsing. The fix was one line.

kmerdb distance throws numba not found error

I've removed the numba jit compiler from the codebase, this is a genuine bug.

IndexError out of bounds from self.kmer_ids[i] = kmer_id

This IndexError is coming up in fileutil.py

Traceback (most recent call last):
  File "/home/matt/.pyenv/versions/kdb/bin/kmerdb", line 33, in <module>
    sys.exit(load_entry_point('kmerdb==0.5.0', 'console_scripts', 'kmerdb')())
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/__init__.py", line 1625, in cli
    args.func(args)
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/__init__.py", line 1200, in view
    with fileutil.open(arguments.kdb_in, mode='r') as kdb_in:
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/fileutil.py", line 189, in open
    return KDBReader(filename=filepath, mode=mode)
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/fileutil.py", line 326, in __init__
    self.slurp(dtype=dtype)
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/fileutil.py", line 477, in slurp
    self._slurp(dtype=dtype)
  File "/home/matt/.pyenv/versions/kdb/lib/python3.10/site-packages/kmerdb-0.5.0-py3.10.egg/kmerdb/fileutil.py", line 452, in _slurp
    self.kmer_ids[i] = kmer_id
IndexError: index 65536 is out of bounds for axis 0 with size 65536`1

Excessive warning messages about zero neighbor information during indexing

This issue can be solved by enabling neighbor information by default, but is a low priority bug.

Slurp profile

The other goal that can be started is the experimentation of a get_profile function or similar in fileutils that would slurp the entire profile into memory. I'm sick of not being able to have direct access to the data in memory if I'm sure it will fit, and having a function like this would be helpful for experimentation with the similarity metrics.

All metadata feature is broken

After the most recent pull request, the feature for inserting data into the database has changed. I need to test this on a small value of k to see how it goes.

N-residues are not supported, and k-mers containing N are removed. This is unsupported and will be a longstanding issue.

This issue will not be directly addressed, but will be left with some comments in the codebase.

Create a function to shuffle the nodes of a kmer database amongst the ids

kmerdb shuf

This command shuffles the lines of a kdb file, much like the command line tool shuf. It requires the file to be indexed, since it does the shuffling on disk through a single stream, behind the index. This makes it very memory efficient shuffling, which will be necessary as the size of the nodes/kmer_metadata gets larger.

Reverse complement k-mers in addition, by default

view needs to be able to read uncompressed input

View needs to be able to read and write to STDIN/STDOUT and to read and write bgzf (un)compressed input/output, except we won't actually to bgzf uncompressed output to a file. Just redirect from the stupid STDOUT.

kmerdb matrix and distance don't assume tsv/csv with no inputs supplied.

Minor syntactical sugar on the command line.

Old usage (still supported)

kmerdb matrix Unnormalized *.kdb | kmerdb distance spearman STDIN #or '/dev/stdin'
kmerdb matrix Unnormalized *.kdb | kmerdb matrix Normalized STDIN

New usage should be

kmerdb matrix Unormalized *.kdb | kmerdb matrix Normalized