dkoslicki / cmash Goto Github PK

View Code? Open in Web Editor NEW

42.0 9.0 9.0 17.71 MB

Fast and accurate set similarity estimation via containment min hash

License: BSD 3-Clause "New" or "Revised" License

Python 94.23% Shell 5.77%

cmash's People

Stargazers

Watchers

Forkers

bede slebras wangdi2014 bioforensics jaquejbrito tyrionbian mealser arvidgollwitzer

cmash's Issues

Require pandas version>=0.21.1

Otherwise you will get a **kwargs error about labels being an invalid keyword argument when you call df.reindex

Multiple k-mer sizes bug

Current implementation of multiple k-mer size containment index computation may have some bugs in it as noted here. In particular, training with 30-60-10 and training with 60-60-10 results in different 60-mer containment index values (which it shouldn't).

Parallelize the "collect the intersection counts"

Currently this is single threaded, but embarrassingly parallel, so could Parallelize this for large reference databases.

Ground truth computation is too slow for realistic data set sizes

In the current CMash/GroundTruth.py, since it's using a naive python set calculation, this is much too slow to get results in a reasonable amount of time on larger data sets. Will need to switch to using KMC many times over to calculate the actual ground truth.

Work being done on groundtruth branch.

Use KMC to count k-mers of all training database genomes for all k-mer sizes (and return total number of distinct k-mers)
Use KMC to count k-mer of the query genome for all k-mer sizes
Calculate containment index via kmc_tools intersect and divide by total number of distinct training database k-mers

Modularize StreamingQueryDNADatabase.py

Scripts should be brief, and basically just call other modules. StreamingQueryDNADatabase.py is a bloated mess right now and is very difficult to test (will effect #14). The bulk of the stuff in this script should be in a different module, and StreamingQuery... should only call it. This imagined module should contain unit tests.

move bulk of StreamingQueryDNADatabase.py to separate module
test integration with simple wrapper StreamingQueryDNADatabase.py
write unit tests for the module

Make sure jaccard isn't counting blanks/Inf as matches

See https://github.com/dkoslicki/CMash/blob/master/CMash/MinHash.py#L662-L671

Make a Conda/bioconda release of CMash

This will also serve to assist with the installation of Metalign.

Figure out what to do with the reverse complements

Add canonical kmers to the sketch? Or just standard kmers? Add canonical/std kmers to the tree? To the prefilter? For prefix searches?

Make python 3 compliant

Add tests, hopefully get on conda, etc.

Bloom filter for pre-filter of kmers

Try creating a pre-filter consisting of all kmers (for all kmer sizes) of all the hashes and their prefixes, so I don't have to hit the trie so frequently.

StreamingQueryDNA... may be case sensitive to sequences

Haven't looked closely, but it appears that nowhere am I making sure everything is upper case. While the BF doesn't care (new option set), the TST definitely does care about case.

Test k-mer frequency distribution idea

Goal of this is to create a fast way to re-create k-mer count distributions (similar to what is trying to be done in publications like this and references therein).

Current code base already keeps track of sketch k-mer counts (see here, and here).

Project would be:

compare histogram of sketch k-mer counts with actual k-mer count histogram
implement a metric to compare the difference between the distributions (probably the Total Variation Metric, Wasserstein metric, or a simple L1 metric.
compare change in these metrics as sketch sizes is increased.
compare to existing methods that re-create k-mer count distributions (eg. this one and the methods it compares to).

Optional:

test how jaccard/containment weighted by sketch counts compares to the true jaccard/containment when weighted by all counts (not just the sketches).

This would be sufficient for a conference paper.

For a journal paper would need to:

characterize/prove the convergence between the true and estimated distributions as a function of sketch size. (not too difficult, but would take a bit of probability work)

StreamingQuery isn't sharing the set

Hence if you run 12 threads, it will over count by up to 12. For example, test with k-mer size 1, 12 threads, and note how containment index is 12 (instead of 1).

Installation woes -- TLSV1_ALERT_PROTOCOL_VERSION and pytest-runner

Cloned master branch today; install fails with this error (all on one line)
Download error on https://pypi.org/simple/pytest-runner/: [SSL:
TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) --
Some packages may not be found!
Because of that, it fails to find a suitable version of pytest-runner:
distutils.errors.DistutilsError: Could not find suitable distribution
for Requirement.parse('pytest-runner<3dev,>=2.0')

I'm aware of what TLSV1_ALERT_PROTOCOL_VERSION is, and have updated my own versions of pip and virtualenv (today).

I'm parroting the installation steps pretty closely:
cd somepath/cmash
source somepath/cmash/cmash_venv/bin/activate
pip install --upgrade pip
Requirement already up-to-date: pip in ./cmash_venv/lib/python2.7/site-packages (18.0)
git clone https://github.com/dkoslicki/CMash.git
mv CMash cmash-master.aug_2_2018
cd somepath/cmash/cmash-master.aug_2_2018
pip install -r requirements.txt

The install is what fails. Full log below.

I'm a little baffled. I'm running the pip inside the VE (which ought to be the correct one). And if I search for pytest-runner it finds one, without getting any protocol error.
source somepath/cmash/cmash_venv/bin/activate
pip --version
pip 18.0 from somepath/cmash/cmash_venv/lib/python2.7/site-packages/pip (python 2.7)
pip search pytest-runner
(many other packages)
pytest-runner (4.2) - Invoke py.test as distutils command with dependency resolution

Thanks for any suggestions.

Looking in indexes: https://pypi.python.org/simple/
Obtaining file://somepath/cmash/cmash-master.aug_2_2018 (from -r requirements.txt (line 3))
Collecting khmer>=2.1.1 (from CMash==0.3.0->-r requirements.txt (line 3))
Using cached https://files.pythonhosted.org/packages/78/c9/caee0a310f249c86fd351706d2513e197e036231f4db6f3bda7a363ea1c4/khmer-2.1.1.tar.gz
Complete output from command python setup.py egg_info:
Download error on https://pypi.org/simple/pytest-runner/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
Couldn't find index page for 'pytest-runner' (maybe misspelled?)
Download error on https://pypi.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
No local packages or working download links found for pytest-runner<3dev,>=2.0
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/p3/6sl53f_s4yz5fm1n3mbbd57w0000gp/T/pip-install-aJCcSA/khmer/setup.py", line 304, in
setup(cmdclass=CMDCLASS, **SETUP_METADATA)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 130, in setup
_install_setup_requires(attrs)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 125, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 514, in fetch_build_eggs
replace_conflicting=True,
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 773, in resolve
replace_conflicting=replace_conflicting
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1056, in best_match
return self.obtain(req, installer)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1068, in obtain
return installer(requirement)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 581, in fetch_build_egg
return cmd.easy_install(req)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 670, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner<3dev,>=2.0')

Deal with the duplicate small k-mers in the sketches

i.e. do something more intelligent than:

containment_indices[:, k_size_loc] = (hit_matrices[k_size_loc].sum(axis=1).ravel()/float(num_hashes))

Python3 hanging

See this issue with Metalign. In short,
So I figured out the probably source of the program hanging: python3 vs python2 for CMash:
After setting up the data, this hangs:

python3 -m venv VE3
source VE3/bin/activate
.\setup_libraries.sh

This hangs:

python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv

this is the thing causing the hang:

python3 StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive
So instead, try python2, and it doesn't hang:

virtualenv VE2
source VE2/bin/activate
cd CMash
pip install -r requirements.txt

this runs just fine and does not hang:

python StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive

this works too (oddly enough, since it's being called with python3, so it only looks like installing CMash with python2 is required):

python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv

also works (as it appears metalign.py is python2/3 compliant):

python metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv
So possible solutions (with my assessment of ease of implementation):

Make setup_libraries.sh use python2 when installing CMash (probably via a virtualenv) (easy)
See why marisa-trie isn't working with python3 (their repo says it's python3 compatible) (medium)
Refactor CMash so it's python3 compliant (hard)

Improved classification time with KMC

When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:

Dump all the training database sketch k-mers using KMC (like it's done here
Use KMC to count the k-mers in the sample
Intersect these with the k-mers in the training database sketches and dump these to a file
Reformat these dumped k-mers into a FASTA-looking file
Feed that into StreamingQueryDNADatabase.py

Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

Testing environment

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

adding unit tests to the tests folder (lots can be copied from CMash/MinHash.py)
add additional unit tests as code coverage is rather low at this point
test validity of Query module results
automate script tests so manual checking is no longer required.
brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
add tests for the multiple k-mer size features in test_scripts
Set up Travis CI

For very large databases, creation of TST is slow and memory intensive

Will need to:

modularize the content of MakeStreamingDNADatabase.py
create methods to make the TST the old way
create method to make TST without reading in the whole training database, and yielding the results to mt.Trie()

Work will happen in the branch ModularizeMake

In MakeStreamingDNADatabase.py, don't require output directory to exist

eg. if

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k ${maxKsize} -v

and ${outputDir} doesn't exist, then MakeStreamingDNADatabase.py should create it!

Add KMC as requirement

in yaml and docs

Restructure repo so it's clear which approach is using the streaming, and which is using Bloom filters

The Bloom filter approach (presented in this paper) corresponds to:

MakeDNADatabase.py
QueryDNADatabase.py
MakeNodeGraph.py

The new (as of yet unpublished) implementation that uses ternary search trees and streaming consists of

MakeStreamingDNADatabase.py
MakeStreamingPrefilter.py
StreamingQueryDNADatabase.py
StreamingQueryDNADatabase_queue.py (depreciated)

Need to clarify this in the repo structure, as well as Update the documentation

Minor refactor of Query.py

In the ModularizeStreamingQuery branch, Query.py method find_non_unique_kmers second half needs to be moved to create_post_containment_indicies. See the #FIXME

k-merization makes no attempt to avoid Ns

This is more of an observation than an issue. I'm not sure whether this would be considered a bug or a feature.

The kmerization (at least in MinHash.py lines 758-763) is agnostic to the alphabet. That's a feature if you're trying to compare, e.g., sets of proteins. But if you're trying to compare DNA that has some Ns in it, each N can produce K kmers that are unlikely to exist in the other set(s), reducing the similarity counts.

Might be worth letting the user know this happens. Either in the readme or maybe give a warning if a sequence with non-ACGT is encountered (assuming ACGT is the intended use).

I didn't test the following, but if a reference genome is softmasked, are the kmers containing lowercase going to match the same kmer containing uppercase?

In my case, admittedly atypical, I had several sets of kmers that I wanted to compare. So I converted them to fasta files that had a single fasta header (for each file), and with each line containing an N followed by a distinct kmer. Using MakeDNADatabase and QueryDNADatabase as shown in the README (except with --containment_threshold 0.0), what I got in the end was a csv file with nothing in it but the headers. This confused me, since my query file was one of the 10 files in my database -- I expected to at least see that fact reflected in the output.

The simple solution was to get rid of the Ns and give every kmer its own fasta header. That appears to have worked fine.

The node graph file was much bigger in the attempt that used Ns, consistent with the hypothesis that Ns are included in the kmers. But I still can't explain why it failed to report the perfect match.

Nondeterministic behavior of output queue

For some really weird reason, the output queue is showing non-deterministic behavior: the sleep time affects the number of hits returned.

Create script to demonstrate how to re-train CMash

Basically, show an end-to-end way to:

re-train CMash
add to an existing CMash database
retrain in a fashion suitable for Metalign

Work happening on Retraining branch.

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this

For example, using the Metalign default training database (199807 genomes) and running

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10

results in uncompressed:

16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst

yet

4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz

so ~2-4x compression.

Would need to either:

Enable MakeStreamingDNADatabase.py and MakeStreamingPrefilter.py to detect compressed training data and decompress it in the script or (better yet)
Enable decompression in the modules MinHash.py and Query.py themselves.

In post-processing, find correct denominator

Specifically:

# then normalize by the number of unique k-mers (to get the containment index)
		# In essence, this is the containment index, restricted to unique k-mers. This effectively increases the specificity,
		# but also increases the variance/confidence interval, since this decreases the size of the sketch.
		for k_size_loc in range(len(k_range)):
			k_size = k_range[k_size_loc]
			for hash_loc in np.where(containment_indices[:, k_size_loc])[0]:  # find the genomes with non-zero containment
				unique_kmers = set()
				for kmer in CEs[hash_loc]._kmers:
					unique_kmers.add(kmer[:k_size])  # find the unique k-mers
				#containment_indices[hash_loc, k_size_loc] /= float(len(unique_kmers))  # FIXME: this doesn't seem like the right way to normalize, but apparently it is!
				containment_indices[hash_loc, k_size_loc] /= float(num_unique[hash_loc, k_size_loc])  # FIXME: in small tests, this seems to give better results. To be revisted.

which denominator in the last two lines is correct?

Update PyPi release of CMash

As the current version is quite old

Multiple k-mer sizes confirmation and testing

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:

Address #19 so we no nothing funky is happening with the current implementation.
Create testing environment to prepare for comparing old method to new method
Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
Test the magnitude of the bias factor over many genomes (relatively straightforward task).

Is CMash metric?

Dear CMash team,

I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C) >= CMash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric), it is a metric. But I am not sure whether CMash estimated mutation rate is also a metric distance.

Many Thanks,

Jianshu