Code Monkey home page Code Monkey logo

cmash's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmash's Issues

Multiple k-mer sizes bug

Current implementation of multiple k-mer size containment index computation may have some bugs in it as noted here. In particular, training with 30-60-10 and training with 60-60-10 results in different 60-mer containment index values (which it shouldn't).

Ground truth computation is too slow for realistic data set sizes

In the current CMash/GroundTruth.py, since it's using a naive python set calculation, this is much too slow to get results in a reasonable amount of time on larger data sets. Will need to switch to using KMC many times over to calculate the actual ground truth.

Work being done on groundtruth branch.

  • Use KMC to count k-mers of all training database genomes for all k-mer sizes (and return total number of distinct k-mers)
  • Use KMC to count k-mer of the query genome for all k-mer sizes
  • Calculate containment index via kmc_tools intersect and divide by total number of distinct training database k-mers

Modularize StreamingQueryDNADatabase.py

Scripts should be brief, and basically just call other modules. StreamingQueryDNADatabase.py is a bloated mess right now and is very difficult to test (will effect #14). The bulk of the stuff in this script should be in a different module, and StreamingQuery... should only call it. This imagined module should contain unit tests.

  • move bulk of StreamingQueryDNADatabase.py to separate module
  • test integration with simple wrapper StreamingQueryDNADatabase.py
  • write unit tests for the module

Bloom filter for pre-filter of kmers

Try creating a pre-filter consisting of all kmers (for all kmer sizes) of all the hashes and their prefixes, so I don't have to hit the trie so frequently.

Test k-mer frequency distribution idea

Goal of this is to create a fast way to re-create k-mer count distributions (similar to what is trying to be done in publications like this and references therein).

Current code base already keeps track of sketch k-mer counts (see here, and here).

Project would be:

  • compare histogram of sketch k-mer counts with actual k-mer count histogram
  • implement a metric to compare the difference between the distributions (probably the Total Variation Metric, Wasserstein metric, or a simple L1 metric.
  • compare change in these metrics as sketch sizes is increased.
  • compare to existing methods that re-create k-mer count distributions (eg. this one and the methods it compares to).

Optional:

This would be sufficient for a conference paper.

For a journal paper would need to:

  • characterize/prove the convergence between the true and estimated distributions as a function of sketch size. (not too difficult, but would take a bit of probability work)

StreamingQuery isn't sharing the set

Hence if you run 12 threads, it will over count by up to 12. For example, test with k-mer size 1, 12 threads, and note how containment index is 12 (instead of 1).

Installation woes -- TLSV1_ALERT_PROTOCOL_VERSION and pytest-runner

Cloned master branch today; install fails with this error (all on one line)
Download error on https://pypi.org/simple/pytest-runner/: [SSL:
TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) --
Some packages may not be found!
Because of that, it fails to find a suitable version of pytest-runner:
distutils.errors.DistutilsError: Could not find suitable distribution
for Requirement.parse('pytest-runner<3dev,>=2.0')

I'm aware of what TLSV1_ALERT_PROTOCOL_VERSION is, and have updated my own versions of pip and virtualenv (today).

I'm parroting the installation steps pretty closely:
cd somepath/cmash
source somepath/cmash/cmash_venv/bin/activate
pip install --upgrade pip
Requirement already up-to-date: pip in ./cmash_venv/lib/python2.7/site-packages (18.0)
git clone https://github.com/dkoslicki/CMash.git
mv CMash cmash-master.aug_2_2018
cd somepath/cmash/cmash-master.aug_2_2018
pip install -r requirements.txt

The install is what fails. Full log below.

I'm a little baffled. I'm running the pip inside the VE (which ought to be the correct one). And if I search for pytest-runner it finds one, without getting any protocol error.
source somepath/cmash/cmash_venv/bin/activate
pip --version
pip 18.0 from somepath/cmash/cmash_venv/lib/python2.7/site-packages/pip (python 2.7)
pip search pytest-runner
(many other packages)
pytest-runner (4.2) - Invoke py.test as distutils command with dependency resolution

Thanks for any suggestions.

Looking in indexes: https://pypi.python.org/simple/
Obtaining file://somepath/cmash/cmash-master.aug_2_2018 (from -r requirements.txt (line 3))
Collecting khmer>=2.1.1 (from CMash==0.3.0->-r requirements.txt (line 3))
Using cached https://files.pythonhosted.org/packages/78/c9/caee0a310f249c86fd351706d2513e197e036231f4db6f3bda7a363ea1c4/khmer-2.1.1.tar.gz
Complete output from command python setup.py egg_info:
Download error on https://pypi.org/simple/pytest-runner/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
Couldn't find index page for 'pytest-runner' (maybe misspelled?)
Download error on https://pypi.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
No local packages or working download links found for pytest-runner<3dev,>=2.0
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/p3/6sl53f_s4yz5fm1n3mbbd57w0000gp/T/pip-install-aJCcSA/khmer/setup.py", line 304, in
setup(cmdclass=CMDCLASS, **SETUP_METADATA)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 130, in setup
_install_setup_requires(attrs)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 125, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 514, in fetch_build_eggs
replace_conflicting=True,
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 773, in resolve
replace_conflicting=replace_conflicting
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1056, in best_match
return self.obtain(req, installer)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1068, in obtain
return installer(requirement)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 581, in fetch_build_egg
return cmd.easy_install(req)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 670, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner<3dev,>=2.0')

Python3 hanging

See this issue with Metalign. In short,
So I figured out the probably source of the program hanging: python3 vs python2 for CMash:
After setting up the data, this hangs:

python3 -m venv VE3
source VE3/bin/activate
.\setup_libraries.sh

This hangs:

python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv

this is the thing causing the hang:

python3 StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive
So instead, try python2, and it doesn't hang:

virtualenv VE2
source VE2/bin/activate
cd CMash
pip install -r requirements.txt

this runs just fine and does not hang:

python StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive

this works too (oddly enough, since it's being called with python3, so it only looks like installing CMash with python2 is required):

python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv

also works (as it appears metalign.py is python2/3 compliant):

python metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv
So possible solutions (with my assessment of ease of implementation):

Make setup_libraries.sh use python2 when installing CMash (probably via a virtualenv) (easy)
See why marisa-trie isn't working with python3 (their repo says it's python3 compatible) (medium)
Refactor CMash so it's python3 compliant (hard)

Improved classification time with KMC

When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:

  1. Dump all the training database sketch k-mers using KMC (like it's done here
  2. Use KMC to count the k-mers in the sample
  3. Intersect these with the k-mers in the training database sketches and dump these to a file
  4. Reformat these dumped k-mers into a FASTA-looking file
  5. Feed that into StreamingQueryDNADatabase.py

Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

Testing environment

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

  • adding unit tests to the tests folder (lots can be copied from CMash/MinHash.py)
  • add additional unit tests as code coverage is rather low at this point
  • test validity of Query module results
  • automate script tests so manual checking is no longer required.
  • brute force calculate containment indicies and check that results of all scripts are within acceptable error ranges.
  • add tests for the multiple k-mer size features in test_scripts
  • Set up Travis CI

Restructure repo so it's clear which approach is using the streaming, and which is using Bloom filters

The Bloom filter approach (presented in this paper) corresponds to:

  1. MakeDNADatabase.py
  2. QueryDNADatabase.py
  3. MakeNodeGraph.py

The new (as of yet unpublished) implementation that uses ternary search trees and streaming consists of

  1. MakeStreamingDNADatabase.py
  2. MakeStreamingPrefilter.py
  3. StreamingQueryDNADatabase.py
  4. StreamingQueryDNADatabase_queue.py (depreciated)

Need to clarify this in the repo structure, as well as Update the documentation

Minor refactor of Query.py

In the ModularizeStreamingQuery branch, Query.py method find_non_unique_kmers second half needs to be moved to create_post_containment_indicies. See the #FIXME

k-merization makes no attempt to avoid Ns

This is more of an observation than an issue. I'm not sure whether this would be considered a bug or a feature.

The kmerization (at least in MinHash.py lines 758-763) is agnostic to the alphabet. That's a feature if you're trying to compare, e.g., sets of proteins. But if you're trying to compare DNA that has some Ns in it, each N can produce K kmers that are unlikely to exist in the other set(s), reducing the similarity counts.

Might be worth letting the user know this happens. Either in the readme or maybe give a warning if a sequence with non-ACGT is encountered (assuming ACGT is the intended use).

I didn't test the following, but if a reference genome is softmasked, are the kmers containing lowercase going to match the same kmer containing uppercase?

In my case, admittedly atypical, I had several sets of kmers that I wanted to compare. So I converted them to fasta files that had a single fasta header (for each file), and with each line containing an N followed by a distinct kmer. Using MakeDNADatabase and QueryDNADatabase as shown in the README (except with --containment_threshold 0.0), what I got in the end was a csv file with nothing in it but the headers. This confused me, since my query file was one of the 10 files in my database -- I expected to at least see that fact reflected in the output.

The simple solution was to get rid of the Ns and give every kmer its own fasta header. That appears to have worked fine.

The node graph file was much bigger in the attempt that used Ns, consistent with the hypothesis that Ns are included in the kmers. But I still can't explain why it failed to report the perfect match.

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this

For example, using the Metalign default training database (199807 genomes) and running

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10

results in uncompressed:

16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst

yet

4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz

so ~2-4x compression.

Would need to either:

  • Enable MakeStreamingDNADatabase.py and MakeStreamingPrefilter.py to detect compressed training data and decompress it in the script or (better yet)
  • Enable decompression in the modules MinHash.py and Query.py themselves.

In post-processing, find correct denominator

Specifically:

# then normalize by the number of unique k-mers (to get the containment index)
		# In essence, this is the containment index, restricted to unique k-mers. This effectively increases the specificity,
		# but also increases the variance/confidence interval, since this decreases the size of the sketch.
		for k_size_loc in range(len(k_range)):
			k_size = k_range[k_size_loc]
			for hash_loc in np.where(containment_indices[:, k_size_loc])[0]:  # find the genomes with non-zero containment
				unique_kmers = set()
				for kmer in CEs[hash_loc]._kmers:
					unique_kmers.add(kmer[:k_size])  # find the unique k-mers
				#containment_indices[hash_loc, k_size_loc] /= float(len(unique_kmers))  # FIXME: this doesn't seem like the right way to normalize, but apparently it is!
				containment_indices[hash_loc, k_size_loc] /= float(num_unique[hash_loc, k_size_loc])  # FIXME: in small tests, this seems to give better results. To be revisted.

which denominator in the last two lines is correct?

Multiple k-mer sizes confirmation and testing

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:

  • Address #19 so we no nothing funky is happening with the current implementation.
  • Create testing environment to prepare for comparing old method to new method
  • Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
  • Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

  • Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
  • Test the magnitude of the bias factor over many genomes (relatively straightforward task).

Is CMash metric?

Dear CMash team,

I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C) >= CMash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric), it is a metric. But I am not sure whether CMash estimated mutation rate is also a metric distance.

Many Thanks,

Jianshu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.