dkoslicki / cmash Goto Github PK
View Code? Open in Web Editor NEWFast and accurate set similarity estimation via containment min hash
License: BSD 3-Clause "New" or "Revised" License
Fast and accurate set similarity estimation via containment min hash
License: BSD 3-Clause "New" or "Revised" License
Otherwise you will get a **kwargs
error about labels
being an invalid keyword argument when you call df.reindex
Current implementation of multiple k-mer size containment index computation may have some bugs in it as noted here. In particular, training with 30-60-10 and training with 60-60-10 results in different 60-mer containment index values (which it shouldn't).
Currently this is single threaded, but embarrassingly parallel, so could Parallelize this for large reference databases.
In the current CMash/GroundTruth.py
, since it's using a naive python set calculation, this is much too slow to get results in a reasonable amount of time on larger data sets. Will need to switch to using KMC many times over to calculate the actual ground truth.
Work being done on groundtruth
branch.
kmc_tools intersect
and divide by total number of distinct training database k-mersScripts should be brief, and basically just call other modules. StreamingQueryDNADatabase.py
is a bloated mess right now and is very difficult to test (will effect #14). The bulk of the stuff in this script should be in a different module, and StreamingQuery...
should only call it. This imagined module should contain unit tests.
StreamingQueryDNADatabase.py
to separate moduleStreamingQueryDNADatabase.py
This will also serve to assist with the installation of Metalign.
Add canonical kmers to the sketch? Or just standard kmers? Add canonical/std kmers to the tree? To the prefilter? For prefix searches?
Add tests, hopefully get on conda, etc.
Try creating a pre-filter consisting of all kmers (for all kmer sizes) of all the hashes and their prefixes, so I don't have to hit the trie so frequently.
Haven't looked closely, but it appears that nowhere am I making sure everything is upper case. While the BF doesn't care (new option set), the TST definitely does care about case.
Goal of this is to create a fast way to re-create k-mer count distributions (similar to what is trying to be done in publications like this and references therein).
Current code base already keeps track of sketch k-mer counts (see here, and here).
Project would be:
Optional:
This would be sufficient for a conference paper.
For a journal paper would need to:
Hence if you run 12 threads, it will over count by up to 12. For example, test with k-mer size 1, 12 threads, and note how containment index is 12 (instead of 1).
Cloned master branch today; install fails with this error (all on one line)
Download error on https://pypi.org/simple/pytest-runner/: [SSL:
TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) --
Some packages may not be found!
Because of that, it fails to find a suitable version of pytest-runner:
distutils.errors.DistutilsError: Could not find suitable distribution
for Requirement.parse('pytest-runner<3dev,>=2.0')
I'm aware of what TLSV1_ALERT_PROTOCOL_VERSION is, and have updated my own versions of pip and virtualenv (today).
I'm parroting the installation steps pretty closely:
cd somepath/cmash
source somepath/cmash/cmash_venv/bin/activate
pip install --upgrade pip
Requirement already up-to-date: pip in ./cmash_venv/lib/python2.7/site-packages (18.0)
git clone https://github.com/dkoslicki/CMash.git
mv CMash cmash-master.aug_2_2018
cd somepath/cmash/cmash-master.aug_2_2018
pip install -r requirements.txt
The install is what fails. Full log below.
I'm a little baffled. I'm running the pip inside the VE (which ought to be the correct one). And if I search for pytest-runner it finds one, without getting any protocol error.
source somepath/cmash/cmash_venv/bin/activate
pip --version
pip 18.0 from somepath/cmash/cmash_venv/lib/python2.7/site-packages/pip (python 2.7)
pip search pytest-runner
(many other packages)
pytest-runner (4.2) - Invoke py.test as distutils command with dependency resolution
Thanks for any suggestions.
Looking in indexes: https://pypi.python.org/simple/
Obtaining file://somepath/cmash/cmash-master.aug_2_2018 (from -r requirements.txt (line 3))
Collecting khmer>=2.1.1 (from CMash==0.3.0->-r requirements.txt (line 3))
Using cached https://files.pythonhosted.org/packages/78/c9/caee0a310f249c86fd351706d2513e197e036231f4db6f3bda7a363ea1c4/khmer-2.1.1.tar.gz
Complete output from command python setup.py egg_info:
Download error on https://pypi.org/simple/pytest-runner/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
Couldn't find index page for 'pytest-runner' (maybe misspelled?)
Download error on https://pypi.org/simple/: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) -- Some packages may not be found!
No local packages or working download links found for pytest-runner<3dev,>=2.0
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/p3/6sl53f_s4yz5fm1n3mbbd57w0000gp/T/pip-install-aJCcSA/khmer/setup.py", line 304, in
setup(cmdclass=CMDCLASS, **SETUP_METADATA)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 130, in setup
_install_setup_requires(attrs)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/init.py", line 125, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 514, in fetch_build_eggs
replace_conflicting=True,
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 773, in resolve
replace_conflicting=replace_conflicting
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1056, in best_match
return self.obtain(req, installer)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/pkg_resources/init.py", line 1068, in obtain
return installer(requirement)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/dist.py", line 581, in fetch_build_egg
return cmd.easy_install(req)
File "somepath/cmash/cmash_venv/lib/python2.7/site-packages/setuptools/command/easy_install.py", line 670, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner<3dev,>=2.0')
i.e. do something more intelligent than:
containment_indices[:, k_size_loc] = (hit_matrices[k_size_loc].sum(axis=1).ravel()/float(num_hashes))
See this issue with Metalign. In short,
So I figured out the probably source of the program hanging: python3 vs python2 for CMash:
After setting up the data, this hangs:
python3 -m venv VE3
source VE3/bin/activate
.\setup_libraries.sh
python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv
python3 StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive
So instead, try python2, and it doesn't hang:
virtualenv VE2
source VE2/bin/activate
cd CMash
pip install -r requirements.txt
python StreamingQueryDNADatabase.py ../../data/r7aqo9zw/60mers_intersection_dump.fa ../../data/cmash_db_n1000_k60.h5 ../../test/CMash_out.csv 30-60-10 -c 0 -r 10000 -v -f ../../data/cmash_filter_n1000_k60_30-60-10.bf --sensitive
python3 metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv
python metalign.py test/RL_S001__insert_270_1M_subset.fq data/ --output test/RL_S001__insert_270_1M_subset_results.tsv
So possible solutions (with my assessment of ease of implementation):
Make setup_libraries.sh use python2 when installing CMash (probably via a virtualenv) (easy)
See why marisa-trie isn't working with python3 (their repo says it's python3 compatible) (medium)
Refactor CMash so it's python3 compliant (hard)
When running StreamingQueryDNADatabase.py
, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:
StreamingQueryDNADatabase.py
Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.
Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:
tests
folder (lots can be copied from CMash/MinHash.py
)Will need to:
MakeStreamingDNADatabase.py
mt.Trie()
Work will happen in the branch ModularizeMake
eg. if
python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k ${maxKsize} -v
and ${outputDir}
doesn't exist, then MakeStreamingDNADatabase.py
should create it!
in yaml and docs
The Bloom filter approach (presented in this paper) corresponds to:
MakeDNADatabase.py
QueryDNADatabase.py
MakeNodeGraph.py
The new (as of yet unpublished) implementation that uses ternary search trees and streaming consists of
MakeStreamingDNADatabase.py
MakeStreamingPrefilter.py
StreamingQueryDNADatabase.py
StreamingQueryDNADatabase_queue.py
(depreciated)Need to clarify this in the repo structure, as well as Update the documentation
In the ModularizeStreamingQuery
branch, Query.py
method find_non_unique_kmers
second half needs to be moved to create_post_containment_indicies
. See the #FIXME
This is more of an observation than an issue. I'm not sure whether this would be considered a bug or a feature.
The kmerization (at least in MinHash.py lines 758-763) is agnostic to the alphabet. That's a feature if you're trying to compare, e.g., sets of proteins. But if you're trying to compare DNA that has some Ns in it, each N can produce K kmers that are unlikely to exist in the other set(s), reducing the similarity counts.
Might be worth letting the user know this happens. Either in the readme or maybe give a warning if a sequence with non-ACGT is encountered (assuming ACGT is the intended use).
I didn't test the following, but if a reference genome is softmasked, are the kmers containing lowercase going to match the same kmer containing uppercase?
In my case, admittedly atypical, I had several sets of kmers that I wanted to compare. So I converted them to fasta files that had a single fasta header (for each file), and with each line containing an N followed by a distinct kmer. Using MakeDNADatabase and QueryDNADatabase as shown in the README (except with --containment_threshold 0.0), what I got in the end was a csv file with nothing in it but the headers. This confused me, since my query file was one of the 10 files in my database -- I expected to at least see that fact reflected in the output.
The simple solution was to get rid of the Ns and give every kmer its own fasta header. That appears to have worked fine.
The node graph file was much bigger in the attempt that used Ns, consistent with the hypothesis that Ns are included in the kmers. But I still can't explain why it failed to report the perfect match.
For some really weird reason, the output queue is showing non-deterministic behavior: the sleep
time affects the number of hits returned.
Basically, show an end-to-end way to:
Work happening on Retraining branch.
For example, using the Metalign default training database (199807 genomes) and running
python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10
results in uncompressed:
16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst
yet
4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz
so ~2-4x compression.
Would need to either:
MakeStreamingDNADatabase.py
and MakeStreamingPrefilter.py
to detect compressed training data and decompress it in the script or (better yet)Specifically:
# then normalize by the number of unique k-mers (to get the containment index)
# In essence, this is the containment index, restricted to unique k-mers. This effectively increases the specificity,
# but also increases the variance/confidence interval, since this decreases the size of the sketch.
for k_size_loc in range(len(k_range)):
k_size = k_range[k_size_loc]
for hash_loc in np.where(containment_indices[:, k_size_loc])[0]: # find the genomes with non-zero containment
unique_kmers = set()
for kmer in CEs[hash_loc]._kmers:
unique_kmers.add(kmer[:k_size]) # find the unique k-mers
#containment_indices[hash_loc, k_size_loc] /= float(len(unique_kmers)) # FIXME: this doesn't seem like the right way to normalize, but apparently it is!
containment_indices[hash_loc, k_size_loc] /= float(num_unique[hash_loc, k_size_loc]) # FIXME: in small tests, this seems to give better results. To be revisted.
which denominator in the last two lines is correct?
As the current version is quite old
Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.
Tasks:
StreamingQueryDNADatabase.py
with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py
many times with training databases trained with a specific k-mer size).This would be sufficient for a conference paper. More details can follow depending on interest.
For a journal publication, would need to:
Dear CMash team,
I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C) >= CMash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric), it is a metric. But I am not sure whether CMash estimated mutation rate is also a metric distance.
Many Thanks,
Jianshu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.