ariesk's People
ariesk's Issues
Build database from FASTA
Add Testing Rig
- make a distance table
- build a database
- search a database
Implement Remote Homology Search
Modification of Contig Search. Key differences
- larger centroid radius
- fewer, larger centroids per contig
Make DB Build Faster
Currently building the database is quite slow, likely because of python interface to sqlite
Preliminary Benchmarking
Four stages I want to test:
- Communication
- Coarse Search
- Retrieval
- Alignment
All testing on Hippo, database with 10^7 kmers and r=0.2 in /dev/shm
Convert to Unscaled Ram Distances
Unscaled Ramanujan Distances seem prefereable to scaled. Useful property is that they are good at discriminating very distant sequences while sub-kmer distance is only useful for discriminating more similar sequences. In combination we should be able to search < 10% of a dataset to get near perfect recall
Benchmark Recall
Profile Search
Figure out what is making the filtering step of search slow
Implement Smith-Waterman in Contig Search
Implement SW or similar for fine stage of contig search
Add Bloom Filters for Cluster Fast Reject
Quickly reject candidate clusters using a bloom filter of sub-kmers.
Approach. Given stored k-mers of size k
and sub-mers of size m
If two k-mers have n_e
mismatches/indels they will have at least k + 1 - m(n_e + 1)
matching sub-mers.
Exploit this fact to build a bloom filter that checks if a hit is possible in a cluster. Can also use a Bloom Grid with reads hashed to 1 of N bloom filters
Benchmark Caching
Using a too simple cache benchmark performance on db in /dev/shm
and on disk
All testing on Hippo, database with 10^7 kmers and r=0.2
Test Command
ariesk search-seq -p 5431 --search-mode full --inner-metric none -r 0 -i 0.1 <kmer> | wc -l
Queries
CCCCCCCCCCCCCCGGGGGGGGGGGGGGGGG
ACCGCAGTATTATGATGTTGAAAACATGGAT
AGGGCCAGTTCGAAGCGATGTACTCAAAACT
CAGATGTGCCGACGATTTTGCGCCCCGGAGG
AATAATCCAATGCACGCTCTACTTCTACTAT
Run each query twice
Build from pre-db
Build a real db from a pre-db
Find mystery build bug
chunks/chunk_dz
[#######-----------------------------] 20% 00:18:32
Traceback (most recent call last):
File "/home/dcd3001/miniconda3/envs/ariesk/bin/ariesk", line 11, in <module>
load_entry_point('ariesk', 'console_scripts', 'ariesk')()
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/dcd3001/miniconda3/envs/ariesk/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/pbtech_mounts/homes039/dcd3001/Dev/AriesK/ariesk/cli/cli_build.py", line 111, in build_grid_cover_fasta
n_added = predb.fast_add_kmers_from_fasta(fasta_filename)
File "ariesk/pre_db.pyx", line 148, in ariesk.pre_db.PreDB.fast_add_kmers_from_fasta
File "ariesk/pre_db.pyx", line 49, in ariesk.pre_db.PreDB.c_add_kmer
File "ariesk/ram.pyx", line 58, in ariesk.ram.RotatingRamifier.c_ramify
File "ariesk/ram.pyx", line 36, in ariesk.ram.Ramifier.c_ramify
IndexError: Out of bounds on buffer access (axis 1)
happened in 18/106 chunks
Contig Pre-DB
Add a pre-db for contigs
Cythonize KDTree
Use a KD Tree directly without going through python
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.