Comments (13)
There's nothing wrong with loading the file, it is read entirely, but the linear PRG file is broken. If you print out the linear PRG starting with character 4294967000, you get ....132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG1CCCCCCCCCC....... That 1 breaks things because the parser assumes any number found is bigger than the previous. Also it looks like after this 1 it starts the numbering from the beginning. If you look further down in the linear PRG file you find sites marked with 3,5,7 etc. Any idea how your script could have triggered this?
from gramtools.
er.. wow that is cool
from gramtools.
This is happening at a point where the reference genome contains a lot of Ns, which are converted to all those C's after that errant 1. Can you figure out where this is (position in the genome/VCF)?
from gramtools.
The restarting of numbering is very odd. There's only one line where $nextvar is assigned to anything, and it is set to 5, at the start. After that it just gets incremented.
from gramtools.
@sm0179 said it went wrong here
...132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG1CCCCCCCCCC
So, which-th variant is that....
132794503 -4 = 132794497
2n-1 = 132794497 => n=66397250
Unfortunately, I can't just pull out the 66397250-th variant, because it really means the 66397250-th cluster of variants
If variants are v closely positioned, the script clusters them, and prints all possible haplotypes as alternale alleles.
from gramtools.
I'm tryign to reproduce this on current code
from gramtools.
pwd
/data2/users/zam/analyses/2017/0623_reproduce_gramtools_bug_19
-bash-4.1$
Run this, from master
perl ~/dev/git/gramtools/utils/vcf_to_linear_prg.pl --vcf /Net/banyan/data0/users/zam/results/20150429_build_1000g_for_gramtools/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf --ref /data3/cortex/testsets/na12878/ref/Homo_sapiens.GRCh37.60.dna.WHOLE_GENOME.fa --outfile human_prg_all
Finished printing linear PRG. Final number in alphabet is 158548838
head -c 4294967320 human_prg_all | tail -c 300
97ACGATGG132794498A132794497CAGTGAAGGAGGGGGAGGGGAATGGGGTAAAGGAGT132794499G132794500A132794499CTGAAGTGCAGAGATCTGTGGCCCCAGCTCACCCCA132794501A132794502T132794501TGTTGAGCAGAGTGAGCCTGGACCAAGACGCCTCTTTGCCAGCATG132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG132794505T132794506C13279
I can't see all these C's
from gramtools.
Now I'm just going to run Robyn's --build and generate a sim haplotype and then do --quasimap
from gramtools.
source /home/sm0179/bwt_search/wt-env/bin/activate
(wt-env)bash-4.1$ python ~/dev/git/gramtools/gramtools.py --build --reference /data3/cortex/testsets/na12878/ref/Homo_sapiens.GRCh37.60.dna.WHOLE_GENOME.fa --vcf /Net/banyan/data
0/users/zam/results/20150429_build_1000g_for_gramtools/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf --ksize 13
2017-06-24 00:22:14,888 gramtools INFO Start process: build
2017-06-24 01:08:46,845 gramtools INFO stdout:
Finished printing linear PRG. Final number in alphabet is 158548838
2017-06-24 01:08:47,412 gramtools INFO End process: build
Note that this time it did print out kmers (unlike the bug I reported yesterday), I assume because I activated sorina's venv
Make a simulated haplotype and 10000 reads
python ~/dev/git/gramtools/utils/variantKmers.py -r 10000 -k 13 -f ALL/cache/perl_generated_fa > deleteme
This is still running, currently using 23Gb of RAM and rising!
from gramtools.
And while it's running just to confirm:
head -c 4294967320 ALL/prg | tail -c 300
97ACGATGG132794498A132794497CAGTGAAGGAGGGGGAGGGGAATGGGGTAAAGGAGT132794499G132794500A132794499CTGAAGTGCAGAGATCTGTGGCCCCAGCTCACCCCA132794501A132794502T132794501TGTTGAGCAGAGTGAGCCTGGACCAAGACGCCTCTTTGCCAGCATG132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG132794505T132794506C13279
again cnanot reproduce the 1CCCCC just after 132794503
from gramtools.
2 hours later, is using 100Gb of RAM - this is the haplotype and read generating script!
from gramtools.
Generating the reads worked, and I removed those reads which were all C's. I then started quasimapping, and this failed
2017-06-24 18:23:14,664 gramtools INFO Start process: quasimap
2017-06-24 18:23:14,700 gramtools INFO stdout:
2017-06-25 08:25:07,591 gramtools ERROR Error code != 0
2017-06-25 08:25:07,604 gramtools INFO Output run directory:
/data2/users/zam/analyses/2017/0623_reproduce_gramtools_bug_19/ALL_output/1498324994_ALL_ksize13
2017-06-25 08:25:07,604 gramtools INFO End process: quasimap
Constructing FM-index
Parsing sites and allele masks
Pre-calculating kmers
Precalculated kmers not found, calculating them using 25 threads
from gramtools.
The code has changed significantly since this issue.
from gramtools.
Related Issues (20)
- Per base coverage recording error
- Pf benchmarking HOT 1
- Handle no variants after clustering HOT 8
- Release 1.6.0 HOT 5
- Release 1.7 HOT 1
- 'N' in reference genome fasta not supported
- Error Installing gramtools HOT 8
- Boost download link is dead, breaks gramtools compilation process HOT 4
- Retire variant-aware kmer indexing code
- Unable to run the geamtools discover command successfully HOT 4
- Install error HOT 3
- Stop installing py-cortex-api by default
- Ubuntu 18.04 install woes HOT 3
- Migrate to pyproject.toml
- Error while running gramtools HOT 2
- gramtools build error HOT 1
- install error HOT 1
- gramtools gramtools backend expected at: /home/tools/anaconda3/lib/python3.7/site-packages/gramtools/bin/gram but not found HOT 2
- make[1]: *** [CMakeFiles/Makefile2:260: libgramtools/CMakeFiles/gram.dir/rule] Error 2 make: *** [Makefile:186: gram] Error 2 HOT 6
- How to use Gramtools build based on multiple VCF files? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gramtools.