Code Monkey home page Code Monkey logo

Comments (13)

sm0179 avatar sm0179 commented on June 16, 2024

There's nothing wrong with loading the file, it is read entirely, but the linear PRG file is broken. If you print out the linear PRG starting with character 4294967000, you get ....132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG1CCCCCCCCCC....... That 1 breaks things because the parser assumes any number found is bigger than the previous. Also it looks like after this 1 it starts the numbering from the beginning. If you look further down in the linear PRG file you find sites marked with 3,5,7 etc. Any idea how your script could have triggered this?

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

er.. wow that is cool

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

This is happening at a point where the reference genome contains a lot of Ns, which are converted to all those C's after that errant 1. Can you figure out where this is (position in the genome/VCF)?

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

The restarting of numbering is very odd. There's only one line where $nextvar is assigned to anything, and it is set to 5, at the start. After that it just gets incremented.

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

@sm0179 said it went wrong here
...132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG1CCCCCCCCCC

So, which-th variant is that....

132794503 -4 = 132794497
2n-1 = 132794497 => n=66397250
Unfortunately, I can't just pull out the 66397250-th variant, because it really means the 66397250-th cluster of variants

If variants are v closely positioned, the script clusters them, and prints all possible haplotypes as alternale alleles.

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

I'm tryign to reproduce this on current code

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

pwd
/data2/users/zam/analyses/2017/0623_reproduce_gramtools_bug_19
-bash-4.1$

Run this, from master

perl ~/dev/git/gramtools/utils/vcf_to_linear_prg.pl --vcf /Net/banyan/data0/users/zam/results/20150429_build_1000g_for_gramtools/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf --ref /data3/cortex/testsets/na12878/ref/Homo_sapiens.GRCh37.60.dna.WHOLE_GENOME.fa --outfile human_prg_all

Finished printing linear PRG. Final number in alphabet is 158548838

head -c 4294967320 human_prg_all | tail -c 300
97ACGATGG132794498A132794497CAGTGAAGGAGGGGGAGGGGAATGGGGTAAAGGAGT132794499G132794500A132794499CTGAAGTGCAGAGATCTGTGGCCCCAGCTCACCCCA132794501A132794502T132794501TGTTGAGCAGAGTGAGCCTGGACCAAGACGCCTCTTTGCCAGCATG132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG132794505T132794506C13279

I can't see all these C's

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

Now I'm just going to run Robyn's --build and generate a sim haplotype and then do --quasimap

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

source /home/sm0179/bwt_search/wt-env/bin/activate

(wt-env)bash-4.1$ python ~/dev/git/gramtools/gramtools.py --build --reference /data3/cortex/testsets/na12878/ref/Homo_sapiens.GRCh37.60.dna.WHOLE_GENOME.fa --vcf /Net/banyan/data
0/users/zam/results/20150429_build_1000g_for_gramtools/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5a.20130502.sites.vcf --ksize 13
2017-06-24 00:22:14,888 gramtools INFO Start process: build
2017-06-24 01:08:46,845 gramtools INFO stdout:

Finished printing linear PRG. Final number in alphabet is 158548838

2017-06-24 01:08:47,412 gramtools INFO End process: build

Note that this time it did print out kmers (unlike the bug I reported yesterday), I assume because I activated sorina's venv
Make a simulated haplotype and 10000 reads
python ~/dev/git/gramtools/utils/variantKmers.py -r 10000 -k 13 -f ALL/cache/perl_generated_fa > deleteme

This is still running, currently using 23Gb of RAM and rising!

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

And while it's running just to confirm:
head -c 4294967320 ALL/prg | tail -c 300
97ACGATGG132794498A132794497CAGTGAAGGAGGGGGAGGGGAATGGGGTAAAGGAGT132794499G132794500A132794499CTGAAGTGCAGAGATCTGTGGCCCCAGCTCACCCCA132794501A132794502T132794501TGTTGAGCAGAGTGAGCCTGGACCAAGACGCCTCTTTGCCAGCATG132794503G132794504C132794503AGTTCTGGACTGCAAGCTCCTGGGGCATGGACCTCATTCCTG132794505T132794506C13279
again cnanot reproduce the 1CCCCC just after 132794503

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

2 hours later, is using 100Gb of RAM - this is the haplotype and read generating script!

from gramtools.

iqbal-lab avatar iqbal-lab commented on June 16, 2024

Generating the reads worked, and I removed those reads which were all C's. I then started quasimapping, and this failed

2017-06-24 18:23:14,664 gramtools INFO Start process: quasimap
2017-06-24 18:23:14,700 gramtools INFO stdout:

2017-06-25 08:25:07,591 gramtools ERROR Error code != 0
2017-06-25 08:25:07,604 gramtools INFO Output run directory:
/data2/users/zam/analyses/2017/0623_reproduce_gramtools_bug_19/ALL_output/1498324994_ALL_ksize13
2017-06-25 08:25:07,604 gramtools INFO End process: quasimap
Constructing FM-index
Parsing sites and allele masks
Pre-calculating kmers
Precalculated kmers not found, calculating them using 25 threads

from gramtools.

ffranr avatar ffranr commented on June 16, 2024

The code has changed significantly since this issue.

from gramtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.