Code Monkey home page Code Monkey logo

Comments (3)

singing-scientist avatar singing-scientist commented on May 20, 2024

Greetings, Dan! Thanks for your interest in SNPGenie. It is difficult to pinpoint the problem without a full example that includes a reference FASTA sequence and a GTF file. However, upon examining the file you sent, two things I notice are (1) the file seems to be a hybrid CSV/VCF format in that the "sample" columns are surrounded by quotation marks; and (2) the file uses Windows (CRLF) line endings, not Unix (LF) line endings. I am attaching a version where quotations have been removed and line endings have been changed to Unix (LF). If either of these two issues is the culprit, this should work. Otherwise please provide a full reproducible example (all input and exact command line call) and we can take it from there!

Chase

snp_genie_prep.vcf.zip

from snpgenie.

padpadpadpad avatar padpadpadpad commented on May 20, 2024

Hi Chase

Sorry I feel I was a bit flustered yesterday and my initial issue comment was not very informative.

Thank you for your response. A few hours later I managed to parse a better looking VCF that did indeed work with SNPGenie. Did this in R from a data frame which I think may be a nice inclusion to help anyone else with the same problem. I think SNPGenie is a really nice tool. I ended up shortening the GTF file to just include the genes where my 12 variants were. I did then stumble into the problem of forward and reverse strands but that is for another day.

I do have some questions about the calculation of pi in poolseq data. I cannot for the life of me find a strict definition of how it is calculated at a per site basis.

So for a given sample I have the proportion of each variant. These were high confidence SNPs in pool-seq data that were confirmed via sequencing of individual clones. So to comply with the format, I then arbitrarily made everything have a depth of 1000 and then calculated the correct number of reads for each allele.

#CHROM POS REF ALT QUAL FORMAT SAMPLE-Compost_1
NC_012660 1816564 GT G 1000 AD:DP 1000,0:1000
NC_012660 1823344 T G 1000 AD:DP 900,100:1000
NC_012660 1824609 TA T 1000 AD:DP 1000,0:1000
NC_012660 1829303 T TC 1000 AD:DP 500,500:1000
NC_012660 2474703 C T 1000 AD:DP 1000,0:1000
NC_012660 4496109 C T 1000 AD:DP 1000,0:1000
NC_012660 4560835 G A 1000 AD:DP 1000,0:1000
NC_012660 4568375 T C 1000 AD:DP 1000,0:1000
NC_012660 4823073 C T 1000 AD:DP 900,100:1000
NC_012660 4915904 TCTG T 1000 AD:DP 1000,0:1000
NC_012660 6041420 T C 1000 AD:DP 1000,0:1000
NC_012660 6249053 G A 1000 AD:DP 1000,0:1000

For each site, how exactly is pi calculated? Would be interested in helping write up some documentation and a workflow to improve understanding of how these metrics are calculated for pool-seq data. I find the documentation in both Popoolation2 and PopGenome rather high level, whereas in reality I think everyone would benefit from working through a real world, simple example.

Sorry this has gone very off piece. I am happy to close this Issue and start a new one if needs be.

from snpgenie.

singing-scientist avatar singing-scientist commented on May 20, 2024

Thanks for these details! I am glad to hear it's working. Regarding pi, it is simply calculated as the sum of the mean number of differences per site divided by the number of sites. For example, suppose there are 4 sequences of length 10 bp. Site 1 is polymorphic, with seq1 having C and the other 3 seqs having T, and the remaining 9 sites are not polymorphic. Because there are 4 sequences, there are 4choose2, or (4^2-4)/2=6, possible comparisons between them — that is, seq1 vs. seq2, seq1 vs. seq3, seq1 vs. seq4, seq2 vs. seq3, seq2 vs. seq4, and seq3 vs. seq4. Because seq1 differs from the others, comparisons involving seq1 (bold) are mismatches, while the other comparisons are matches. Thus, of the 6 possible comparisons, 3 represent differences. As a result, the mean number of pairwise differences at this site is 3/6=0.5. If you were only interested in this site, the value of pi would be 0.5/1=0.5 for this site. If you were interested in the whole sequence of 10 bp, the value of pi would be (0.5+0+0+0+0+0+0+0+0+0+0)/10 = 0.5/10 = 0.05.

When it comes to coding regions, this is done on a codon-by-codon basis, where nonsynonymous and synonymous differences and sites are partitioned into piN and piS, respectively. We give a full explanation in Nelson et al. 2015. Please don't hesitate to write again if there are any questions not addressed therein.

from snpgenie.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.