Hi SNPGenie team I am trying to get snpgen

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. about snpgenie HOT 3 CLOSED

padpadpadpad commented on May 20, 2024

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header.

from snpgenie.

Comments (3)

singing-scientist commented on May 20, 2024

Greetings, Dan! Thanks for your interest in SNPGenie. It is difficult to pinpoint the problem without a full example that includes a reference FASTA sequence and a GTF file. However, upon examining the file you sent, two things I notice are (1) the file seems to be a hybrid CSV/VCF format in that the "sample" columns are surrounded by quotation marks; and (2) the file uses Windows (CRLF) line endings, not Unix (LF) line endings. I am attaching a version where quotations have been removed and line endings have been changed to Unix (LF). If either of these two issues is the culprit, this should work. Otherwise please provide a full reproducible example (all input and exact command line call) and we can take it from there!

Chase

snp_genie_prep.vcf.zip

from snpgenie.

padpadpadpad commented on May 20, 2024

Hi Chase

Sorry I feel I was a bit flustered yesterday and my initial issue comment was not very informative.

Thank you for your response. A few hours later I managed to parse a better looking VCF that did indeed work with SNPGenie. Did this in R from a data frame which I think may be a nice inclusion to help anyone else with the same problem. I think SNPGenie is a really nice tool. I ended up shortening the GTF file to just include the genes where my 12 variants were. I did then stumble into the problem of forward and reverse strands but that is for another day.

I do have some questions about the calculation of pi in poolseq data. I cannot for the life of me find a strict definition of how it is calculated at a per site basis.

So for a given sample I have the proportion of each variant. These were high confidence SNPs in pool-seq data that were confirmed via sequencing of individual clones. So to comply with the format, I then arbitrarily made everything have a depth of 1000 and then calculated the correct number of reads for each allele.

#CHROM	POS	REF	ALT	QUAL	FORMAT	SAMPLE-Compost_1
NC_012660	1816564	GT	G	1000	AD:DP	1000,0:1000
NC_012660	1823344	T	G	1000	AD:DP	900,100:1000
NC_012660	1824609	TA	T	1000	AD:DP	1000,0:1000
NC_012660	1829303	T	TC	1000	AD:DP	500,500:1000
NC_012660	2474703	C	T	1000	AD:DP	1000,0:1000
NC_012660	4496109	C	T	1000	AD:DP	1000,0:1000
NC_012660	4560835	G	A	1000	AD:DP	1000,0:1000
NC_012660	4568375	T	C	1000	AD:DP	1000,0:1000
NC_012660	4823073	C	T	1000	AD:DP	900,100:1000
NC_012660	4915904	TCTG	T	1000	AD:DP	1000,0:1000
NC_012660	6041420	T	C	1000	AD:DP	1000,0:1000
NC_012660	6249053	G	A	1000	AD:DP	1000,0:1000

For each site, how exactly is pi calculated? Would be interested in helping write up some documentation and a workflow to improve understanding of how these metrics are calculated for pool-seq data. I find the documentation in both Popoolation2 and PopGenome rather high level, whereas in reality I think everyone would benefit from working through a real world, simple example.

Sorry this has gone very off piece. I am happy to close this Issue and start a new one if needs be.

from snpgenie.

singing-scientist commented on May 20, 2024

Thanks for these details! I am glad to hear it's working. Regarding pi, it is simply calculated as the sum of the mean number of differences per site divided by the number of sites. For example, suppose there are 4 sequences of length 10 bp. Site 1 is polymorphic, with seq1 having C and the other 3 seqs having T, and the remaining 9 sites are not polymorphic. Because there are 4 sequences, there are 4choose2, or (4^2-4)/2=6, possible comparisons between them — that is, seq1 vs. seq2, seq1 vs. seq3, seq1 vs. seq4, seq2 vs. seq3, seq2 vs. seq4, and seq3 vs. seq4. Because seq1 differs from the others, comparisons involving seq1 (bold) are mismatches, while the other comparisons are matches. Thus, of the 6 possible comparisons, 3 represent differences. As a result, the mean number of pairwise differences at this site is 3/6=0.5. If you were only interested in this site, the value of pi would be 0.5/1=0.5 for this site. If you were interested in the whole sequence of 10 bp, the value of pi would be (0.5+0+0+0+0+0+0+0+0+0+0)/10 = 0.5/10 = 0.05.

When it comes to coding regions, this is done on a codon-by-codon basis, where nonsynonymous and synonymous differences and sites are partitioned into piN and piS, respectively. We give a full explanation in Nelson et al. 2015. Please don't hesitate to write again if there are any questions not addressed therein.

from snpgenie.

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header. about snpgenie HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent