Hello, I am using SNPGenie for a Genome-wide SNPs dataset of 70 indi

How to join the output for a whole genome analysis about snpgenie HOT 5 OPEN

pabloangulo7 commented on June 11, 2024

How to join the output for a whole genome analysis

from snpgenie.

Comments (5)

singing-scientist commented on June 11, 2024

Greetings, Pablo! Thanks very much for the question. That all sounds correct; the requirement to run twice (once for each strand) is a bit of a hack, unfortunately necessary because SNPGenie was originally written for one strand only. The genes of each strand can simply be combined as you say -- but be aware that, if there are any overlapping antisense genes (i.e., genes in reverse complement reading frames that overlap the same genome positions), then some genome positions/regions will be represented more than once. Also, some of the individual SITES results will be redundant. Thus, the answer to how the data should be combined depends in part on how you want to treat overlapping genes from separate strands (if they exist). Certainly for non-protein-coding sites, you'll only want to include them once, and results will be redundant between the strands.

For the whole (coding) genome, the question is whether you want the mean value for all sites (codon is the unit) or of all genes (gene is the unit). For the first, you could simply sum numerators and denominators, i.e., mean_dN = sum(N_diffs)/sum(N_sites) and mean_dS = sum(S_diffs)/sum(S_sites). This would reflect the "typical protein-coding site". Contrarily, if you're looking for an average of all genes, you could simply take the means of genes, e.g., mean_dN = mean(vector of all genes' dN). This would reflect the "typical gene". Let me know if that makes sense!

Yours,
Chase

from snpgenie.

pabloangulo7 commented on June 11, 2024

Thank you very much, sorry for the late reply but my data is very large and it took me a long time to finish the analysis. I already have the results for each of the chromosomes and for the minus and plus strand. Then following the instructions you gave me in the previous answer, I was about to calculate the dN/dS ratio for each chromosome and for the whole genome, but to make sure I do it correctly, to do the calculation, for example, for all the sites, I would have to use the file called "codon_results.txt" and use the columns of N_diffs, S_diffs, N_Sites and S_Sites, right? And to do the calculation by genes, would it be the same but with the file "product_results.txt"?

Thanks in advance

from snpgenie.

singing-scientist commented on June 11, 2024

Hello! Both codon_results and product_results should yield the same results, i.e., product_results are just the sum of codon_results, for each gene. The question is whether you want to add all the N_diffs, S_diffs, N_Sites and S_Sites together, then calculate dN/dS (mean of all coding sites); or first calculate dN/dS for each product, and take the mean of those values (mean of all genes). Let me know if that helps!

from snpgenie.

pabloangulo7 commented on June 11, 2024

Thank you very much, I think I understand it better now. On the other hand, I was reviewing the results and there is one thing that doesn't add up. In one of the "site_results.txt" files, one of the SNPs is missing, which for some reason the program has filtered out, but I don't quite know why. In the temporary file of the sample (generated by SNPGenie) "temp_snp_report_VY02.txt" this variant does appear. So, there would be a total of 37 SNPs although only 36 SNPs are analyzed in the results. I attach the files, and specifically it is the variant at site 1580099 (line 27 in the file). Let me know if you need more info.

Thanks in advance
temp_snp_report_VY02.txt
site_results.txt

from snpgenie.

singing-scientist commented on June 11, 2024

Greetings, Pablo! Indeed I see what you mean and confirm the SNP at 1580099 is present in temp but not the sites file. It is not really possible for me to troubleshoot this without the input; if you'd like to supply the 3 files and command that give rise to this result I'd be happy to check it out and determine what's happening.

Regarding dN or dS equal to 0, ideally one would perform a statistical test to determine if dN=dS can be rejected with significance. Normally this is done with bootstrapping, but when dS=0 there's no point. Thus as a rule of thumb, I generally consider that if dS (which reflects mutation rate) is 0, there's not enough variation to have statistical power to detect a difference between dN and dS. I'd report the ratio as undefined (certainly don't infer dN>dS or positive selection usually). If dN is 0 that's a different story (again because dS reflects mutation rate) and one can infer strong purifying selection given a proper statistical test. Let me know if that helps!

from snpgenie.

How to join the output for a whole genome analysis about snpgenie HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent