chasewnelson / snpgenie Goto Github PK

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data

License: GNU General Public License v3.0

Perl 96.42% R 3.58%

snp-report fasta-sequences next-generation-sequencing dnds-estimation dnds nucleotide diversity-measurement nucleotide-diversity sequence-analysis population-genetics

snpgenie's People

Contributors

Stargazers

Watchers

snpgenie's Issues

vcf2revcom.pl - No modifications to the fasta and gtf file

Hello,
I'm trying to run SNPgenie-vcf2revcom.pl on a vcf file, the corresponding reference fasta and gtf file as per the instructions on your github page. However, just running the script with the three input files throws out an error as below suggesting the input should be just the SNP report (vcf in this case) and the corresponding total number of sites in the fasta file (See Below). If I am understanding correctly you are just modifying the SNP report and not the gff or the fasta for the negative strand? Did you modify your SNPgenie script to account for the negative strand or am I missing something here. I would really appreciate it if you could provide any clarification to this.

mac0167359:SNPGenie$ perl vcf2revcom.pl ref.fa ref.gtf All2.vcf

WARNING: The SNPGenie script vcfformat1_to_revcom needs exactly 2 arguments, in this order:

## (1) A '+' strand SNP report in VCF format 1 (SNPGenie descriptions).

(2) The exact length of the sequence in the FASTA file.

For example: vcfformat1_to_revcom.pl my_snp_report.vcf 248956422

mac0167359:SNPGenie $ perl vcf2revcom.pl All2.vcf 2872915
Provided seq length is 2872915

Converting All2.vcf to reverse complement format...

REVERSE COMPLEMENT VCF FILE has been written to:

VCF: All2_revcom.vcf

SNPGenie with CRISP output

Dear Chase,

I am working on pooled data that and I hope to use SNPGenie to generate the population parameters. I have a question for you regarding how to enter the VCF file I have.

The variant calling was done using CRISP:
https://github.com/vibansal/crisp
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881398/pdf/btq214.pdf

After converted to VCF, the output from CRISP look like this:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 ...
2 321989 SNPID G A 9420 PASS AC=37;AF=0.389474;AN=95;CT=-31.4;DP=545,492,10;EMstats=942.07:-398.73;FLANKSEQ=cgagggagga:G:caaggcaagg;HWEstats=12.4;MQS=31,18,119,1005;NP=38;NS=19;VF=EMpass;VP=19;VT=SNV GT:GQ:DP:ADf:ADr:ADb 0/0/0/1/1:7:53:15,8:7,14:1,0 0/0/0/0/1:4:57:20,3:15,11:0,0

Can SNPGenie use this output? It is not clear to me what VCF input format option to use (--vcfformat=2 or 4), since CRISP output (ADf:ADr:ADb) does not have a regular AD format.

My goal is to generate diversity statistics by sample. When I tried to use --vcfformat=2 option, the SNPGenie considered the whole population. However, with --vcfformat=4 option, I am not sure if the AD format is appropriately recognized.

Your help is much appreciated.

Best regards,
Barbara.

snpgenie\_within\_group.pl: Question about input file formats

I have several thousand fasta files, each fasta file represents a single gene containing all sequenced individuals from a single population. Each fasta file is a nucleotide alignment, which I have attempted to framecode align using TranslatorX. I would like to calculate piN/piS for each gene using snpgenie_within_group.pl

I do not have a gtf file. Can I run snpgenie_within_group.pl without the gtf file? If not, can you offer any guidance on how would I format this type of sequence data and generate the correctly formatted gtf file?

Thanks

mid-sequence STOP codon in the sequence

Hi Chase,

Thanks for the useful tool,

I see this warning,

WARNING: Please be aware that there is a mid-sequence STOP codon in the sequence GU280_gp01|13483.

Please check your annotations for: (1) incorrect frame; or (2) incorrect starting or ending coordinates.

A premature STOP codon may also indicate a pseudogene, for which piN vs. piS analysis may not be appropriate.

Please advise how to deal with,

Tag a release?

I would like to add SNPGenie to Bioconda for easy installation but to do that I would need a tagged release.

mulifasta genome questions

Hi Chase,

I am working with a genome with multiple scaffolds. I followed the instruction to split the fasta file into individual single sequence fasta file now. I am curious what the next step is. Should I also split the gtf file and vcf file into separate files or they can be left unchanged?

I tried to grep the information from the gtf and vcf file for a specific scaffold. SNPGenie seems to be running OK and then killed displaying "Performing final calculations, noncoding overlap analysis, and output... Killed" for unknown reason. I have the test files attached here: https://drive.google.com/open?id=1HphfXSw7QPJRuXTGf1nO98Cjd_agP6ok

WARNING: AD must list numbers

Hi Chase,
thank you for this useful tool!
I am having some issues with a set of data. I got a warning, and SNPGenie terminates. I am running SNPGenie with --vcfformat=4; I got the vcf by merging individual vcfs with bcftools.
The warning is

WARNING: the line is:

NC_005967.2 4009 . T A,G 687.24 . BaseQRankSum=-0.486;Dels=0;FS=0;HRun=1;HaplotypeScore=1.6229;MQ=13.77;MQ0=7;MQRankSum=4.129;QD=7.29;ReadPosRankSum=-1.301;DP=1156;AF=1,0.5;AN=64;AC=38,21 GT:AD:DP:GQ:PL 1/1:36,25:61:9.09:477,9,0,.,.,.

For vcfformat==4, AD must list numbers of reads for two SNPs as follows: ,,. SNPGenie TERMINATED.

AD is actually a list of numbers (in the warning= 36,25), so I don't get the message... Could you please help me find out what's the problem?

I have another question: in the SNPGenie_parameters.txt file that is generated over the run, it is reported "COMPLEMENT MODE", which is "Yes" when running the software on "fw" data and is "No" when running it on revcom data. However, to my understanding (and according to the results of the few run), SNPGenie does not reverse-complement the data, so the results are only relative to one strand. Am I correct?

thank you very much!!
best wishes
Irene

Watterson's theta and Tajima's D estimates

Hi Chase, thanks for the brilliant software. We are trying to get the Watterson's theta and Tajima's D estimates on the synonymous sites for our pooled sequence data. I wonder whether the software could provide these information? Thanks in advance!

NONREF_NONPOLY and population_summary.txt empty

Hello

I run your tool and I find it really useful nevertheless there are some outcomes that I do not completely understand.

One is about NONREF_NONPOLY under the column class in the file site_results.txt. what does NONREF_NONPOLY means ?

The second one is the output of population_summary. Even if I have SNPs this file is always empty whatever chromosome I run :

file sites sites_coding sites_noncoding pi pi_coding pi_noncoding N_sites S_sites piN piS mean_dN_vs_ref mean_dS_vs_ref mean_gdiv_polymorphic mean_N_gdiv mean_S_gdiv mean_gdiv sites_polymorphic mean_gdiv_coding_poly sites_coding_poly mean_gdiv_noncoding_poly sites_noncoding_poly
temp_vcf4_AfricanBeer.vcf 0 0 0 * * * 0 0 * * * * * * * * 0 * 0 * 0

The command I run is :

snpgenie.pl --vcfformat=4 --snpreport=./vcf/chrIV_Scc.AfricanBeer.vcf --fastafile=./fasta/chrIV.Scc.fa --gtffile=./gft/chrIV_Scc.Scc.gft --outdir chrIV

so each file is specific for chr 4 (chrIV_Scc) and I modified the vcf as suggested in #35. here an example :

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AfricanBeer
chrIV_Scc 26094 . C A 159.76 . INFO AD:DP 4,4:39
chrIV_Scc 27939 . T C 222 . INFO AD:DP 0,4:39
chrIV_Scc 28094 . C T 222 . INFO AD:DP 0,4:39
chrIV_Scc 28654 . C T 225.01 . INFO AD:DP 0,1:39
chrIV_Scc 28968 . C T 225.01 . INFO AD:DP 0,1:39

few notes:
I replaced the field INFO with INFO to make it readable.
In contrast to #35 the DP field disagree with the sum of AD values because some positions were missing in the original vcf files.

I am sorry for the format of the text but I am new in git-hub.

Thank you in advance

Best,

Nicolò Tellini

Empty output / wrong VCF file?

Hi,

I'm trying to run SNPgenie on a couple of sequences that I've mapped from different samples. I've called the SNPs using freebayes, and produced the GTF file using prodigal. I needed to tinker a little bit with the files, but I think at this point they should be ok to use. However when I run SNPgenie with following command:

snpgenie.pl --vcfformat=4 --snpreport=vcffile.vcf --fastafile=fastafile.fna --gtffile=gtffile.gtf

The Program runs, and I get immediattely following output:

################################################################################
##                                                                            ##
##                             SNPGenie Initiated!                            ##
##                                                                            ##
################################################################################

  ###############################  LICENSE:  #################################
  ##            SNPGenie Copyright (C) 2015-18 Chase W. Nelson              ##
  ##            This program comes with ABSOLUTELY NO WARRANTY;             ##
  ##     This is free software, and you are welcome to redistribute it      ##
  ##               under certain conditions; see LICENSE.txt.               ##
  ############################################################################

You have not selected a MIN. MINOR ALLELE FREQ. All variants in the SNP report(s) will be included.

Reading in FASTA file... COMPLETED.

Indexing sequence... COMPLETED.

################################################################################
##                      SNPGenie completed successfully.                      ##
##             Please find results in the SNPGenie_Results folder.            ##
################################################################################

However when I look into the results output, all the files are empty..
I'm a bit puzzled at what could be causing this, and I think the VCF file could be the problem? I've inserted the files in here for troubleshooting.

Thanks in advance!

Warddeb

sliding window error with new version compared to old version.

Hi Chase

I'm very happy with SNPGenie and that you wrote in a way that has minimal requirements.
I have encountered an anomaly when using the current version compared to an old one I used previously.

The sliding window result file looks like this with the current version:
file product first_site first_codon last_site last_codon sum_nonsyn_diffs sum_syn_diffs sum_nonsyn_sites sum_syn_sites piN piS piN/piS piN_vs_ref piS_vs_ref piN/piS_vs_ref
WGS572_S13.vcf ORF 1 ATG 150 AGG 0 0 0 0 * * * * * *
WGS572_S13.vcf ORF 4 AGC 153 AAG 0 0 0 0 * * * * * *
WGS572_S13.vcf ORF 7 ACA 156 ACT 0 0 0 0 * * * * * *
WGS572_S13.vcf ORF 10 AAT 159 TCC 0 0 0 0 * * * * * *
WGS572_S13.vcf ORF 13 CCT 162 GAG 0 0 0 0 * * * * * *

The sliding window result file looks like this with the old version:
file product first_site first_codon last_site last_codon sum_nonsyn_diffs sum_syn_diffs sum_nonsyn_sites sum_syn_sites piN piS piN/piS piN_vs_ref piS_vs_ref piN/piS_vs_ref
WGS572_S13.vcf ORF 1 ATG 150 AGG 0 0.0228922932049934 111.5 38.5 0 0.000594605018311517 0 0 0.000300779220779221 0
WGS572_S13.vcf ORF 4 AGC 153 AAG 0 0.0228922932049934 111.166666666667 38.8333333333333 0 0.000589501112574938 0 0 0.000298197424892704 0
WGS572_S13.vcf ORF 7 ACA 156 ACT 0 0.0228922932049934 110.5 39.5 0 0.000579551726708694 0 0 0.000293164556962025 0

I am running this command:
snpgenie.pl --vcfformat=2 --snpreport=WGS572_S13.vcf --fastafile=WGS572_S13.consensus-map.fasta --minfreq=0.01 --slidingwindow=50 --gtffile=SNPGenie_orf.gff3

I have attached the files.
SNPGenie-error.zip

Am I missing something obvious?

Best,
Martin

how does SNPGenie deal with invariant sites?

Should I output invariant sites to my VCF or does it assume all sites are genotyped?

gtf -strand convert to + strand

Hi, I have a problem when I convert my gtf file to + strand file. I do not what length I should enter. Total length of reference gene or whole length of genome. Because my gtf file is download for whole genome. when I enter reference gene length, the location in + strand file is minus "-". Here is my gtf file. Thanks.
AT5G13480_2_cds.txt

memory usage

Hi,

I am analyzing the polymorphism data (vcf files) from a bird species (genome size ~1.2G). I split the genome into 10 parts, but still the memory usage reached ~30G. Do you have an idea how to split/process the input files to reduce the memory usage?

Thanks a lot!

Best,
Luohao

Invalid FASTA file

Call me nitpicky, but your example FASTA file is invalid.

> FASTA Reference Sequence

The space between > and FASTA makes the sequence have an empty name, and I guess thats not what you want.

Making a vcf file by hand: ## WARNING: ## The SNP Report table_1.csv has no header.

Hi SNPGenie team

I am trying to get snpgenie.pl to work, but am getting stuck at The SNP Report table_1.csv has no header.

I think is likely because I am using a SNP report table that I made by hand, as the way I picked SNPs was a bit long winded and I ended up filtering them in R and want to now put them into SNPGenie.

I created a vcf file that I think follows the specifications of a VCF file and those documented in the README of SNPGenie, but maybe there is something SNPGenie does that means it does not work.

Any thoughts would be much appreciated.

Many thanks
Dan

snp_genie_prep.vcf.zip

Problem with VCF

Hi,

I was trying to use your software with a SNPeff annotated VCF file that I built using freebayes from several (pooled) samples ...

I am aware this is probably not the default format your software is expecting, but I was wondering whether you could give me a hint as to how I should modify the VCF to make it work.

I attached a sample.vcf.txt

The error(s) I got are below:

cheers,

dom

>> snpgenie-1.2.pl --sepfiles --minfreq=0.01 --snpreport=Gr_pops.vs.nGr.v1.1.GQ.vcf.norm.filtered.final.snpeff.vcf --fastafile=nGr.v1.1.fa --gtffile=nGr.v1.1.augustus.manual.2015_03_23.gtf 

################################################################################
##                                                                            ##
##                             SNPGenie Initiated!                            ##
##                                                                            ##
################################################################################

  ###############################  LICENSE:  #################################
  ##              SNPGenie Copyright (C) 2015 Chase W. Nelson               ##
  ##            This program comes with ABSOLUTELY NO WARRANTY;             ##
  ##     This is free software, and you are welcome to redistribute it      ##
  ##               under certain conditions; see LICENSE.txt.               ##
  ############################################################################

There is 1 FASTA file in the working directory. ONE-SEQUENCE MODE activated...

Your MIN. MINOR ALLELE FREQ. is 0.01. All variants falling below this frequency will be ignored...


###########################  CURRENTLY PROCESSING:   ###########################
../../final/Gr_pops.vs.nGr.v1.1.GQ.vcf.norm.filtered.final.snpeff.vcf... VCF format detected

Converting ../../final/Gr_pops.vs.nGr.v1.1.GQ.vcf.norm.filtered.final.snpeff.vcf to SNPGenie format... 

In file ../../final/Gr_pops.vs.nGr.v1.1.GQ.vcf.norm.filtered.final.snpeff.vcf, the newline type is: Unix (LF, \n)


### FILE TYPE NOT FULLY SUPPORTED###

### FILE TYPE NOT FULLY SUPPORTED###

### FILE TYPE NOT FULLY SUPPORTED###

### FILE TYPE NOT FULLY SUPPORTED###

### FILE TYPE NOT FULLY SUPPORTED###

### FILE TYPE NOT FULLY SUPPORTED###

Troubleshooting and STOP codon

Dear Chase,

I am trying to run the SNPGenie and I got several warnings related to "No SNPs in >=1000 contiguous codons" (SNPGenie_LOG.txt ):

temp_vcf4_D710_D501.vcf Cre01.g000100 35174 No SNPs in >=1000 contiguous codons. If this was unexpected, you may need to specify Unix (\n) newline characters! See Troubleshooting
temp_vcf4_D710_D501.vcf Cre01.g001150 146734 No SNPs in >=1000 contiguous codons. If this was unexpected, you may need to specify Unix (\n) newline characters! See Troubleshooting
temp_vcf4_D710_D501.vcf Cre01.g001200 159483 No SNPs in >=1000 contiguous codons. If this was unexpected, you may need to specify Unix (\n) newline characters! See Troubleshooting

Also, I got some warnings related to "Mid-sequence STOP codon" (SNPGenie_LOG.txt ):

N/A Cre01.g001678 284241 Mid-sequence STOP codon. Please check your annotations for: (1) incorrect frame; or (2) incorrect starting or ending coordinates. A premature STOP codon may also indicate a pseudogene, for which piN vs. piS analysis may not be appropriate.

Can I ignore these warnings? Or the software will not generate correct results?

Thank you!
Barbara

score in GTF file, parsing problem

Hi,

I've used prodigal to generate some GTF files that I use in SNPgenie downstream. I've noticed that if the score field has an actual value, SNPgenie fails saying that there are no + strand products in the GTF file (the score values are for example 120.5). If I remove the score values (by '.') this problem seems to be fixed.

Kind regards,

Warddeb

specify genetic code?

Hi,

I would like to specify invertebrate mitochondrial genetic code (translation table = 5, I think). It looks like your program assumes standard genetic code and I see no where to alter that. Can I get it specified?

error in running the SNPGenie

hi, i run the SNPGenie and got warning like that :

WARNING: The CDS coordinates for gene Glyma.01G115000.3.Wm82.a2.v1.CDS.2 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

When I remove the wrong CDS manually , the SNPGenie works again .but my gft file is too huge to remove all wrong CDSs manually. could you tell me how to judge a CDS in a gft file could be translated into complete codons or not ,so that I can write a script and remove them from my gft file once at all.
Thank you very.

GTF problems and empty outputs

Hello,
Im trying to use SNPGenie to calculate my dN/dS in a bacteria genome. But Im in trouble and probably is because my gtf file. I started to manipulate this type of files recently and I`m in the beginner level on bioinformatic world.
So I will describe what I exactly did.

First I used gffread to convert the gff file in the gtf file. Here is the command that I used:
./gffcompare -E Aus0004.gff -T -o- | more
And the program created the file gffcmp.combined.gtf, I just rename to Aus0004.gtf (Just to know, because this is the name that I will start to use in the following commands)

Than I went to SNPGenie, I did the download, used the command chmod +x to make de script executable. Then I used this command to run:
./snpgenie.pl --vcfformat=3 --snpreport=CL9800.vcf —fastafile=Ef_Aus0004.fasta --gtffile=Aus0004.gtf

And this message appeared:
**_**you have not selected a MIN. MINOR ALLELE FREQ. All variants in the SNP report(s) will be included.

WARNING: Aus0004.gtf does not contain any sense (+) strand products. SNPGenie terminated._

I dont know what is wrong and what I need manipulate in the gtf file! Im attaching the gtf file if you need to look.
Aus004gtf.zip

Thank you!

SNPgenie for one bacterial strain

Dear Chase,
I have a vcf file for snps of one strain against a ref. Can I use SNPgenie to calculate Pi trough the whole genome using sliding window?
Sincerely

Input error gtf

Hi. Thank you so much for developing scripts for Biologist!.

I have issue with gtf file input. I am trying to calculate dN/dS ratio between particular strain of Japanese rice fish with samples which were gathered in wild. However I got error message several times and have no idea to fix the problem. (I am totally new to bioinformatics)

I obtained gtf file from here (Ensembl)
And genome fasta file from same ensembl page. For VCF file, I got the file from [here](url https://www.ebi.ac.uk/ena/data/view/PRJEB5473)

and I retrieve only CDS data from that gtf file by following command.
`awk 'BEGIN{FS="\t"}$3 ~ /CDS/{print $0}' Medaka.gtf

Then, I conducted following command and got error message for 5 or 6 times.
`Perl snpgenie.pl --vcfformat=4 --snpreport=<kiyosu.vcf>.vcf --fastafile=<wholegonomedrR.fa>.fa --gtffile=<drR.gtf>.gtf

I eliminated eliminated the error genes by my hand through searching it to confirm whether it is fixed or not, But, it did not fix. Do you have any idea to fix this problem?

WARNING: The CDS coordinates for gene ENSORLG00000003078 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

Sincerely
Hiroki Tomida

Missing License

The abstract claims that SNPGenie is free and open-source. However, to make SNPGenie really free (as in freedom) you should http://choosealicense.com/.

vcf2revcom.pl - wrong gtf output?

Dear @cwnelson88,

Thank you a lot for your tool, it is great and really helpful.
I calculated the pn/ps ratio using "+" genes from GTF file. But I also need to calculate metrics with "-" genes. I used your script vcf2revcom from Additional scripts window. And how I see, it gives reversed fasta file from end to beginning and the same with vcf file, but when I look on resulting gtf file it still the same as gtf before launching script and when I start calculation with snpgenie script, it gives the same result as with "+" genes but wrong pn/ps in that case.

Briefly, normal gtf and gtf_revcom are the same. What can I do with that? Maybe your script need something specific?

analysing within-host diversity

Hi
I have read your exhaustive introduction and paper and I think it can be useful for my purpose. My interest is to study Ecoli within-host diversity for each sample. Basically, each sample is a fastq and after proper analysis, I found that there are major and minor variants for the Ecoli, indicating a population of same Ecoli but different strains. My idea is to define the diversity per sample, i.e. nucleotide diversity, and the compare the diversity between samples.
Said so, my idea is to generate a multi fasta file containing the sequences per each Ecoli strains coming from the same host, in order to get the nucleotide diversity per host. Then I will create a matrix nxn hosts and run some other analysis.
In order to use SNPGenie, as far as I understood I should use for this within-group analysis, right? Does my approach make sense to you?

Question on interpreting the output parameters

Hi Chase!

Thanks for this awesome tool! I have a question regards on an output parameter named "mean_gdiv". I saw it defined as the mean gene diversity but it's still hard for me to differentiate it from the pi. Could you give me a more detailed definition of mean_gdiv, please? Also, does the mean_dN_vs_ref/mean_dS_vs_ref means the same with dN/dS? Thank you so much for your help!

Regards,

between pops with vcf?

Hi Chase,

Is it possible to run the snpgenie_between_group.pl with vcf files rather than fasta files? I have two populations with variant data for 6 diploids per population and would like to estimate dn/ds between populations. If if cannot be run with vcfs do you have a tool you recommend for converting diploid vcf data to fasta format?

Thanks,
Karen

Missing sites in FASTA files

Dear Chase,

Thank you for writing this tool, It is very helpful and I learned much from it.

If you please, I have a question regarding missing sites representation. I have a multi-sample VCF file generated with GATK in which some samples have missing sites reported for some variants. I want to generate a multi-sequence alignment in FASTA format in order to use it with "snpgenie_within_group.pl" to calculate dN/dS per gene. Do I have to use the REF base to substitute missing sites (not recommended) or is there a way for SNPGenie to not take those sites into consideration (e.g. using "-" to denote a missing site)?

If my samples are diploid and come from one population, will there be a difference between reported dN/dS from "snpgenie_within_group.pl" and piN/piS if I wanted to use "snpgenie.pl" with --vcfformat=1 instead?

Sincerely,
Samer

pi cluster within-pool

Hi Chase -
I have performed the within-pool using the vcf report as input. Now I have several output and I am trying to understand (after reading your papers and references reported in the README.md) how to use them for my purpose. I have been reading previous issues #48 and #49 since they are strongly correlated to my purpose.
My aim is to characterize each sample by its nucleotide diversity (i.e. pi, piS, piN, pi_coding, pi_noncoding) as function of variant frequency and then compare between samples. At given frequency, I built a pairwise distance matrix between samples using as metric Euclidean distance between pi, pi_coding and pi_noncoding. On that distance matrix I have carried out either a linkage clustering or a dimension reduction as UMAP, to perform a clustering analysis. Interestingly, pi_coding is giving few clusters.
Said so, now I'd like to link those cluster to the effect of the mutations, i.e. piN/piS which is identical to dN/dS as far as I understood in my case. I will make an average/variance of piN/piS per cluster.
Any comment on that?
The thread #48 could be useful in my case too, running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?

Then, reading further I read the McDonald-Kreitman test:
[pN/pS] / [dN/dS]
where
dS: the number of synonymous substitutions per gene
dN: the number of non-synonymous substitutions per gene
pS: the number of synonymous polymorphisms per gene
pN: the number of non-synonymous polymorphisms per gene

Now I got lost/confused and I ask your kind help.
pS and pN are not piS and piN, right? I am asking because the definition are very similar between them.

thank for any comment on that! very appreciate
Best

error: gtf does not contain any sense (+) strand products. SNPGenie terminated.

Hi, Chase.

I have been trying to work with SNPGenie and can't figure out how to troubleshoot this error -- CDS_EXAMPLE.gtf does not contain any sense (+) strand products. SNPGenie terminated..

I am trying to run snpgenie.pl using the following command:
perl snpgenie.pl --minfreq=0.01 --vcfformat=4 --snpreport=A_GUANGDONG_17SF003_2016_H7.vcf --fastafile=A_GUANGDONG_17SF003_2016_H7.fasta --gtffile=A_GUANGDONG_17SF003_2016_H7.gtf --slidingwindow=50

I assume this is a GTF formatting error, but I can't identify it. Additionally, I tried running snpgenie.pl using your example files and got the same error.

I am attaching my GTF, fasta, and VCF for your reference.

Any help would be greatly appreciated! Thanks in advance.

files.zip

sliding_window script requires different headers than between_group output file generates

Hi!
Have been running SNPgenie for a while and now testing the sliding_window script however have come across a few issues. See attached the file I'm attempting to use for testing, with the following command:
Rscript SNPGenie_sliding_windows.R between_group_codon_results.txt N S 40 1 1000 50 NONE 6 > btw_test.txt
between_group_codon_results.txt

The first two issues are labeling bugs, the third I can't work out but I'm wondering if it just doesn't like the between_group output file? I'd be interested to know if you can reproduce this, and what could be going on? Thanks!

1: Unknown or uninitialised column: codon_num.

This occurs because the column in the file outputted by between_group_ testing is named just 'codon'. Renaming the column in the output file passes this.

Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `num_defined_seqs >= MIN_DEFINED_CODONS`.
✖ object 'num_defined_seqs' not found

Again, column name issue. I changed the line to match on heading, but not a proper fix

#codon_data_filtered <- filter(codon_data, num_defined_seqs >= MIN_DEFINED_CODONS)
codon_data_filtered <- filter(codon_data, num_defined_codons_g1 >= MIN_DEFINED_CODONS)

Here I can't identify the issue.

Error: Assigned data `*vtmp*` must be compatible with existing data.
ℹ Error occurred for column `sw_start`.
✖ Can't convert <character> to <double>.
Backtrace:
     █
  1. ├─base::`[<-`(...)
  2. ├─tibble:::`[<-.tbl_df`(...)
  3. │ └─tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
  4. │   └─tibble:::tbl_subassign_row(x, i, value, value_arg)
  5. │     ├─base::withCallingHandlers(...)
  6. │     └─vctrs::`vec_slice<-`(`*tmp*`, i, value = value[[j]])
  7. │       └─(function () ...
  8. │         └─vctrs::vec_default_cast(...)
  9. │           └─vctrs::stop_incompatible_cast(...)
 10. │             └─vctrs::stop_incompatible_type(...)
 11. │               └─vctrs:::stop_incompatible(...)
 12. │                 └─vctrs:::stop_vctrs(...)
 13. │                   └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
 14. │                     └─rlang:::signal_abort(cnd)
 15. │                       └─base::signalCondition(cnd)
 16. └─(function (cnd) ...
Execution halted

Results _ File/ Just one CDS

[Hi](url
SNPGenie_ISSUES.zip
) Chase,
Its me again! I thought that my problems are solved, in fact, im not having problems to run the program, but when I started to look to my results and the output files, something seems weird.
Seems that the program just recognized my first CDS, and just print the results for this.
I`m attaching the results folder, the gtf, vcf and fasta file.
The command line that i used was:
./snpgenie.pl --vcfformat=3 --snpreport=CL9800.vcf —fastafile=Ef_Aus0004.fasta --gtffile=Aus004.gtf

Regards,

Multi-Fasta Reference Issue & Output Nucleotide Diversity for all Sites (Not Only Variant)

Hi there,

I have been experiencing issues getting nucleotide diversity (π) for every site in the whole genome for a segmented virus population. The reference fasta has 10 segments and the vcf file was generated with samtools mpileup and bcftools call -c (both v1.3.1). This returned the error:
WARNING: There are multiple sequences in the FASTA file. There must be only one reference genome. SNPGenie terminated.
I have worked around this by splitting the gtf, fasta, and vcf file (header needs to be present) and am writing this into a loop. Before proceeding I would like to output the "within-group" π at every site instead of only for variants (even if it is 0). Does anyone know how to do this?

Thanks!

Question about calculating pi within a window

Hi! Thanks for making such a a great tool. I am using SNPGenie to examine some intra-host virus SNV data. I was hoping you might be able to help me with two issues I am having.

I am looking to calculate pi but only in specific regions and not across the genome as a whole. As I understand it the population_summary value is genome wide. Is there a way to do this with the outputs of the classic 'within pool' analyses? I have the codon, product, site and population results files.

I also seem to be able to get the sliding window analyses running. Can that run on within pool data? I would like to look at a the general direction of selection at a finer scale than each ORF. Is it best to call consensus sequences and look at the consensus level? Or is there another approach that SNPGenie can do but I have missed?

Thanks for the help!

reverse compliment

Hey Chase, this is copied from our email exchange:

I was running into two issues last night and was hoping you could help.

First, I am running from VCFs of individuals sequenced individually (format 1). I have separated out the fasta, gtf, and vcfs into chromosomes. When I run snpgenie on each chromosome separately, it begins by stating ''There are antisense ('-') strand records in the gtf file. COMPLEMENT MODE activated..." it then proceeds and creates output for all genes in the gtf I have provided (- and +). Are the - stranded gene estimates here believable or do I need to also run snpgenie again for the same data reverse-complemented?

Second, if we do need to run the reverse-complemented data, I am struggling a bit with the vcf2revcom.pl pipeline. I'm not entirely sure how I need to trim my data for this script to work. It also takes some time to figure out which sites are throwing errors because the offending sites are reported in the reverse-complement orientation. My first round of this revealed that vcf2revcom.pl doesn't like missing data (./.). Right now, I am getting the following error and I am not sure what the issue is, I'd love any advice:

seq length is 27754200

## WARNING: In NC_037638.1.S.AFAN.vcf site 26819924,
## the reads total should equal the coverage (29) but is instead: 0.
## The reads total has been used. Please verify your data.
Illegal division by zero at vcf2revcom.pl line 998, <ORIGINAL_SNP_REPORT> line 20695.

Here (I think) is the offending SNP (VCF v4.2):

NC_037638.1 934277 . T C 541.33 PASS AF=0.0909091;AN=22 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:0,0:29:99:.:.:0,138,1298:. 0/0:20,0:20:51:.:.:0,51,765:. 0|1:11,12:23:99:1|0:934263_TA_T:394,0,423:934263 0/0:35,0:35:81:.:.:0,81,1215:. 0/0:22,0:22:60:.:.:0,60,900:. 0/0:29,0:29:60:.:.:0,60,900:. 0/0:33,0:33:90:.:.:0,90,1350:. 0/0:32,0:32:72:.:.:0,72,1080:. 0|1:11,8:19:99:0|1:934277_T_C:105,0,466:934277 0/0:16,0:16:33:.:.:0,33,495:. 0/0:15,0:15:36:.:.:0,36,540:.

Thanks so much for building the software, it's really a great resource.

Advice on multi-sample, multi-species, multi-contig analyses for both πN/πS and dN/dS

Dear @cwnelson88 ,

Thanks for implementing these solutions in these scripts! These are hugely time-saving for anyone interested in accomplishing gene-centered population genomics!

I have some questions which I think were partly explained in #34 and #35 but I will take the liberty to ask again, just to be sure.

Does SNPGenie consider every "SAMPLE" in a VCF file to be a separate pool/population for the within-pool analyses?

Lets say I work with discrete libraries from individuals sampled from one population (one library, one read-group sample and one column per individual following the "FORMAT" column on every line with a variant) and produce a VCF with a tool like Freebayes (VCFv4.2). The intention is to compute πN/πS across all these individuals, i.e. to treat them as a single population.

If run with --vcfformat=4, will these samples be treated as a single or separate populations?

Freebayes also outputs the global DP and AF tags in the INFO column. Will SNPGenie use these instead of the "SAMPLE" columns? Can I specify --vcfformat=2 to enforce using these tags and effectively parse the multi-sample VCF as a single-population VCF?

When looking across species to compute dN/dS, what is a practical way to accomplish these computations using the same gene models as for a within-population scan?

I suspect one answer to this is to map data from other species to the same reference, and generate one VCF per species. Depending on genetic distance, this will contain quite a few gene regions with variants fixed for a non-reference alternate allele. Using the script from #34, species-specific sequences can then be exported and easily concatenated. This mapping approach side-steps the issue of orthology-detection using external tools.

Does this sound reasonable to you?

For dN/dS, is there any correction for multiple substitutions when estimating dN and dS?

Corrected dN and dS could possibly be computed separately using PAML. Would it be possible for you to implement a way of optionally side-loading those stats from a simple table, such that they can replace or complement the internal computation of dN/dS and be incorporated in the standard summary statistics computed by the program?

Lets say that I work with a highly fragmented reference sequence, spanning many thousands of contigs. Are there any special considerations that need to be taken into account? For instance:

Must all included contigs have a gene model?
Will the reverse-complement script handle multiple accessions as is, or will I need to use alternative solutions?

Cheers!

Understanding how within-between group works.

When using the normal snpgenie (WITHIN-POOL ANALYSIS), it uses a reference fasta and a list of variants to calculate the Nonsynonymous/Synonymous. However I'm a bit confused with the within-groups and between-groups because the inputs are only a fasta alignment and the annotation file (which contains only the CDS annotations), without a reference. It's comparing every sequence vs all for each position?
Many thanks,

Two products have the same starting position, causing an error.

Hi,

I'm trying to run snpgenie-vcf2revcom.pl on a vcf file generated using GATK HaplotypeCaller, a fasta reference, and a gtf file generated using snpgenie-gff2gtf.pl. When I try to run the script I get the following error:

seq length is 129543483

Two products have the same starting position, causing an error.
Please contact script author for a revision.

Please let me know what info would be needed to help trouble shoot this error.

Ignore genes that are not multiples of 3

Hello Chase,

I'm using SNPGenie on some genome-wide pooled sequencing. SNPGenie stops at the start when it finds loci in the gtf that are not multiples of 3. Is it possible for a SNPGenie option to ignore these loci (with a note in the log) and carry on with an analysis?

This would speed up batch-processing considerably and not affect downstream analyses in my case.

Best wishes,
Jack

gtf error, SNPGenie terminated

Hi, I tried to run SNPGenie on a whole genome to calculate the piN piS. I noticed for some genes SNPGenine will report something like this:

"The CDS coordinates for gene g25564 in the gtf file do not yield a set of complete codons, or are absent from the file. The number of nucleotides must be a multiple of 3."

It seems something wrong with the gtf file. I tried both the gtf file from the official reference of our species and one made by myself using augustus. It always have the same issue. Is there any way to fix this? Thank you!

Variance input format of vcf

Hi, I notice that snpgenie takes CLC or Geneious format for input, how about take vcf format into consideration, since human and fly have well data in this format. thanks

column header "ID"

Hello Chase,
I am trying to run SNPGenie and I got the following error:
temp_vcf4_POS.vcf does not contain the standard VCF column header "ID". SNPGenie terminated.
I checked the .vcf file (produced by LoFreq) and the "ID" header is there and it is also present in the temp_vcf4_ID.vcf file. So up to that point the program is able to parse the "ID" column.
I am not sure what can be wrong.
I hope you can help me.

Thanks

Lucas

How to properly format VCF for snpgenie.pl (when using VCFs from multiple isolates of a population)

Hello!

First I want to say thank you for maintaining this awesome software :)

I have a question regarding the input VCF format when using snpgenie.pl.

For context, I am interested in analyzing gene diversity across a set of whole genome bacterial assemblies I have generated with PacBio (1 contig and circular).

I have aligned each assembly to the reference genome and inferred SNPs using Minimap2. This results in a single VCF for each sample. I have 31 individual VCFs, which I have also merged using bcftools. I am interested in estimating πN/πS, dN/dS, and gene diversity across this population of 31 assemblies.

My first attempt at using snpgenie.pl was to simply merge all 31 VCFs to a multi-sample VCF.

The multisample VCF is formatted as followed:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	M0011368_9	M0014888_3	
NC_000962.3	309	.	G	C	60	.	QNAME=contig_1;QSTART=309;QSTRAND=+	GT:AD:DP	1/1:0,1:1	0/0:1,0:1
NC_000962.3	490	.	G	A	60	.	QNAME=contig_1;QSTART=490;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1
NC_000962.3	558	.	G	T	60	.	QNAME=contig_1;QSTART=558;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1
NC_000962.3	1088	.	G	A	60	.	QNAME=contig_1;QSTART=1088;QSTRAND=+	GT:AD:DP	0/0:1,0:1	0/0:1,0:1

NOTE 1: For the purpose of this example I left only 2 samples within the VCF
NOTE 2: I encoded each sample as having an AD that has one count either towards the reference or alternative allele(s). Thus the depth (DP) of each sample is also 1.

I ran snpgenie.pl with the following command

snpgenie.pl --fastafile=./GCF_000195955.2_ASM19595v2_genomic.fasta --gtffile=./GCF_000195955.2_ASM19595v2_genomic.gtf --snpreport=./PMP_31CI_TestVCF_pyvcf.vcf --vcfformat=4 --outdir=./OutDir

As SNPgenie ran (and when it finished) I noticed that it evaluating each sample in the merged VCF individually. Producing files such as
temp_vcf4_M0011368_9.vcf and temp_vcf4_M0014888_3.vcf

In the population_summary.txt you can see that the stats were calculated for each sample individually.

file	sites	sites_coding	sites_noncoding	pi	pi_coding	pi_noncoding	N_sites	S_sites	piN	piS	mean_dN_vs_ref	mean_dS_vs_ref	mean_gdiv_polymorphic	mean_N_gdiv	mean_S_gdiv	mean_gdiv	sites_polymorphic	mean_gdiv_coding_poly	sites_coding_poly	mean_gdiv_noncoding_poly	sites_noncoding_poly
temp_vcf4_M0011368_9.vcf	4411532	3947191	464341	0	0	0	1412442.16666667	516764.833333334	0	0	2.54877692337374e-05	4.45076725744689e-05	0	*	*	0	138	0	114	0	24
temp_vcf4_M0014888_3.vcf	4411532	3947191	464341	0	0	0	1412444.83333333	516762.166666667	0	0	2.97356746322482e-05	5.22484069879986e-05	0	*	*	0	173	0	146	0	27

I now realize that I was misunderstanding the needed format of the input VCF. The pi is 0 for these two samples, which is unsurprising if each sample is evaluated individually.

So my question is, what is the most appropriate way to evaluate gene diversity across my "population" of 27 isolates? (When I have individual VCFs for each isolate)

Should I format a VCF which has a single "sample" that represents the population of 27 isolates?

Quick example of this idea below:

DP always is 27
AD corresponds to the number of REF vs ALT alleles across 27 samples.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	MyPopulationSample
NC_000962.3	309	.	G	C	60	.	QNAME=contig_1;QSTART=309;QSTRAND=+	GT:AD:DP	1/1:3,24:27	
NC_000962.3	490	.	G	A	60	.	QNAME=contig_1;QSTART=490;QSTRAND=+	GT:AD:DP	1/1:2,25:27
NC_000962.3	558	.	G	T	60	.	QNAME=contig_1;QSTART=558;QSTRAND=+	GT:AD:DP	1/1:22,5:27

Also, is SNPgenie even the appropriate tool for this? Is there something inappropriate in how I am thinking about this analysis?

Thank you so much!
Max

Batch processing

Hello,

I am trying to use SNPgenie in a for loop to batch process. However it is unclear how I change output directory. The software processes the first sample, creates the standard output directory then throws an error for the rest for the samples because the output folder already exists.
Any help would be appreciated.
Thank you,
Matt

comments on dN/dS and pi

Hi all
I report here a very kind email from Chase on few question about dN/dS and pi. I hope it could be useful for the community.

I am using as "within-pool analysis" using vcf format as you know in order to calculate nucleotide diversity (pi) and ratio dN/dS.
In site_results.txt it says:
pi. Nucleotide diversity at this site.
whereas in population_summary.txt:
pi. Mean number of pairwise differences per site in the pooled sample across the whole genome.
For definition, they should be the same thing, isn't? You also used the same keyword. pi_coding and pi_noncoding then refer to the coding and noncoding region I suppose.

The site_results pi is π for a single site in the genome. Most will hopefully be 0. For population_summary, as stated in the definition, it is the mean for the whole genome (all sites), which is not the same. Correct: coding is for regions annotated as protein-coding (in the GTF) and noncoding is for regions NOT annotated as protein-coding (in the GTF).

Then the ratio piN/piS is not automatically calculated but only presented as piN and piS, right?

In cases where it is not calculated for you, you can calculate it yourself as piN/piS (or dN/dS, depending on application).

Finally, dN/dS is not calculated and only mean_dN_vs_ref (and dS) is presented. Is it the same thing?

Good question! They are not the same. piN is the mean of all pairs, i.e., every sequence (or read) to every other. Contrarily, mean dN vs. the reference sequence is the mean of comparisons of the reference sequence (provided in the FASTA) to all other sequences; non-reference sequences are not compared to one another for this (they are only compared to the reference). This result shows just the mean of variation that differs from the reference. It is typically useful when you are interested in divergence from the reference only, not overall diversity in the population.

Gene diversity (observed heterozygosity)

Dear Chase,

What does the asterisk (*) symbol mean in the output results of gene diversity (mean_gdiv_polymorphic, mean_N_gdiv and mean_S_gdiv)?

What is the difference between zero values (0) to the asterisks (*) symbols?

For some products (product_result) I have zero values for nucleotide diversity; but, I do get non-zero values for the number of sites (N_sites and S_sites). How can SNPGenie assess the number of sites, but not the other estimates (e.g. number of nucleotide substitutions per site [d] = N_diffs and S_diffs; and nucleotide diversity [pi] = piN and piS) for a specific product?

I appreciate your help!
Barbara.

Hello, I am trying to run snpgenie_wthin_group.pl after installing Parallel::ForkManager (with -force option) and get the error message: MacBookPro:SNPGenie-master julia$ perl snpgenie_within_group.pl --fasta_file_name=compl_cons_ORF --gtf_file_name=GGS.gtf Can't locate Parallel/ForkManager.pm in @INC (@INC contains: /Library/Perl/5.16/darwin-thread-multi-2level /Library/Perl/5.16 /Network/Library/Perl/5.16/darwin-thread-multi-2level /Network/Library/Perl/5.16 /Library/Perl/Updates/5.16.2/darwin-thread-multi-2level /Library/Perl/Updates/5.16.2 /System/Library/Perl/5.16/darwin-thread-multi-2level /System/Library/Perl/5.16 /System/Library/Perl/Extras/5.16/darwin-thread-multi-2level /System/Library/Perl/Extras/5.16 .) at snpgenie_within_group.pl line 62. BEGIN failed--compilation aborted at snpgenie_within_group.pl line 62. Do you have any idea what I can do to fix it? Thanks!!!

SNPGenie for bacterial samples from the same lineage

Hi @cwnelson88 ,

Thanks for the fantastic software and the detailed wiki page. I would like you to ask for advice on my setup, just to see if it makes sense. I am quite a newbie in this field.

I am analysing 138 bacterial genomes (E. coli) from the same lineage (closely-related). I used snippy which uses FreeBayes to compute which SNPs are present in these genomes against a reference genome. My first question is more theoretical: if my final goal is to identify which genes are under positive or purifying selection, should I select a reference genome from the same lineage or from a distinct (ancestral) lineage?

Following the advice given at the issue #35, I am planning to merge the 138 individual *vcf files into a single *vcf file that represents the entire lineage. Is there any other alternative that I should consider instead?

Thanks in advance! and it is super handy the conda installation :)

Sergio

chasewnelson / snpgenie Goto Github PK

snpgenie's People

Contributors

Stargazers

Watchers

Forkers

snpgenie's Issues

WARNING: The SNPGenie script vcfformat1_to_revcom needs exactly 2 arguments, in this order:

## (1) A '+' strand SNP report in VCF format 1 (SNPGenie descriptions).

(2) The exact length of the sequence in the FASTA file.

For example: vcfformat1_to_revcom.pl my_snp_report.vcf 248956422

Converting All2.vcf to reverse complement format...

REVERSE COMPLEMENT VCF FILE has been written to:

VCF: All2_revcom.vcf

WARNING: Please be aware that there is a mid-sequence STOP codon in the sequence GU280_gp01|13483.

Please check your annotations for: (1) incorrect frame; or (2) incorrect starting or ending coordinates.

A premature STOP codon may also indicate a pseudogene, for which piN vs. piS analysis may not be appropriate.

WARNING: the line is:

For vcfformat==4, AD must list numbers of reads for two SNPs as follows: ,,. SNPGenie TERMINATED.

WARNING: The CDS coordinates for gene Glyma.01G115000.3.Wm82.a2.v1.CDS.2 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

WARNING: Aus0004.gtf does not contain any sense (+) strand products. SNPGenie terminated._

WARNING: The CDS coordinates for gene ENSORLG00000003078 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

Recommend Projects

Recommend Topics

Recommend Org