angsd / angsd Goto Github PK

View Code? Open in Web Editor NEW

222.0 222.0 52.0 77.64 MB

Program for analysing NGS data.

Makefile 0.43% R 7.98% C++ 78.47% C 9.24% Shell 3.25% Python 0.63%

angsd's Introduction

===== Program for analysing NGS data.

http://www.popgen.dk/angsd

Installation:

Using a local folder containing htslib

#download htslib
git clone --recurse-submodules https://github.com/samtools/htslib.git;
#download angsd
git clone https://github.com/angsd/angsd.git;

#install htslib
cd htslib
make

#install angsd
cd ../angsd
make HTSSRC=../htslib

Systemwide installation of htslib

git clone https://github.com/angsd/angsd.git;
cd angsd; make HTSSRC=systemwide

Using htslib submodule

git clone https://github.com/angsd/angsd.git;
cd angsd; make

Notes

I've switched over to using htslib for parsing single reads (to allow for CRAM reading, while avoid having to write my own CRAM parser). I'm still using my own readpools. Users should therefore also download and install htslib.
If you are on a mac computer and the compilation process complains about a missnig crybtolib library then do 'make CRYPTOLIB=""'

Program has a paper

http://www.biomedcentral.com/1471-2105/15/356/abstract

angsd's People

Contributors

Stargazers

Watchers

Forkers

katiedelange mb3188 chenyu600 aalbrechtsen pdworzynski yangjl sunyatin samuelesoraggi plibrado rareseas rafalcode jinfengchen simomounir jianlian92 altingia ammodramus ichobits kiwiroy cnyuanh huangl07 idiv-biodiversity fengyq keithmp e-jorsboe wbyu nspope juzheng87 huixinxu anbena infinitycollaborations laura-lc tayyub-png youreprettygood stephenturner alexllanesquevedo apfuentes sunshichao0916 difiore wangpanqiao vitorpavinato delehef stickle-back-in-time kullrich licheng0921 tsangntu wook2014 stl114 abigailramsoe melakbet cgwyx trofi yancheer

angsd's Issues

The name of Haploid calling

Hi all,

When I do the haploid calling from bam file, the outfile looks like below:
chr pos major ind0 ind1 ind2 ind3
sc0000012 2416 T T T T N
sc0000012 2417 G G G G G
sc0000012 2418 A A A N A
sc0000012 2420 G G G N G

the outfile seems like that they cant automatically replace @sm for ind0/1/2/3. So I am wondering whether the index of ind (ind0/1/2/3) should be same as the order of bam.list file.

For example: I have the bam.list file looks like below:
bar.Adif.bam
bar.Aech.bam
bar.Agem.bam
bar.Asub.bam

In this case, ind0 should be Adif, ind1 should be Aech, ind2 shoud be Agem, ind3 should be Asub,right?

Thanks a lot

Best,
Yafei

VCF genotype likelihood input

As VCF is able to encode genotype likelihoods, is there a way to use these in ANGSD?

Some methods have rather sophisticated methods to calculate these, which might be somewhat context dependent. Also, this would allow working with indels, SVs, and other non-SNP variation.

I'm happy to write a converter to go from VCF to a format that ANGSD can understand. But is there documentation to explain the input format?

Newest version breaks vcf-gl

realSFS 2DSFS Segfault

Hello,

I'm trying to generate a 2DSFS and I keep getting a segfault:

hoffmanp@login03 [~/scratch/angsd/misc] % gdb realSFS
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /panfs/roc/scratch/hoffmanp/angsd/misc/realSFS...done.
(gdb) run ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.idx ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.idx > ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/out
Starting program: /panfs/roc/scratch/hoffmanp/angsd/misc/realSFS ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.idx ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.idx > ~/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/out
[Thread debugging using libthread_db enabled]
    -> Version of fname:/home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.idx is:2
    -> Assuming .saf.gz file: /home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.gz
    -> Assuming .saf.pos.gz: /home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.pos.gz
    -> Version of fname:/home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.idx is:2
    -> Assuming .saf.gz file: /home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.gz
    -> Assuming .saf.pos.gz: /home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.pos.gz
    -> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:/home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.idx fstout:(null) oldout:0 seed:1462801289 bootstrap:0
    -> Multi SFS is 'still' under development. Please report strange behaviour
    -> nSites: 396873
    -> The choice of -nSites will require atleast: 42.390610 megabyte memory, that is at least: 0.18% of total memory
    -> dim(/home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Bloomington_Intergenic.saf.idx):5
    -> dim(/home/morrellp/hoffmanp/scratch/SCNAngsd2k/Bloomington.Huxley/Fst/Huxley_Intergenic.saf.idx):19
    -> Dimension of parameter space: 95
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF0
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF0] from pop0:  1340
    -> Sites to keep[HEGLY_SF0] from pop1:  1340
    -> [readdata] lastread:1340 posi:3170
    -> Comparing positions: 1 with 0 has:1340
    -> Only read nSites: 1340 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF100
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF100] from pop0:    206
    -> Sites to keep[HEGLY_SF100] from pop1:    206
    -> [readdata] lastread:206 posi:3170
    -> Comparing positions: 1 with 0 has:1546
    -> Only read nSites: 1546 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10003
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10003] from pop0:  129
    -> Sites to keep[HEGLY_SF10003] from pop1:  129
    -> [readdata] lastread:129 posi:3170
    -> Comparing positions: 1 with 0 has:1675
    -> Only read nSites: 1675 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1001
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1001] from pop0:   87
    -> Sites to keep[HEGLY_SF1001] from pop1:   87
    -> [readdata] lastread:87 posi:3170
    -> Comparing positions: 1 with 0 has:1762
    -> Only read nSites: 1762 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10016
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10016] from pop0:  159
    -> Sites to keep[HEGLY_SF10016] from pop1:  159
    -> [readdata] lastread:159 posi:3170
    -> Comparing positions: 1 with 0 has:1921
    -> Only read nSites: 1921 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1004
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1004] from pop0:   176
    -> Sites to keep[HEGLY_SF1004] from pop1:   176
    -> [readdata] lastread:176 posi:3170
    -> Comparing positions: 1 with 0 has:2097
    -> Only read nSites: 2097 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1005
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1005] from pop0:   376
    -> Sites to keep[HEGLY_SF1005] from pop1:   376
    -> [readdata] lastread:376 posi:3170
    -> Comparing positions: 1 with 0 has:2473
    -> Only read nSites: 2473 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10063
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10063] from pop0:  149
    -> Sites to keep[HEGLY_SF10063] from pop1:  149
    -> [readdata] lastread:149 posi:3170
    -> Comparing positions: 1 with 0 has:2622
    -> Only read nSites: 2622 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10070
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10070] from pop0:  123
    -> Sites to keep[HEGLY_SF10070] from pop1:  123
    -> [readdata] lastread:123 posi:3170
    -> Comparing positions: 1 with 0 has:2745
    -> Only read nSites: 2745 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10082
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10082] from pop0:  49
    -> Sites to keep[HEGLY_SF10082] from pop1:  49
    -> [readdata] lastread:49 posi:3170
    -> Comparing positions: 1 with 0 has:2794
    -> Only read nSites: 2794 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1013
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1013] from pop0:   170
    -> Sites to keep[HEGLY_SF1013] from pop1:   170
    -> [readdata] lastread:170 posi:3170
    -> Comparing positions: 1 with 0 has:2964
    -> Only read nSites: 2964 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10131
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10131] from pop0:  90
    -> Sites to keep[HEGLY_SF10131] from pop1:  90
    -> [readdata] lastread:90 posi:3170
    -> Comparing positions: 1 with 0 has:3054
    -> Only read nSites: 3054 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1016
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1016] from pop0:   170
    -> Sites to keep[HEGLY_SF1016] from pop1:   170
    -> [readdata] lastread:170 posi:3170
    -> Comparing positions: 1 with 0 has:3224
    -> Only read nSites: 3224 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10168
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10168] from pop0:  376
    -> Sites to keep[HEGLY_SF10168] from pop1:  376
    -> [readdata] lastread:376 posi:3170
    -> Comparing positions: 1 with 0 has:3600
    -> Only read nSites: 3600 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10171
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10171] from pop0:  146
    -> Sites to keep[HEGLY_SF10171] from pop1:  146
    -> [readdata] lastread:146 posi:3170
    -> Comparing positions: 1 with 0 has:3746
    -> Only read nSites: 3746 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF10178
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF10178] from pop0:  128
    -> Sites to keep[HEGLY_SF10178] from pop1:  128
    -> [readdata] lastread:128 posi:3170
    -> Comparing positions: 1 with 0 has:3874
    -> Only read nSites: 3874 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1020
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
    -> Sites to keep[HEGLY_SF1020] from pop0:   230
    -> Sites to keep[HEGLY_SF1020] from pop1:   230
    -> [readdata] lastread:230 posi:3170
    -> Comparing positions: 1 with 0 has:4104
    -> Only read nSites: 4104 will therefore prepare next chromosome (or exit)
    -> Done reading data from chromosome will prepare next chromosome
    -> Is in multi sfs, will now read data from chr:HEGLY_SF1021
    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 
    -> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites

Program received signal SIGSEGV, Segmentation fault.
0x00000038fba78f8d in _int_free (av=0x38fbd8fe80, p=0x37dcf00, have_lock=0) at malloc.c:5011
5011          unlink(av, p, bck, fwd);
Missing separate debuginfos, use: debuginfo-install zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000038fba78f8d in _int_free (av=0x38fbd8fe80, p=0x37dcf00, have_lock=0) at malloc.c:5011
#1  0x000000000040c95d in iter_init(persaf*, char*, int, int) ()
#2  0x000000000040444b in set_intersect_pos(std::vector<persaf*, std::allocator<persaf*> >&, char*, int, int, char**, filt*) ()
#3  0x00000000004095d5 in int readdata<float>(std::vector<persaf*, std::allocator<persaf*> >&, std::vector<Matrix<float>*, std::allocator<Matrix<float>*> >&, unsigned long, char*, int, int, int*, char**, filt*) ()
#4  0x000000000040b0df in int main_opt<float>(args*) ()
#5  0x0000000000405f8a in main ()

Problems using the inbreeding coefficient for unfoldedSFS calculation

I am using the options: -doSaf 2 -indF inbreeding.txt , my samples are highly inbred defined 0.99 in the inbreeding.txt file. However, the results for the unfolded SFS are what I would expect if the samples had inbreeding coefficient = 0, I actually tested this by changing the inbreeding.txt to 0's. Should I do something different?

Segmentation fault problem

I am trying to run 32 samples with the following commands

./angsd -bam angsd.persicabam.txt -anc PC01.fa -ref Prunus_persica_v1.0_scaffolds.fa -nInd 32 -doMajorMinor 1 -doMaf 1 -doSaf 2 -GL 1 -doPost 1 -doGeno 32 -indF PP_r.indF -nThreads 4 -nQueueSize 20 -minMapQ 30 -minQ 20 -r scaffold_8: -out PP_s8

in angsd version: 0.900-3-g100443c (htslib: 1.2.1-71-g30fb9ee) build(May 14 2015 10:16:45), but end up with the following error

        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads
/var/spool/slurmd/job3693807/slurm_script: line 59: 14620 Segmentation fault      (core dumped)

(line 59 contains the above angsd commands).

Output files that appear to be of expected size are .arg, .geno.gz, .mafs.gz, .saf.gz, and .saf.pos.gz. The .saf.idx file is of size zero bytes. When I run the same commands with a different group of only 14 samples (with or without -nQueueSize) this problem does not occur. (It also did not occur in the previous version of ANGSD.)

Grateful for any thoughts or suggestions you may have.

doMaf gzipped output error

Hi there,

Attempting to use ANGSD for genotype calling, and using doMaf to see which sites to focus on. The output of doMaf is a gzipped file, but when I try to open it I get the error:

"gzip: sites.mafs.gz: invalid compressed data--crc error"

My command line is:
../angsd0.614/angsd -b bam.list -rf regions.txt -doMaf 1 -doMajorMinor 1 -snp_pval 0.01 -GL 2 -P 4 -baq 0 -minQ 13 -minMapQ 10 -out sites

I'm on Ubuntu 12.04, if that matters. I've tried redownloading and compiling with no effect. Is there any way to output the file not gzipped?

Cheers

unweighted and weighted Fst

After much searching, including reading through Fumagalli et al. (2013) and documentation at http://www.popgen.dk/angsd, I have been unsuccessful in finding a clear explanation of the difference between the unweighted and weighted Fst values output by ANGSD realSFS. Are the unweighted values equation 4 while the weighted values equation 12 in Fumagalli et al. (2013)? The motivation for asking is that the unweighted and weighted Fst values for my two study populations are very different, 0.029449 and 0.668699, respectively, and I am trying to understand why there is such a large difference and its meaning. Any clarification is greatly appreciated; thank you.

feature request: interval/site list/range

It would be cool if I could specify a range of sites or a list of sites for ANGSD to call. Thanks.

It owuld be equivalent to --intervals in GATK:
https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--intervals

regions file error if fasta not sorted

Regions files don't work unless chromosome headers in fasta references and/or ancestral references are in numerical order. As most fastas don't come in numerical order and there's no one quick sort function to alleviate this, can this be fixed in angsd? Or at least mentioned in the documentation?

problem with using GLF format as input

Hi,
I use ANGSD's genotype likelihood function to dump results in Binary GLF files. And I tried to perform analyze using ANGSD with such GLF file as input. For example,
angsd -glf binary.glf.gz -fold 1 -nInd 2 -anc ucsc.hg19.fasta -doSaf 1 -out test -fai ucsc.hg19.fasta.fai,
however, I got lots of errors, such as:

Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171349 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171399 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171449 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171499 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171549 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171599 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171649 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171699 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171749 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171799 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171849 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171899 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171949 ref_len=16571
Trying to access fasta efter end of chromsome+200:chrM/chrM pos=2171974 ref_len=16571
-> Done reading data waiting for calculations to finish
-> Done waiting for threads

    -> npools:0 unfreed tnodes before clean:0
    -> Output filenames:
            ->"test.arg"
            ->"test.saf.gz"
            ->"test.saf.pos.gz"
            ->"test.saf.idx"
    -> Thu Dec 17 14:39:35 2015
    -> Arguments and parameters for all analysis are located in .arg file
    [ALL done] cpu-time used =  2.43 sec
    [ALL done] walltime used =  3.00 sec

I would like to know is there any reasons for such problem.

Thanks very much

Segfault when using doMaf -1

I get a segfault when using -doMaf -1 on:

./angsd -glf testF.glf.gz -fai testAF.ANC.fai -nInd 20 -doMajorMinor 1 -doGlf 2 -doMaf -1 -SNP_pval 1e-4 -out testF

thanks,

Make error

I'm having trouble with make completing for the latest version. Following is the make command and everything until it appears to hang on making bams.h.

make HTSDIR=../htslib
make -C /home/dmvelasc/Software/htslib
make[1]: Entering directory `/home/dmvelasc/Software/htslib'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/home/dmvelasc/Software/htslib'
make: *** No rule to make target `bams.h', needed by `abcAncError.o'.  Stop.

Or am I doing something dumb?

fst nsites output option?

The angsd fst function for sliding windows works nicely, but the output doesn't contain a column listing the number of sites within a window used to calculate the average for that window. It would be useful to have this in the output similar to that in the thetas output. This would allow users to drop windows which contain below some threshold number of sites, i.e. I think there's quite a few windows in my output that have only a handful of sites post-filtering and I'd like to know which ones

Contamination

Hello,

I am trying to make a contamination check with ANGSD on a mitochondrial genome. But I receive a segmentation fault message. Any suggestions?

Best

Niall

incongruence result in ABBA-BABA test

Hi all,

I am trying to do ABBA-BABA test in two ways:

using -doAbbababa directly,
using haplotyping to get haplotype and calculate D-value by myself
and, I set the same parameters for these two run but these two got huge different results there.

I have done ABBA-BABA test using -doAbbababa 1 using bam file as input:

angsd -doAbbababa 1 -blockSize 5000000 -anc Aten.fa -doCounts 1 -minQ 30 -minMapQ 60 -P 24 -setMinDepth 100 -setMaxDepth 1200 -out five_abba_30 -setMinDepth 100 -setMaxDepth 1200 -bam Bam.list #Aten.fa was assembed by myself and should be outgroup

Results:

Aech_group_realigned_reads.bam Adif_group_realigned_reads.bam Asub_group_realigned_reads.bam 348168 353385 -0.007436359 -0.007436359 0.002436614 -3.051924

Also, I am trying to do haplotyping and do ABBA-BABA by myself:

/work/student/yafei-mao/angsd_dep/angsd -minQ 30 -doCounts 1 -setMinDepth 100 -setMaxDepth 1200 -minMapQ 60 -dohaplocall 1 -maxMis 0 -out aln_des_30 -bam bam_aln_des.list
less aln_des_30.haplo.gz |cat >aln_des.txt

here aln_des.txt looks like

chr pos major ind0 ind1 ind2 ind3

sc0000012 2474 C C C C C

sc0000012 2477 C C C C C

...

ind0: Adif_group_realigned_reads.bam

ind1: Aech_group_realigned_reads.bam

ind2: Asub_group_realigned_reads.bam

ind3: Aten_group_realigned_reads.bam (outgroup)

python code

infile=open('aln_des.txt','r')
outfile=open('block_file','w')
step=5000000 #'block size'
ABBA=0
BABA=0
mark='sc0000012' #same as first chr in 'aln_des.txt'
n=1
for line in infile:
if line.strip()[0]=='c': continue
L=line.strip().split('\t')
if mark==L[0]:
if int(L[1])<=int(step_n):
if L[6]==L[3]!='N' and L[4]==L[5]!='N' and L[3]!=L[4]:
ABBA=ABBA+1
elif L[6]==L[4]!='N' and L[3]==L[5]!='N' and L[4]!=L[3]:
BABA=BABA+1
else:
outfile.write(L[0]+'\t'+str(step_(n-1))+'\t'+str(step_n)+'\t'+str(ABBA)+'\t'+str(BABA)+'\n')
n=n+1
ABBA=0
BABA=0
else:
outfile.write(mark+'\t'+str(step_(n-1))+'\t'+str(step*n)+'\t'+str(ABBA)+'\t'+str(BABA)+'\n')
n=1
ABBA=0
BABA=0
mark=L[0]
continue
infile.close()
outfile.close()

here, the block_file looks like:

sc0000012 0 5000000 1403 756

sc0000002 0 5000000 1418 775

sc0000013 0 5000000 1530 832

sc0000007 0 5000000 1580 1008

cat block_file|awk '{sum+=$4} END {print "Sum = ", sum}'

Sum = 287075

cat block_file|awk '{sum+=$5} END {print "Sum = ", sum}'

Sum = 206104

Results:
Aech_group_realigned_reads.bam Adif_group_realigned_reads.bam Asub_group_realigned_reads.bam 206104 287075 0.19

Here, you will find the two results are huge different. I am not sure why these two results are so different. I have a simple idea below:

Using bam file to do ABBA-BABA test directly, I used Aten.fa as ref outgoup, which genome has more information (including mate pair), so that I could got more ABBA-BABA pattern. However, calling haplotype I lost lots of mate pair reads in Aten.bam (much missing data). So that I got low number of ABBA-BABA pattern.

Do you think the second way(haplotyping and do ABBA-BABA) is reasonable? And also, how could I calculate D-value for each of scaffold (-Chr) directly using -doAbbababa?

Thanks a lot~
Yafei

New segfault error: angsd sites index

Happened into a new segfault error while running ANGSD. In this case I am working on setting up a 2dSFS estimation to use as a prior for Fst. The first two lines

gunzip -c PD.saf.pos.gz PP.saf.pos.gz | sort -S 50% | uniq -d | sort -k1,1 -S 50% | gzip > PE.saf.pos.gz
gunzip -c PE.saf.pos.gz PE.2pops.saf.pos.gz | sort -S 50% | uniq -d | sort -k1,1 -S 50% > intersect.txt

appear to function properly, producing a 64KB file, but

bin/angsd sites index intersect.txt

gives the following error

indexing intersect.txt and will add '0' to pos column
        -> Filterfile: intersect.txt supplied will generate binary representations... 
/var/spool/slurmd/job3702808/slurm_script: line 100: 21399 Segmentation fault      (core dumped)

(line 100 being the angsd line).

Thanks in advance for any help with this situation.

-SNP_pval parameter

Hi all,

Just a quick question. What method is implemented for the -SNP_pval parameter?

I was reading your paper (and the wiki) and I think that parameter is related to LRT in SNP discovery, described in the table 1 of the paper, but I am not pretty sure about that.

Thanks in advance

Masking regions in the genome

Hello there,

I have information about low complexity and repeated regions in my genome, and I would like to use it to increase accuracy in my analysis (SFS, theta calculus...). Removing the regions in all my BAMs might be problematic (time and space consuming). I tried to mask those regions in the -ref that I supplied, but I as far as I could understood that is not eliminating those regions for computing statistics. Thus, I was wondering if getting an SFS taking into account those regions, calculate Thetas and then ignore those positions for window analysis will have a major impact on my estimates.Any other solution? Recommendation? Sorry if this is not the forum to ask this question.

question on the input file

Hi,
Those days I came across ANGSD, it is a very great tool for population genetics study. I would like to know is it necessary to perform BQSR with the bam files before running ANGSD.
Thanks very much!

chromosome names realSFS restrictions?

Hi,

realSFS seems to allow only certain chromosome names, as I get the following error:
$ realSFS print ESCP.saf.idx
$ -> Version of fname:ESCP.saf.idx is:2
$ Problem with chr: chrX, key already exists

What are the restrictions to chromosome names (1...n, X, Y ?) and is there a workaround? The ref genome I am using is labelled with roman numbers.

Thank you very much for your help,
David

multiSFS with realSFS fails for reasons unknown

Hi there,

I'm encountering an error that I can't seem to work around. I'm trying to make a 2d SFS (joint MAF SFS) from two populations. The two ANGSD commands seem to work without issue, but when I run the realSFS command on the two .saf.idx files I get an error that I don't understand. Any ideas what's happening? This happens regardless of whether I change the way I create my 1d SFS (-GL 1 or -GL 2, folded or unfolded -- although I really do need to keep it folded). Below are the commands that I've run, and the error. Any help you can give would be great -- I'm not even sure what went wrong...

Thanks!

-Alex

angsd -b ../PAon_bamlist -anc ../Reference.fasta -Sites ../Sites_passed_filters.keep -out 2dattempt_paon -GL 2 -minMapQ 1 -minQ 20 -doSaf 1

angsd -b ../PAoff_bamlist -anc ../Reference.fasta -Sites ../Sites_passed_filters.keep -out 2dattempt_paoff -GL 2 -minMapQ 1 -minQ 20 -doSaf 1

realSFS 2dattempt_paoff.saf.idx 2dattempt_paon.saf.idx > 2attempt

-> 1) Will set iter according to chooseChr and start and stop

    -> Sites to keep[EBRARK001_BG118_locus100956] from pop0:        90

    -> Sites to keep[EBRARK001_BG118_locus100956] from pop1:        90

    -> [readdata] lastread:90 posi:4

    -> Comparing positions: 1 with 0 has:16779

    -> Only read nSites: 16779 will therefore prepare next chromosome (or exit)

    -> Done reading data from chromosome will prepare next chromosome

    -> Is in multi sfs, will now read data from chr:EBRARK001_BG118_locus100963

    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect

    -> 1) Will set iter according to chooseChr and start and stop

    -> Sites to keep[EBRARK001_BG118_locus100963] from pop0:        66

    -> Sites to keep[EBRARK001_BG118_locus100963] from pop1:        66

    -> [readdata] lastread:66 posi:4

    -> Comparing positions: 1 with 0 has:16845

    -> Only read nSites: 16845 will therefore prepare next chromosome (or exit)

    -> Done reading data from chromosome will prepare next chromosome

    -> Is in multi sfs, will now read data from chr:EBRARK001_BG118_locus100965

    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 

    -> 1) Will set iter according to chooseChr and start and stop

    -> Sites to keep[EBRARK001_BG118_locus100965] from pop0:        90

    -> Sites to keep[EBRARK001_BG118_locus100965] from pop1:        90

    -> [readdata] lastread:90 posi:4

    -> Comparing positions: 1 with 0 has:16935

    -> Only read nSites: 16935 will therefore prepare next chromosome (or exit)

    -> Done reading data from chromosome will prepare next chromosome

    -> Is in multi sfs, will now read data from chr:EBRARK001_BG118_locus10097

    -> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect 

    -> 1) Will set iter according to chooseChr and start and stop

realSFS: multisafreader.hpp:117: int set_intersect_pos(std::vector<persaf*>&, char_, int, int, char_*): Assertion `hit->m>0' failed.

Aborted (core dumped)

check that input files are sorted

Add variability to fst output

chisq->cdf hangs if input is nan

dstat requres refactoring

@aalbrechtsen

There is an implicit assumption that regions specified by -rf has to come in the same ordering as the referencename in the header.
The expected output when specifiying -r chromosomename, is to output counts only for that chromosome, not all chromosome names given by the header of the bam.
This code is NOT threadsafe if the above assumption 1) doesnt hold. Instead use the threadsafe changeChr function if background structures needs to be dumped/updated before we change to a new chromosome:

Valgrind reports numerous problems with accessing uninitialized memory. Even when no -rf has been specified.

=12478== Invalid read of size 8
==12478== at 0x413F91: abcDstat::printAndEmpty() (abcDstat.cpp:205)
==12478== by 0x413D84: abcDstat::~abcDstat() (abcDstat.cpp:177)
==12478== by 0x413E71: abcDstat::~abcDstat() (abcDstat.cpp:185)
==12478== by 0x478844: destroy_shared() (shared.cpp:114)
==12478== by 0x443AA7: main (angsd.cpp:256)
==12478== Address 0xc572038 is 8 bytes before a block of size 21,365,904 alloc'd
==12478== at 0x4C2DC90: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12478== by 0x52969BC: bam_hdr_read (sam.c:155)
==12478== by 0x5297E58: sam_hdr_read (sam.c:728)
==12478== by 0x468F5E: multiReader::multiReader(int, char**) (multiReader.cpp:336)
==12478== by 0x443978: main (angsd.cpp:222)
==12478==
==12478== Conditional jump or move depends on uninitialised value(s)
==12478== at 0x5D807E0: vfprintf (vfprintf.c:1641)
==12478== by 0x5D87B96: fprintf (fprintf.c:32)
==12478== by 0x414010: abcDstat::printAndEmpty() (abcDstat.cpp:207)
==12478== by 0x413D84: abcDstat::~abcDstat() (abcDstat.cpp:177)
==12478== by 0x413E71: abcDstat::~abcDstat() (abcDstat.cpp:185)
==12478== by 0x478844: destroy_shared() (shared.cpp:114)
==12478== by 0x443AA7: main (angsd.cpp:256)

Use kstring_t for buffering output and write directly into a gzFile instead of fprintf directly into a FILE
This part of the angsd code uses WAY too much memory (even when threading has been disabled). Consider using a memory pool for the tmp int** structs.

question about ploidy or pooled samples

I was just wondering if there is a way currently, or possibly a planned extension, to allow for non-diploid samples or pooled samples to be analyzed within the ANGSD framework? Thanks!

add option for filtering saf files

BAM header error

Hello,

ANGSD fails with an error when the BAM headers are different in trivial places. We have a large set of BAM files we are going to analyze with ANGSD, but some of them have slightly different values in their @ RG PL: fields.

The header for one is

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:44628078
@SQ SN:10   LN:78891986
@SQ SN:11   LN:77806621
@SQ SN:12   LN:114493404
@SQ SN:13   LN:90638451
@SQ SN:14   LN:107695634
@SQ SN:15   LN:211075956
@SQ SN:16   LN:236119348
@SQ SN:17   LN:247448774
@SQ SN:18   LN:153404915
@SQ SN:19   LN:44469704
@SQ SN:2    LN:25614509
@SQ SN:20   LN:74042087
@SQ SN:21   LN:336444398
@SQ SN:22   LN:423973857
@SQ SN:23   LN:177023008
@SQ SN:24   LN:134276247
@SQ SN:25   LN:166082040
@SQ SN:3    LN:31008187
@SQ SN:4    LN:34072844
@SQ SN:5    LN:36686259
@SQ SN:6    LN:47845093
@SQ SN:7    LN:63559575
@SQ SN:8    LN:132814111
@SQ SN:9    LN:113902069
@RG ID:Sample_170457    PL:hiseq2000    PU:Sample_170457    SM:Sample_170457
@PG ID:MarkDuplicates   PN:MarkDuplicates   VN:1.100(1571)  CL:net.sf.picard.sam.MarkDuplicates INPUT=[Sample_170457/Sample_170457.bam] OUTPUT=Sample_170457/Sample_170457_rmdup.bam METRICS_FILE=Sample_170457/Sample_170457_picarddup.txt SORTING_COLLECTION_SIZE_RATIO=0.05 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false

And the header for another is

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:44628078
@SQ SN:10   LN:78891986
@SQ SN:11   LN:77806621
@SQ SN:12   LN:114493404
@SQ SN:13   LN:90638451
@SQ SN:14   LN:107695634
@SQ SN:15   LN:211075956
@SQ SN:16   LN:236119348
@SQ SN:17   LN:247448774
@SQ SN:18   LN:153404915
@SQ SN:19   LN:44469704
@SQ SN:2    LN:25614509
@SQ SN:20   LN:74042087
@SQ SN:21   LN:336444398
@SQ SN:22   LN:423973857
@SQ SN:23   LN:177023008
@SQ SN:24   LN:134276247
@SQ SN:25   LN:166082040
@SQ SN:3    LN:31008187
@SQ SN:4    LN:34072844
@SQ SN:5    LN:36686259
@SQ SN:6    LN:47845093
@SQ SN:7    LN:63559575
@SQ SN:8    LN:132814111
@SQ SN:9    LN:113902069
@RG ID:Sample_204281    PL:ILLUMINA PU:Sample_204281    SM:Sample_204281
@PG ID:MarkDuplicates   PN:MarkDuplicates   VN:1.100(1571)  CL:net.sf.picard.sam.MarkDuplicates INPUT=[Sample_204281/Sample_204281.bam] OUTPUT=Sample_204281/Sample_204281_rmdup.bam METRICS_FILE=Sample_204281/Sample_204281_picarddup.txt SORTING_COLLECTION_SIZE_RATIO=0.05 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false

The only differences besides sample ID in these headers is that one has PL:hiseq2000 and the other has PL:ILLUMINA
ANGSD gives us this error message:

->(Using Filipe modification of: angsd_realSFS.cpp)

Difference in BAM headers: Problem with number of chromosomes in header

filteroptions that exist in multiple analysis classes should notify users at runtime in which classes they can be used

Compilation error in bam_likes.cpp

The definition of err_mod_t has been commented out in recent releases. Because of this I cannot built angsd versions later than 0.910.
I've already tried using HTSlib 1.3.2 to see if this would solve the issue.

The error message using gcc 4.9.3 is:

bam_likes.cpp:271:1: error: ‘errmod_t’ does not name a type
 errmod_t *mod = NULL;

minQ doesn't seem to filter

Filtering without setting any minQ:

/opt/angsd/angsd
-P 4
-b my_bamlist.bamlist
-ref myref_genome.fa
-out test_1.qc
-uniqueOnly 1
-remove_bads 1
-only_proper_pairs 1
-baq 1
-C 50
-minMapQ 20
-minInd 10
-doQsDist 1
-doDepth 1
-doCounts 1

Result --> test_1.qc.qs:
qscore counts perc
0 238923500 0.000933898301529085
1 17250164 0.0010013253183805
2 62039382 0.00124382332915681
3 53694141 0.00145370166712676
4 83444136 0.00177986605398264
5 343915629 0.00312415500752805
6 260831145 0.00414368537311112
7 303137118 0.00532858028562756
8 336742720 0.00664483194885987
9 416247587 0.00827185028114983
10 458819334 0.0100652720112488
11 468760041 0.0118975497324215
12 397065639 0.0134495901380646
13 278482917 0.0145381173159737
14 249902172 0.0155149287806443
15 303371074 0.0167007381746204
16 382181559 0.0181946000543679
17 471625377 0.020038077730437
18 586967965 0.0223324036781378
19 753900165 0.0252792301042365
20 871126635 0.0286842684736026
21 1086066241 0.0329294574894089
22 1451773493 0.0386041140188739
23 2201277526 0.0472084134817796
24 2351913369 0.0564015146235788
25 2832976237 0.0674749824741944
26 4405280569 0.0846942348338333
27 7155142714 0.112662080725103
28 13338473918 0.164799179574778
29 24793478649 0.261711319231608
30 40255135801 0.419059604152834
31 56025276812 0.638049828460714
32 52545623732 0.843438870376227
33 27902958736 0.952505269333159
34 9638182002 0.990178758133951
35 2259030756 0.999008802002094
36 253582723 0.999999999914007
37 22 1

Filtering with minQ = 20:

/opt/angsd/angsd
-P 4
-b my_bamlist.bamlist
-ref myref_genome.fa
-out test_2.qc
-uniqueOnly 1
-remove_bads 1
-only_proper_pairs 1
-baq 1
-C 50
-minMapQ 20
-minQ 20
-minInd 10
-doQsDist 1
-doDepth 1
-doCounts 1

Result --> test_2.qc.qs:

qscore counts perc
0 232658276 0.000909539560816795
1 17142450 0.000976555171490038
2 61584284 0.00121730887939844
3 52194312 0.00142135402419103
4 81802495 0.00174114752844625
5 342144303 0.00307870484399427
6 259730904 0.00409408042579695
7 302201649 0.00527548845406572
8 335965486 0.00658889069291855
9 415592002 0.00821357980124003
10 458110202 0.0100044868736056
11 468067737 0.0118343213040107
12 396357974 0.0133838181188078
13 277605518 0.0144690715869545
14 249001544 0.0154425024868399
15 302439375 0.0166248398661095
16 381165065 0.018114942482613
17 470470469 0.0199541700016527
18 585502492 0.022243096441694
19 752175545 0.0251836039818445
20 870242210 0.0285856738806503
21 1085098460 0.0328276892248686
22 1451075433 0.0385004318185723
23 2200543571 0.0471030977866015
24 2351279079 0.0562950396235271
25 2832592548 0.0673685979184439
26 4404551408 0.0845874728370706
27 7154630120 0.112557331674808
28 13337770090 0.16469916715344
29 24792360040 0.261620852683309
30 40253999875 0.418987295850053
31 56024030134 0.638004098687814
32 52544597326 0.843418626796779
33 27902517850 0.952498966715449
34 9638071465 0.990177434200513
35 2259010528 0.999008667195284
36 253580791 0.999999999913995
37 22 1

Sorry if I misunderstood something.

svd - matrix singularity

ANGSD -doAsso 2

does not allow for covariate matrix of non full rank for ANY sites. It should instead give NA for these sites

Wrong error message when fai can't be found

When angsd can't find a corresponding fasta index file to a fasta file, it returns with the error message:

-> Reading fasta: ../data/TRIP.fa.gz
-> fai index file: '../data/TRIP.fa.gz.fai' looks older than corresponding fastafile: '../data/TRIP.fa.gz'.
-> Please reindex fasta file

However, this error message doesn't tell the user that the fasta file doesn't exist so if you reindex the fasta, it will not solve the problem.

remove bpl bytesperline

check output dir/file is writable

ctrl+c causes seqfault instead of closing nicely on g++ 5.4.0

make sure -C 50 and baq works if chr doesn't exist in the fasta

angsd fst window drops first window

When I run a fst sliding window analysis, for instance with a 1000 base window, the first window is from 1000-2000 bases. I just rebuilt angsd yesterday, and I think this may be a new bug in the latest build? Sorry if I am mistaken, and thanks for any help!

Noah

Segmentation fault GL

Hello,

I am trying to run angsd on two high coverage genomes (size of 57Gb and 83Gb) to estimate genotype likelihoods and get MAFs and beagle files. It's running a few hours and then I got an error message about a segmentation fault (I've copied it below). I've previously run the program on another dataset (smaller dataset) and it was working.

    -> Printing at chr: JH477403 pos:34226 chunknumber 2438200
    -> Printing at chr: JH477403 pos:69742 chunknumber 2438300
    -> Printing at chr: JH477403 pos:105524 chunknumber 2438400

/opt/sge/default/spool/node5/job_scripts/100345: line 33: 32322 Segmentation fault (core dumped) angsd -GL 1 -out genolike_YimBaylor10072016 -nThreads 12 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 2 -bam YimBaylor.bamlist

I looked online and it might be a memory limit problem: https://kb.iu.edu/d/aqsj (maybe linked to the size of the bam files?). But when I type ulimit, it says "unlimited".

I would greatly appreciate any help to solve this.

Thanks,

Marie

ABBA BABA Segfault

ANGSD segfaults whenever I run it for ABBA BABA. I'm using 10 samples, ranging in size from 2 - 3 gb. I'm running it with 50 regions. My full command is:

% angsd \
>   -doAbbababa 1 \
>   -rmTrans 0 \
>   -blockSize 1000 \
>   -doCounts 1 \
>   -anc ~/barley_samples/Morex_Reference.fasta
>   -bam ~/barley_samples/sample_list.txt
>   -uniqueOnly 0 \
>   -minMapQ 30
>   -minInd 1 \
>   -nThreads 32 \
>   -checkBamHeaders 0 \
>   -rf ~/barley_samples/regions.txt \
>   -out ~/barley_samples/barley.D

I get the following as output

    -> Reading fasta: Morex_Reference.fasta
    -> Parsing 10 number of samples 
    -> Region lookup 1/50
    -> Region lookup 2/50
    -> Region lookup 3/50
    -> Region lookup 4/50
    -> Region lookup 5/50
    -> Region lookup 6/50
    -> Region lookup 7/50
16576 Segmentation fault      (core dumped)

Some system info:

% head -n1 /etc/issue
CentOS release 6.6 (Final)

% grep MemTotal /proc/meminfo
MemTotal:       16331076 kB

% grep MemFree /proc/meminfo 
MemFree:         7101612 kB

% grep "model name" /proc/cpuinfo
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
model name  : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz

What am I doing wrong?

Adjust Makefile so that ANGSD compiles on systems with site-wide installed htslib

Hi,
I am trying to compile this package but find it very awkward that it contains hardcoded paths to ../htslib . I managed to work around by editing:

    sed -e 's@FLAGS=-O3@FLAGS=@' 
        -e 's@HTSLIB = .*@HTSLIB = -lhts@' 
        -e 's@all: htshook angsd misc@all: angsd misc@' -i Makefile
    sed -e 's@HTSLIB = .*@HTSLIB = -lhts@' 
        -e 's@HTSDIR = .*@HTSDIR = /usr/include@' -i misc/Makefile

but then I ran into

g++ -O3 thetaStat.cpp  -I/usr/include/htslib/htslib  -lz -o thetaStat -lpthread -lhts
thetaStat.cpp:7:28: fatal error: ../htslib/bgzf.h: No such file or directory
 #include "../htslib/bgzf.h"
                            ^
compilation terminated.
Makefile:41: recipe for target 'thetaStat' failed
make[1]: *** [thetaStat] Error 1
make[1]: Leaving directory '/var/tmp/portage/sci-biology/ANGSD-9999/work/ANGSD-9999/misc'

The thetaStat.cpp should include just "bgzf.h" and let Makefile to set proper '-I /path'.

I propose you use HTS_LIBDIR and HTS_INCDIR variables in your Makefiles, optionally HTS_SRC if you want to use the path to a local copy of htslib source tree. However, it should be enough for a user to provide $(HTS_LIBDIR) and $(HTS_INCDIR).

Finally, you should allow users to link against a shared library -lhts, instead of hardcoding htslib.a. So, probably a good reason to introduce $(HTS_LIBS) variable .

Just some ideas how to make the package more sysadmin friendly. ;-)

splitgl subprogram doesnt output the correct number of processed reads

Filter -minInd not working properly

I've recently noticed that the -minInd filter in ANGSD is not working properly. For example, using -minInd 76 on a test run gives a MAF file:

chromo position major minor ref knownEM pK-EM nInd
chr10 90625430 C G C 0.135223 0.000000e+00 76
chr10 90625447 G A G 0.031195 1.791576e-10 75
chr10 90625462 C A A 0.319601 0.000000e+00 76

best,

ZLIB_VERSION

I downloaded the sample set and have angsd installed on my computer but I keep getting this error:
-> Problem with difference in zlib version used for compiling and linking
-> ZLIB_VERSION: 1.2.8 zlibversion: 1.2.5

I have Zlibversion 1.2.5 installed on my computer but it doesn't seem to recognize it. I have included the zlib version 1.2.5 into my path but that didn't fix the issue. I even deleted the newer version of zlib and I still get the same error. Please how do I tell it to use the zlib 1.2.5 which is installed on my computer.

Thank you!

./realSFS : No such file or directory

Hi there,
I am encountering a slight issue with angsd. I have been downloading and installing recentely angsd and htslib. I ran perfectly -dosaf but since I am trying to run ./realSFS, it doesn't find it (it end up being (-bash: ./realSFS: No such file or directory).
Here is the log of the installation of angsd.

make HTSSRC=../htslib/
HTSSRC defined
make -C misc/ HTSSRC=/Users/stephaniearnoux/Downloads/proj/htslib
HTSSRC defined
/Users/stephaniearnoux/Downloads/proj/htslib/libhts.a
g++ -O3 -c safreader.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 -c safstat.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 -c realSFS_args.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 -c header.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 -c fstreader.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 -c safcat.cpp -I /Users/stephaniearnoux/Downloads/proj/htslib
g++ -O3 realSFS.cpp -o realSFS safreader.o realSFS_args.o safstat.o fstreader.o header.o safcat.o ../prep_sites.o ../analysisFunction.o -I /Users/stephaniearnoux/Downloads/proj/htslib /Users/stephaniearnoux/Downloads/proj/htslib/libhts.a -lz -lpthread
make install
HTSSRC not defined, assuming systemwide installation -lhts
make: Nothing to be done for `install'.

any suggestion?

angsd -rf combined with -sites seems to fail

I'm running angsd like this:

$angsd -bam $bam -GL 1 -doMaf 2 -doMajorMinor 1 -doGeno 9 -ref $ref -sites $sites -doPost 1 -out $out -rf $rf

I get this error message despite my bam being indexed:

[main_samview] random alignment retrieval only works for indexed BAM or CRAM files. (file: 'bamse_og_kylling.bam' )

Is it because I used an older version of tabix to index my file? What's going on?

I just tried to index one of the bam files with samtools 1.3 and it seems to work. Can you somehow make it compatible with index files generated using an older version of samtools?

beagle genotype posteriors as input (-doSaf 4 and -beagle)

Dear angsd-developers,

I tried to create an saf file from genotype likelihoods in a beagle file:
angsd -beagle file_BEAGLE.input -doSaf 4 -doGeno 32 -out out -fai file.fai -anc file.fas -underFlowProtect 1 -fold 0

This worked nicely in version 0.610 up to version 0.614, but 0.911 down to version 0.800 ask in addition for -GL and bam files to work on, instead of beagle genotype posteriors. Is -beagle not anymore supported in this context or is this a bug?

Addition: When providing a -glf file (and -nInd) instead of a beagle file as input, angsd produces a segfault. So, likely a bug?

Thank you for your help,
David

Makefile for non github tar.gz

GL fields in output VCF don't sum to 1

Hello,
I've been trying to output a VCF file to look at IBD between samples, and I was curious if the GL's in the VCF file are scaled in any way. It says they have been scaled to add up to 1 in the header of the VCF, but they don't (the GP fields do sum to 1).

Here's the command I used:
angsd0.902/angsd -bam bamlist.txt -out OUTPUT -uniqueOnly 1 -only_proper_pairs 1 -remove_bads 1 -doCounts 1 -nThreads 27 -doMaf 1 -doMajorMinor 2 -GL 2 -minMapQ 20 -minQ 30 -dovcf 1 -doPost 2

And here's what my VCF looks like:

fileformat=VCFv4.2(angsd version)

FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">

FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">

FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">

FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (these are really llh eventhough they sum to one)">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1

scaffold_0 1616 . T A . PASS . GP:GL 0.666631,0.333333,0.000035:0.000000,-0.301007,-4.277052 0.666578,0.333333,0.000088:0.000000,-0.300972,-3.876948

Thanks very much for any insight!
Evan

angsd / angsd Goto Github PK

angsd's Introduction

Installation:

Notes

Program has a paper

angsd's People

Contributors

Stargazers

Watchers

Forkers

angsd's Issues

here aln_des.txt looks like

chr pos major ind0 ind1 ind2 ind3

sc0000012 2474 C C C C C

sc0000012 2477 C C C C C

...

ind0: Adif_group_realigned_reads.bam

ind1: Aech_group_realigned_reads.bam

ind2: Asub_group_realigned_reads.bam

ind3: Aten_group_realigned_reads.bam (outgroup)

python code

here, the block_file looks like:

sc0000012 0 5000000 1403 756

sc0000002 0 5000000 1418 775

sc0000013 0 5000000 1530 832

sc0000007 0 5000000 1580 1008

Sum = 287075

Sum = 206104

fileformat=VCFv4.2(angsd version)

FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">

FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">

FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">

FORMAT=<ID=GL,Number=G,Type=Float,Description="scaled Genotype Likelihoods (these are really llh eventhough they sum to one)">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ind0 ind1

Recommend Projects

Recommend Topics

Recommend Org